iceberg Module API

This section documents the main API functions for creating and loading tables.

Main Functions

Main module for the Python Iceberg implementation Provides the primary API for working with Iceberg tables

class datashard.iceberg.Table(table_path: str, create_if_not_exists: bool = True, schema: Schema | None = None, partition_spec: Any | None = None)[source]

Bases: object

Main table interface with transaction support

__init__(table_path: str, create_if_not_exists: bool = True, schema: Schema | None = None, partition_spec: Any | None = None)[source]

new_transaction() → Transaction[source]: Create a new transaction

current_snapshot() → Snapshot | None[source]: Get the current snapshot

snapshot_by_id(snapshot_id: int) → Snapshot | None[source]: Get a specific snapshot by ID

snapshots() → List[Dict[str, Any]][source]: Get all snapshots

time_travel(snapshot_id: int | None = None, timestamp: int | None = None) → Any[source]

Look up a historical Snapshot by id or timestamp.

NOTE: This returns snapshot METADATA (id, timestamp, manifest list reference); it does not switch the table’s state, and scan() always reads the CURRENT snapshot. Reading data as-of an old snapshot is not implemented yet.

append_data(files: List[DataFile]) → bool[source]: Append data files to the table (convenience method)

append_pandas(df: Any, schema: Schema | None = None) → bool[source]: Append pandas DataFrame to table (convenience method)

append_records(records: List[Dict[str, Any]], schema: Schema | None = None, partition_values: Dict[str, Any] | None = None) → bool[source]: Append actual data records to the table by creating new data files (convenience method)

refresh() → bool[source]: Refresh the table metadata from storage

garbage_collect(grace_period_ms: int = 3600000) → Dict[str, int][source]

Delete orphaned files not referenced by any snapshot.

Fail closed: if any reachable manifest cannot be read, GC aborts (GarbageCollectionAborted) without deleting anything. Files belonging to in-flight transactions are protected via markers regardless of age.

Parameters:: grace_period_ms – Only delete orphaned files older than this age (default 1 hour).
Returns:: Dict with counts of deleted files by type.

row_count() → int[source]

Get total row count from parquet metadata without scanning data.

This is a fast O(manifest_files) operation that reads only metadata, not the actual parquet data files. Use this for count-only queries instead of len(table.scan()).

Returns:: Total number of rows across all data files in current snapshot.

scan(columns: List[str] | None = None, filter: Dict[str, Any] | None = None, parallel: bool | int = False, verify_checksums: bool | None = None) → List[Dict[str, Any]][source]

Scan the table’s CURRENT snapshot and return records.

Parameters:

columns – Optional list of column names to read. If None, reads all columns.
filter –
Optional filter dict for predicate pushdown. .. rubric:: Examples

{“status”: “failed”} # status == “failed” {“age”: (“>”, 18)} # age > 18 {“id”: (“in”, [1, 2, 3])} # id in [1, 2, 3] {“ts”: (“between”, (t1, t2))} # t1 <= ts <= t2 {“name”: (“is_null”, True)} # name IS NULL

Null handling follows SQL semantics: comparison operators and in/not_in never match NULL values; use is_null / is_not_null.
parallel – Enable parallel reading. - False: Sequential reading (default) - True: Use all CPU cores - int: Use specified number of threads
verify_checksums – Verify each data file against its stored checksum (raises CorruptDataError on mismatch). Defaults to the DATASHARD_VERIFY_CHECKSUMS env var, which defaults to true.

Returns:

List of dictionaries, each representing a record.

Raises:

RuntimeError / OSError – If any referenced manifest or data file cannot be read - errors are never swallowed into partial results.
CorruptDataError – If checksum verification fails.

to_pandas(columns: List[str] | None = None, filter: Dict[str, Any] | None = None, parallel: bool | int = False, verify_checksums: bool | None = None) → Any[source]

Read the table’s CURRENT snapshot as a pandas DataFrame.

Same filtering, error, and checksum semantics as scan().

Raises:: ImportError – If pandas is not installed.

scan_batches(batch_size: int = 10000, columns: List[str] | None = None, filter: Dict[str, Any] | None = None, verify_checksums: bool | None = None) → Iterator[List[Dict[str, Any]]][source]

Scan data in batches for memory-efficient processing.

Yields batches of records, processing one parquet file at a time using PyArrow’s iter_batches for memory efficiency. Uses the same filter engine (and error semantics) as scan().

Parameters:

batch_size – Approximate number of records per batch
columns – Optional column projection
filter – Optional predicate pushdown filter
verify_checksums – As in scan().

Yields:

List of records (dicts) per batch

iter_records(columns: List[str] | None = None, filter: Dict[str, Any] | None = None, verify_checksums: bool | None = None) → Iterator[Dict[str, Any]][source]

Iterate over records one at a time.

Memory efficient - only one batch in memory at a time. Ideal for row-by-row processing of large tables.

Parameters:

columns – Optional column projection
filter – Optional predicate pushdown filter
verify_checksums – As in scan().

Yields:

Individual records as dicts

iter_pandas(chunksize: int = 50000, columns: List[str] | None = None, filter: Dict[str, Any] | None = None, verify_checksums: bool | None = None) → Iterator[Any][source]

Iterate over data as pandas DataFrame chunks.

Memory efficient - only one chunk in memory at a time. Ideal for processing large tables with pandas operations.

Parameters:

chunksize – Approximate rows per chunk
columns – Optional column projection
filter – Optional predicate pushdown filter
verify_checksums – As in scan().

Yields:

pandas DataFrame chunks

Raises:

ImportError – If pandas is not installed.

class datashard.iceberg.Transaction(metadata_manager: MetadataManager, snapshot_manager: SnapshotManager, file_manager: FileManager)[source]

Bases: object

Represents a database transaction with ACID properties

__init__(metadata_manager: MetadataManager, snapshot_manager: SnapshotManager, file_manager: FileManager)[source]

begin() → Transaction[source]: Start a new transaction

is_active() → bool[source]: Check if transaction is active

append_files(files: List[DataFile]) → Transaction[source]: Queue files to append to the table

append_pandas(df: Any, schema: Schema | None = None) → Transaction[source]

Append a pandas DataFrame to the table.

Parameters:

df – pandas DataFrame to append
schema – Optional Schema. If None, uses the table’s current schema.

Returns:

Self for chaining

append_data(records: List[Dict[str, Any]], schema: Schema | None = None, partition_values: Dict[str, Any] | None = None) → Transaction[source]

Append actual data records to the table by creating new data files.

The schema is resolved from the table metadata when not provided; a table without a usable schema raises instead of silently writing zero-column files.

delete_files(file_paths: List[str]) → Transaction[source]: Queue files to delete from the table

overwrite_by_filter(filter_func: Callable[[Any], bool]) → Transaction[source]

NOT IMPLEMENTED - raises instead of silently doing nothing.

Earlier versions queued this operation, committed “successfully”, and changed nothing. An overwrite API that reports success without overwriting is a data-integrity hazard, so until row-level overwrite is actually implemented this raises loudly.

expire_snapshots(older_than_ms: int) → Transaction[source]: Queue snapshot expiration: snapshots with timestamp_ms older than the given cutoff are removed from table metadata at commit (the current snapshot is never expired). Physical file cleanup is done by garbage_collect() once the snapshots are unreachable.

commit() → bool[source]

Commit the transaction with ACID properties using Optimistic Concurrency Control.

Failure semantics (bank-grade, fail closed): - ConcurrentModificationException: clean conflict, retried with backoff

against a freshly-read base.

AmbiguousCommitError: the commit-point write may have succeeded; written data files are KEPT (a durable snapshot may reference them) and the error is re-raised. True orphans are GC’d later.
After the commit point, no fallible operation runs before commit() returns - a post-commit failure can never trigger a rollback that deletes committed data.

rollback() → bool[source]: Rollback the transaction

datashard.iceberg.create_table(table_path: str, schema: Schema | None = None, partition_spec: PartitionSpec | None = None) → Table[source]

Create a new Iceberg table (or open the existing one at table_path).

The provided schema and partition spec are PERSISTED into the table metadata, so subsequent appends without an explicit schema use them. Initialization is guarded and race-safe: an already-initialized table is never overwritten, and two concurrent creators cannot clobber each other.

Parameters:

table_path – Path where the table should be stored
schema – Optional schema for the table (persisted as the current schema)
partition_spec – Optional partition spec for the table (persisted)

Returns:

Table instance

datashard.iceberg.load_table(table_path: str) → Table[source]

Load an existing Iceberg table

Parameters:: table_path – Path to the existing table
Returns:: Table instance
Raises:: ValueError – If no initialized table exists at table_path.

class datashard.iceberg.DataFile(file_path: str, file_format: FileFormat, partition_values: Dict[str, Any], record_count: int, file_size_in_bytes: int, column_sizes: Dict[int, int] | None = None, value_counts: Dict[int, int] | None = None, null_value_counts: Dict[int, int] | None = None, lower_bounds: Dict[int, Any] | None = None, upper_bounds: Dict[int, Any] | None = None, key_metadata: bytes | None = None, checksum: str | None = None, split_offsets: List[int] | None = None, split_compressed_offsets: List[int] | None = None, equality_ids: List[int] | None = None, sort_order_id: int | None = None, added_snapshot_id: int | None = None)[source]

Bases: object

Represents a data file in the table

file_path: str

file_format: FileFormat

partition_values: Dict[str, Any]

record_count: int

file_size_in_bytes: int

column_sizes: Dict[int, int] | None = None

value_counts: Dict[int, int] | None = None

null_value_counts: Dict[int, int] | None = None

lower_bounds: Dict[int, Any] | None = None

upper_bounds: Dict[int, Any] | None = None

key_metadata: bytes | None = None

checksum: str | None = None

split_offsets: List[int] | None = None

split_compressed_offsets: List[int] | None = None

equality_ids: List[int] | None = None

sort_order_id: int | None = None

added_snapshot_id: int | None = None

__init__(file_path: str, file_format: FileFormat, partition_values: Dict[str, Any], record_count: int, file_size_in_bytes: int, column_sizes: Dict[int, int] | None = None, value_counts: Dict[int, int] | None = None, null_value_counts: Dict[int, int] | None = None, lower_bounds: Dict[int, Any] | None = None, upper_bounds: Dict[int, Any] | None = None, key_metadata: bytes | None = None, checksum: str | None = None, split_offsets: List[int] | None = None, split_compressed_offsets: List[int] | None = None, equality_ids: List[int] | None = None, sort_order_id: int | None = None, added_snapshot_id: int | None = None) → None

class datashard.iceberg.FileFormat(value)[source]

Bases: Enum

Supported file formats

PARQUET = 'parquet'

AVRO = 'avro'

ORC = 'orc'

Classes

Table

class datashard.Table(table_path: str, create_if_not_exists: bool = True, schema: Schema | None = None, partition_spec: Any | None = None)[source]

Bases: object

Main table interface with transaction support

__init__(table_path: str, create_if_not_exists: bool = True, schema: Schema | None = None, partition_spec: Any | None = None)[source]

new_transaction() → Transaction[source]: Create a new transaction

current_snapshot() → Snapshot | None[source]: Get the current snapshot

snapshot_by_id(snapshot_id: int) → Snapshot | None[source]: Get a specific snapshot by ID

snapshots() → List[Dict[str, Any]][source]: Get all snapshots

time_travel(snapshot_id: int | None = None, timestamp: int | None = None) → Any[source]

Look up a historical Snapshot by id or timestamp.

NOTE: This returns snapshot METADATA (id, timestamp, manifest list reference); it does not switch the table’s state, and scan() always reads the CURRENT snapshot. Reading data as-of an old snapshot is not implemented yet.

append_data(files: List[DataFile]) → bool[source]: Append data files to the table (convenience method)

append_pandas(df: Any, schema: Schema | None = None) → bool[source]: Append pandas DataFrame to table (convenience method)

append_records(records: List[Dict[str, Any]], schema: Schema | None = None, partition_values: Dict[str, Any] | None = None) → bool[source]: Append actual data records to the table by creating new data files (convenience method)

refresh() → bool[source]: Refresh the table metadata from storage

garbage_collect(grace_period_ms: int = 3600000) → Dict[str, int][source]

Delete orphaned files not referenced by any snapshot.

Fail closed: if any reachable manifest cannot be read, GC aborts (GarbageCollectionAborted) without deleting anything. Files belonging to in-flight transactions are protected via markers regardless of age.

Parameters:: grace_period_ms – Only delete orphaned files older than this age (default 1 hour).
Returns:: Dict with counts of deleted files by type.

row_count() → int[source]

Get total row count from parquet metadata without scanning data.

This is a fast O(manifest_files) operation that reads only metadata, not the actual parquet data files. Use this for count-only queries instead of len(table.scan()).

Returns:: Total number of rows across all data files in current snapshot.

scan(columns: List[str] | None = None, filter: Dict[str, Any] | None = None, parallel: bool | int = False, verify_checksums: bool | None = None) → List[Dict[str, Any]][source]

Scan the table’s CURRENT snapshot and return records.

Parameters:

columns – Optional list of column names to read. If None, reads all columns.
filter –
Optional filter dict for predicate pushdown. .. rubric:: Examples

{“status”: “failed”} # status == “failed” {“age”: (“>”, 18)} # age > 18 {“id”: (“in”, [1, 2, 3])} # id in [1, 2, 3] {“ts”: (“between”, (t1, t2))} # t1 <= ts <= t2 {“name”: (“is_null”, True)} # name IS NULL

Null handling follows SQL semantics: comparison operators and in/not_in never match NULL values; use is_null / is_not_null.
parallel – Enable parallel reading. - False: Sequential reading (default) - True: Use all CPU cores - int: Use specified number of threads
verify_checksums – Verify each data file against its stored checksum (raises CorruptDataError on mismatch). Defaults to the DATASHARD_VERIFY_CHECKSUMS env var, which defaults to true.

Returns:

List of dictionaries, each representing a record.

Raises:

RuntimeError / OSError – If any referenced manifest or data file cannot be read - errors are never swallowed into partial results.
CorruptDataError – If checksum verification fails.

to_pandas(columns: List[str] | None = None, filter: Dict[str, Any] | None = None, parallel: bool | int = False, verify_checksums: bool | None = None) → Any[source]

Read the table’s CURRENT snapshot as a pandas DataFrame.

Same filtering, error, and checksum semantics as scan().

Raises:: ImportError – If pandas is not installed.

scan_batches(batch_size: int = 10000, columns: List[str] | None = None, filter: Dict[str, Any] | None = None, verify_checksums: bool | None = None) → Iterator[List[Dict[str, Any]]][source]

Scan data in batches for memory-efficient processing.

Yields batches of records, processing one parquet file at a time using PyArrow’s iter_batches for memory efficiency. Uses the same filter engine (and error semantics) as scan().

Parameters:

batch_size – Approximate number of records per batch
columns – Optional column projection
filter – Optional predicate pushdown filter
verify_checksums – As in scan().

Yields:

List of records (dicts) per batch

iter_records(columns: List[str] | None = None, filter: Dict[str, Any] | None = None, verify_checksums: bool | None = None) → Iterator[Dict[str, Any]][source]

Iterate over records one at a time.

Memory efficient - only one batch in memory at a time. Ideal for row-by-row processing of large tables.

Parameters:

columns – Optional column projection
filter – Optional predicate pushdown filter
verify_checksums – As in scan().

Yields:

Individual records as dicts

iter_pandas(chunksize: int = 50000, columns: List[str] | None = None, filter: Dict[str, Any] | None = None, verify_checksums: bool | None = None) → Iterator[Any][source]

Iterate over data as pandas DataFrame chunks.

Memory efficient - only one chunk in memory at a time. Ideal for processing large tables with pandas operations.

Parameters:

chunksize – Approximate rows per chunk
columns – Optional column projection
filter – Optional predicate pushdown filter
verify_checksums – As in scan().

Yields:

pandas DataFrame chunks

Raises:

ImportError – If pandas is not installed.

Transaction

class datashard.Transaction(metadata_manager: MetadataManager, snapshot_manager: SnapshotManager, file_manager: FileManager)[source]

Bases: object

Represents a database transaction with ACID properties

__init__(metadata_manager: MetadataManager, snapshot_manager: SnapshotManager, file_manager: FileManager)[source]

begin() → Transaction[source]: Start a new transaction

is_active() → bool[source]: Check if transaction is active

append_files(files: List[DataFile]) → Transaction[source]: Queue files to append to the table

append_pandas(df: Any, schema: Schema | None = None) → Transaction[source]

Append a pandas DataFrame to the table.

Parameters:

df – pandas DataFrame to append
schema – Optional Schema. If None, uses the table’s current schema.

Returns:

Self for chaining

append_data(records: List[Dict[str, Any]], schema: Schema | None = None, partition_values: Dict[str, Any] | None = None) → Transaction[source]

Append actual data records to the table by creating new data files.

The schema is resolved from the table metadata when not provided; a table without a usable schema raises instead of silently writing zero-column files.

delete_files(file_paths: List[str]) → Transaction[source]: Queue files to delete from the table

overwrite_by_filter(filter_func: Callable[[Any], bool]) → Transaction[source]

NOT IMPLEMENTED - raises instead of silently doing nothing.

Earlier versions queued this operation, committed “successfully”, and changed nothing. An overwrite API that reports success without overwriting is a data-integrity hazard, so until row-level overwrite is actually implemented this raises loudly.

expire_snapshots(older_than_ms: int) → Transaction[source]: Queue snapshot expiration: snapshots with timestamp_ms older than the given cutoff are removed from table metadata at commit (the current snapshot is never expired). Physical file cleanup is done by garbage_collect() once the snapshots are unreachable.

commit() → bool[source]

Commit the transaction with ACID properties using Optimistic Concurrency Control.

Failure semantics (bank-grade, fail closed): - ConcurrentModificationException: clean conflict, retried with backoff

against a freshly-read base.

AmbiguousCommitError: the commit-point write may have succeeded; written data files are KEPT (a durable snapshot may reference them) and the error is re-raised. True orphans are GC’d later.
After the commit point, no fallible operation runs before commit() returns - a post-commit failure can never trigger a rollback that deletes committed data.

rollback() → bool[source]: Rollback the transaction

DataFile and FileFormat

class datashard.DataFile(file_path: str, file_format: FileFormat, partition_values: Dict[str, Any], record_count: int, file_size_in_bytes: int, column_sizes: Dict[int, int] | None = None, value_counts: Dict[int, int] | None = None, null_value_counts: Dict[int, int] | None = None, lower_bounds: Dict[int, Any] | None = None, upper_bounds: Dict[int, Any] | None = None, key_metadata: bytes | None = None, checksum: str | None = None, split_offsets: List[int] | None = None, split_compressed_offsets: List[int] | None = None, equality_ids: List[int] | None = None, sort_order_id: int | None = None, added_snapshot_id: int | None = None)[source]

Bases: object

Represents a data file in the table

file_path: str

file_format: FileFormat

partition_values: Dict[str, Any]

record_count: int

file_size_in_bytes: int

column_sizes: Dict[int, int] | None = None

value_counts: Dict[int, int] | None = None

null_value_counts: Dict[int, int] | None = None

lower_bounds: Dict[int, Any] | None = None

upper_bounds: Dict[int, Any] | None = None

key_metadata: bytes | None = None

checksum: str | None = None

split_offsets: List[int] | None = None

split_compressed_offsets: List[int] | None = None

equality_ids: List[int] | None = None

sort_order_id: int | None = None

added_snapshot_id: int | None = None

__init__(file_path: str, file_format: FileFormat, partition_values: Dict[str, Any], record_count: int, file_size_in_bytes: int, column_sizes: Dict[int, int] | None = None, value_counts: Dict[int, int] | None = None, null_value_counts: Dict[int, int] | None = None, lower_bounds: Dict[int, Any] | None = None, upper_bounds: Dict[int, Any] | None = None, key_metadata: bytes | None = None, checksum: str | None = None, split_offsets: List[int] | None = None, split_compressed_offsets: List[int] | None = None, equality_ids: List[int] | None = None, sort_order_id: int | None = None, added_snapshot_id: int | None = None) → None

class datashard.FileFormat(value)[source]

Bases: Enum

Supported file formats

PARQUET = 'parquet'

AVRO = 'avro'

ORC = 'orc'