transaction Module API

This section documents the transaction and table management functionality.

Classes

Transaction

class datashard.Transaction(metadata_manager: MetadataManager, snapshot_manager: SnapshotManager, file_manager: FileManager)[source]

Bases: object

Represents a database transaction with ACID properties

__init__(metadata_manager: MetadataManager, snapshot_manager: SnapshotManager, file_manager: FileManager)[source]

begin() → Transaction[source]: Start a new transaction

is_active() → bool[source]: Check if transaction is active

append_files(files: List[DataFile]) → Transaction[source]: Queue files to append to the table

append_pandas(df: Any, schema: Schema | None = None) → Transaction[source]

Append a pandas DataFrame to the table.

Parameters:

df – pandas DataFrame to append
schema – Optional Schema. If None, uses the table’s current schema.

Returns:

Self for chaining

append_data(records: List[Dict[str, Any]], schema: Schema | None = None, partition_values: Dict[str, Any] | None = None) → Transaction[source]

Append actual data records to the table by creating new data files.

The schema is resolved from the table metadata when not provided; a table without a usable schema raises instead of silently writing zero-column files.

delete_files(file_paths: List[str]) → Transaction[source]: Queue files to delete from the table

overwrite_by_filter(filter_func: Callable[[Any], bool]) → Transaction[source]

NOT IMPLEMENTED - raises instead of silently doing nothing.

Earlier versions queued this operation, committed “successfully”, and changed nothing. An overwrite API that reports success without overwriting is a data-integrity hazard, so until row-level overwrite is actually implemented this raises loudly.

expire_snapshots(older_than_ms: int) → Transaction[source]: Queue snapshot expiration: snapshots with timestamp_ms older than the given cutoff are removed from table metadata at commit (the current snapshot is never expired). Physical file cleanup is done by garbage_collect() once the snapshots are unreachable.

commit() → bool[source]

Commit the transaction with ACID properties using Optimistic Concurrency Control.

Failure semantics (bank-grade, fail closed): - ConcurrentModificationException: clean conflict, retried with backoff

against a freshly-read base.

AmbiguousCommitError: the commit-point write may have succeeded; written data files are KEPT (a durable snapshot may reference them) and the error is re-raised. True orphans are GC’d later.
After the commit point, no fallible operation runs before commit() returns - a post-commit failure can never trigger a rollback that deletes committed data.

rollback() → bool[source]: Rollback the transaction

TransactionManager

Table

class datashard.Table(table_path: str, create_if_not_exists: bool = True, schema: Schema | None = None, partition_spec: Any | None = None)[source]

Bases: object

Main table interface with transaction support

__init__(table_path: str, create_if_not_exists: bool = True, schema: Schema | None = None, partition_spec: Any | None = None)[source]

new_transaction() → Transaction[source]: Create a new transaction

current_snapshot() → Snapshot | None[source]: Get the current snapshot

snapshot_by_id(snapshot_id: int) → Snapshot | None[source]: Get a specific snapshot by ID

snapshots() → List[Dict[str, Any]][source]: Get all snapshots

time_travel(snapshot_id: int | None = None, timestamp: int | None = None) → Any[source]

Look up a historical Snapshot by id or timestamp.

NOTE: This returns snapshot METADATA (id, timestamp, manifest list reference); it does not switch the table’s state, and scan() always reads the CURRENT snapshot. Reading data as-of an old snapshot is not implemented yet.

append_data(files: List[DataFile]) → bool[source]: Append data files to the table (convenience method)

append_pandas(df: Any, schema: Schema | None = None) → bool[source]: Append pandas DataFrame to table (convenience method)

append_records(records: List[Dict[str, Any]], schema: Schema | None = None, partition_values: Dict[str, Any] | None = None) → bool[source]: Append actual data records to the table by creating new data files (convenience method)

refresh() → bool[source]: Refresh the table metadata from storage

garbage_collect(grace_period_ms: int = 3600000) → Dict[str, int][source]

Delete orphaned files not referenced by any snapshot.

Fail closed: if any reachable manifest cannot be read, GC aborts (GarbageCollectionAborted) without deleting anything. Files belonging to in-flight transactions are protected via markers regardless of age.

Parameters:: grace_period_ms – Only delete orphaned files older than this age (default 1 hour).
Returns:: Dict with counts of deleted files by type.

row_count() → int[source]

Get total row count from parquet metadata without scanning data.

This is a fast O(manifest_files) operation that reads only metadata, not the actual parquet data files. Use this for count-only queries instead of len(table.scan()).

Returns:: Total number of rows across all data files in current snapshot.

scan(columns: List[str] | None = None, filter: Dict[str, Any] | None = None, parallel: bool | int = False, verify_checksums: bool | None = None) → List[Dict[str, Any]][source]

Scan the table’s CURRENT snapshot and return records.

Parameters:

columns – Optional list of column names to read. If None, reads all columns.
filter –
Optional filter dict for predicate pushdown. .. rubric:: Examples

{“status”: “failed”} # status == “failed” {“age”: (“>”, 18)} # age > 18 {“id”: (“in”, [1, 2, 3])} # id in [1, 2, 3] {“ts”: (“between”, (t1, t2))} # t1 <= ts <= t2 {“name”: (“is_null”, True)} # name IS NULL

Null handling follows SQL semantics: comparison operators and in/not_in never match NULL values; use is_null / is_not_null.
parallel – Enable parallel reading. - False: Sequential reading (default) - True: Use all CPU cores - int: Use specified number of threads
verify_checksums – Verify each data file against its stored checksum (raises CorruptDataError on mismatch). Defaults to the DATASHARD_VERIFY_CHECKSUMS env var, which defaults to true.

Returns:

List of dictionaries, each representing a record.

Raises:

RuntimeError / OSError – If any referenced manifest or data file cannot be read - errors are never swallowed into partial results.
CorruptDataError – If checksum verification fails.

to_pandas(columns: List[str] | None = None, filter: Dict[str, Any] | None = None, parallel: bool | int = False, verify_checksums: bool | None = None) → Any[source]

Read the table’s CURRENT snapshot as a pandas DataFrame.

Same filtering, error, and checksum semantics as scan().

Raises:: ImportError – If pandas is not installed.

scan_batches(batch_size: int = 10000, columns: List[str] | None = None, filter: Dict[str, Any] | None = None, verify_checksums: bool | None = None) → Iterator[List[Dict[str, Any]]][source]

Scan data in batches for memory-efficient processing.

Yields batches of records, processing one parquet file at a time using PyArrow’s iter_batches for memory efficiency. Uses the same filter engine (and error semantics) as scan().

Parameters:

batch_size – Approximate number of records per batch
columns – Optional column projection
filter – Optional predicate pushdown filter
verify_checksums – As in scan().

Yields:

List of records (dicts) per batch

iter_records(columns: List[str] | None = None, filter: Dict[str, Any] | None = None, verify_checksums: bool | None = None) → Iterator[Dict[str, Any]][source]

Iterate over records one at a time.

Memory efficient - only one batch in memory at a time. Ideal for row-by-row processing of large tables.

Parameters:

columns – Optional column projection
filter – Optional predicate pushdown filter
verify_checksums – As in scan().

Yields:

Individual records as dicts

iter_pandas(chunksize: int = 50000, columns: List[str] | None = None, filter: Dict[str, Any] | None = None, verify_checksums: bool | None = None) → Iterator[Any][source]

Iterate over data as pandas DataFrame chunks.

Memory efficient - only one chunk in memory at a time. Ideal for processing large tables with pandas operations.

Parameters:

chunksize – Approximate rows per chunk
columns – Optional column projection
filter – Optional predicate pushdown filter
verify_checksums – As in scan().

Yields:

pandas DataFrame chunks

Raises:

ImportError – If pandas is not installed.

Table.scan() Method

Read data from the table with optional filtering and parallel processing.

Parameters:

columns (List[str], optional): Column names to read. If None, reads all columns.
filter (Dict[str, Any], optional): Filter dict for predicate pushdown.
parallel (bool or int, optional): Enable parallel reading. True uses all cores, int specifies thread count.

Filter Syntax:

# Equality
{"column": value}
{"column": ("==", value)}

# Comparison
{"column": (">", value)}
{"column": (">=", value)}
{"column": ("<", value)}
{"column": ("<=", value)}

# Membership
{"column": ("in", [v1, v2, v3])}
{"column": ("not_in", [v1, v2])}

# Range
{"column": ("between", (low, high))}

Example:

# Basic scan
records = table.scan()

# With filter
records = table.scan(filter={"status": "active"})

# With columns and parallel
records = table.scan(columns=["id", "name"], parallel=True)

Table.to_pandas() Method

Read data from the table as a pandas DataFrame.

Parameters:

Same as scan() plus returns a pandas DataFrame instead of list of dicts.

Example:

df = table.to_pandas(filter={"age": (">", 30)}, parallel=True)

Table.scan_batches() Method

Iterate over data in batches for memory-efficient processing.

Parameters:

batch_size (int): Approximate number of records per batch. Default: 10000.
columns (List[str], optional): Column names to read.
filter (Dict[str, Any], optional): Filter dict for predicate pushdown.

Example:

for batch in table.scan_batches(batch_size=5000):
    process(batch)  # batch is List[Dict]

Table.iter_records() Method

Iterate over records one at a time.

Parameters:

columns (List[str], optional): Column names to read.
filter (Dict[str, Any], optional): Filter dict for predicate pushdown.

Example:

for record in table.iter_records(filter={"status": "failed"}):
    handle_failure(record)

Table.iter_pandas() Method

Iterate over data as pandas DataFrame chunks.

Parameters:

chunksize (int): Approximate rows per chunk. Default: 50000.
columns (List[str], optional): Column names to read.
filter (Dict[str, Any], optional): Filter dict for predicate pushdown.

Example:

for chunk_df in table.iter_pandas(chunksize=10000):
    results.append(chunk_df.groupby('x').sum())