Core Concepts

This section explains the fundamental concepts behind datashard and how they relate to Apache Iceberg.

What is datashard?

datashard is a Python implementation of Apache Iceberg concepts that provides ACID transactions, time travel, and safe concurrent access to data. It allows multiple processes to read and write data safely without the risk of corruption or inconsistency.

Key Concepts

Table

A table in datashard represents a collection of data files with associated metadata. Each table has:

A location on the filesystem
Metadata describing schema, partitioning, and other properties
Snapshots that capture the state of the table at different points in time
Manifest files that catalog the data files in the table

Snapshot

A snapshot is a point-in-time view of the table. Each snapshot includes:

A unique ID and timestamp
A manifest list pointing to the data files at that point in time
Information about the operation that created the snapshot
References to parent snapshots for lineage tracking

Metadata

Metadata contains all the information about the table structure, including:

Schema definitions
Partition specifications
Current and historical snapshots
Table properties and configuration

Manifest

A manifest is a file that contains metadata about data files. It includes:

File paths and sizes
Partition values for each file
Record counts
Column statistics

Transactions

datashard uses ACID transactions to ensure data consistency:

Atomic: Operations either complete entirely or not at all
Consistent: Transactions maintain data integrity constraints
Isolated: Concurrent transactions don’t interfere with each other
Durable: Once committed, changes are permanent

How it Relates to Apache Iceberg

While Apache Iceberg is a specification implemented in Java for use with Spark, Hive, and other big data tools, datashard implements the same concepts in pure Python:

Same Metadata Format: Uses the same metadata structure as Apache Iceberg
Same Concurrency Model: Implements Optimistic Concurrency Control (OCC)
Same Time Travel: Provides the ability to examine historical data states
Same Schema Evolution: Supports adding, removing, and changing columns
Same File Organization: Uses similar manifest (Avro) and metadata (JSON) file structures

Benefits of datashard

Pure Python: No Java dependencies or JVM required
ML/AI Focus: Designed for machine learning and data science workflows
Easy Integration: Can be used in any Python environment
Scalable: Can handle large datasets with appropriate file formats
Safe Concurrency: Multiple processes can read/write safely
Historical Access: Time travel capabilities for reproducible analysis