datashard Documentation

Welcome to the official documentation for datashard, a Python implementation of Apache Iceberg concepts for safe concurrent data operations in ML/AI workloads.

About datashard

datashard provides ACID transactions, time travel, and safe concurrent access to data in a pure Python implementation of Apache Iceberg concepts. It’s designed specifically for ML/AI workloads where data integrity and concurrent access are critical.

Key Features:

  • ACID Transactions: Atomic, Consistent, Isolated, Durable operations

  • Time Travel: Query data as it existed at any point in time

  • Safe Concurrency: Multiple processes can read/write without corruption

  • S3-Compatible Storage: AWS S3, MinIO, DigitalOcean Spaces support (v0.2.2+)

  • Distributed Workflows: Run workers across different machines with shared S3 storage

  • Schema Evolution: Add/remove columns while maintaining history

  • Multiple File Formats: Parquet, Avro, ORC support

  • Data Integrity: Built-in validation and error handling

Indices and tables