datashard Documentation

Welcome to the official documentation for datashard, a Python implementation of Apache Iceberg concepts for safe concurrent data operations in ML/AI workloads.

Getting Started

User Guide

API Reference

Advanced Topics

Community

About datashard

datashard provides ACID transactions, time travel, and safe concurrent access to data in a pure Python implementation of Apache Iceberg concepts. It’s designed specifically for ML/AI workloads where data integrity and concurrent access are critical.

Key Features:

ACID Transactions: Atomic, Consistent, Isolated, Durable operations
Time Travel: Query data as it existed at any point in time
Safe Concurrency: Multiple processes can read/write without corruption
S3-Compatible Storage: AWS S3, MinIO, DigitalOcean Spaces support (v0.2.2+)
Distributed Workflows: Run workers across different machines with shared S3 storage
Schema Evolution: Add/remove columns while maintaining history
Multiple File Formats: Parquet, Avro, ORC support
Data Integrity: Built-in validation and error handling

datashard Documentation

About datashard

Indices and tables