==================== S3-Compatible Storage ==================== .. note:: S3 storage support is available since DataShard v0.2.2 Overview ======== DataShard supports S3-compatible storage backends, enabling distributed workflows across multiple machines without shared filesystems. This feature allows you to store your DataShard tables in: * AWS S3 * MinIO * DigitalOcean Spaces * Wasabi * Any S3-compatible object storage Benefits ======== ✅ **Distributed Workflows**: Workers on different machines can write concurrently ✅ **Cloud-Native**: No shared filesystem required ✅ **ACID Guarantees**: Same transactional safety as local storage ✅ **High Durability**: 99.999999999% with AWS S3 ✅ **Cost-Effective**: ~1,500x cheaper than RDS for workflow logs ✅ **Scalable**: Auto-scaling workers can write without coordination Configuration ============= Environment Variables --------------------- Configure S3 storage using environment variables: .. code-block:: bash # Enable S3 storage export DATASHARD_STORAGE_TYPE=s3 # S3 endpoint (use AWS S3 or custom endpoint for MinIO) export DATASHARD_S3_ENDPOINT=https://s3.amazonaws.com # S3 credentials export DATASHARD_S3_ACCESS_KEY=your-access-key-id export DATASHARD_S3_SECRET_KEY=your-secret-access-key # S3 bucket and region export DATASHARD_S3_BUCKET=my-datashard-bucket export DATASHARD_S3_REGION=us-east-1 # Optional: Prefix within bucket export DATASHARD_S3_PREFIX=optional/prefix/ AWS S3 Setup ------------ Create an S3 bucket: .. code-block:: bash aws s3 mb s3://my-datashard-bucket --region us-east-1 IAM Policy (Minimum Permissions): .. code-block:: json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject", "s3:DeleteObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::my-datashard-bucket", "arn:aws:s3:::my-datashard-bucket/*" ] } ] } MinIO Setup ----------- Start MinIO with Docker: .. code-block:: bash docker run -d \ -p 9000:9000 \ -p 9001:9001 \ -e MINIO_ROOT_USER=minioadmin \ -e MINIO_ROOT_PASSWORD=minioadmin \ --name minio \ minio/minio server /data --console-address ":9001" Configure DataShard for MinIO: .. code-block:: bash export DATASHARD_STORAGE_TYPE=s3 export DATASHARD_S3_ENDPOINT=http://localhost:9000 export DATASHARD_S3_ACCESS_KEY=minioadmin export DATASHARD_S3_SECRET_KEY=minioadmin export DATASHARD_S3_BUCKET=datashard export DATASHARD_S3_REGION=us-east-1 Usage Examples ============== Basic Table Creation -------------------- .. code-block:: python from datashard import create_table, Schema # Configure S3 via environment variables # Then use DataShard normally - API is identical! schema = Schema( schema_id=1, fields=[ {"id": 1, "name": "user_id", "type": "long", "required": True}, {"id": 2, "name": "event", "type": "string", "required": True}, {"id": 3, "name": "timestamp", "type": "long", "required": True} ] ) # Creates table in S3 bucket table = create_table("user-events", schema) # Append records (stored in S3) table.append_records([ {"user_id": 1, "event": "login", "timestamp": 1700000000}, {"user_id": 2, "event": "purchase", "timestamp": 1700000100} ], schema) Distributed Workflow -------------------- **Worker 1** (us-east-1a): .. code-block:: python from datashard import create_table, Schema schema = Schema(schema_id=1, fields=[ {"id": 1, "name": "task_id", "type": "string", "required": True}, {"id": 2, "name": "status", "type": "string", "required": True}, ]) table = create_table("task-results", schema) table.append_records([ {"task_id": "t1", "status": "success"} ], schema) **Worker 2** (us-east-1b, simultaneously): .. code-block:: python from datashard import load_table # Loads from S3, sees Worker 1's table table = load_table("task-results") # Concurrent append with ACID guarantees table.append_records([ {"task_id": "t2", "status": "success"} ], schema) **Analysis Server** (us-west-2): .. code-block:: python from datashard import load_table # Query from different region table = load_table("task-results") # Get current snapshot snapshot = table.current_snapshot() print(f"Snapshot ID: {snapshot.snapshot_id}") Storage Structure ================= DataShard creates the following structure in S3: .. code-block:: text s3://my-bucket/optional-prefix/ └── table-name/ ├── data/ │ ├── auto_1763162900677093.parquet │ └── auto_1763162915234567.parquet ├── metadata/ │ ├── v0.metadata.json │ ├── v1.metadata.json │ └── manifests/ │ ├── manifest_1763162900796257.avro │ └── manifest_list_*.avro └── metadata.version-hint.text Architecture ============ DataShard uses a dual-API approach for S3: **Metadata Layer** (boto3): - JSON metadata files (table metadata) - Avro manifest files - Version hints - Fast small-file operations **Data Layer** (PyArrow S3FileSystem): - Parquet data files - Native compression - Efficient columnar I/O - Optimized for large files Both APIs are coordinated through the ``StorageBackend`` abstraction. Performance =========== Latency Comparison ------------------ .. list-table:: :header-rows: 1 :widths: 40 20 20 20 * - Operation - Local Disk - S3 (same region) - S3 (cross-region) * - Create table - <1ms - ~50ms - ~150ms * - Append 100 records - ~2ms - ~100ms - ~300ms * - Append 10,000 records - ~50ms - ~500ms - ~1500ms * - Read 50,000 records - ~20ms - ~150ms - ~500ms Cost Analysis ------------- **Example**: 100,000 workflow executions/month (~1GB logs) **S3 Standard (us-east-1)**: * Storage: $0.023/GB/month = $0.023 * Writes: 100,000 × $0.005/1000 = $0.50 * Reads: 10,000 × $0.0004/1000 = $0.004 * **Total: ~$0.53/month** **Compare to**: * RDS PostgreSQL (db.t3.micro): $12.41/month * EBS volume (100GB gp3): $8/month **S3 is 23x cheaper than RDS!** Optimization Tips ----------------- 1. **Batch writes** - Group multiple records into single transactions 2. **Regional colocation** - Deploy workers in same region as S3 bucket 3. **VPC endpoints** - Use VPC endpoints to avoid internet gateway charges 4. **Connection pooling** - DataShard automatically reuses connections Troubleshooting =============== NoSuchBucket Error ------------------ **Problem**: Bucket doesn't exist. **Solution**: .. code-block:: bash aws s3 mb s3://my-bucket Access Denied ------------- **Problem**: Insufficient IAM permissions. **Solution**: Verify IAM policy includes ``s3:PutObject``, ``s3:GetObject``, ``s3:ListBucket``. Slow Performance ---------------- **Problem**: High latency to S3. **Solutions**: 1. Use same AWS region as workers 2. Enable VPC endpoints 3. Batch writes (100+ records per transaction) 4. Consider S3 Transfer Acceleration ImportError: boto3 ------------------ **Problem**: boto3 not installed. **Solution**: .. code-block:: bash pip install datashard[s3] # or pip install boto3 Best Practices ============== 1. **Use IAM Roles** on AWS compute instead of hardcoded credentials 2. **Enable S3 Encryption** at rest (SSE-S3 or SSE-KMS) 3. **Set up Lifecycle Policies** to archive old snapshots to Glacier 4. **Monitor S3 Costs** with AWS Cost Explorer 5. **Use VPC Endpoints** for S3 to reduce costs 6. **Enable S3 Versioning** for critical tables 7. **Set up CloudWatch Alarms** for S3 access errors Security ======== Encryption at Rest ------------------ Enable server-side encryption: .. code-block:: bash aws s3api put-bucket-encryption \ --bucket my-bucket \ --server-side-encryption-configuration '{ "Rules": [{ "ApplyServerSideEncryptionByDefault": { "SSEAlgorithm": "AES256" } }] }' IAM Roles (Recommended) ----------------------- For EC2/ECS/Lambda, use IAM roles instead of access keys: .. code-block:: bash # No credentials needed - IAM role provides them export DATASHARD_STORAGE_TYPE=s3 export DATASHARD_S3_BUCKET=my-bucket export DATASHARD_S3_REGION=us-east-1 See Also ======== * :doc:`installation` - Installing DataShard with S3 support * :doc:`quickstart` - Getting started with DataShard * :doc:`transactions` - Understanding ACID transactions * :doc:`troubleshooting` - Common issues and solutions * `Full S3 Storage Guide `_