S3-Compatible Storage
Note
S3 storage support is available since DataShard v0.2.2
Overview
DataShard supports S3-compatible storage backends, enabling distributed workflows across multiple machines without shared filesystems. This feature allows you to store your DataShard tables in:
AWS S3
MinIO
DigitalOcean Spaces
Wasabi
Any S3-compatible object storage
Benefits
✅ Distributed Workflows: Workers on different machines can write concurrently
✅ Cloud-Native: No shared filesystem required
✅ ACID Guarantees: Same transactional safety as local storage
✅ High Durability: 99.999999999% with AWS S3
✅ Cost-Effective: ~1,500x cheaper than RDS for workflow logs
✅ Scalable: Auto-scaling workers can write without coordination
Configuration
Environment Variables
Configure S3 storage using environment variables:
# Enable S3 storage
export DATASHARD_STORAGE_TYPE=s3
# S3 endpoint (use AWS S3 or custom endpoint for MinIO)
export DATASHARD_S3_ENDPOINT=https://s3.amazonaws.com
# S3 credentials
export DATASHARD_S3_ACCESS_KEY=your-access-key-id
export DATASHARD_S3_SECRET_KEY=your-secret-access-key
# S3 bucket and region
export DATASHARD_S3_BUCKET=my-datashard-bucket
export DATASHARD_S3_REGION=us-east-1
# Optional: Prefix within bucket
export DATASHARD_S3_PREFIX=optional/prefix/
AWS S3 Setup
Create an S3 bucket:
aws s3 mb s3://my-datashard-bucket --region us-east-1
IAM Policy (Minimum Permissions):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::my-datashard-bucket",
"arn:aws:s3:::my-datashard-bucket/*"
]
}
]
}
MinIO Setup
Start MinIO with Docker:
docker run -d \
-p 9000:9000 \
-p 9001:9001 \
-e MINIO_ROOT_USER=minioadmin \
-e MINIO_ROOT_PASSWORD=minioadmin \
--name minio \
minio/minio server /data --console-address ":9001"
Configure DataShard for MinIO:
export DATASHARD_STORAGE_TYPE=s3
export DATASHARD_S3_ENDPOINT=http://localhost:9000
export DATASHARD_S3_ACCESS_KEY=minioadmin
export DATASHARD_S3_SECRET_KEY=minioadmin
export DATASHARD_S3_BUCKET=datashard
export DATASHARD_S3_REGION=us-east-1
Usage Examples
Basic Table Creation
from datashard import create_table, Schema
# Configure S3 via environment variables
# Then use DataShard normally - API is identical!
schema = Schema(
schema_id=1,
fields=[
{"id": 1, "name": "user_id", "type": "long", "required": True},
{"id": 2, "name": "event", "type": "string", "required": True},
{"id": 3, "name": "timestamp", "type": "long", "required": True}
]
)
# Creates table in S3 bucket
table = create_table("user-events", schema)
# Append records (stored in S3)
table.append_records([
{"user_id": 1, "event": "login", "timestamp": 1700000000},
{"user_id": 2, "event": "purchase", "timestamp": 1700000100}
], schema)
Distributed Workflow
Worker 1 (us-east-1a):
from datashard import create_table, Schema
schema = Schema(schema_id=1, fields=[
{"id": 1, "name": "task_id", "type": "string", "required": True},
{"id": 2, "name": "status", "type": "string", "required": True},
])
table = create_table("task-results", schema)
table.append_records([
{"task_id": "t1", "status": "success"}
], schema)
Worker 2 (us-east-1b, simultaneously):
from datashard import load_table
# Loads from S3, sees Worker 1's table
table = load_table("task-results")
# Concurrent append with ACID guarantees
table.append_records([
{"task_id": "t2", "status": "success"}
], schema)
Analysis Server (us-west-2):
from datashard import load_table
# Query from different region
table = load_table("task-results")
# Get current snapshot
snapshot = table.current_snapshot()
print(f"Snapshot ID: {snapshot.snapshot_id}")
Storage Structure
DataShard creates the following structure in S3:
s3://my-bucket/optional-prefix/
└── table-name/
├── data/
│ ├── auto_1763162900677093.parquet
│ └── auto_1763162915234567.parquet
├── metadata/
│ ├── v0.metadata.json
│ ├── v1.metadata.json
│ └── manifests/
│ ├── manifest_1763162900796257.avro
│ └── manifest_list_*.avro
└── metadata.version-hint.text
Architecture
DataShard uses a dual-API approach for S3:
- Metadata Layer (boto3):
JSON metadata files (table metadata)
Avro manifest files
Version hints
Fast small-file operations
- Data Layer (PyArrow S3FileSystem):
Parquet data files
Native compression
Efficient columnar I/O
Optimized for large files
Both APIs are coordinated through the StorageBackend abstraction.
Performance
Latency Comparison
Operation |
Local Disk |
S3 (same region) |
S3 (cross-region) |
|---|---|---|---|
Create table |
<1ms |
~50ms |
~150ms |
Append 100 records |
~2ms |
~100ms |
~300ms |
Append 10,000 records |
~50ms |
~500ms |
~1500ms |
Read 50,000 records |
~20ms |
~150ms |
~500ms |
Cost Analysis
Example: 100,000 workflow executions/month (~1GB logs)
S3 Standard (us-east-1):
Storage: $0.023/GB/month = $0.023
Writes: 100,000 × $0.005/1000 = $0.50
Reads: 10,000 × $0.0004/1000 = $0.004
Total: ~$0.53/month
Compare to:
RDS PostgreSQL (db.t3.micro): $12.41/month
EBS volume (100GB gp3): $8/month
S3 is 23x cheaper than RDS!
Optimization Tips
Batch writes - Group multiple records into single transactions
Regional colocation - Deploy workers in same region as S3 bucket
VPC endpoints - Use VPC endpoints to avoid internet gateway charges
Connection pooling - DataShard automatically reuses connections
Troubleshooting
NoSuchBucket Error
Problem: Bucket doesn’t exist.
Solution:
aws s3 mb s3://my-bucket
Access Denied
Problem: Insufficient IAM permissions.
Solution: Verify IAM policy includes s3:PutObject, s3:GetObject, s3:ListBucket.
Slow Performance
Problem: High latency to S3.
Solutions:
Use same AWS region as workers
Enable VPC endpoints
Batch writes (100+ records per transaction)
Consider S3 Transfer Acceleration
ImportError: boto3
Problem: boto3 not installed.
Solution:
pip install datashard[s3]
# or
pip install boto3
Best Practices
Use IAM Roles on AWS compute instead of hardcoded credentials
Enable S3 Encryption at rest (SSE-S3 or SSE-KMS)
Set up Lifecycle Policies to archive old snapshots to Glacier
Monitor S3 Costs with AWS Cost Explorer
Use VPC Endpoints for S3 to reduce costs
Enable S3 Versioning for critical tables
Set up CloudWatch Alarms for S3 access errors
Security
Encryption at Rest
Enable server-side encryption:
aws s3api put-bucket-encryption \
--bucket my-bucket \
--server-side-encryption-configuration '{
"Rules": [{
"ApplyServerSideEncryptionByDefault": {
"SSEAlgorithm": "AES256"
}
}]
}'
IAM Roles (Recommended)
For EC2/ECS/Lambda, use IAM roles instead of access keys:
# No credentials needed - IAM role provides them
export DATASHARD_STORAGE_TYPE=s3
export DATASHARD_S3_BUCKET=my-bucket
export DATASHARD_S3_REGION=us-east-1
See Also
Installation - Installing DataShard with S3 support
Quickstart Guide - Getting started with DataShard
Transactions and ACID Properties - Understanding ACID transactions
Troubleshooting - Common issues and solutions