S3-Compatible Storage

Note

S3 storage support is available since DataShard v0.2.2

Overview

DataShard supports S3-compatible storage backends, enabling distributed workflows across multiple machines without shared filesystems. This feature allows you to store your DataShard tables in:

AWS S3
MinIO
DigitalOcean Spaces
Wasabi
Any S3-compatible object storage

Benefits

✅ Distributed Workflows: Workers on different machines can write concurrently

✅ Cloud-Native: No shared filesystem required

✅ ACID Guarantees: Same transactional safety as local storage

✅ High Durability: 99.999999999% with AWS S3

✅ Cost-Effective: ~1,500x cheaper than RDS for workflow logs

✅ Scalable: Auto-scaling workers can write without coordination

Configuration

Environment Variables

Configure S3 storage using environment variables:

# Enable S3 storage
export DATASHARD_STORAGE_TYPE=s3

# S3 endpoint (use AWS S3 or custom endpoint for MinIO)
export DATASHARD_S3_ENDPOINT=https://s3.amazonaws.com

# S3 credentials
export DATASHARD_S3_ACCESS_KEY=your-access-key-id
export DATASHARD_S3_SECRET_KEY=your-secret-access-key

# S3 bucket and region
export DATASHARD_S3_BUCKET=my-datashard-bucket
export DATASHARD_S3_REGION=us-east-1

# Optional: Prefix within bucket
export DATASHARD_S3_PREFIX=optional/prefix/

AWS S3 Setup

Create an S3 bucket:

aws s3 mb s3://my-datashard-bucket --region us-east-1

IAM Policy (Minimum Permissions):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::my-datashard-bucket",
        "arn:aws:s3:::my-datashard-bucket/*"
      ]
    }
  ]
}

MinIO Setup

Start MinIO with Docker:

docker run -d \
  -p 9000:9000 \
  -p 9001:9001 \
  -e MINIO_ROOT_USER=minioadmin \
  -e MINIO_ROOT_PASSWORD=minioadmin \
  --name minio \
  minio/minio server /data --console-address ":9001"

Configure DataShard for MinIO:

export DATASHARD_STORAGE_TYPE=s3
export DATASHARD_S3_ENDPOINT=http://localhost:9000
export DATASHARD_S3_ACCESS_KEY=minioadmin
export DATASHARD_S3_SECRET_KEY=minioadmin
export DATASHARD_S3_BUCKET=datashard
export DATASHARD_S3_REGION=us-east-1

Usage Examples

Basic Table Creation

from datashard import create_table, Schema

# Configure S3 via environment variables
# Then use DataShard normally - API is identical!

schema = Schema(
    schema_id=1,
    fields=[
        {"id": 1, "name": "user_id", "type": "long", "required": True},
        {"id": 2, "name": "event", "type": "string", "required": True},
        {"id": 3, "name": "timestamp", "type": "long", "required": True}
    ]
)

# Creates table in S3 bucket
table = create_table("user-events", schema)

# Append records (stored in S3)
table.append_records([
    {"user_id": 1, "event": "login", "timestamp": 1700000000},
    {"user_id": 2, "event": "purchase", "timestamp": 1700000100}
], schema)

Distributed Workflow

Worker 1 (us-east-1a):

from datashard import create_table, Schema

schema = Schema(schema_id=1, fields=[
    {"id": 1, "name": "task_id", "type": "string", "required": True},
    {"id": 2, "name": "status", "type": "string", "required": True},
])

table = create_table("task-results", schema)
table.append_records([
    {"task_id": "t1", "status": "success"}
], schema)

Worker 2 (us-east-1b, simultaneously):

from datashard import load_table

# Loads from S3, sees Worker 1's table
table = load_table("task-results")

# Concurrent append with ACID guarantees
table.append_records([
    {"task_id": "t2", "status": "success"}
], schema)

Analysis Server (us-west-2):

from datashard import load_table

# Query from different region
table = load_table("task-results")

# Get current snapshot
snapshot = table.current_snapshot()
print(f"Snapshot ID: {snapshot.snapshot_id}")

Storage Structure

DataShard creates the following structure in S3:

s3://my-bucket/optional-prefix/
└── table-name/
    ├── data/
    │   ├── auto_1763162900677093.parquet
    │   └── auto_1763162915234567.parquet
    ├── metadata/
    │   ├── v0.metadata.json
    │   ├── v1.metadata.json
    │   └── manifests/
    │       ├── manifest_1763162900796257.avro
    │       └── manifest_list_*.avro
    └── metadata.version-hint.text

Architecture

DataShard uses a dual-API approach for S3:

Metadata Layer (boto3):

JSON metadata files (table metadata)
Avro manifest files
Version hints
Fast small-file operations

Data Layer (PyArrow S3FileSystem):

Parquet data files
Native compression
Efficient columnar I/O
Optimized for large files

Both APIs are coordinated through the StorageBackend abstraction.

Performance

Latency Comparison

Operation	Local Disk	S3 (same region)	S3 (cross-region)
Create table	<1ms	~50ms	~150ms
Append 100 records	~2ms	~100ms	~300ms
Append 10,000 records	~50ms	~500ms	~1500ms
Read 50,000 records	~20ms	~150ms	~500ms

Cost Analysis

Example: 100,000 workflow executions/month (~1GB logs)

S3 Standard (us-east-1):

Storage: $0.023/GB/month = $0.023
Writes: 100,000 × $0.005/1000 = $0.50
Reads: 10,000 × $0.0004/1000 = $0.004
Total: ~$0.53/month

Compare to:

RDS PostgreSQL (db.t3.micro): $12.41/month
EBS volume (100GB gp3): $8/month

S3 is 23x cheaper than RDS!

Optimization Tips

Batch writes - Group multiple records into single transactions
Regional colocation - Deploy workers in same region as S3 bucket
VPC endpoints - Use VPC endpoints to avoid internet gateway charges
Connection pooling - DataShard automatically reuses connections

Troubleshooting

NoSuchBucket Error

Problem: Bucket doesn’t exist.

Solution:

aws s3 mb s3://my-bucket

Access Denied

Problem: Insufficient IAM permissions.

Solution: Verify IAM policy includes s3:PutObject, s3:GetObject, s3:ListBucket.

Slow Performance

Problem: High latency to S3.

Solutions:

Use same AWS region as workers
Enable VPC endpoints
Batch writes (100+ records per transaction)
Consider S3 Transfer Acceleration

ImportError: boto3

Problem: boto3 not installed.