S3-Compatible Storage

Note

S3 storage support is available since DataShard v0.2.2

Overview

DataShard supports S3-compatible storage backends, enabling distributed workflows across multiple machines without shared filesystems. This feature allows you to store your DataShard tables in:

  • AWS S3

  • MinIO

  • DigitalOcean Spaces

  • Wasabi

  • Any S3-compatible object storage

Benefits

Distributed Workflows: Workers on different machines can write concurrently

Cloud-Native: No shared filesystem required

ACID Guarantees: Same transactional safety as local storage

High Durability: 99.999999999% with AWS S3

Cost-Effective: ~1,500x cheaper than RDS for workflow logs

Scalable: Auto-scaling workers can write without coordination

Configuration

Environment Variables

Configure S3 storage using environment variables:

# Enable S3 storage
export DATASHARD_STORAGE_TYPE=s3

# S3 endpoint (use AWS S3 or custom endpoint for MinIO)
export DATASHARD_S3_ENDPOINT=https://s3.amazonaws.com

# S3 credentials
export DATASHARD_S3_ACCESS_KEY=your-access-key-id
export DATASHARD_S3_SECRET_KEY=your-secret-access-key

# S3 bucket and region
export DATASHARD_S3_BUCKET=my-datashard-bucket
export DATASHARD_S3_REGION=us-east-1

# Optional: Prefix within bucket
export DATASHARD_S3_PREFIX=optional/prefix/

AWS S3 Setup

Create an S3 bucket:

aws s3 mb s3://my-datashard-bucket --region us-east-1

IAM Policy (Minimum Permissions):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::my-datashard-bucket",
        "arn:aws:s3:::my-datashard-bucket/*"
      ]
    }
  ]
}

MinIO Setup

Start MinIO with Docker:

docker run -d \
  -p 9000:9000 \
  -p 9001:9001 \
  -e MINIO_ROOT_USER=minioadmin \
  -e MINIO_ROOT_PASSWORD=minioadmin \
  --name minio \
  minio/minio server /data --console-address ":9001"

Configure DataShard for MinIO:

export DATASHARD_STORAGE_TYPE=s3
export DATASHARD_S3_ENDPOINT=http://localhost:9000
export DATASHARD_S3_ACCESS_KEY=minioadmin
export DATASHARD_S3_SECRET_KEY=minioadmin
export DATASHARD_S3_BUCKET=datashard
export DATASHARD_S3_REGION=us-east-1

Usage Examples

Basic Table Creation

from datashard import create_table, Schema

# Configure S3 via environment variables
# Then use DataShard normally - API is identical!

schema = Schema(
    schema_id=1,
    fields=[
        {"id": 1, "name": "user_id", "type": "long", "required": True},
        {"id": 2, "name": "event", "type": "string", "required": True},
        {"id": 3, "name": "timestamp", "type": "long", "required": True}
    ]
)

# Creates table in S3 bucket
table = create_table("user-events", schema)

# Append records (stored in S3)
table.append_records([
    {"user_id": 1, "event": "login", "timestamp": 1700000000},
    {"user_id": 2, "event": "purchase", "timestamp": 1700000100}
], schema)

Distributed Workflow

Worker 1 (us-east-1a):

from datashard import create_table, Schema

schema = Schema(schema_id=1, fields=[
    {"id": 1, "name": "task_id", "type": "string", "required": True},
    {"id": 2, "name": "status", "type": "string", "required": True},
])

table = create_table("task-results", schema)
table.append_records([
    {"task_id": "t1", "status": "success"}
], schema)

Worker 2 (us-east-1b, simultaneously):

from datashard import load_table

# Loads from S3, sees Worker 1's table
table = load_table("task-results")

# Concurrent append with ACID guarantees
table.append_records([
    {"task_id": "t2", "status": "success"}
], schema)

Analysis Server (us-west-2):

from datashard import load_table

# Query from different region
table = load_table("task-results")

# Get current snapshot
snapshot = table.current_snapshot()
print(f"Snapshot ID: {snapshot.snapshot_id}")

Storage Structure

DataShard creates the following structure in S3:

s3://my-bucket/optional-prefix/
└── table-name/
    ├── data/
    │   ├── auto_1763162900677093.parquet
    │   └── auto_1763162915234567.parquet
    ├── metadata/
    │   ├── v0.metadata.json
    │   ├── v1.metadata.json
    │   └── manifests/
    │       ├── manifest_1763162900796257.avro
    │       └── manifest_list_*.avro
    └── metadata.version-hint.text

Architecture

DataShard uses a dual-API approach for S3:

Metadata Layer (boto3):
  • JSON metadata files (table metadata)

  • Avro manifest files

  • Version hints

  • Fast small-file operations

Data Layer (PyArrow S3FileSystem):
  • Parquet data files

  • Native compression

  • Efficient columnar I/O

  • Optimized for large files

Both APIs are coordinated through the StorageBackend abstraction.

Performance

Latency Comparison

Operation

Local Disk

S3 (same region)

S3 (cross-region)

Create table

<1ms

~50ms

~150ms

Append 100 records

~2ms

~100ms

~300ms

Append 10,000 records

~50ms

~500ms

~1500ms

Read 50,000 records

~20ms

~150ms

~500ms

Cost Analysis

Example: 100,000 workflow executions/month (~1GB logs)

S3 Standard (us-east-1):

  • Storage: $0.023/GB/month = $0.023

  • Writes: 100,000 × $0.005/1000 = $0.50

  • Reads: 10,000 × $0.0004/1000 = $0.004

  • Total: ~$0.53/month

Compare to:

  • RDS PostgreSQL (db.t3.micro): $12.41/month

  • EBS volume (100GB gp3): $8/month

S3 is 23x cheaper than RDS!

Optimization Tips

  1. Batch writes - Group multiple records into single transactions

  2. Regional colocation - Deploy workers in same region as S3 bucket

  3. VPC endpoints - Use VPC endpoints to avoid internet gateway charges

  4. Connection pooling - DataShard automatically reuses connections

Troubleshooting

NoSuchBucket Error

Problem: Bucket doesn’t exist.

Solution:

aws s3 mb s3://my-bucket

Access Denied

Problem: Insufficient IAM permissions.

Solution: Verify IAM policy includes s3:PutObject, s3:GetObject, s3:ListBucket.

Slow Performance

Problem: High latency to S3.

Solutions:

  1. Use same AWS region as workers

  2. Enable VPC endpoints

  3. Batch writes (100+ records per transaction)

  4. Consider S3 Transfer Acceleration

ImportError: boto3

Problem: boto3 not installed.

Solution:

pip install datashard[s3]
# or
pip install boto3

Best Practices

  1. Use IAM Roles on AWS compute instead of hardcoded credentials

  2. Enable S3 Encryption at rest (SSE-S3 or SSE-KMS)

  3. Set up Lifecycle Policies to archive old snapshots to Glacier

  4. Monitor S3 Costs with AWS Cost Explorer

  5. Use VPC Endpoints for S3 to reduce costs

  6. Enable S3 Versioning for critical tables

  7. Set up CloudWatch Alarms for S3 access errors

Security

Encryption at Rest

Enable server-side encryption:

aws s3api put-bucket-encryption \
  --bucket my-bucket \
  --server-side-encryption-configuration '{
    "Rules": [{
      "ApplyServerSideEncryptionByDefault": {
        "SSEAlgorithm": "AES256"
      }
    }]
  }'

See Also