Troubleshooting

This section helps you diagnose and resolve common issues with datashard.

Common Issues and Solutions

Installation Issues

Problem: Module not found after installation Solution: Ensure that the installation completed successfully and that you’re using the correct Python environment:

pip install datashard
python -c "import datashard; print(datashard.__version__)"

Problem: Permission errors during installation Solution: Install in user mode or use a virtual environment:

pip install --user datashard
# OR
python -m venv venv && source venv/bin/activate && pip install datashard

Transaction Issues

Problem: ConcurrentModificationException occurring frequently Solution: Implement proper retry logic with exponential backoff:

from datashard import ConcurrentModificationException
import time

def safe_transaction_with_retry(table, operations, max_retries=5):
    for attempt in range(max_retries):
        try:
            with table.new_transaction() as tx:
                operations(tx)  # Perform your operations
                return tx.commit()
        except ConcurrentModificationException:
            if attempt == max_retries - 1:
                raise
            # Exponential backoff with jitter
            sleep_time = 0.01 * (2 ** attempt) + random.uniform(0, 0.01)
            time.sleep(sleep_time)

Problem: Transactions taking too long to complete Solution: Keep transactions small and fast; consider batching operations differently

from datashard import create_table, Schema

# Define schema
batch_schema = Schema(
    schema_id=1,
    fields=[
        {"id": 1, "name": "record_id", "type": "long", "required": True},
        {"id": 2, "name": "data", "type": "string", "required": True}
    ]
)

table = create_table("/path/to/table", batch_schema)

# Instead of large transactions - process records one at a time (slow)
# Use smaller batch transactions instead
batch_size = 1000
for i in range(0, len(huge_dataset), batch_size):
    batch = huge_dataset[i:i + batch_size]
    with table.new_transaction() as tx:
        tx.append_data(records=batch, schema=batch_schema)
        success = tx.commit()

File and Path Issues

Problem: FileNotFoundError when accessing data files Solution: Ensure file paths are correct and accessible:

import os
from datashard import DataFile, FileFormat

file_path = "/absolute/path/to/data.parquet"
if os.path.exists(file_path):
    data_file = DataFile(
        file_path=file_path,
        file_format=FileFormat.PARQUET,
        # ... other parameters
    )
else:
    raise FileNotFoundError(f"Data file does not exist: {file_path}")

Problem: Permission errors accessing table directories Solution: Ensure the process has read/write permissions to the table location:

# Check permissions
ls -la /path/to/table

# Set appropriate permissions (example)
chmod -R 755 /path/to/table

Memory Issues

Problem: OutOfMemoryError with large datasets Solution: Process data in smaller chunks:

from datashard import create_table, Schema
import gc

# Define schema
chunk_schema = Schema(
    schema_id=1,
    fields=[
        {"id": 1, "name": "record_id", "type": "long", "required": True},
        {"id": 2, "name": "value", "type": "string", "required": True}
    ]
)

table = create_table("/path/to/table", chunk_schema)

def process_large_dataset_in_chunks(table, large_dataset, schema, chunk_size=1000):
    results = []
    for i in range(0, len(large_dataset), chunk_size):
        chunk = large_dataset[i:i + chunk_size]
        with table.new_transaction() as tx:
            tx.append_data(records=chunk, schema=schema)
            results.append(tx.commit())
        # Allow garbage collection between chunks
        gc.collect()
    return results

# Example usage
large_data = [{"record_id": i, "value": f"value_{i}"} for i in range(100000)]
process_large_dataset_in_chunks(table, large_data, chunk_schema, chunk_size=1000)

Debugging Techniques

Enable Logging

Add logging to understand what’s happening:

import logging

# Enable debug logging
logging.basicConfig(level=logging.DEBUG)

# Or set specific loggers
metadata_logger = logging.getLogger('datashard.metadata_manager')
metadata_logger.setLevel(logging.DEBUG)

Diagnostic Information

Collect diagnostic information for troubleshooting:

def diagnose_table(table):
    print(f"Table location: {table.table_path}")

    # Check metadata
    metadata = table.metadata_manager.refresh()
    if metadata:
        print(f"Metadata format version: {metadata.format_version}")
        print(f"Current snapshot ID: {metadata.current_snapshot_id}")
        print(f"Number of snapshots: {len(metadata.snapshots)}")
        print(f"Number of schemas: {len(metadata.schemas)}")
    else:
        print("No metadata found")

    # Check snapshots
    snapshots = table.snapshots()
    print(f"Available snapshots: {len(snapshots)}")

    # Check active transactions
    active_txs = table.transaction_manager.get_active_transactions()
    print(f"Active transactions: {len(active_txs)}")

Common Configuration Issues

Wrong Table Path

Problem: Unable to create or access table at specified path Solution: Verify path exists and is accessible:

import os
from datashard import create_table, Schema

# Define schema
basic_schema = Schema(
    schema_id=1,
    fields=[
        {"id": 1, "name": "id", "type": "long", "required": True},
        {"id": 2, "name": "data", "type": "string", "required": True}
    ]
)

table_path = "/valid/path/for/table"

# Ensure parent directory exists
parent_dir = os.path.dirname(table_path)
os.makedirs(parent_dir, exist_ok=True)

# Create table
table = create_table(table_path, basic_schema)

Incompatible File Formats

Problem: Issues reading/writing files in certain formats Solution: Ensure dependencies are properly installed:

# datashard relies on pyarrow for file format support
# Make sure it's properly installed
try:
    import pyarrow as pa
    print(f"PyArrow version: {pa.__version__}")
except ImportError:
    print("PyArrow not installed - install with: pip install pyarrow")

Environment-Specific Issues

Virtual Environment Issues

If using virtual environments, ensure datashard is installed in the active environment:

# Activate your virtual environment first
source /path/to/venv/bin/activate

# Verify which Python you're using
which python

# Install datashard in the virtual environment
pip install datashard

Path Issues

Make sure the table path is accessible and has appropriate permissions:

import os

def validate_table_path(path):
    # Check if path exists, if not create it
    if not os.path.exists(path):
        os.makedirs(path, exist_ok=True)

    # Check if path is writable
    if not os.access(path, os.W_OK):
        raise PermissionError(f"Path not writable: {path}")

    # Check if it's actually a directory
    if not os.path.isdir(path):
        raise NotADirectoryError(f"Path is not a directory: {path}")

Support and Community

When to Seek Help

  • If you’ve reviewed this troubleshooting guide and still have issues

  • For feature requests or enhancement suggestions

  • To report bugs or unexpected behavior

Where to Get Help

  1. Check the GitHub repository for similar issues

  2. File an issue with detailed information about your problem

  3. Include Python version, datashard version, and a reproducible example

import sys
import datashard

print(f"Python version: {sys.version}")
print(f"datashard version: {datashard.__version__}")
print(f"System platform: {sys.platform}")

Common Error Messages

  • ConcurrentModificationException: Indicates another process modified data during your transaction

  • FileNotFoundError: The specified file path doesn’t exist

  • PermissionError: Insufficient permissions to access files/directories

  • ValueError: Invalid parameter values provided to functions

  • RuntimeError: General runtime issues with transaction state