Troubleshooting
This section helps you diagnose and resolve common issues with datashard.
Common Issues and Solutions
Installation Issues
Problem: Module not found after installation Solution: Ensure that the installation completed successfully and that you’re using the correct Python environment:
pip install datashard
python -c "import datashard; print(datashard.__version__)"
Problem: Permission errors during installation Solution: Install in user mode or use a virtual environment:
pip install --user datashard
# OR
python -m venv venv && source venv/bin/activate && pip install datashard
Transaction Issues
Problem: ConcurrentModificationException occurring frequently Solution: Implement proper retry logic with exponential backoff:
from datashard import ConcurrentModificationException
import time
def safe_transaction_with_retry(table, operations, max_retries=5):
for attempt in range(max_retries):
try:
with table.new_transaction() as tx:
operations(tx) # Perform your operations
return tx.commit()
except ConcurrentModificationException:
if attempt == max_retries - 1:
raise
# Exponential backoff with jitter
sleep_time = 0.01 * (2 ** attempt) + random.uniform(0, 0.01)
time.sleep(sleep_time)
Problem: Transactions taking too long to complete Solution: Keep transactions small and fast; consider batching operations differently
from datashard import create_table, Schema
# Define schema
batch_schema = Schema(
schema_id=1,
fields=[
{"id": 1, "name": "record_id", "type": "long", "required": True},
{"id": 2, "name": "data", "type": "string", "required": True}
]
)
table = create_table("/path/to/table", batch_schema)
# Instead of large transactions - process records one at a time (slow)
# Use smaller batch transactions instead
batch_size = 1000
for i in range(0, len(huge_dataset), batch_size):
batch = huge_dataset[i:i + batch_size]
with table.new_transaction() as tx:
tx.append_data(records=batch, schema=batch_schema)
success = tx.commit()
File and Path Issues
Problem: FileNotFoundError when accessing data files Solution: Ensure file paths are correct and accessible:
import os
from datashard import DataFile, FileFormat
file_path = "/absolute/path/to/data.parquet"
if os.path.exists(file_path):
data_file = DataFile(
file_path=file_path,
file_format=FileFormat.PARQUET,
# ... other parameters
)
else:
raise FileNotFoundError(f"Data file does not exist: {file_path}")
Problem: Permission errors accessing table directories Solution: Ensure the process has read/write permissions to the table location:
# Check permissions
ls -la /path/to/table
# Set appropriate permissions (example)
chmod -R 755 /path/to/table
Memory Issues
Problem: OutOfMemoryError with large datasets Solution: Process data in smaller chunks:
from datashard import create_table, Schema
import gc
# Define schema
chunk_schema = Schema(
schema_id=1,
fields=[
{"id": 1, "name": "record_id", "type": "long", "required": True},
{"id": 2, "name": "value", "type": "string", "required": True}
]
)
table = create_table("/path/to/table", chunk_schema)
def process_large_dataset_in_chunks(table, large_dataset, schema, chunk_size=1000):
results = []
for i in range(0, len(large_dataset), chunk_size):
chunk = large_dataset[i:i + chunk_size]
with table.new_transaction() as tx:
tx.append_data(records=chunk, schema=schema)
results.append(tx.commit())
# Allow garbage collection between chunks
gc.collect()
return results
# Example usage
large_data = [{"record_id": i, "value": f"value_{i}"} for i in range(100000)]
process_large_dataset_in_chunks(table, large_data, chunk_schema, chunk_size=1000)
Debugging Techniques
Enable Logging
Add logging to understand what’s happening:
import logging
# Enable debug logging
logging.basicConfig(level=logging.DEBUG)
# Or set specific loggers
metadata_logger = logging.getLogger('datashard.metadata_manager')
metadata_logger.setLevel(logging.DEBUG)
Diagnostic Information
Collect diagnostic information for troubleshooting:
def diagnose_table(table):
print(f"Table location: {table.table_path}")
# Check metadata
metadata = table.metadata_manager.refresh()
if metadata:
print(f"Metadata format version: {metadata.format_version}")
print(f"Current snapshot ID: {metadata.current_snapshot_id}")
print(f"Number of snapshots: {len(metadata.snapshots)}")
print(f"Number of schemas: {len(metadata.schemas)}")
else:
print("No metadata found")
# Check snapshots
snapshots = table.snapshots()
print(f"Available snapshots: {len(snapshots)}")
# Check active transactions
active_txs = table.transaction_manager.get_active_transactions()
print(f"Active transactions: {len(active_txs)}")
Common Configuration Issues
Wrong Table Path
Problem: Unable to create or access table at specified path Solution: Verify path exists and is accessible:
import os
from datashard import create_table, Schema
# Define schema
basic_schema = Schema(
schema_id=1,
fields=[
{"id": 1, "name": "id", "type": "long", "required": True},
{"id": 2, "name": "data", "type": "string", "required": True}
]
)
table_path = "/valid/path/for/table"
# Ensure parent directory exists
parent_dir = os.path.dirname(table_path)
os.makedirs(parent_dir, exist_ok=True)
# Create table
table = create_table(table_path, basic_schema)
Incompatible File Formats
Problem: Issues reading/writing files in certain formats Solution: Ensure dependencies are properly installed:
# datashard relies on pyarrow for file format support
# Make sure it's properly installed
try:
import pyarrow as pa
print(f"PyArrow version: {pa.__version__}")
except ImportError:
print("PyArrow not installed - install with: pip install pyarrow")
Environment-Specific Issues
Virtual Environment Issues
If using virtual environments, ensure datashard is installed in the active environment:
# Activate your virtual environment first
source /path/to/venv/bin/activate
# Verify which Python you're using
which python
# Install datashard in the virtual environment
pip install datashard
Path Issues
Make sure the table path is accessible and has appropriate permissions:
import os
def validate_table_path(path):
# Check if path exists, if not create it
if not os.path.exists(path):
os.makedirs(path, exist_ok=True)
# Check if path is writable
if not os.access(path, os.W_OK):
raise PermissionError(f"Path not writable: {path}")
# Check if it's actually a directory
if not os.path.isdir(path):
raise NotADirectoryError(f"Path is not a directory: {path}")
Support and Community
When to Seek Help
If you’ve reviewed this troubleshooting guide and still have issues
For feature requests or enhancement suggestions
To report bugs or unexpected behavior
Where to Get Help
Check the GitHub repository for similar issues
File an issue with detailed information about your problem
Include Python version, datashard version, and a reproducible example
import sys
import datashard
print(f"Python version: {sys.version}")
print(f"datashard version: {datashard.__version__}")
print(f"System platform: {sys.platform}")
Common Error Messages
ConcurrentModificationException: Indicates another process modified data during your transaction
FileNotFoundError: The specified file path doesn’t exist
PermissionError: Insufficient permissions to access files/directories
ValueError: Invalid parameter values provided to functions
RuntimeError: General runtime issues with transaction state