Troubleshooting =============== This section helps you diagnose and resolve common issues with datashard. Common Issues and Solutions --------------------------- Installation Issues ^^^^^^^^^^^^^^^^^^^ **Problem**: Module not found after installation **Solution**: Ensure that the installation completed successfully and that you're using the correct Python environment: .. code-block:: bash pip install datashard python -c "import datashard; print(datashard.__version__)" **Problem**: Permission errors during installation **Solution**: Install in user mode or use a virtual environment: .. code-block:: bash pip install --user datashard # OR python -m venv venv && source venv/bin/activate && pip install datashard Transaction Issues ^^^^^^^^^^^^^^^^^^ **Problem**: ConcurrentModificationException occurring frequently **Solution**: Implement proper retry logic with exponential backoff: .. code-block:: python from datashard import ConcurrentModificationException import time def safe_transaction_with_retry(table, operations, max_retries=5): for attempt in range(max_retries): try: with table.new_transaction() as tx: operations(tx) # Perform your operations return tx.commit() except ConcurrentModificationException: if attempt == max_retries - 1: raise # Exponential backoff with jitter sleep_time = 0.01 * (2 ** attempt) + random.uniform(0, 0.01) time.sleep(sleep_time) **Problem**: Transactions taking too long to complete **Solution**: Keep transactions small and fast; consider batching operations differently .. code-block:: python from datashard import create_table, Schema # Define schema batch_schema = Schema( schema_id=1, fields=[ {"id": 1, "name": "record_id", "type": "long", "required": True}, {"id": 2, "name": "data", "type": "string", "required": True} ] ) table = create_table("/path/to/table", batch_schema) # Instead of large transactions - process records one at a time (slow) # Use smaller batch transactions instead batch_size = 1000 for i in range(0, len(huge_dataset), batch_size): batch = huge_dataset[i:i + batch_size] with table.new_transaction() as tx: tx.append_data(records=batch, schema=batch_schema) success = tx.commit() File and Path Issues ^^^^^^^^^^^^^^^^^^^^ **Problem**: FileNotFoundError when accessing data files **Solution**: Ensure file paths are correct and accessible: .. code-block:: python import os from datashard import DataFile, FileFormat file_path = "/absolute/path/to/data.parquet" if os.path.exists(file_path): data_file = DataFile( file_path=file_path, file_format=FileFormat.PARQUET, # ... other parameters ) else: raise FileNotFoundError(f"Data file does not exist: {file_path}") **Problem**: Permission errors accessing table directories **Solution**: Ensure the process has read/write permissions to the table location: .. code-block:: bash # Check permissions ls -la /path/to/table # Set appropriate permissions (example) chmod -R 755 /path/to/table Memory Issues ^^^^^^^^^^^^^ **Problem**: OutOfMemoryError with large datasets **Solution**: Process data in smaller chunks: .. code-block:: python from datashard import create_table, Schema import gc # Define schema chunk_schema = Schema( schema_id=1, fields=[ {"id": 1, "name": "record_id", "type": "long", "required": True}, {"id": 2, "name": "value", "type": "string", "required": True} ] ) table = create_table("/path/to/table", chunk_schema) def process_large_dataset_in_chunks(table, large_dataset, schema, chunk_size=1000): results = [] for i in range(0, len(large_dataset), chunk_size): chunk = large_dataset[i:i + chunk_size] with table.new_transaction() as tx: tx.append_data(records=chunk, schema=schema) results.append(tx.commit()) # Allow garbage collection between chunks gc.collect() return results # Example usage large_data = [{"record_id": i, "value": f"value_{i}"} for i in range(100000)] process_large_dataset_in_chunks(table, large_data, chunk_schema, chunk_size=1000) Debugging Techniques -------------------- Enable Logging ^^^^^^^^^^^^^^ Add logging to understand what's happening: .. code-block:: python import logging # Enable debug logging logging.basicConfig(level=logging.DEBUG) # Or set specific loggers metadata_logger = logging.getLogger('datashard.metadata_manager') metadata_logger.setLevel(logging.DEBUG) Diagnostic Information ^^^^^^^^^^^^^^^^^^^^^^ Collect diagnostic information for troubleshooting: .. code-block:: python def diagnose_table(table): print(f"Table location: {table.table_path}") # Check metadata metadata = table.metadata_manager.refresh() if metadata: print(f"Metadata format version: {metadata.format_version}") print(f"Current snapshot ID: {metadata.current_snapshot_id}") print(f"Number of snapshots: {len(metadata.snapshots)}") print(f"Number of schemas: {len(metadata.schemas)}") else: print("No metadata found") # Check snapshots snapshots = table.snapshots() print(f"Available snapshots: {len(snapshots)}") # Check active transactions active_txs = table.transaction_manager.get_active_transactions() print(f"Active transactions: {len(active_txs)}") Common Configuration Issues --------------------------- Wrong Table Path ^^^^^^^^^^^^^^^^ **Problem**: Unable to create or access table at specified path **Solution**: Verify path exists and is accessible: .. code-block:: python import os from datashard import create_table, Schema # Define schema basic_schema = Schema( schema_id=1, fields=[ {"id": 1, "name": "id", "type": "long", "required": True}, {"id": 2, "name": "data", "type": "string", "required": True} ] ) table_path = "/valid/path/for/table" # Ensure parent directory exists parent_dir = os.path.dirname(table_path) os.makedirs(parent_dir, exist_ok=True) # Create table table = create_table(table_path, basic_schema) Incompatible File Formats ^^^^^^^^^^^^^^^^^^^^^^^^^ **Problem**: Issues reading/writing files in certain formats **Solution**: Ensure dependencies are properly installed: .. code-block:: python # datashard relies on pyarrow for file format support # Make sure it's properly installed try: import pyarrow as pa print(f"PyArrow version: {pa.__version__}") except ImportError: print("PyArrow not installed - install with: pip install pyarrow") Environment-Specific Issues --------------------------- Virtual Environment Issues ^^^^^^^^^^^^^^^^^^^^^^^^^^ If using virtual environments, ensure datashard is installed in the active environment: .. code-block:: bash # Activate your virtual environment first source /path/to/venv/bin/activate # Verify which Python you're using which python # Install datashard in the virtual environment pip install datashard Path Issues ^^^^^^^^^^^ Make sure the table path is accessible and has appropriate permissions: .. code-block:: python import os def validate_table_path(path): # Check if path exists, if not create it if not os.path.exists(path): os.makedirs(path, exist_ok=True) # Check if path is writable if not os.access(path, os.W_OK): raise PermissionError(f"Path not writable: {path}") # Check if it's actually a directory if not os.path.isdir(path): raise NotADirectoryError(f"Path is not a directory: {path}") Support and Community --------------------- When to Seek Help ^^^^^^^^^^^^^^^^^ - If you've reviewed this troubleshooting guide and still have issues - For feature requests or enhancement suggestions - To report bugs or unexpected behavior Where to Get Help ^^^^^^^^^^^^^^^^^ 1. Check the GitHub repository for similar issues 2. File an issue with detailed information about your problem 3. Include Python version, datashard version, and a reproducible example .. code-block:: python import sys import datashard print(f"Python version: {sys.version}") print(f"datashard version: {datashard.__version__}") print(f"System platform: {sys.platform}") Common Error Messages ^^^^^^^^^^^^^^^^^^^^^ - **ConcurrentModificationException**: Indicates another process modified data during your transaction - **FileNotFoundError**: The specified file path doesn't exist - **PermissionError**: Insufficient permissions to access files/directories - **ValueError**: Invalid parameter values provided to functions - **RuntimeError**: General runtime issues with transaction state