Schemas and Data Types
======================

This section covers how datashard handles schemas and the various data types supported.

Schema Overview
---------------

A schema in datashard defines the structure of data in a table. It specifies what columns are available, their data types, and other properties. Schemas are immutable - each schema change creates a new schema version.

Creating Schemas
----------------

Schema Definition
^^^^^^^^^^^^^^^^^

.. code-block:: python

   from datashard import Schema

   # Define schema fields with required field IDs
   user_schema = Schema(
       schema_id=1,  # Required: unique schema version ID
       fields=[
           {"id": 1, "name": "id", "type": "long", "required": True},
           {"id": 2, "name": "name", "type": "string", "required": True},
           {"id": 3, "name": "age", "type": "long", "required": False},
           {"id": 4, "name": "salary", "type": "double", "required": False}
       ]
   )

Data Types
----------

datashard supports various data types through the underlying Parquet file format:

Supported Types
^^^^^^^^^^^^^^^

- **boolean**: True/False values
- **int**: 32-bit signed integers
- **long**: 64-bit signed integers
- **float**: 32-bit IEEE 754 floating point
- **double**: 64-bit IEEE 754 floating point
- **string**: UTF-8 encoded character sequences

Type Selection Guidelines
^^^^^^^^^^^^^^^^^^^^^^^^^^

- Use **long** for integer IDs and counters
- Use **int** for smaller integer values where range is limited
- Use **double** for floating-point numbers (prices, measurements)
- Use **string** for text data (names, descriptions, emails)
- Use **boolean** for true/false flags

Working with Schemas
--------------------

Creating a Table with Schema
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   from datashard import create_table, Schema

   # Define schema
   employee_schema = Schema(
       schema_id=1,
       fields=[
           {"id": 1, "name": "employee_id", "type": "long", "required": True},
           {"id": 2, "name": "name", "type": "string", "required": True},
           {"id": 3, "name": "department", "type": "string", "required": True},
           {"id": 4, "name": "salary", "type": "double", "required": True}
       ]
   )

   # Create table with schema
   table = create_table("/path/to/employee_table", employee_schema)

   print(f"Table created with schema version {employee_schema.schema_id}")

Appending Data with Schema
^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   # Sample employee records
   employee_records = [
       {"employee_id": 1, "name": "Alice", "department": "Engineering", "salary": 95000.0},
       {"employee_id": 2, "name": "Bob", "department": "Sales", "salary": 75000.0},
       {"employee_id": 3, "name": "Charlie", "department": "Marketing", "salary": 80000.0}
   ]

   # Append records - schema ensures data consistency
   success = table.append_records(records=employee_records, schema=employee_schema)

   if success:
       print(f"Successfully added {len(employee_records)} employees")

Schema Validation
^^^^^^^^^^^^^^^^^

datashard automatically validates that records match the expected schema during append operations, preventing invalid data from being added:

.. code-block:: python

   # This will fail validation - missing required field
   invalid_record = [
       {"employee_id": 4, "name": "David"}  # Missing department and salary
   ]

   try:
       table.append_records(records=invalid_record, schema=employee_schema)
   except Exception as e:
       print(f"Validation failed: {e}")

Schema Evolution
----------------

Schemas are immutable in datashard. To evolve a schema, create a new schema version:

Creating a New Schema Version
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   # Original schema (version 1)
   original_schema = Schema(
       schema_id=1,
       fields=[
           {"id": 1, "name": "user_id", "type": "long", "required": True},
           {"id": 2, "name": "name", "type": "string", "required": True}
       ]
   )

   # Evolved schema (version 2) - adds email field
   evolved_schema = Schema(
       schema_id=2,
       fields=[
           {"id": 1, "name": "user_id", "type": "long", "required": True},
           {"id": 2, "name": "name", "type": "string", "required": True},
           {"id": 3, "name": "email", "type": "string", "required": False}
       ]
   )

   # New data uses the evolved schema
   new_records = [
       {"user_id": 5, "name": "Eve", "email": "eve@example.com"}
   ]

   table.append_records(records=new_records, schema=evolved_schema)

Schema Evolution Rules
^^^^^^^^^^^^^^^^^^^^^^

When evolving schemas:

1. **Keep field IDs stable**: Existing field IDs should not change
2. **Add new fields as optional**: New fields should have ``required: False``
3. **Don't remove fields**: Historical data may reference removed fields
4. **Don't change field types**: Type changes can break historical data access
5. **Increment schema_id**: Each schema version needs a unique ID

Schema Management
^^^^^^^^^^^^^^^^^

Accessing Current Schema
""""""""""""""""""""""""

.. code-block:: python

   # Get the current schema of a table
   current_metadata = table.metadata_manager.refresh()

   if current_metadata and current_metadata.schemas:
       current_schema_id = current_metadata.current_schema_id
       current_schema = next(
           (s for s in current_metadata.schemas if s.schema_id == current_schema_id),
           None
       )

       if current_schema:
           print(f"Current schema version: {current_schema.schema_id}")
           print(f"Number of fields: {len(current_schema.fields)}")

           for field in current_schema.fields:
               print(f"  - {field['name']}: {field['type']}")

Schema History
""""""""""""""

.. code-block:: python

   # Access all schema versions
   if current_metadata:
       print("Schema history:")
       for schema in current_metadata.schemas:
           print(f"  Schema {schema.schema_id}: {len(schema.fields)} fields")

Best Practices
--------------

Schema Design
^^^^^^^^^^^^^

1. **Plan Ahead**: Consider future data evolution needs when designing schemas
2. **Use Appropriate Types**: Choose the most specific type that fits your data

   - Use ``long`` for IDs and large integers
   - Use ``double`` for decimal numbers
   - Use ``string`` for text
   - Use ``boolean`` for flags

3. **Required vs Optional**: Mark fields as required only if they will always have values
4. **Field IDs**: Assign field IDs sequentially starting from 1
5. **Naming**: Use clear, descriptive field names (snake_case recommended)

Performance Considerations
^^^^^^^^^^^^^^^^^^^^^^^^^^

- Fewer fields = faster reads
- Simple types (long, string) are faster than complex computations
- Schema validation adds minimal overhead
- Field order doesn't affect performance

Example: Complete Schema Workflow
----------------------------------

.. code-block:: python

   from datashard import create_table, Schema
   import os

   # 1. Define initial schema
   metrics_schema = Schema(
       schema_id=1,
       fields=[
           {"id": 1, "name": "timestamp", "type": "long", "required": True},
           {"id": 2, "name": "metric_name", "type": "string", "required": True},
           {"id": 3, "name": "value", "type": "double", "required": True}
       ]
   )

   # 2. Create table
   metrics_table = create_table("/tmp/metrics", metrics_schema)

   # 3. Add initial data
   initial_metrics = [
       {"timestamp": 1700000000000, "metric_name": "cpu_usage", "value": 45.2},
       {"timestamp": 1700000000000, "metric_name": "memory_usage", "value": 62.8}
   ]

   metrics_table.append_records(records=initial_metrics, schema=metrics_schema)

   # 4. Evolve schema to add tags
   enhanced_schema = Schema(
       schema_id=2,
       fields=[
           {"id": 1, "name": "timestamp", "type": "long", "required": True},
           {"id": 2, "name": "metric_name", "type": "string", "required": True},
           {"id": 3, "name": "value", "type": "double", "required": True},
           {"id": 4, "name": "tags", "type": "string", "required": False}  # New field
       ]
   )

   # 5. Add new data with enhanced schema
   tagged_metrics = [
       {"timestamp": 1700000100000, "metric_name": "cpu_usage", "value": 52.1, "tags": "production"}
   ]

   metrics_table.append_records(records=tagged_metrics, schema=enhanced_schema)

   # 6. Clean up
   import shutil
   shutil.rmtree("/tmp/metrics")

   print("Schema workflow completed successfully")