Skip to content

Type System Specification

Overview

New in 2.1

PostgreSQL type mappings were added in DataJoint 2.1. See Database Backends for backend configuration.

This document defines a three-layer type architecture:

  1. Native database types - Backend-specific (FLOAT, TINYINT UNSIGNED, LONGBLOB). Discouraged for direct use.
  2. Core DataJoint types - Standardized across backends, scientist-friendly (int32, float32, bool, json).
  3. Codec Types - Programmatic types with encode()/decode() semantics. Composable.
Layer Description Examples
3. Codec Types Programmatic types with encode()/decode() semantics <blob>, <attach>, <object@>, <hash@>, <filepath@>, user-defined
2. Core DataJoint Standardized, scientist-friendly types (preferred) int32, float64, varchar(n), bool, datetime, json, bytes
1. Native Database Backend-specific types (discouraged) INT, FLOAT, TINYINT UNSIGNED, LONGBLOB

Codec types resolve through core types to native types: <blob> β†’ bytes β†’ LONGBLOB.

Syntax distinction: - Core types: int32, float64, varchar(255) - no brackets - Codec types: <blob>, <object@store>, <filepath@main> - angle brackets - The @ character indicates store (object storage vs in-table)

Core DataJoint Types (Layer 2)

Core types provide a standardized, scientist-friendly interface that works identically across MySQL and PostgreSQL backends. Users should prefer these over native database types.

All core types are recorded in field comments using :type: syntax for reconstruction.

Numeric Types

Core Type Description MySQL PostgreSQL
int8 8-bit signed TINYINT SMALLINT
int16 16-bit signed SMALLINT SMALLINT
int32 32-bit signed INT INTEGER
int64 64-bit signed BIGINT BIGINT
float32 32-bit float FLOAT REAL
float64 64-bit float DOUBLE DOUBLE PRECISION
decimal(n,f) Fixed-point DECIMAL(n,f) NUMERIC(n,f)

String Types

Core Type Description MySQL PostgreSQL
char(n) Fixed-length CHAR(n) CHAR(n)
varchar(n) Variable-length VARCHAR(n) VARCHAR(n)

Encoding: All strings use UTF-8 (utf8mb4 in MySQL, UTF8 in PostgreSQL). See Encoding and Collation Policy for details.

Boolean

Core Type Description MySQL PostgreSQL
bool True/False TINYINT BOOLEAN

Date/Time Types

Core Type Description MySQL PostgreSQL
date Date only DATE DATE
datetime Date and time DATETIME TIMESTAMP

Timezone policy: All datetime values should be stored as UTC. Timezone conversion is a presentation concern handled by the application layer, not the database. This ensures: - Reproducible computations regardless of server or client timezone settings - Simple arithmetic on temporal values (no DST ambiguity) - Portable data across systems and regions

Use CURRENT_TIMESTAMP for auto-populated creation times:

created_at : datetime = CURRENT_TIMESTAMP

Binary Types

The core bytes type stores raw bytes without any serialization. Use the <blob> codec for serialized Python objects.

Core Type Description MySQL PostgreSQL
bytes Raw bytes LONGBLOB BYTEA

Other Types

Core Type Description MySQL PostgreSQL
json JSON document JSON JSONB
uuid UUID BINARY(16) UUID
enum(...) Enumeration ENUM(...) CREATE TYPE ... AS ENUM

Native Passthrough Types

Users may use native database types directly (e.g., int, float, mediumint, tinyblob), but these are discouraged and will generate a warning. Native types lack explicit size information, are not recorded in field comments, and may have portability issues across database backends.

Prefer core DataJoint types over native types:

Native (discouraged) Core DataJoint (preferred)
int int32
float float32 or float64
double float64
tinyint int8
smallint int16
bigint int64
tinyint unsigned int16 (use larger signed type)
smallint unsigned int32 (use larger signed type)
int unsigned int64 (use larger signed type)
bigint unsigned decimal(20,0) (use decimal for full range)

Type Modifiers Policy

DataJoint table definitions have their own syntax for constraints and metadata. SQL type modifiers are not allowed in type specifications because they conflict with DataJoint's declarative syntax:

Modifier Status DataJoint Alternative
NOT NULL / NULL ❌ Not allowed Use = NULL for nullable; omit default for required
DEFAULT value ❌ Not allowed Use = value syntax before the type
PRIMARY KEY ❌ Not allowed Position above --- line
UNIQUE ❌ Not allowed Use DataJoint index syntax
COMMENT 'text' ❌ Not allowed Use # comment syntax
CHARACTER SET ❌ Not allowed Database-level configuration
COLLATE ❌ Not allowed Database-level configuration
AUTO_INCREMENT ⚠️ Discouraged Allowed with native types only, generates warning
UNSIGNED ⚠️ Discouraged MySQL-specific, not portable (use larger signed type)

Nullability and defaults: DataJoint handles nullability through the default value syntax. An attribute is nullable if and only if its default is NULL:

# Required (NOT NULL, no default)
name : varchar(100)

# Nullable (default is NULL)
nickname = NULL : varchar(100)

# Required with default value
status = "active" : varchar(20)

Auto-increment policy: DataJoint discourages AUTO_INCREMENT / SERIAL because: - Breaks reproducibility (IDs depend on insertion order) - Makes pipelines non-deterministic - Complicates data migration and replication - Primary keys should be meaningful, not arbitrary

If required, use native types: int auto_increment or serial (with warning).

Encoding and Collation Policy

Character encoding and collation are database-level configuration, not part of type definitions. This ensures consistent behavior across all tables and simplifies portability.

Configuration (in dj.config or datajoint.json):

{
    "database.charset": "utf8mb4",
    "database.collation": "utf8mb4_bin"
}

Defaults:

Setting MySQL PostgreSQL
Charset utf8mb4 UTF8
Collation utf8mb4_bin C

Policy: - UTF-8 required: DataJoint validates charset is UTF-8 compatible at connection time - Case-sensitive by default: Binary collation (utf8mb4_bin / C) ensures predictable comparisons - No per-column overrides: CHARACTER SET and COLLATE are rejected in type definitions - Like timezone: Encoding is infrastructure configuration, not part of the data model

Codec Types (Layer 3)

Codec types provide encode()/decode() semantics on top of core types. They are composable and can be built-in or user-defined.

Storage Mode: @ Convention

The @ character in codec syntax indicates object store (vs in-table):

  • No @: In-table storage (database column) - e.g., <blob>, <attach>
  • @ present: Object store - e.g., <blob@>, <attach@store>
  • @ alone: Use default store - e.g., <blob@>
  • @name: Use named store - e.g., <blob@cold>

Some codecs support both modes (<blob>, <attach>), others are store-only (<object@>, <npy@>, <hash@>, <filepath@>).

Codec Base Class

Codecs inherit from dj.Codec and auto-register when their class is defined. See the Codec API Specification for complete details on creating custom codecs.

class GraphCodec(dj.Codec):
    """Auto-registered as <graph>."""
    name = "graph"

    def get_dtype(self, is_store: bool) -> str:
        return "<blob>"

    def encode(self, graph, *, key=None, store_name=None):
        return {'nodes': list(graph.nodes()), 'edges': list(graph.edges())}

    def decode(self, stored, *, key=None):
        import networkx as nx
        G = nx.Graph()
        G.add_nodes_from(stored['nodes'])
        G.add_edges_from(stored['edges'])
        return G

Codec Resolution and Chaining

Codecs resolve to core types through chaining. The get_dtype(is_store) method returns the appropriate dtype based on storage mode:

Codec is_store Resolution Chain SQL Type
<blob> False "bytes" LONGBLOB/BYTEA
<blob@> True "<hash>" β†’ "json" JSON/JSONB
<blob@cold> True "<hash>" β†’ "json" (store=cold) JSON/JSONB
<attach> False "bytes" LONGBLOB/BYTEA
<attach@> True "<hash>" β†’ "json" JSON/JSONB
<object@> True "json" JSON/JSONB
<npy@> True "json" JSON/JSONB
<object> False ERROR (store only) β€”
<npy> False ERROR (store only) β€”
<hash@> True "json" JSON/JSONB
<filepath@> True "json" JSON/JSONB

<object@> / <object@store> - Schema-Addressed Storage

Built-in codec. Store only.

Schema-addressed OAS storage for complex, multi-part objects (files, folders, Zarr arrays, HDF5):

  • Schema-addressed: Path mirrors database structure: {schema}/{table}/{pk}/{attribute}/
  • Complex objects: Can store directory structures with multiple files (e.g., Zarr arrays)
  • One-to-one relationship with table row
  • Deleted when row is deleted
  • Returns ObjectRef for lazy access
  • Supports direct writes (Zarr, HDF5) via fsspec
  • dtype: json (stores path, store name, metadata)
class Analysis(dj.Computed):
    definition = """
    -> Recording
    ---
    results : <object@>         # default store
    archive : <object@cold>     # specific store
    """

Implementation

class ObjectCodec(SchemaCodec):
    """Schema-addressed OAS storage. Store only."""
    name = "object"

    # get_dtype inherited from SchemaCodec

    def encode(self, value, *, key=None, store_name=None) -> dict:
        schema, table, field, pk = self._extract_context(key)
        path, _ = self._build_path(schema, table, field, pk)
        backend = self._get_backend(store_name)
        backend.put(path, value)
        return {"path": path, "store": store_name, ...}

    def decode(self, stored: dict, *, key=None) -> ObjectRef:
        backend = self._get_backend(stored["store"])
        return ObjectRef.from_json(stored, backend=backend)

<hash@> / <hash@store> - Hash-Addressed Storage

Built-in codec. Store only.

Hash-addressed storage with deduplication for individual, atomic objects:

  • Hash-addressed: Path derived from content hash: _hash/{hash[:2]}/{hash[2:4]}/{hash}
  • Individual/atomic objects only: Stores single files or serialized blobs (not directory structures)
  • Cannot handle complex multi-part objects like Zarr arraysβ€”use <object@> for those
  • Per-project scope: content is shared across all schemas in a project (not per-schema)
  • Many-to-one: multiple rows (even across schemas) can reference same content
  • Reference counted for garbage collection
  • Deduplication: identical content stored once across the entire project
  • dtype: json (stores hash, store name, size, metadata)
store_root/
β”œβ”€β”€ {schema}/{table}/{pk}/     # schema-addressed storage
β”‚   └── {attribute}/
β”‚
└── _hash/                     # hash-addressed storage
    └── {hash[:2]}/{hash[2:4]}/{hash}

Implementation

class HashCodec(dj.Codec):
    """Hash-addressed storage. Store only."""
    name = "hash"

    def get_dtype(self, is_store: bool) -> str:
        if not is_store:
            raise DataJointError("<hash> requires @ (store only)")
        return "json"

    def encode(self, data: bytes, *, key=None, store_name=None) -> dict:
        """Store content, return metadata as JSON."""
        hash_id = hashlib.md5(data).hexdigest()  # 32-char hex
        store = get_store(store_name or dj.config['stores']['default'])
        path = f"_hash/{hash_id[:2]}/{hash_id[2:4]}/{hash_id}"

        if not store.exists(path):
            store.put(path, data)

        # Metadata stored in JSON column (no separate registry)
        return {"hash": hash_id, "store": store_name, "size": len(data)}

    def decode(self, stored: dict, *, key=None) -> bytes:
        """Retrieve content by hash."""
        store = get_store(stored["store"])
        path = f"_hash/{stored['hash'][:2]}/{stored['hash'][2:4]}/{stored['hash']}"
        return store.get(path)

Database Column

The <hash@> type stores JSON metadata:

-- content column (MySQL)
features JSON NOT NULL
-- Contains: {"hash": "abc123...", "store": "main", "size": 12345}

-- content column (PostgreSQL)
features JSONB NOT NULL

<filepath@store> - Portable External Reference

Built-in codec. In-store only (store required).

Relative path references within configured stores:

  • Relative paths: paths within a configured store (portable across environments)
  • Store-aware: resolves paths against configured store backend
  • Returns ObjectRef for lazy access via fsspec
  • Stores optional checksum for verification
  • dtype: json (stores path, store name, checksum, metadata)

Key benefit: Portability. The path is relative to the store, so pipelines can be moved between environments (dev β†’ prod, cloud β†’ local) by changing store configuration without updating data.

class RawData(dj.Manual):
    definition = """
    session_id : int32
    ---
    recording : <filepath@main>      # relative path within 'main' store
    """

# Insert - user provides relative path within the store
table.insert1({
    'session_id': 1,
    'recording': 'experiment_001/data.nwb'  # relative to main store root
})

# Fetch - returns ObjectRef (lazy)
row = (table & 'session_id=1').fetch1()
ref = row['recording']           # ObjectRef
ref.download('/local/path')      # explicit download
ref.open()                       # fsspec streaming access

When to Use <filepath@store> vs varchar

Use Case Recommended Type
Need ObjectRef/lazy access <filepath@store>
Need portability (relative paths) <filepath@store>
Want checksum verification <filepath@store>
Just storing a URL string varchar
External URLs you don't control varchar

For arbitrary URLs (S3, HTTP, etc.) where you don't need ObjectRef semantics, just use varchar. A string is simpler and more transparent.

Implementation

class FilepathCodec(dj.Codec):
    """Store-relative file references. External only."""
    name = "filepath"

    def get_dtype(self, is_store: bool) -> str:
        if not is_store:
            raise DataJointError("<filepath> requires @store")
        return "json"

    def encode(self, relative_path: str, *, key=None, store_name=None) -> dict:
        """Register reference to file in store."""
        store = get_store(store_name)  # store_name required for filepath
        return {'path': relative_path, 'store': store_name}

    def decode(self, stored: dict, *, key=None) -> ObjectRef:
        """Return ObjectRef for lazy access."""
        return ObjectRef(store=get_store(stored['store']), path=stored['path'])

Database Column

-- filepath column (MySQL)
recording JSON NOT NULL
-- Contains: {"path": "experiment_001/data.nwb", "store": "main", "checksum": "...", "size": ...}

-- filepath column (PostgreSQL)
recording JSONB NOT NULL

Key Differences from Legacy filepath@store (now <filepath@store>)

Feature Legacy New
Access Copy to local stage ObjectRef (lazy)
Copying Automatic Explicit via ref.download()
Streaming No Yes via ref.open()
Paths Relative Relative (unchanged)
Store param Required (@store) Required (@store)

Database Types

json - Cross-Database JSON Type

JSON storage compatible across MySQL and PostgreSQL:

-- MySQL
column_name JSON NOT NULL

-- PostgreSQL (uses JSONB for better indexing)
column_name JSONB NOT NULL

The json database type: - Used as dtype by built-in codecs (<object@>, <hash@>, <filepath@store>) - Stores arbitrary JSON-serializable data - Automatically uses appropriate type for database backend - Supports JSON path queries where available

Built-in Codecs

<blob> / <blob@> - Serialized Python Objects

Supports both in-table and in-store storage.

Serializes Python objects using DataJoint's custom binary serialization format. The format uses protocol headers and type-specific encoding to serialize complex Python objects efficiently.

Serialization format:

  • Protocol headers:
    • mYm β€” Original MATLAB-compatible format for numeric arrays, structs, cells (see mYm on MATLAB FileExchange and mym on GitHub)
    • dj0 β€” Extended format supporting Python-specific types (UUID, Decimal, datetime, etc.)
  • Compression: Automatic zlib compression for data > 1KB
  • Type codes: Each Python type has a specific serialization code:
    • 'A' β€” NumPy arrays (numeric)
    • 'F' β€” NumPy recarrays (structured arrays with fields)
    • '\x01' β€” Tuples
    • '\x02' β€” Lists
    • '\x03' β€” Sets
    • '\x04' β€” Dicts
    • '\x05' β€” Strings (UTF-8)
    • '\x06' β€” Bytes
    • '\x0a' β€” Unbounded integers
    • '\x0b' β€” Booleans
    • '\x0c' β€” Complex numbers
    • '\x0d' β€” Floats
    • 'd' β€” Decimal
    • 't' β€” Datetime/date/time
    • 'u' β€” UUID
    • 'S' β€” MATLAB structs
    • 'C' β€” MATLAB cell arrays

Version detection: The protocol header (mYm\0 or dj0\0) is embedded at the start of the blob, enabling automatic format detection and backward compatibility.

Storage modes:

  • <blob>: Stored in database (bytes β†’ LONGBLOB/BYTEA)
  • <blob@>: Stored externally via <hash@> with MD5 deduplication
  • <blob@store>: Stored in specific named store
class BlobCodec(dj.Codec):
    """Serialized Python objects. Supports internal and external."""
    name = "blob"

    def get_dtype(self, is_store: bool) -> str:
        return "<hash>" if is_store else "bytes"

    def encode(self, value, *, key=None, store_name=None) -> bytes:
        from . import blob
        return blob.pack(value, compress=True)

    def decode(self, stored, *, key=None) -> Any:
        from . import blob
        return blob.unpack(stored)

Usage:

class ProcessedData(dj.Computed):
    definition = """
    -> RawData
    ---
    small_result : <blob>          # in-table (in database)
    large_result : <blob@>         # in-store (default store)
    archive_result : <blob@cold>   # in-store (specific store)
    """

<attach> / <attach@> - File Attachments

Supports both in-table and in-store storage.

Stores files with filename preserved. On fetch, extracts to configured download path.

  • <attach>: Stored in database (bytes β†’ LONGBLOB/BYTEA)
  • <attach@>: Stored in object store via <hash@> with deduplication
  • <attach@store>: Stored in specific named store
class AttachCodec(dj.Codec):
    """File attachment with filename. Supports in-table and in-store."""
    name = "attach"

    def get_dtype(self, is_store: bool) -> str:
        return "<hash>" if is_store else "bytes"

    def encode(self, filepath, *, key=None, store_name=None) -> bytes:
        path = Path(filepath)
        return path.name.encode() + b"\0" + path.read_bytes()

    def decode(self, stored, *, key=None) -> str:
        filename, contents = stored.split(b"\0", 1)
        filename = filename.decode()
        download_path = Path(dj.config['download_path']) / filename
        download_path.write_bytes(contents)
        return str(download_path)

Usage:

class Attachments(dj.Manual):
    definition = """
    attachment_id : int32
    ---
    config : <attach>           # in-table (small file in DB)
    data_file : <attach@>       # in-store (default store)
    archive : <attach@cold>     # in-store (specific store)
    """

User-Defined Codecs

Users can define custom codecs for domain-specific data. See the Codec API Specification for complete examples including:

  • Simple serialization codecs
  • In-store codecs
  • JSON with schema validation
  • Context-dependent encoding
  • In-store-only codecs (Zarr, HDF5)

Storage Comparison

Type get_dtype Resolves To Storage Location Dedup Returns
<blob> bytes LONGBLOB/BYTEA Database No Python object
<blob@> <hash> json _hash/{hash} Yes Python object
<blob@s> <hash> json _hash/{hash} Yes Python object
<attach> bytes LONGBLOB/BYTEA Database No Local file path
<attach@> <hash> json _hash/{hash} Yes Local file path
<attach@s> <hash> json _hash/{hash} Yes Local file path
<object@> json JSON/JSONB {schema}/{table}/{pk}/ No ObjectRef
<object@s> json JSON/JSONB {schema}/{table}/{pk}/ No ObjectRef
<hash@> json JSON/JSONB _hash/{hash} Yes bytes
<hash@s> json JSON/JSONB _hash/{hash} Yes bytes
<filepath@s> json JSON/JSONB Configured store No ObjectRef

Garbage Collection for Hash Storage

Hash metadata (hash, store, size) is stored directly in each table's JSON column - no separate registry table is needed. Garbage collection scans all tables to find referenced hashes:

def garbage_collect(store_name):
    """Remove hash-addressed data not referenced by any table."""
    # Scan store for all hash files
    store = get_store(store_name)
    all_hashes = set(store.list_hashes())  # from _hash/ directory

    # Scan all tables for referenced hashes
    referenced = set()
    for schema in project.schemas:
        for table in schema.tables:
            for attr in table.heading.attributes:
                if uses_hash_storage(attr):  # <blob@>, <attach@>, <hash@>
                    for row in table:
                        val = row.get(attr.name)
                        if val and val.get('store') == store_name:
                            referenced.add(val['hash'])

    # Delete orphaned files
    for hash_id in (all_hashes - referenced):
        store.delete(hash_path(hash_id))

Built-in Codec Comparison

Feature <blob> <attach> <object@> <hash@> <filepath@>
Storage modes Both Both External only External only External only
Internal dtype bytes bytes N/A N/A N/A
External dtype <hash@> <hash@> json json json
Addressing Hash Hash Primary key Hash Relative path
Deduplication Yes (external) Yes (external) No Yes No
Structure Single blob Single file Files, folders Single blob Any
Returns Python object Local path ObjectRef bytes ObjectRef
GC Ref counted Ref counted With row Ref counted User managed

When to use each: - <blob>: Serialized Python objects (NumPy arrays, dicts). Use <blob@> for large/duplicated data - <attach>: File attachments with filename preserved. Use <attach@> for large files - <object@>: Large/complex file structures (Zarr, HDF5) where DataJoint controls organization - <hash@>: Raw bytes with deduplication (typically used via <blob@> or <attach@>) - <filepath@store>: Portable references to externally-managed files - varchar: Arbitrary URLs/paths where ObjectRef semantics aren't needed

Key Design Decisions

  1. Three-layer architecture:
    • Layer 1: Native database types (backend-specific, discouraged)
    • Layer 2: Core DataJoint types (standardized, scientist-friendly)
    • Layer 3: Codec types (encode/decode, composable)
  2. Core types are scientist-friendly: float32, int8, bool, bytes instead of FLOAT, TINYINT, LONGBLOB
  3. Codecs use angle brackets: <blob>, <object@store>, <filepath@main> - distinguishes from core types
  4. @ indicates in-store storage: No @ = database, @ present = object store
  5. get_dtype(is_store) method: Codecs resolve dtype at declaration time based on storage mode
  6. Codecs are composable: <blob@> uses <hash@>, which uses json
  7. Built-in in-store codecs use JSON dtype: Stores metadata (path, hash, store name, etc.)
  8. Two OAS regions: object (PK-addressed) and hash (hash-addressed) within managed stores
  9. Filepath for portability: <filepath@store> uses relative paths within stores for environment portability
  10. No uri type: For arbitrary URLs, use varcharβ€”simpler and more transparent
  11. Naming conventions: - @ = in-store storage (object store) - No @ = in-table storage (database) - @ alone = default store - @name = named store
  12. Dual-mode codecs: <blob> and <attach> support both in-table and in-store storage
  13. In-store-only codecs: <object@>, <hash@>, <filepath@> require @
  14. Transparent access: Codecs return Python objects or file paths
  15. Lazy access: <object@> and <filepath@store> return ObjectRef
  16. MD5 for content hashing: See Hash Algorithm Choice below
  17. No separate registry: Hash metadata stored in JSON columns, not a separate table
  18. Auto-registration via __init_subclass__: Codecs register automatically when subclassedβ€”no decorator needed. Use register=False for abstract bases. Requires Python 3.10+.

Hash Algorithm Choice

Content-addressed storage uses MD5 (128-bit, 32-char hex) rather than SHA256 (256-bit, 64-char hex).

Rationale:

  1. Practical collision resistance is sufficient: The birthday bound for MD5 is ~2^64 operations before 50% collision probability. No scientific project will store anywhere near 10^19 files. For content deduplication (not cryptographic verification), MD5 provides adequate uniqueness.

  2. Storage efficiency: 32-char hashes vs 64-char hashes in every JSON metadata field. With millions of records, this halves the storage overhead for hash identifiers.

  3. Performance: MD5 is ~2-3x faster than SHA256 for large files. While both are fast, the difference is measurable when hashing large scientific datasets.

  4. Legacy compatibility: DataJoint's existing uuid_from_buffer() function uses MD5. The new system changes only the storage format (hex string in JSON vs binary UUID), not the underlying hash algorithm. This simplifies migration.

  5. Consistency with existing codebase: Internal functions use MD5 for query caching.

Why not SHA256?

SHA256 is the modern standard for content-addressable storage (Git, Docker, IPFS). However: - These systems prioritize cryptographic security against adversarial collision attacks - Scientific data pipelines face no adversarial threat model - The practical benefits (storage, speed, compatibility) outweigh theoretical security gains

Note: If cryptographic verification is ever needed (e.g., for compliance or reproducibility audits), SHA256 checksums can be computed on-demand without changing the storage addressing scheme.

Migration from Legacy Types

Legacy New Equivalent
longblob (auto-serialized) <blob>
blob@store <blob@store>
attach <attach>
attach@store <attach@store>
filepath@store (copy-based) <filepath@store> (ObjectRef-based)

Migration from Legacy ~external_* Stores

Legacy external storage used per-schema ~external_{store} tables with UUID references. Migration to the new JSON-based hash storage requires:

def migrate_external_store(schema, store_name):
    """
    Migrate legacy ~external_{store} to new HashRegistry.

    1. Read all entries from ~external_{store}
    2. For each entry:
       - Fetch content from legacy location
       - Compute MD5 hash
       - Copy to _hash/{hash}/ if not exists
       - Update table column to new hash format
    3. After all schemas migrated, drop ~external_{store} tables
    """
    external_table = schema.external[store_name]

    for entry in external_table:
        legacy_uuid = entry['hash']

        # Fetch content from legacy location
        content = external_table.get(legacy_uuid)

        # Compute new content hash
        hash_id = hashlib.md5(content).hexdigest()

        # Store in new location if not exists
        new_path = f"_hash/{hash_id[:2]}/{hash_id[2:4]}/{hash_id}"
        store = get_store(store_name)
        if not store.exists(new_path):
            store.put(new_path, content)

        # Update referencing tables: convert UUID column to JSON with hash metadata
        # The JSON column stores {"hash": hash_id, "store": store_name, "size": len(content)}
        # ... update all tables that reference this UUID ...

    # After migration complete for all schemas:
    # DROP TABLE `{schema}`.`~external_{store}`

Migration considerations: - Legacy UUIDs were based on MD5 content hash stored as binary(16) (UUID format) - New system uses char(32) MD5 hex strings stored in JSON - The hash algorithm is unchanged (MD5), only the storage format differs - Migration can be done incrementally per schema - Backward compatibility layer can read both formats during transition

Open Questions

  1. How long should the backward compatibility layer support legacy ~external_* format?
  2. Should <hash@> (without store name) use a default store or require explicit store name?