Type System Specification¶

Overview¶

New in 2.1

PostgreSQL type mappings were added in DataJoint 2.1. See Database Backends for backend configuration.

This document defines a three-layer type architecture:

Native database types - Backend-specific (FLOAT, TINYINT UNSIGNED, LONGBLOB). Discouraged for direct use.
Core DataJoint types - Standardized across backends, scientist-friendly (int32, float32, bool, json).
Codec Types - Programmatic types with encode()/decode() semantics. Composable.

Layer	Description	Examples
3. Codec Types	Programmatic types with `encode()`/`decode()` semantics	`<blob>`, `<attach>`, `<object@>`, `<hash@>`, `<filepath@>`, user-defined
2. Core DataJoint	Standardized, scientist-friendly types (preferred)	`int32`, `float64`, `varchar(n)`, `bool`, `datetime`, `json`, `bytes`
1. Native Database	Backend-specific types (discouraged)	`INT`, `FLOAT`, `TINYINT UNSIGNED`, `LONGBLOB`

Codec types resolve through core types to native types: <blob> → bytes → LONGBLOB.

Syntax distinction: - Core types: int32, float64, varchar(255) - no brackets - Codec types: <blob>, <object@store>, <filepath@main> - angle brackets - The @ character indicates store (object storage vs in-table)

Core DataJoint Types (Layer 2)¶

Core types provide a standardized, scientist-friendly interface that works identically across MySQL and PostgreSQL backends. Users should prefer these over native database types.

All core types are recorded in field comments using :type: syntax for reconstruction.

Numeric Types¶

Core Type	Description	MySQL	PostgreSQL
`int8`	8-bit signed	`TINYINT`	`SMALLINT`
`int16`	16-bit signed	`SMALLINT`	`SMALLINT`
`int32`	32-bit signed	`INT`	`INTEGER`
`int64`	64-bit signed	`BIGINT`	`BIGINT`
`float32`	32-bit float	`FLOAT`	`REAL`
`float64`	64-bit float	`DOUBLE`	`DOUBLE PRECISION`
`decimal(n,f)`	Fixed-point	`DECIMAL(n,f)`	`NUMERIC(n,f)`

String Types¶

Core Type	Description	MySQL	PostgreSQL
`char(n)`	Fixed-length	`CHAR(n)`	`CHAR(n)`
`varchar(n)`	Variable-length	`VARCHAR(n)`	`VARCHAR(n)`

Encoding: All strings use UTF-8 (utf8mb4 in MySQL, UTF8 in PostgreSQL). See Encoding and Collation Policy for details.

Boolean¶

Core Type	Description	MySQL	PostgreSQL
`bool`	True/False	`TINYINT`	`BOOLEAN`

Date/Time Types¶

Core Type	Description	MySQL	PostgreSQL
`date`	Date only	`DATE`	`DATE`
`datetime`	Date and time	`DATETIME`	`TIMESTAMP`

Timezone policy: All datetime values should be stored as UTC. Timezone conversion is a presentation concern handled by the application layer, not the database. This ensures: - Reproducible computations regardless of server or client timezone settings - Simple arithmetic on temporal values (no DST ambiguity) - Portable data across systems and regions

Use CURRENT_TIMESTAMP for auto-populated creation times:

created_at : datetime = CURRENT_TIMESTAMP

Binary Types¶

The core bytes type stores raw bytes without any serialization. Use the <blob> codec for serialized Python objects.

Core Type	Description	MySQL	PostgreSQL
`bytes`	Raw bytes	`LONGBLOB`	`BYTEA`

Other Types¶

Core Type	Description	MySQL	PostgreSQL
`json`	JSON document	`JSON`	`JSONB`
`uuid`	UUID	`BINARY(16)`	`UUID`
`enum(...)`	Enumeration	`ENUM(...)`	`CREATE TYPE ... AS ENUM`

Native Passthrough Types¶

Users may use native database types directly (e.g., int, float, mediumint, tinyblob), but these are discouraged and will generate a warning. Native types lack explicit size information, are not recorded in field comments, and may have portability issues across database backends.

Prefer core DataJoint types over native types:

Native (discouraged)	Core DataJoint (preferred)
`int`	`int32`
`float`	`float32` or `float64`
`double`	`float64`
`tinyint`	`int8`
`smallint`	`int16`
`bigint`	`int64`
`tinyint unsigned`	`int16` (use larger signed type)
`smallint unsigned`	`int32` (use larger signed type)
`int unsigned`	`int64` (use larger signed type)
`bigint unsigned`	`decimal(20,0)` (use decimal for full range)

Type Modifiers Policy¶

DataJoint table definitions have their own syntax for constraints and metadata. SQL type modifiers are not allowed in type specifications because they conflict with DataJoint's declarative syntax:

Modifier	Status	DataJoint Alternative
`NOT NULL` / `NULL`	❌ Not allowed	Use `= NULL` for nullable; omit default for required
`DEFAULT value`	❌ Not allowed	Use `= value` syntax before the type
`PRIMARY KEY`	❌ Not allowed	Position above `---` line
`UNIQUE`	❌ Not allowed	Use DataJoint index syntax
`COMMENT 'text'`	❌ Not allowed	Use `# comment` syntax
`CHARACTER SET`	❌ Not allowed	Database-level configuration
`COLLATE`	❌ Not allowed	Database-level configuration
`AUTO_INCREMENT`	⚠️ Discouraged	Allowed with native types only, generates warning
`UNSIGNED`	⚠️ Discouraged	MySQL-specific, not portable (use larger signed type)

Nullability and defaults: DataJoint handles nullability through the default value syntax. An attribute is nullable if and only if its default is NULL:

# Required (NOT NULL, no default)
name : varchar(100)

# Nullable (default is NULL)
nickname = NULL : varchar(100)

# Required with default value
status = "active" : varchar(20)

Auto-increment policy: DataJoint discourages AUTO_INCREMENT / SERIAL because: - Breaks reproducibility (IDs depend on insertion order) - Makes pipelines non-deterministic - Complicates data migration and replication - Primary keys should be meaningful, not arbitrary

If required, use native types: int auto_increment or serial (with warning).

Encoding and Collation Policy¶

Character encoding and collation are database-level configuration, not part of type definitions. This ensures consistent behavior across all tables and simplifies portability.

Configuration (in dj.config or datajoint.json):

{
    "database.charset": "utf8mb4",
    "database.collation": "utf8mb4_bin"
}

Defaults:

Setting	MySQL	PostgreSQL
Charset	`utf8mb4`	`UTF8`
Collation	`utf8mb4_bin`	`C`

Policy: - UTF-8 required: DataJoint validates charset is UTF-8 compatible at connection time - Case-sensitive by default: Binary collation (utf8mb4_bin / C) ensures predictable comparisons - No per-column overrides: CHARACTER SET and COLLATE are rejected in type definitions - Like timezone: Encoding is infrastructure configuration, not part of the data model

Codec Types (Layer 3)¶

Codec types provide encode()/decode() semantics on top of core types. They are composable and can be built-in or user-defined.

Storage Mode: `@` Convention¶

The @ character in codec syntax indicates object store (vs in-table):

No @: In-table storage (database column) - e.g., <blob>, <attach>
@ present: Object store - e.g., <blob@>, <attach@store>
@ alone: Use default store - e.g., <blob@>
@name: Use named store - e.g., <blob@cold>

Some codecs support both modes (<blob>, <attach>), others are store-only (<object@>, <npy@>, <hash@>, <filepath@>).

Codec Base Class¶

Codecs inherit from dj.Codec and auto-register when their class is defined. See the Codec API Specification for complete details on creating custom codecs.

class GraphCodec(dj.Codec):
    """Auto-registered as <graph>."""
    name = "graph"

    def get_dtype(self, is_store: bool) -> str:
        return "<blob>"

    def encode(self, graph, *, key=None, store_name=None):
        return {'nodes': list(graph.nodes()), 'edges': list(graph.edges())}

    def decode(self, stored, *, key=None):
        import networkx as nx
        G = nx.Graph()
        G.add_nodes_from(stored['nodes'])
        G.add_edges_from(stored['edges'])
        return G

Codec Resolution and Chaining¶

Codecs resolve to core types through chaining. The get_dtype(is_store) method returns the appropriate dtype based on storage mode:

Codec	`is_store`	Resolution Chain	SQL Type
`<blob>`	`False`	`"bytes"`	`LONGBLOB`/`BYTEA`
`<blob@>`	`True`	`"<hash>"` → `"json"`	`JSON`/`JSONB`
`<blob@cold>`	`True`	`"<hash>"` → `"json"` (store=cold)	`JSON`/`JSONB`
`<attach>`	`False`	`"bytes"`	`LONGBLOB`/`BYTEA`
`<attach@>`	`True`	`"<hash>"` → `"json"`	`JSON`/`JSONB`
`<object@>`	`True`	`"json"`	`JSON`/`JSONB`
`<npy@>`	`True`	`"json"`	`JSON`/`JSONB`
`<object>`	`False`	ERROR (store only)	—
`<npy>`	`False`	ERROR (store only)	—
`<hash@>`	`True`	`"json"`	`JSON`/`JSONB`
`<filepath@>`	`True`	`"json"`	`JSON`/`JSONB`

`<object@>` / `<object@store>` - Schema-Addressed Storage¶

Built-in codec. Store only.

Schema-addressed OAS storage for complex, multi-part objects (files, folders, Zarr arrays, HDF5):

Schema-addressed: Path mirrors database structure within the store's schema section: {schema_prefix}/{schema}/{table}/{pk}/{attribute}_{token} (schema_prefix defaults to _schema)
Complex objects: Can store directory structures with multiple files (e.g., Zarr arrays)
One-to-one relationship with table row
Deleted when row is deleted
Returns ObjectRef for lazy access
Supports direct writes (Zarr, HDF5) via fsspec
dtype: json (stores path, store name, metadata)

class Analysis(dj.Computed):
    definition = """
    -> Recording
    ---
    results : <object@>         # default store
    archive : <object@cold>     # specific store
    """

Implementation¶

class ObjectCodec(SchemaCodec):
    """Schema-addressed OAS storage. Store only."""
    name = "object"

    # get_dtype inherited from SchemaCodec

    def encode(self, value, *, key=None, store_name=None) -> dict:
        schema, table, field, pk = self._extract_context(key)
        path, _ = self._build_path(schema, table, field, pk)
        backend = self._get_backend(store_name)
        backend.put(path, value)
        return {"path": path, "store": store_name, ...}

    def decode(self, stored: dict, *, key=None) -> ObjectRef:
        backend = self._get_backend(stored["store"])
        return ObjectRef.from_json(stored, backend=backend)

`<hash@>` / `<hash@store>` - Hash-Addressed Storage¶

Built-in codec. Store only.

Hash-addressed storage with deduplication for individual, atomic objects:

Hash-addressed: Path derived from content hash, with the schema name embedded: {hash_prefix}/{schema}/{hash}
Individual/atomic objects only: Stores single files or serialized blobs (not directory structures)
Cannot handle complex multi-part objects like Zarr arrays—use <object@> for those
Per-schema scope: content is deduplicated within each schema (the schema name is part of every hash path)
Many-to-one: multiple rows within the same schema can reference the same content
Reference counted for garbage collection
Deduplication: identical content stored once per schema — the schema name is part of every hash path, which scopes deduplication and lets garbage collection attribute each stored object to its schema
dtype: json (stores hash, store name, size, metadata)

store_root/
├── {schema_prefix}/               # default _schema/ — schema-addressed storage
│   └── {schema}/{table}/{pk}/
│       └── {attribute}_{token}[.ext]
│
└── {hash_prefix}/                 # default _hash/ — hash-addressed storage
    └── {schema}/
        └── {hash[:2]}/{hash[2:4]}/{hash}

Section prefixes come from the store settings (hash_prefix, schema_prefix; see Object Store Configuration). Both sections embed the schema name in every path.

Implementation¶

class HashCodec(dj.Codec):
    """Hash-addressed storage. Store only."""
    name = "hash"

    def get_dtype(self, is_store: bool) -> str:
        if not is_store:
            raise DataJointError("<hash> requires @ (store only)")
        return "json"

    def encode(self, data: bytes, *, key=None, store_name=None) -> dict:
        """Store content, return metadata as JSON."""
        hash_id = hashlib.md5(data).hexdigest()  # 32-char hex
        store = get_store(store_name or dj.config['stores']['default'])
        path = f"_hash/{hash_id[:2]}/{hash_id[2:4]}/{hash_id}"

        if not store.exists(path):
            store.put(path, data)

        # Metadata stored in JSON column (no separate registry)
        return {"hash": hash_id, "store": store_name, "size": len(data)}

    def decode(self, stored: dict, *, key=None) -> bytes:
        """Retrieve content by hash."""
        store = get_store(stored["store"])
        path = f"_hash/{stored['hash'][:2]}/{stored['hash'][2:4]}/{stored['hash']}"
        return store.get(path)

Database Column¶

The <hash@> type stores JSON metadata:

-- content column (MySQL)
features JSON NOT NULL
-- Contains: {"hash": "abc123...", "store": "main", "size": 12345}

-- content column (PostgreSQL)
features JSONB NOT NULL

`<filepath@store>` - Portable External Reference¶

Built-in codec. In-store only (store required).

Relative path references within configured stores:

Relative paths: paths within a configured store (portable across environments)
Store-aware: resolves paths against configured store backend
Returns ObjectRef for lazy access via fsspec
Stores optional checksum for verification
dtype: json (stores path, store name, checksum, metadata)

Key benefit: Portability. The path is relative to the store, so pipelines can be moved between environments (dev → prod, cloud → local) by changing store configuration without updating data.

class RawData(dj.Manual):
    definition = """
    session_id : int32
    ---
    recording : <filepath@main>      # relative path within 'main' store
    """

# Insert - user provides relative path within the store
table.insert1({
    'session_id': 1,
    'recording': 'experiment_001/data.nwb'  # relative to main store root
})

# Fetch - returns ObjectRef (lazy)
row = (table & 'session_id=1').fetch1()
ref = row['recording']           # ObjectRef
ref.download('/local/path')      # explicit download
ref.open()                       # fsspec streaming access

When to Use `<filepath@store>` vs `varchar`¶

Use Case	Recommended Type
Need ObjectRef/lazy access	`<filepath@store>`
Need portability (relative paths)	`<filepath@store>`
Want checksum verification	`<filepath@store>`
Just storing a URL string	`varchar`
External URLs you don't control	`varchar`

For arbitrary URLs (S3, HTTP, etc.) where you don't need ObjectRef semantics, just use varchar. A string is simpler and more transparent.

Implementation¶

class FilepathCodec(dj.Codec):
    """Store-relative file references. External only."""
    name = "filepath"

    def get_dtype(self, is_store: bool) -> str:
        if not is_store:
            raise DataJointError("<filepath> requires @store")
        return "json"

    def encode(self, relative_path: str, *, key=None, store_name=None) -> dict:
        """Register reference to file in store."""
        store = get_store(store_name)  # store_name required for filepath
        return {'path': relative_path, 'store': store_name}

    def decode(self, stored: dict, *, key=None) -> ObjectRef:
        """Return ObjectRef for lazy access."""
        return ObjectRef(store=get_store(stored['store']), path=stored['path'])

Database Column¶

-- filepath column (MySQL)
recording JSON NOT NULL
-- Contains: {"path": "experiment_001/data.nwb", "store": "main", "checksum": "...", "size": ...}

-- filepath column (PostgreSQL)
recording JSONB NOT NULL

Key Differences from Legacy `filepath@store` (now `<filepath@store>`)¶

Feature	Legacy	New
Access	Copy to local stage	ObjectRef (lazy)
Copying	Automatic	Explicit via `ref.download()`
Streaming	No	Yes via `ref.open()`
Paths	Relative	Relative (unchanged)
Store param	Required (`@store`)	Required (`@store`)

Database Types¶

`json` - Cross-Database JSON Type¶

JSON storage compatible across MySQL and PostgreSQL:

-- MySQL
column_name JSON NOT NULL

-- PostgreSQL (uses JSONB for better indexing)
column_name JSONB NOT NULL

The json database type: - Used as dtype by built-in codecs (<object@>, <hash@>, <filepath@store>) - Stores arbitrary JSON-serializable data - Automatically uses appropriate type for database backend - Supports JSON path queries where available

Built-in Codecs¶

`<blob>` / `<blob@>` - Serialized Python Objects¶

Supports both in-table and in-store storage.

Serializes Python objects using DataJoint's custom binary serialization format. The format uses protocol headers and type-specific encoding to serialize complex Python objects efficiently.

Serialization format:

Protocol headers:
- mYm — Original MATLAB-compatible format for numeric arrays, structs, cells (see mYm on MATLAB FileExchange and mym on GitHub)
- dj0 — Extended format supporting Python-specific types (UUID, Decimal, datetime, etc.)
Compression: Automatic zlib compression for data > 1KB
Type codes: Each Python type has a specific serialization code:
- 'A' — NumPy arrays (numeric)
- 'F' — NumPy recarrays (structured arrays with fields)
- '\x01' — Tuples
- '\x02' — Lists
- '\x03' — Sets
- '\x04' — Dicts
- '\x05' — Strings (UTF-8)
- '\x06' — Bytes
- '\x0a' — Unbounded integers
- '\x0b' — Booleans
- '\x0c' — Complex numbers
- '\x0d' — Floats
- 'd' — Decimal
- 't' — Datetime/date/time
- 'u' — UUID
- 'S' — MATLAB structs
- 'C' — MATLAB cell arrays

Version detection: The protocol header (mYm\0 or dj0\0) is embedded at the start of the blob, enabling automatic format detection and backward compatibility.

Storage modes:

<blob>: Stored in database (bytes → LONGBLOB/BYTEA)
<blob@>: Stored externally via <hash@> with MD5 deduplication
<blob@store>: Stored in specific named store

class BlobCodec(dj.Codec):
    """Serialized Python objects. Supports internal and external."""
    name = "blob"

    def get_dtype(self, is_store: bool) -> str:
        return "<hash>" if is_store else "bytes"

    def encode(self, value, *, key=None, store_name=None) -> bytes:
        from . import blob
        return blob.pack(value, compress=True)

    def decode(self, stored, *, key=None) -> Any:
        from . import blob
        return blob.unpack(stored)

Usage:

class SessionAnalysis(dj.Computed):
    definition = """
    -> Session
    ---
    small_result : <blob>          # in-table (in database)
    large_result : <blob@>         # in-store (default store)
    archive_result : <blob@cold>   # in-store (specific store)
    """

`<attach>` / `<attach@>` - File Attachments¶

Supports both in-table and in-store storage.

Stores files with filename preserved. On fetch, extracts to configured download path.

<attach>: Stored in database (bytes → LONGBLOB/BYTEA)
<attach@>: Stored in object store via <hash@> with deduplication
<attach@store>: Stored in specific named store

class AttachCodec(dj.Codec):
    """File attachment with filename. Supports in-table and in-store."""
    name = "attach"

    def get_dtype(self, is_store: bool) -> str:
        return "<hash>" if is_store else "bytes"

    def encode(self, filepath, *, key=None, store_name=None) -> bytes:
        path = Path(filepath)
        return path.name.encode() + b"\0" + path.read_bytes()

    def decode(self, stored, *, key=None) -> str:
        filename, contents = stored.split(b"\0", 1)
        filename = filename.decode()
        download_path = Path(dj.config['download_path']) / filename
        download_path.write_bytes(contents)
        return str(download_path)

Usage:

class Attachments(dj.Manual):
    definition = """
    attachment_id : int32
    ---
    config : <attach>           # in-table (small file in DB)
    data_file : <attach@>       # in-store (default store)
    archive : <attach@cold>     # in-store (specific store)
    """

User-Defined Codecs¶

Users can define custom codecs for domain-specific data. See the Codec API Specification for complete examples including:

Simple serialization codecs
In-store codecs
JSON with schema validation
Context-dependent encoding
In-store-only codecs (Zarr, HDF5)

Storage Comparison¶

Type	get_dtype	Resolves To	Storage Location	Dedup	Returns
`<blob>`	`bytes`	`LONGBLOB`/`BYTEA`	Database	No	Python object
`<blob@>`	`<hash>`	`json`	`{hash_prefix}/{schema}/{hash}`	Yes	Python object
`<blob@s>`	`<hash>`	`json`	`{hash_prefix}/{schema}/{hash}`	Yes	Python object
`<attach>`	`bytes`	`LONGBLOB`/`BYTEA`	Database	No	Local file path
`<attach@>`	`<hash>`	`json`	`{hash_prefix}/{schema}/{hash}`	Yes	Local file path
`<attach@s>`	`<hash>`	`json`	`{hash_prefix}/{schema}/{hash}`	Yes	Local file path
`<object@>`	`json`	`JSON`/`JSONB`	`{schema_prefix}/{schema}/{table}/{pk}/`	No	ObjectRef
`<object@s>`	`json`	`JSON`/`JSONB`	`{schema_prefix}/{schema}/{table}/{pk}/`	No	ObjectRef
`<hash@>`	`json`	`JSON`/`JSONB`	`{hash_prefix}/{schema}/{hash}`	Yes	bytes
`<hash@s>`	`json`	`JSON`/`JSONB`	`{hash_prefix}/{schema}/{hash}`	Yes	bytes
`<filepath@s>`	`json`	`JSON`/`JSONB`	Configured store	No	ObjectRef

Garbage Collection for Hash Storage¶

Hash metadata (hash, store, size) is stored directly in each table's JSON column - no separate registry table is needed. Garbage collection scans all tables to find referenced hashes:

def garbage_collect(store_name):
    """Remove hash-addressed data not referenced by any table."""
    # Scan store for all hash files
    store = get_store(store_name)
    all_hashes = set(store.list_hashes())  # from _hash/ directory

    # Scan all tables for referenced hashes
    referenced = set()
    for schema in project.schemas:
        for table in schema.tables:
            for attr in table.heading.attributes:
                if uses_hash_storage(attr):  # <blob@>, <attach@>, <hash@>
                    for row in table:
                        val = row.get(attr.name)
                        if val and val.get('store') == store_name:
                            referenced.add(val['hash'])

    # Delete orphaned files
    for hash_id in (all_hashes - referenced):
        store.delete(hash_path(hash_id))

Built-in Codec Comparison¶

Feature	`<blob>`	`<attach>`	`<object@>`	`<hash@>`	`<filepath@>`
Storage modes	Both	Both	External only	External only	External only
Internal dtype	`bytes`	`bytes`	N/A	N/A	N/A
External dtype	`<hash@>`	`<hash@>`	`json`	`json`	`json`
Addressing	Hash	Hash	Primary key	Hash	Relative path
Deduplication	Yes (external)	Yes (external)	No	Yes	No
Structure	Single blob	Single file	Files, folders	Single blob	Any
Returns	Python object	Local path	ObjectRef	bytes	ObjectRef
GC	Ref counted	Ref counted	With row	Ref counted	User managed

When to use each: - <blob>: Serialized Python objects (NumPy arrays, dicts). Use <blob@> for large/duplicated data - <attach>: File attachments with filename preserved. Use <attach@> for large files - <object@>: Large/complex file structures (Zarr, HDF5) where DataJoint controls organization - <hash@>: Raw bytes with deduplication (typically used via <blob@> or <attach@>) - <filepath@store>: Portable references to externally-managed files - varchar: Arbitrary URLs/paths where ObjectRef semantics aren't needed

Key Design Decisions¶

Three-layer architecture:
- Layer 1: Native database types (backend-specific, discouraged)
- Layer 2: Core DataJoint types (standardized, scientist-friendly)
- Layer 3: Codec types (encode/decode, composable)
Core types are scientist-friendly: float32, int8, bool, bytes instead of FLOAT, TINYINT, LONGBLOB
Codecs use angle brackets: <blob>, <object@store>, <filepath@main> - distinguishes from core types
@ indicates in-store storage: No @ = database, @ present = object store
get_dtype(is_store) method: Codecs resolve dtype at declaration time based on storage mode
Codecs are composable: <blob@> uses <hash@>, which uses json
Built-in in-store codecs use JSON dtype: Stores metadata (path, hash, store name, etc.)
Two OAS regions: object (PK-addressed) and hash (hash-addressed) within managed stores
Filepath for portability: <filepath@store> uses relative paths within stores for environment portability
No uri type: For arbitrary URLs, use varchar—simpler and more transparent
Naming conventions: - @ = in-store storage (object store) - No @ = in-table storage (database) - @ alone = default store - @name = named store
Dual-mode codecs: <blob> and <attach> support both in-table and in-store storage
In-store-only codecs: <object@>, <hash@>, <filepath@> require @
Transparent access: Codecs return Python objects or file paths
Lazy access: <object@> and <filepath@store> return ObjectRef
MD5 for content hashing: See Hash Algorithm Choice below
No separate registry: Hash metadata stored in JSON columns, not a separate table
Auto-registration via __init_subclass__: Codecs register automatically when subclassed—no decorator needed. Use register=False for abstract bases. Requires Python 3.10+.

Hash Algorithm Choice¶

Content-addressed storage uses MD5 (128-bit, 32-char hex) rather than SHA256 (256-bit, 64-char hex).

Rationale:

Practical collision resistance is sufficient: The birthday bound for MD5 is ~2^64 operations before 50% collision probability. No scientific project will store anywhere near 10^19 files. For content deduplication (not cryptographic verification), MD5 provides adequate uniqueness.
Storage efficiency: 32-char hashes vs 64-char hashes in every JSON metadata field. With millions of records, this halves the storage overhead for hash identifiers.
Performance: MD5 is ~2-3x faster than SHA256 for large files. While both are fast, the difference is measurable when hashing large scientific datasets.
Legacy compatibility: DataJoint's existing uuid_from_buffer() function uses MD5. The new system changes only the storage format (hex string in JSON vs binary UUID), not the underlying hash algorithm. This simplifies migration.
Consistency with existing codebase: Internal functions use MD5 for query caching.

Why not SHA256?

SHA256 is the modern standard for content-addressable storage (Git, Docker, IPFS). However: - These systems prioritize cryptographic security against adversarial collision attacks - Scientific data pipelines face no adversarial threat model - The practical benefits (storage, speed, compatibility) outweigh theoretical security gains

Note: If cryptographic verification is ever needed (e.g., for compliance or reproducibility audits), SHA256 checksums can be computed on-demand without changing the storage addressing scheme.

Migration from Legacy Types¶

Legacy	New Equivalent
`longblob` (auto-serialized)	`<blob>`
`blob@store`	`<blob@store>`
`attach`	`<attach>`
`attach@store`	`<attach@store>`
`filepath@store` (copy-based)	`<filepath@store>` (ObjectRef-based)

Migration from Legacy `~external_*` Stores¶

Legacy external storage used per-schema ~external_{store} tables with UUID references. Migration to the new JSON-based hash storage requires:

def migrate_external_store(schema, store_name):
    """
    Migrate legacy ~external_{store} to new HashRegistry.

    1. Read all entries from ~external_{store}
    2. For each entry:
       - Fetch content from legacy location
       - Compute MD5 hash
       - Copy to _hash/{hash}/ if not exists
       - Update table column to new hash format
    3. After all schemas migrated, drop ~external_{store} tables
    """
    external_table = schema.external[store_name]

    for entry in external_table:
        legacy_uuid = entry['hash']

        # Fetch content from legacy location
        content = external_table.get(legacy_uuid)

        # Compute new content hash
        hash_id = hashlib.md5(content).hexdigest()

        # Store in new location if not exists
        new_path = f"_hash/{hash_id[:2]}/{hash_id[2:4]}/{hash_id}"
        store = get_store(store_name)
        if not store.exists(new_path):
            store.put(new_path, content)

        # Update referencing tables: convert UUID column to JSON with hash metadata
        # The JSON column stores {"hash": hash_id, "store": store_name, "size": len(content)}
        # ... update all tables that reference this UUID ...

    # After migration complete for all schemas:
    # DROP TABLE `{schema}`.`~external_{store}`

Migration considerations: - Legacy UUIDs were based on MD5 content hash stored as binary(16) (UUID format) - New system uses char(32) MD5 hex strings stored in JSON - The hash algorithm is unchanged (MD5), only the storage format differs - Migration can be done incrementally per schema - Backward compatibility layer can read both formats during transition

Open Questions¶

How long should the backward compatibility layer support legacy ~external_* format?
Should <hash@> (without store name) use a default store or require explicit store name?

Type System Specification¶

Overview¶

Core DataJoint Types (Layer 2)¶

Numeric Types¶

String Types¶

Boolean¶

Date/Time Types¶

Binary Types¶

Other Types¶

Native Passthrough Types¶

Type Modifiers Policy¶

Encoding and Collation Policy¶

Codec Types (Layer 3)¶

Storage Mode: @ Convention¶

Codec Base Class¶

Codec Resolution and Chaining¶

<object@> / <object@store> - Schema-Addressed Storage¶

Implementation¶

<hash@> / <hash@store> - Hash-Addressed Storage¶

Implementation¶

Database Column¶

<filepath@store> - Portable External Reference¶

When to Use <filepath@store> vs varchar¶

Implementation¶

Database Column¶

Key Differences from Legacy filepath@store (now <filepath@store>)¶

Database Types¶

json - Cross-Database JSON Type¶

Built-in Codecs¶

<blob> / <blob@> - Serialized Python Objects¶

<attach> / <attach@> - File Attachments¶

User-Defined Codecs¶

Storage Comparison¶

Garbage Collection for Hash Storage¶

Built-in Codec Comparison¶

Key Design Decisions¶

Hash Algorithm Choice¶

Migration from Legacy Types¶

Migration from Legacy ~external_* Stores¶

Open Questions¶

Storage Mode: `@` Convention¶

`<object@>` / `<object@store>` - Schema-Addressed Storage¶

`<hash@>` / `<hash@store>` - Hash-Addressed Storage¶

`<filepath@store>` - Portable External Reference¶

When to Use `<filepath@store>` vs `varchar`¶

Key Differences from Legacy `filepath@store` (now `<filepath@store>`)¶

`json` - Cross-Database JSON Type¶

`<blob>` / `<blob@>` - Serialized Python Objects¶

`<attach>` / `<attach@>` - File Attachments¶

Migration from Legacy `~external_*` Stores¶