Type System Specification¶
Overview¶
New in 2.1
PostgreSQL type mappings were added in DataJoint 2.1. See Database Backends for backend configuration.
This document defines a three-layer type architecture:
- Native database types - Backend-specific (
FLOAT,TINYINT UNSIGNED,LONGBLOB). Discouraged for direct use. - Core DataJoint types - Standardized across backends, scientist-friendly (
int32,float32,bool,json). - Codec Types - Programmatic types with
encode()/decode()semantics. Composable.
| Layer | Description | Examples |
|---|---|---|
| 3. Codec Types | Programmatic types with encode()/decode() semantics |
<blob>, <attach>, <object@>, <hash@>, <filepath@>, user-defined |
| 2. Core DataJoint | Standardized, scientist-friendly types (preferred) | int32, float64, varchar(n), bool, datetime, json, bytes |
| 1. Native Database | Backend-specific types (discouraged) | INT, FLOAT, TINYINT UNSIGNED, LONGBLOB |
Codec types resolve through core types to native types: <blob> β bytes β LONGBLOB.
Syntax distinction:
- Core types: int32, float64, varchar(255) - no brackets
- Codec types: <blob>, <object@store>, <filepath@main> - angle brackets
- The @ character indicates store (object storage vs in-table)
Core DataJoint Types (Layer 2)¶
Core types provide a standardized, scientist-friendly interface that works identically across MySQL and PostgreSQL backends. Users should prefer these over native database types.
All core types are recorded in field comments using :type: syntax for reconstruction.
Numeric Types¶
| Core Type | Description | MySQL | PostgreSQL |
|---|---|---|---|
int8 |
8-bit signed | TINYINT |
SMALLINT |
int16 |
16-bit signed | SMALLINT |
SMALLINT |
int32 |
32-bit signed | INT |
INTEGER |
int64 |
64-bit signed | BIGINT |
BIGINT |
float32 |
32-bit float | FLOAT |
REAL |
float64 |
64-bit float | DOUBLE |
DOUBLE PRECISION |
decimal(n,f) |
Fixed-point | DECIMAL(n,f) |
NUMERIC(n,f) |
String Types¶
| Core Type | Description | MySQL | PostgreSQL |
|---|---|---|---|
char(n) |
Fixed-length | CHAR(n) |
CHAR(n) |
varchar(n) |
Variable-length | VARCHAR(n) |
VARCHAR(n) |
Encoding: All strings use UTF-8 (utf8mb4 in MySQL, UTF8 in PostgreSQL).
See Encoding and Collation Policy for details.
Boolean¶
| Core Type | Description | MySQL | PostgreSQL |
|---|---|---|---|
bool |
True/False | TINYINT |
BOOLEAN |
Date/Time Types¶
| Core Type | Description | MySQL | PostgreSQL |
|---|---|---|---|
date |
Date only | DATE |
DATE |
datetime |
Date and time | DATETIME |
TIMESTAMP |
Timezone policy: All datetime values should be stored as UTC. Timezone conversion is a
presentation concern handled by the application layer, not the database. This ensures:
- Reproducible computations regardless of server or client timezone settings
- Simple arithmetic on temporal values (no DST ambiguity)
- Portable data across systems and regions
Use CURRENT_TIMESTAMP for auto-populated creation times:
created_at : datetime = CURRENT_TIMESTAMP
Binary Types¶
The core bytes type stores raw bytes without any serialization. Use the <blob> codec
for serialized Python objects.
| Core Type | Description | MySQL | PostgreSQL |
|---|---|---|---|
bytes |
Raw bytes | LONGBLOB |
BYTEA |
Other Types¶
| Core Type | Description | MySQL | PostgreSQL |
|---|---|---|---|
json |
JSON document | JSON |
JSONB |
uuid |
UUID | BINARY(16) |
UUID |
enum(...) |
Enumeration | ENUM(...) |
CREATE TYPE ... AS ENUM |
Native Passthrough Types¶
Users may use native database types directly (e.g., int, float, mediumint, tinyblob),
but these are discouraged and will generate a warning. Native types lack explicit size
information, are not recorded in field comments, and may have portability issues across
database backends.
Prefer core DataJoint types over native types:
| Native (discouraged) | Core DataJoint (preferred) |
|---|---|
int |
int32 |
float |
float32 or float64 |
double |
float64 |
tinyint |
int8 |
smallint |
int16 |
bigint |
int64 |
tinyint unsigned |
int16 (use larger signed type) |
smallint unsigned |
int32 (use larger signed type) |
int unsigned |
int64 (use larger signed type) |
bigint unsigned |
decimal(20,0) (use decimal for full range) |
Type Modifiers Policy¶
DataJoint table definitions have their own syntax for constraints and metadata. SQL type modifiers are not allowed in type specifications because they conflict with DataJoint's declarative syntax:
| Modifier | Status | DataJoint Alternative |
|---|---|---|
NOT NULL / NULL |
β Not allowed | Use = NULL for nullable; omit default for required |
DEFAULT value |
β Not allowed | Use = value syntax before the type |
PRIMARY KEY |
β Not allowed | Position above --- line |
UNIQUE |
β Not allowed | Use DataJoint index syntax |
COMMENT 'text' |
β Not allowed | Use # comment syntax |
CHARACTER SET |
β Not allowed | Database-level configuration |
COLLATE |
β Not allowed | Database-level configuration |
AUTO_INCREMENT |
β οΈ Discouraged | Allowed with native types only, generates warning |
UNSIGNED |
β οΈ Discouraged | MySQL-specific, not portable (use larger signed type) |
Nullability and defaults: DataJoint handles nullability through the default value syntax.
An attribute is nullable if and only if its default is NULL:
# Required (NOT NULL, no default)
name : varchar(100)
# Nullable (default is NULL)
nickname = NULL : varchar(100)
# Required with default value
status = "active" : varchar(20)
Auto-increment policy: DataJoint discourages AUTO_INCREMENT / SERIAL because:
- Breaks reproducibility (IDs depend on insertion order)
- Makes pipelines non-deterministic
- Complicates data migration and replication
- Primary keys should be meaningful, not arbitrary
If required, use native types: int auto_increment or serial (with warning).
Encoding and Collation Policy¶
Character encoding and collation are database-level configuration, not part of type definitions. This ensures consistent behavior across all tables and simplifies portability.
Configuration (in dj.config or datajoint.json):
{
"database.charset": "utf8mb4",
"database.collation": "utf8mb4_bin"
}
Defaults:
| Setting | MySQL | PostgreSQL |
|---|---|---|
| Charset | utf8mb4 |
UTF8 |
| Collation | utf8mb4_bin |
C |
Policy:
- UTF-8 required: DataJoint validates charset is UTF-8 compatible at connection time
- Case-sensitive by default: Binary collation (utf8mb4_bin / C) ensures predictable comparisons
- No per-column overrides: CHARACTER SET and COLLATE are rejected in type definitions
- Like timezone: Encoding is infrastructure configuration, not part of the data model
Codec Types (Layer 3)¶
Codec types provide encode()/decode() semantics on top of core types. They are
composable and can be built-in or user-defined.
Storage Mode: @ Convention¶
The @ character in codec syntax indicates object store (vs in-table):
- No
@: In-table storage (database column) - e.g.,<blob>,<attach> @present: Object store - e.g.,<blob@>,<attach@store>@alone: Use default store - e.g.,<blob@>@name: Use named store - e.g.,<blob@cold>
Some codecs support both modes (<blob>, <attach>), others are store-only (<object@>, <npy@>, <hash@>, <filepath@>).
Codec Base Class¶
Codecs inherit from dj.Codec and auto-register when their class is defined. See the Codec API Specification for complete details on creating custom codecs.
class GraphCodec(dj.Codec):
"""Auto-registered as <graph>."""
name = "graph"
def get_dtype(self, is_store: bool) -> str:
return "<blob>"
def encode(self, graph, *, key=None, store_name=None):
return {'nodes': list(graph.nodes()), 'edges': list(graph.edges())}
def decode(self, stored, *, key=None):
import networkx as nx
G = nx.Graph()
G.add_nodes_from(stored['nodes'])
G.add_edges_from(stored['edges'])
return G
Codec Resolution and Chaining¶
Codecs resolve to core types through chaining. The get_dtype(is_store) method
returns the appropriate dtype based on storage mode:
| Codec | is_store |
Resolution Chain | SQL Type |
|---|---|---|---|
<blob> |
False |
"bytes" |
LONGBLOB/BYTEA |
<blob@> |
True |
"<hash>" β "json" |
JSON/JSONB |
<blob@cold> |
True |
"<hash>" β "json" (store=cold) |
JSON/JSONB |
<attach> |
False |
"bytes" |
LONGBLOB/BYTEA |
<attach@> |
True |
"<hash>" β "json" |
JSON/JSONB |
<object@> |
True |
"json" |
JSON/JSONB |
<npy@> |
True |
"json" |
JSON/JSONB |
<object> |
False |
ERROR (store only) | β |
<npy> |
False |
ERROR (store only) | β |
<hash@> |
True |
"json" |
JSON/JSONB |
<filepath@> |
True |
"json" |
JSON/JSONB |
<object@> / <object@store> - Schema-Addressed Storage¶
Built-in codec. Store only.
Schema-addressed OAS storage for complex, multi-part objects (files, folders, Zarr arrays, HDF5):
- Schema-addressed: Path mirrors database structure:
{schema}/{table}/{pk}/{attribute}/ - Complex objects: Can store directory structures with multiple files (e.g., Zarr arrays)
- One-to-one relationship with table row
- Deleted when row is deleted
- Returns
ObjectReffor lazy access - Supports direct writes (Zarr, HDF5) via fsspec
- dtype:
json(stores path, store name, metadata)
class Analysis(dj.Computed):
definition = """
-> Recording
---
results : <object@> # default store
archive : <object@cold> # specific store
"""
Implementation¶
class ObjectCodec(SchemaCodec):
"""Schema-addressed OAS storage. Store only."""
name = "object"
# get_dtype inherited from SchemaCodec
def encode(self, value, *, key=None, store_name=None) -> dict:
schema, table, field, pk = self._extract_context(key)
path, _ = self._build_path(schema, table, field, pk)
backend = self._get_backend(store_name)
backend.put(path, value)
return {"path": path, "store": store_name, ...}
def decode(self, stored: dict, *, key=None) -> ObjectRef:
backend = self._get_backend(stored["store"])
return ObjectRef.from_json(stored, backend=backend)
<hash@> / <hash@store> - Hash-Addressed Storage¶
Built-in codec. Store only.
Hash-addressed storage with deduplication for individual, atomic objects:
- Hash-addressed: Path derived from content hash:
_hash/{hash[:2]}/{hash[2:4]}/{hash} - Individual/atomic objects only: Stores single files or serialized blobs (not directory structures)
- Cannot handle complex multi-part objects like Zarr arraysβuse
<object@>for those - Per-project scope: content is shared across all schemas in a project (not per-schema)
- Many-to-one: multiple rows (even across schemas) can reference same content
- Reference counted for garbage collection
- Deduplication: identical content stored once across the entire project
- dtype:
json(stores hash, store name, size, metadata)
store_root/
βββ {schema}/{table}/{pk}/ # schema-addressed storage
β βββ {attribute}/
β
βββ _hash/ # hash-addressed storage
βββ {hash[:2]}/{hash[2:4]}/{hash}
Implementation¶
class HashCodec(dj.Codec):
"""Hash-addressed storage. Store only."""
name = "hash"
def get_dtype(self, is_store: bool) -> str:
if not is_store:
raise DataJointError("<hash> requires @ (store only)")
return "json"
def encode(self, data: bytes, *, key=None, store_name=None) -> dict:
"""Store content, return metadata as JSON."""
hash_id = hashlib.md5(data).hexdigest() # 32-char hex
store = get_store(store_name or dj.config['stores']['default'])
path = f"_hash/{hash_id[:2]}/{hash_id[2:4]}/{hash_id}"
if not store.exists(path):
store.put(path, data)
# Metadata stored in JSON column (no separate registry)
return {"hash": hash_id, "store": store_name, "size": len(data)}
def decode(self, stored: dict, *, key=None) -> bytes:
"""Retrieve content by hash."""
store = get_store(stored["store"])
path = f"_hash/{stored['hash'][:2]}/{stored['hash'][2:4]}/{stored['hash']}"
return store.get(path)
Database Column¶
The <hash@> type stores JSON metadata:
-- content column (MySQL)
features JSON NOT NULL
-- Contains: {"hash": "abc123...", "store": "main", "size": 12345}
-- content column (PostgreSQL)
features JSONB NOT NULL
<filepath@store> - Portable External Reference¶
Built-in codec. In-store only (store required).
Relative path references within configured stores:
- Relative paths: paths within a configured store (portable across environments)
- Store-aware: resolves paths against configured store backend
- Returns
ObjectReffor lazy access via fsspec - Stores optional checksum for verification
- dtype:
json(stores path, store name, checksum, metadata)
Key benefit: Portability. The path is relative to the store, so pipelines can be moved between environments (dev β prod, cloud β local) by changing store configuration without updating data.
class RawData(dj.Manual):
definition = """
session_id : int32
---
recording : <filepath@main> # relative path within 'main' store
"""
# Insert - user provides relative path within the store
table.insert1({
'session_id': 1,
'recording': 'experiment_001/data.nwb' # relative to main store root
})
# Fetch - returns ObjectRef (lazy)
row = (table & 'session_id=1').fetch1()
ref = row['recording'] # ObjectRef
ref.download('/local/path') # explicit download
ref.open() # fsspec streaming access
When to Use <filepath@store> vs varchar¶
| Use Case | Recommended Type |
|---|---|
| Need ObjectRef/lazy access | <filepath@store> |
| Need portability (relative paths) | <filepath@store> |
| Want checksum verification | <filepath@store> |
| Just storing a URL string | varchar |
| External URLs you don't control | varchar |
For arbitrary URLs (S3, HTTP, etc.) where you don't need ObjectRef semantics,
just use varchar. A string is simpler and more transparent.
Implementation¶
class FilepathCodec(dj.Codec):
"""Store-relative file references. External only."""
name = "filepath"
def get_dtype(self, is_store: bool) -> str:
if not is_store:
raise DataJointError("<filepath> requires @store")
return "json"
def encode(self, relative_path: str, *, key=None, store_name=None) -> dict:
"""Register reference to file in store."""
store = get_store(store_name) # store_name required for filepath
return {'path': relative_path, 'store': store_name}
def decode(self, stored: dict, *, key=None) -> ObjectRef:
"""Return ObjectRef for lazy access."""
return ObjectRef(store=get_store(stored['store']), path=stored['path'])
Database Column¶
-- filepath column (MySQL)
recording JSON NOT NULL
-- Contains: {"path": "experiment_001/data.nwb", "store": "main", "checksum": "...", "size": ...}
-- filepath column (PostgreSQL)
recording JSONB NOT NULL
Key Differences from Legacy filepath@store (now <filepath@store>)¶
| Feature | Legacy | New |
|---|---|---|
| Access | Copy to local stage | ObjectRef (lazy) |
| Copying | Automatic | Explicit via ref.download() |
| Streaming | No | Yes via ref.open() |
| Paths | Relative | Relative (unchanged) |
| Store param | Required (@store) |
Required (@store) |
Database Types¶
json - Cross-Database JSON Type¶
JSON storage compatible across MySQL and PostgreSQL:
-- MySQL
column_name JSON NOT NULL
-- PostgreSQL (uses JSONB for better indexing)
column_name JSONB NOT NULL
The json database type:
- Used as dtype by built-in codecs (<object@>, <hash@>, <filepath@store>)
- Stores arbitrary JSON-serializable data
- Automatically uses appropriate type for database backend
- Supports JSON path queries where available
Built-in Codecs¶
<blob> / <blob@> - Serialized Python Objects¶
Supports both in-table and in-store storage.
Serializes Python objects using DataJoint's custom binary serialization format. The format uses protocol headers and type-specific encoding to serialize complex Python objects efficiently.
Serialization format:
- Protocol headers:
mYmβ Original MATLAB-compatible format for numeric arrays, structs, cells (see mYm on MATLAB FileExchange and mym on GitHub)dj0β Extended format supporting Python-specific types (UUID, Decimal, datetime, etc.)
- Compression: Automatic zlib compression for data > 1KB
- Type codes: Each Python type has a specific serialization code:
'A'β NumPy arrays (numeric)'F'β NumPy recarrays (structured arrays with fields)'\x01'β Tuples'\x02'β Lists'\x03'β Sets'\x04'β Dicts'\x05'β Strings (UTF-8)'\x06'β Bytes'\x0a'β Unbounded integers'\x0b'β Booleans'\x0c'β Complex numbers'\x0d'β Floats'd'β Decimal't'β Datetime/date/time'u'β UUID'S'β MATLAB structs'C'β MATLAB cell arrays
Version detection: The protocol header (mYm\0 or dj0\0) is embedded at the start of the blob, enabling automatic format detection and backward compatibility.
Storage modes:
<blob>: Stored in database (bytesβLONGBLOB/BYTEA)<blob@>: Stored externally via<hash@>with MD5 deduplication<blob@store>: Stored in specific named store
class BlobCodec(dj.Codec):
"""Serialized Python objects. Supports internal and external."""
name = "blob"
def get_dtype(self, is_store: bool) -> str:
return "<hash>" if is_store else "bytes"
def encode(self, value, *, key=None, store_name=None) -> bytes:
from . import blob
return blob.pack(value, compress=True)
def decode(self, stored, *, key=None) -> Any:
from . import blob
return blob.unpack(stored)
Usage:
class ProcessedData(dj.Computed):
definition = """
-> RawData
---
small_result : <blob> # in-table (in database)
large_result : <blob@> # in-store (default store)
archive_result : <blob@cold> # in-store (specific store)
"""
<attach> / <attach@> - File Attachments¶
Supports both in-table and in-store storage.
Stores files with filename preserved. On fetch, extracts to configured download path.
<attach>: Stored in database (bytesβLONGBLOB/BYTEA)<attach@>: Stored in object store via<hash@>with deduplication<attach@store>: Stored in specific named store
class AttachCodec(dj.Codec):
"""File attachment with filename. Supports in-table and in-store."""
name = "attach"
def get_dtype(self, is_store: bool) -> str:
return "<hash>" if is_store else "bytes"
def encode(self, filepath, *, key=None, store_name=None) -> bytes:
path = Path(filepath)
return path.name.encode() + b"\0" + path.read_bytes()
def decode(self, stored, *, key=None) -> str:
filename, contents = stored.split(b"\0", 1)
filename = filename.decode()
download_path = Path(dj.config['download_path']) / filename
download_path.write_bytes(contents)
return str(download_path)
Usage:
class Attachments(dj.Manual):
definition = """
attachment_id : int32
---
config : <attach> # in-table (small file in DB)
data_file : <attach@> # in-store (default store)
archive : <attach@cold> # in-store (specific store)
"""
User-Defined Codecs¶
Users can define custom codecs for domain-specific data. See the Codec API Specification for complete examples including:
- Simple serialization codecs
- In-store codecs
- JSON with schema validation
- Context-dependent encoding
- In-store-only codecs (Zarr, HDF5)
Storage Comparison¶
| Type | get_dtype | Resolves To | Storage Location | Dedup | Returns |
|---|---|---|---|---|---|
<blob> |
bytes |
LONGBLOB/BYTEA |
Database | No | Python object |
<blob@> |
<hash> |
json |
_hash/{hash} |
Yes | Python object |
<blob@s> |
<hash> |
json |
_hash/{hash} |
Yes | Python object |
<attach> |
bytes |
LONGBLOB/BYTEA |
Database | No | Local file path |
<attach@> |
<hash> |
json |
_hash/{hash} |
Yes | Local file path |
<attach@s> |
<hash> |
json |
_hash/{hash} |
Yes | Local file path |
<object@> |
json |
JSON/JSONB |
{schema}/{table}/{pk}/ |
No | ObjectRef |
<object@s> |
json |
JSON/JSONB |
{schema}/{table}/{pk}/ |
No | ObjectRef |
<hash@> |
json |
JSON/JSONB |
_hash/{hash} |
Yes | bytes |
<hash@s> |
json |
JSON/JSONB |
_hash/{hash} |
Yes | bytes |
<filepath@s> |
json |
JSON/JSONB |
Configured store | No | ObjectRef |
Garbage Collection for Hash Storage¶
Hash metadata (hash, store, size) is stored directly in each table's JSON column - no separate registry table is needed. Garbage collection scans all tables to find referenced hashes:
def garbage_collect(store_name):
"""Remove hash-addressed data not referenced by any table."""
# Scan store for all hash files
store = get_store(store_name)
all_hashes = set(store.list_hashes()) # from _hash/ directory
# Scan all tables for referenced hashes
referenced = set()
for schema in project.schemas:
for table in schema.tables:
for attr in table.heading.attributes:
if uses_hash_storage(attr): # <blob@>, <attach@>, <hash@>
for row in table:
val = row.get(attr.name)
if val and val.get('store') == store_name:
referenced.add(val['hash'])
# Delete orphaned files
for hash_id in (all_hashes - referenced):
store.delete(hash_path(hash_id))
Built-in Codec Comparison¶
| Feature | <blob> |
<attach> |
<object@> |
<hash@> |
<filepath@> |
|---|---|---|---|---|---|
| Storage modes | Both | Both | External only | External only | External only |
| Internal dtype | bytes |
bytes |
N/A | N/A | N/A |
| External dtype | <hash@> |
<hash@> |
json |
json |
json |
| Addressing | Hash | Hash | Primary key | Hash | Relative path |
| Deduplication | Yes (external) | Yes (external) | No | Yes | No |
| Structure | Single blob | Single file | Files, folders | Single blob | Any |
| Returns | Python object | Local path | ObjectRef | bytes | ObjectRef |
| GC | Ref counted | Ref counted | With row | Ref counted | User managed |
When to use each:
- <blob>: Serialized Python objects (NumPy arrays, dicts). Use <blob@> for large/duplicated data
- <attach>: File attachments with filename preserved. Use <attach@> for large files
- <object@>: Large/complex file structures (Zarr, HDF5) where DataJoint controls organization
- <hash@>: Raw bytes with deduplication (typically used via <blob@> or <attach@>)
- <filepath@store>: Portable references to externally-managed files
- varchar: Arbitrary URLs/paths where ObjectRef semantics aren't needed
Key Design Decisions¶
- Three-layer architecture:
- Layer 1: Native database types (backend-specific, discouraged)
- Layer 2: Core DataJoint types (standardized, scientist-friendly)
- Layer 3: Codec types (encode/decode, composable)
- Core types are scientist-friendly:
float32,int8,bool,bytesinstead ofFLOAT,TINYINT,LONGBLOB - Codecs use angle brackets:
<blob>,<object@store>,<filepath@main>- distinguishes from core types @indicates in-store storage: No@= database,@present = object storeget_dtype(is_store)method: Codecs resolve dtype at declaration time based on storage mode- Codecs are composable:
<blob@>uses<hash@>, which usesjson - Built-in in-store codecs use JSON dtype: Stores metadata (path, hash, store name, etc.)
- Two OAS regions: object (PK-addressed) and hash (hash-addressed) within managed stores
- Filepath for portability:
<filepath@store>uses relative paths within stores for environment portability - No
uritype: For arbitrary URLs, usevarcharβsimpler and more transparent - Naming conventions:
-
@= in-store storage (object store) - No@= in-table storage (database) -@alone = default store -@name= named store - Dual-mode codecs:
<blob>and<attach>support both in-table and in-store storage - In-store-only codecs:
<object@>,<hash@>,<filepath@>require@ - Transparent access: Codecs return Python objects or file paths
- Lazy access:
<object@>and<filepath@store>return ObjectRef - MD5 for content hashing: See Hash Algorithm Choice below
- No separate registry: Hash metadata stored in JSON columns, not a separate table
- Auto-registration via
__init_subclass__: Codecs register automatically when subclassedβno decorator needed. Useregister=Falsefor abstract bases. Requires Python 3.10+.
Hash Algorithm Choice¶
Content-addressed storage uses MD5 (128-bit, 32-char hex) rather than SHA256 (256-bit, 64-char hex).
Rationale:
-
Practical collision resistance is sufficient: The birthday bound for MD5 is ~2^64 operations before 50% collision probability. No scientific project will store anywhere near 10^19 files. For content deduplication (not cryptographic verification), MD5 provides adequate uniqueness.
-
Storage efficiency: 32-char hashes vs 64-char hashes in every JSON metadata field. With millions of records, this halves the storage overhead for hash identifiers.
-
Performance: MD5 is ~2-3x faster than SHA256 for large files. While both are fast, the difference is measurable when hashing large scientific datasets.
-
Legacy compatibility: DataJoint's existing
uuid_from_buffer()function uses MD5. The new system changes only the storage format (hex string in JSON vs binary UUID), not the underlying hash algorithm. This simplifies migration. -
Consistency with existing codebase: Internal functions use MD5 for query caching.
Why not SHA256?
SHA256 is the modern standard for content-addressable storage (Git, Docker, IPFS). However: - These systems prioritize cryptographic security against adversarial collision attacks - Scientific data pipelines face no adversarial threat model - The practical benefits (storage, speed, compatibility) outweigh theoretical security gains
Note: If cryptographic verification is ever needed (e.g., for compliance or reproducibility audits), SHA256 checksums can be computed on-demand without changing the storage addressing scheme.
Migration from Legacy Types¶
| Legacy | New Equivalent |
|---|---|
longblob (auto-serialized) |
<blob> |
blob@store |
<blob@store> |
attach |
<attach> |
attach@store |
<attach@store> |
filepath@store (copy-based) |
<filepath@store> (ObjectRef-based) |
Migration from Legacy ~external_* Stores¶
Legacy external storage used per-schema ~external_{store} tables with UUID references.
Migration to the new JSON-based hash storage requires:
def migrate_external_store(schema, store_name):
"""
Migrate legacy ~external_{store} to new HashRegistry.
1. Read all entries from ~external_{store}
2. For each entry:
- Fetch content from legacy location
- Compute MD5 hash
- Copy to _hash/{hash}/ if not exists
- Update table column to new hash format
3. After all schemas migrated, drop ~external_{store} tables
"""
external_table = schema.external[store_name]
for entry in external_table:
legacy_uuid = entry['hash']
# Fetch content from legacy location
content = external_table.get(legacy_uuid)
# Compute new content hash
hash_id = hashlib.md5(content).hexdigest()
# Store in new location if not exists
new_path = f"_hash/{hash_id[:2]}/{hash_id[2:4]}/{hash_id}"
store = get_store(store_name)
if not store.exists(new_path):
store.put(new_path, content)
# Update referencing tables: convert UUID column to JSON with hash metadata
# The JSON column stores {"hash": hash_id, "store": store_name, "size": len(content)}
# ... update all tables that reference this UUID ...
# After migration complete for all schemas:
# DROP TABLE `{schema}`.`~external_{store}`
Migration considerations:
- Legacy UUIDs were based on MD5 content hash stored as binary(16) (UUID format)
- New system uses char(32) MD5 hex strings stored in JSON
- The hash algorithm is unchanged (MD5), only the storage format differs
- Migration can be done incrementally per schema
- Backward compatibility layer can read both formats during transition
Open Questions¶
- How long should the backward compatibility layer support legacy
~external_*format? - Should
<hash@>(without store name) use a default store or require explicit store name?