Use Object Storage¶

Store large data objects as part of your Object-Augmented Schema.

Object-Augmented Schema (OAS)¶

An Object-Augmented Schema extends relational tables with object storage as a unified system. The relational database stores metadata, references, and small values while large objects (arrays, files, datasets) are stored in object storage. DataJoint maintains referential integrity across both storage layers—when you delete a row, its associated objects are cleaned up automatically.

OAS supports two addressing schemes:

Addressing	Location	Path Derived From	Object Type	Use Case
Hash-addressed	Object store	Content hash (MD5)	Individual/atomic	Single blobs, single files, attachments (with deduplication)
Schema-addressed	Object store	Schema structure	Complex/multi-part	Zarr arrays, HDF5 datasets, multi-file objects (browsable paths)

Key distinction:

Hash-addressed (<blob@>, <attach@>) stores individual, atomic objects - one object per field
Schema-addressed (<npy@>, <object@>) can store complex, multi-part objects like Zarr (directory structures with multiple files)

Data can also be stored in-table directly in the database column (no @ modifier).

For complete details, see the Type System specification.

When to Use Object Storage¶

Use the @ modifier for:

Large arrays (images, videos, neural recordings)
File attachments
Zarr arrays and HDF5 files
Any data too large for efficient database storage

In-Table vs Object Store¶

@schema
class Recording(dj.Manual):
    definition = """
    recording_id : uuid
    ---
    metadata : <blob>           # In-table: stored in database column
    raw_data : <blob@>          # Object store: hash-addressed
    waveforms : <npy@>          # Object store: schema-addressed (lazy)
    """

Syntax	Storage	Best For
`<blob>`	Database	Small Python objects (typically < 1-10 MB)
`<attach>`	Database	Small files with filename (typically < 1-10 MB)
`<blob@>`	Default store	Large Python objects (hash-addressed, with dedup)
`<attach@>`	Default store	Large files with filename (hash-addressed, with dedup)
`<npy@>`	Default store	NumPy arrays (schema-addressed, lazy, navigable)
`<blob@store>`	Named store	Specific storage tier

Store Data¶

Insert works the same regardless of storage location:

import numpy as np

Recording.insert1({
    'recording_id': uuid.uuid4(),
    'metadata': {'channels': 32, 'rate': 30000},
    'raw_data': np.random.randn(32, 30000)  # ~7.7 MB array
})

DataJoint automatically routes to the configured store.

Retrieve Data¶

Fetch works transparently:

data = (Recording & key).fetch1('raw_data')
# Returns the numpy array, regardless of where it was stored

Named Stores¶

Use different stores for different data types:

@schema
class Experiment(dj.Manual):
    definition = """
    experiment_id : uuid
    ---
    raw_video : <blob@raw>        # Fast local storage
    processed : <blob@archive>    # S3 for long-term
    """

Configure stores in datajoint.json:

{
  "stores": {
    "default": "raw",
    "raw": {
      "protocol": "file",
      "location": "/fast/storage"
    },
    "archive": {
      "protocol": "s3",
      "endpoint": "s3.amazonaws.com",
      "bucket": "archive",
      "location": "project-data"
    }
  }
}

Hash-Addressed Storage¶

<blob@> and <attach@> use hash-addressed storage:

Objects are stored by their content hash (MD5)
Identical data is stored once (automatic deduplication)
Multiple rows can reference the same object
Immutable—changing data creates a new object

# These two inserts store the same array only once
data = np.zeros((1000, 1000))
Table.insert1({'id': 1, 'array': data})
Table.insert1({'id': 2, 'array': data})  # References same object

Schema-Addressed Storage¶

<npy@> and <object@> use schema-addressed storage:

Objects stored at paths that mirror database schema: {schema}/{table}/{pk}/{attribute}.npy
Browsable organization in object storage
One object per entity (no deduplication)
Supports lazy loading with metadata access

@schema
class Dataset(dj.Manual):
    definition = """
    dataset_id : uuid
    ---
    zarr_array : <object@>      # Zarr array stored by path
    """

Use schema-addressed storage for:

Zarr arrays (chunked, appendable)
HDF5 files
Large datasets requiring streaming access

Write Directly to Object Storage¶

For multi-GB imaging recordings, Zarr arrays, HDF5 files, or any object too large to round-trip through local storage, use Staged Insert. It writes directly to the destination object store inside a context manager and commits the database row atomically on clean exit:

with ImagingSession.staged_insert1 as staged:
    staged.rec['subject_id'] = 1
    staged.rec['session_id'] = 1
    z = zarr.open(staged.store('frames', '.zarr'), mode='w', ...)
    ...

See Staged Insert for the full API, atomicity guarantees, and patterns for Zarr, HDF5, and streaming sources.

Attachments¶

Preserve original filenames with <attach@>:

@schema
class Document(dj.Manual):
    definition = """
    doc_id : uuid
    ---
    report : <attach@>          # Preserves filename
    """

# Insert with AttachFileType
from datajoint import AttachFileType
Document.insert1({
    'doc_id': uuid.uuid4(),
    'report': AttachFileType('/path/to/report.pdf')
})

NumPy Arrays with `<npy@>`¶

The <npy@> codec stores NumPy arrays as portable .npy files with lazy loading:

@schema
class Recording(dj.Manual):
    definition = """
    recording_id : int32
    ---
    waveform : <npy@mystore>    # NumPy array, schema-addressed
    """

# Insert - just pass the array
Recording.insert1({
    'recording_id': 1,
    'waveform': np.random.randn(1000, 32),
})

# Fetch returns NpyRef (lazy)
ref = (Recording & 'recording_id=1').fetch1('waveform')

NpyRef: Lazy Array Reference¶

NpyRef provides metadata without downloading:

ref = (Recording & key).fetch1('waveform')

# Metadata access - NO download
ref.shape    # (1000, 32)
ref.dtype    # float64
ref.nbytes   # 256000
ref.is_loaded  # False

# Explicit loading
arr = ref.load()    # Downloads and caches
ref.is_loaded       # True

# Numpy integration (triggers download)
result = np.mean(ref)           # Uses __array__ protocol
result = np.asarray(ref) + 1    # Convert then operate

Bulk Fetch Safety¶

Fetching many rows doesn't download until you access each array:

# Fetch 1000 recordings - NO downloads yet
results = Recording.to_dicts()

# Inspect metadata without downloading
for rec in results:
    ref = rec['waveform']
    if ref.shape[0] > 500:     # Check without download
        process(ref.load())     # Download only what you need

Lazy Loading with ObjectRef¶

<object@> and <filepath@> return lazy references:

ref = (Dataset & key).fetch1('zarr_array')

# Open for streaming access
with ref.open() as f:
    data = zarr.open(f)

# Or download to local path
local_path = ref.download('/tmp/data')

Storage Best Practices¶

Choose the Right Codec¶

Data Type	Codec	Addressing	Lazy	Best For
NumPy arrays	`<npy@>`	Schema	Yes	Arrays needing lazy load, metadata inspection
Python objects	`<blob>` or `<blob@>`	In-table or Hash	No	Dicts, lists, arrays (use `@` for large/dedup)
File attachments	`<attach>` or `<attach@>`	In-table or Hash	No	Files with filename preserved (use `@` for large/dedup)
Zarr/HDF5	`<object@>`	Schema	Yes	Chunked arrays, streaming access
File references	`<filepath@>`	External	Yes	References to external files

Size Guidelines¶

Technical limits: - MySQL: In-table blobs up to 4 GiB (LONGBLOB) - PostgreSQL: In-table blobs unlimited (BYTEA)

Practical recommendations (consider accessibility, cost, performance): - < 1-10 MB: In-table storage (<blob>) often sufficient - 10-100 MB: Object store (<blob@> with dedup, or <npy@> for arrays) - > 100 MB: Schema-addressed (<npy@>, <object@>) for streaming and lazy loading

Store Tiers¶

Configure stores for different access patterns:

{
  "stores": {
    "default": "hot",
    "hot": {
      "protocol": "file",
      "location": "/ssd/data"
    },
    "warm": {
      "protocol": "s3",
      "endpoint": "s3.amazonaws.com",
      "bucket": "project-data",
      "location": "active"
    },
    "cold": {
      "protocol": "s3",
      "endpoint": "s3.amazonaws.com",
      "bucket": "archive",
      "location": "long-term"
    }
  }
}