Skip to content

Use Object Storage

Store large data objects as part of your Object-Augmented Schema.

Object-Augmented Schema (OAS)

An Object-Augmented Schema extends relational tables with object storage as a unified system. The relational database stores metadata, references, and small values while large objects (arrays, files, datasets) are stored in object storage. DataJoint maintains referential integrity across both storage layers—when you delete a row, its associated objects are cleaned up automatically.

OAS supports two addressing schemes:

Addressing Location Path Derived From Object Type Use Case
Hash-addressed Object store Content hash (MD5) Individual/atomic Single blobs, single files, attachments (with deduplication)
Schema-addressed Object store Schema structure Complex/multi-part Zarr arrays, HDF5 datasets, multi-file objects (browsable paths)

Key distinction:

  • Hash-addressed (<blob@>, <attach@>) stores individual, atomic objects - one object per field
  • Schema-addressed (<npy@>, <object@>) can store complex, multi-part objects like Zarr (directory structures with multiple files)

Data can also be stored in-table directly in the database column (no @ modifier).

For complete details, see the Type System specification.

When to Use Object Storage

Use the @ modifier for:

  • Large arrays (images, videos, neural recordings)
  • File attachments
  • Zarr arrays and HDF5 files
  • Any data too large for efficient database storage

In-Table vs Object Store

@schema
class Recording(dj.Manual):
    definition = """
    recording_id : uuid
    ---
    metadata : <blob>           # In-table: stored in database column
    raw_data : <blob@>          # Object store: hash-addressed
    waveforms : <npy@>          # Object store: schema-addressed (lazy)
    """
Syntax Storage Best For
<blob> Database Small Python objects (typically < 1-10 MB)
<attach> Database Small files with filename (typically < 1-10 MB)
<blob@> Default store Large Python objects (hash-addressed, with dedup)
<attach@> Default store Large files with filename (hash-addressed, with dedup)
<npy@> Default store NumPy arrays (schema-addressed, lazy, navigable)
<blob@store> Named store Specific storage tier

Store Data

Insert works the same regardless of storage location:

import numpy as np

Recording.insert1({
    'recording_id': uuid.uuid4(),
    'metadata': {'channels': 32, 'rate': 30000},
    'raw_data': np.random.randn(32, 30000)  # ~7.7 MB array
})

DataJoint automatically routes to the configured store.

Retrieve Data

Fetch works transparently:

data = (Recording & key).fetch1('raw_data')
# Returns the numpy array, regardless of where it was stored

Named Stores

Use different stores for different data types:

@schema
class Experiment(dj.Manual):
    definition = """
    experiment_id : uuid
    ---
    raw_video : <blob@raw>        # Fast local storage
    processed : <blob@archive>    # S3 for long-term
    """

Configure stores in datajoint.json:

{
  "stores": {
    "default": "raw",
    "raw": {
      "protocol": "file",
      "location": "/fast/storage"
    },
    "archive": {
      "protocol": "s3",
      "endpoint": "s3.amazonaws.com",
      "bucket": "archive",
      "location": "project-data"
    }
  }
}

Hash-Addressed Storage

<blob@> and <attach@> use hash-addressed storage:

  • Objects are stored by their content hash (MD5)
  • Identical data is stored once (automatic deduplication)
  • Multiple rows can reference the same object
  • Immutable—changing data creates a new object
# These two inserts store the same array only once
data = np.zeros((1000, 1000))
Table.insert1({'id': 1, 'array': data})
Table.insert1({'id': 2, 'array': data})  # References same object

Schema-Addressed Storage

<npy@> and <object@> use schema-addressed storage:

  • Objects stored at paths that mirror database schema: {schema}/{table}/{pk}/{attribute}.npy
  • Browsable organization in object storage
  • One object per entity (no deduplication)
  • Supports lazy loading with metadata access
@schema
class Dataset(dj.Manual):
    definition = """
    dataset_id : uuid
    ---
    zarr_array : <object@>      # Zarr array stored by path
    """

Use schema-addressed storage for:

  • Zarr arrays (chunked, appendable)
  • HDF5 files
  • Large datasets requiring streaming access

Write Directly to Object Storage

For large datasets like multi-GB imaging recordings, avoid intermediate copies by writing directly to object storage with staged_insert1:

import zarr

@schema
class ImagingSession(dj.Manual):
    definition = """
    subject_id : int32
    session_id : int32
    ---
    n_frames : int32
    frame_rate : float32
    frames : <object@>
    """

# Write Zarr directly to object storage
with ImagingSession.staged_insert1 as staged:
    # 1. Set primary key values first
    staged.rec['subject_id'] = 1
    staged.rec['session_id'] = 1

    # 2. Get storage handle
    store = staged.store('frames', '.zarr')

    # 3. Write directly (no local copy)
    z = zarr.open(store, mode='w', shape=(1000, 512, 512),
                  chunks=(10, 512, 512), dtype='int32')
    for i in range(1000):
        z[i] = acquire_frame()  # Write frame-by-frame

    # 4. Set remaining attributes
    staged.rec['n_frames'] = 1000
    staged.rec['frame_rate'] = 30.0

# Record inserted with computed metadata on successful exit

The staged_insert1 context manager:

  • Writes directly to the object store (no intermediate files)
  • Computes metadata (size, manifest) automatically on exit
  • Cleans up storage if an error occurs (atomic)
  • Requires primary key values before calling store() or open()

Use staged.store(field, ext) for FSMap access (Zarr), or staged.open(field, ext) for file-like access.

Attachments

Preserve original filenames with <attach@>:

@schema
class Document(dj.Manual):
    definition = """
    doc_id : uuid
    ---
    report : <attach@>          # Preserves filename
    """

# Insert with AttachFileType
from datajoint import AttachFileType
Document.insert1({
    'doc_id': uuid.uuid4(),
    'report': AttachFileType('/path/to/report.pdf')
})

NumPy Arrays with <npy@>

The <npy@> codec stores NumPy arrays as portable .npy files with lazy loading:

@schema
class Recording(dj.Manual):
    definition = """
    recording_id : int32
    ---
    waveform : <npy@mystore>    # NumPy array, schema-addressed
    """

# Insert - just pass the array
Recording.insert1({
    'recording_id': 1,
    'waveform': np.random.randn(1000, 32),
})

# Fetch returns NpyRef (lazy)
ref = (Recording & 'recording_id=1').fetch1('waveform')

NpyRef: Lazy Array Reference

NpyRef provides metadata without downloading:

ref = (Recording & key).fetch1('waveform')

# Metadata access - NO download
ref.shape    # (1000, 32)
ref.dtype    # float64
ref.nbytes   # 256000
ref.is_loaded  # False

# Explicit loading
arr = ref.load()    # Downloads and caches
ref.is_loaded       # True

# Numpy integration (triggers download)
result = np.mean(ref)           # Uses __array__ protocol
result = np.asarray(ref) + 1    # Convert then operate

Bulk Fetch Safety

Fetching many rows doesn't download until you access each array:

# Fetch 1000 recordings - NO downloads yet
results = Recording.to_dicts()

# Inspect metadata without downloading
for rec in results:
    ref = rec['waveform']
    if ref.shape[0] > 500:     # Check without download
        process(ref.load())     # Download only what you need

Lazy Loading with ObjectRef

<object@> and <filepath@> return lazy references:

ref = (Dataset & key).fetch1('zarr_array')

# Open for streaming access
with ref.open() as f:
    data = zarr.open(f)

# Or download to local path
local_path = ref.download('/tmp/data')

Storage Best Practices

Choose the Right Codec

Data Type Codec Addressing Lazy Best For
NumPy arrays <npy@> Schema Yes Arrays needing lazy load, metadata inspection
Python objects <blob> or <blob@> In-table or Hash No Dicts, lists, arrays (use @ for large/dedup)
File attachments <attach> or <attach@> In-table or Hash No Files with filename preserved (use @ for large/dedup)
Zarr/HDF5 <object@> Schema Yes Chunked arrays, streaming access
File references <filepath@> External Yes References to external files

Size Guidelines

Technical limits: - MySQL: In-table blobs up to 4 GiB (LONGBLOB) - PostgreSQL: In-table blobs unlimited (BYTEA)

Practical recommendations (consider accessibility, cost, performance): - < 1-10 MB: In-table storage (<blob>) often sufficient - 10-100 MB: Object store (<blob@> with dedup, or <npy@> for arrays) - > 100 MB: Schema-addressed (<npy@>, <object@>) for streaming and lazy loading

Store Tiers

Configure stores for different access patterns:

{
  "stores": {
    "default": "hot",
    "hot": {
      "protocol": "file",
      "location": "/ssd/data"
    },
    "warm": {
      "protocol": "s3",
      "endpoint": "s3.amazonaws.com",
      "bucket": "project-data",
      "location": "active"
    },
    "cold": {
      "protocol": "s3",
      "endpoint": "s3.amazonaws.com",
      "bucket": "archive",
      "location": "long-term"
    }
  }
}

See Also