Skip to content

Choose a Storage Type

Select the right storage codec for your data based on size, access patterns, and lifecycle requirements.

Quick Decision Tree

Start: What type of data are you storing?

โ”œโ”€ Small data (typically < 1-10 MB per row)?
โ”‚  โ”œโ”€ Python objects (dicts, arrays)? โ†’ Use <blob> (in-table)
โ”‚  โ””โ”€ Files with filename? โ†’ Use <attach> (in-table)
โ”‚
โ”œโ”€ Externally managed files?
โ”‚  โ””โ”€ YES โ†’ Use <filepath@> (reference only)
โ”‚  โ””โ”€ NO  โ†’ Continue...
โ”‚
โ”œโ”€ Need browsable storage or access by external tools?
โ”‚  โ””โ”€ YES โ†’ Use <object@> or <npy@> (schema-addressed)
โ”‚  โ””โ”€ NO  โ†’ Continue...
โ”‚
โ”œโ”€ Need streaming or partial reads?
โ”‚  โ””โ”€ YES โ†’ Use <object@> (schema-addressed, Zarr/HDF5)
โ”‚  โ””โ”€ NO  โ†’ Continue...
โ”‚
โ”œโ”€ NumPy arrays that benefit from lazy loading?
โ”‚  โ””โ”€ YES โ†’ Use <npy@> (optimized NumPy storage)
โ”‚  โ””โ”€ NO  โ†’ Continue...
โ”‚
โ”œโ”€ Python objects (dicts, arrays)?
โ”‚  โ””โ”€ YES โ†’ Use <blob@> (hash-addressed)
โ”‚  โ””โ”€ NO  โ†’ Use <attach@> (files with filename preserved)

Storage Types Overview

Codec Location Addressing Python Objects Dedup Best For
<blob> In-table (database) Row-based โœ… Yes No Small Python objects (typically < 1-10 MB)
<attach> In-table (database) Row-based โŒ No (file path) No Small files with filename preserved
<blob@> Object store Content hash โœ… Yes Yes Large Python objects (with dedup)
<attach@> Object store Content hash โŒ No (file path) Yes Large files with filename preserved
<npy@> Object store Schema + key โœ… Yes (arrays) No NumPy arrays (lazy load, navigable)
<object@> Object store Schema + key โŒ No (you manage format) No Zarr, HDF5 (browsable, streaming)
<filepath@> Object store User path โŒ No (you manage format) No External file references

Key Usability: Python Object Convenience

Major advantage of <blob>, <blob@>, and <npy@>: You work with Python objects directly. No manual serialization, file handling, or IO management.

# <blob> and <blob@>: Insert Python objects, get Python objects back
@schema
class Analysis(dj.Computed):
    definition = """
    -> Experiment
    ---
    results : <blob@>         # Any Python object: dicts, lists, arrays
    """

    def make(self, key):
        # Insert nested Python structures directly
        results = {
            'accuracy': 0.95,
            'confusion_matrix': np.array([[10, 2], [1, 15]]),
            'metadata': {'method': 'SVM', 'params': [1, 2, 3]}
        }
        self.insert1({**key, 'results': results})

# Fetch: Get Python object back (no manual unpickling)
data = (Analysis & key).fetch1('results')
print(data['accuracy'])           # 0.95
print(data['confusion_matrix'])   # numpy array
# <npy@>: Insert array-like objects, get array-like objects back
@schema
class Recording(dj.Manual):
    definition = """
    recording_id : uuid
    ---
    traces : <npy@>           # NumPy arrays (no manual .npy files)
    """

# Insert: Just pass the array
Recording.insert1({'recording_id': uuid.uuid4(), 'traces': np.random.randn(1000, 32)})

# Fetch: Get array-like object (NpyRef with lazy loading)
ref = (Recording & key).fetch1('traces')
print(ref.shape)              # (1000, 32) - metadata without download
subset = ref[:100, :]         # Lazy slicing

Contrast with <object@> and <filepath@>: You manage the format (Zarr, HDF5, etc.) and handle file IO yourself. More flexible, but requires format knowledge.

Detailed Decision Criteria

Size and Storage Location

Technical Limits:

  • MySQL: In-table blobs up to 4 GiB (LONGBLOB)
  • PostgreSQL: In-table blobs unlimited (BYTEA)
  • Object stores: Effectively unlimited (S3, file systems, etc.)

Practical Guidance:

The choice between in-table (<blob>) and object storage (<blob@>, <npy@>, <object@>) is a complex decision involving:

  • Accessibility: How fast do you need to access the data?
  • Cost: Database storage vs object storage pricing
  • Performance: Query speed, backup time, replication overhead

General recommendations:

Try to keep in-table blobs under ~1-10 MB, but this depends on your specific use case:

@schema
class Experiment(dj.Manual):
    definition = """
    experiment_id : uuid
    ---
    metadata : <blob>         # Small: config, parameters (< 1 MB)
    thumbnail : <blob>        # Medium: preview images (< 10 MB)
    raw_data : <blob@>        # Large: raw recordings (> 10 MB)
    """

When to use in-table storage (<blob>): - Fast access needed (no external fetch) - Data frequently queried alongside other columns - Transactional consistency critical - Automatic backup with database important - No object storage configuration available

When to use object storage (<blob@>, <npy@>, <object@>): - Data larger than ~10 MB - Infrequent access patterns - Need deduplication (hash-addressed types) - Need browsable structure (schema-addressed types) - Want to separate hot data (DB) from cold data (object store)

Examples by size: - < 1 MB: Configuration JSON, metadata, small parameter arrays โ†’ <blob> - 1-10 MB: Thumbnails, processed features, small waveforms โ†’ <blob> or <blob@> depending on access pattern - 10-100 MB: Neural recordings, images, PDFs โ†’ <blob@> or <attach@> - > 100 MB: Zarr arrays, HDF5 datasets, large videos โ†’ <object@> or <npy@>

Access Pattern Guidelines

Full Access Every Time

Use <blob@> (hash-addressed):

class ProcessedImage(dj.Computed):
    definition = """
    -> RawImage
    ---
    processed : <blob@>       # Always load full image
    """

Typical pattern:

# Fetch always gets full data
img = (ProcessedImage & key).fetch1('processed')


Streaming / Partial Reads

Use <object@> (schema-addressed):

class ScanVolume(dj.Manual):
    definition = """
    scan_id : uuid
    ---
    volume : <object@>        # Stream chunks as needed
    """

Typical pattern:

# Get reference without downloading
ref = (ScanVolume & key).fetch1('volume')

# Stream specific chunks
import zarr
z = zarr.open(ref.fsmap, mode='r')
slice_data = z[100:200, :, :]  # Fetch only this slice


NumPy Arrays with Lazy Loading

Use <npy@> (optimized for NumPy):

class NeuralActivity(dj.Computed):
    definition = """
    -> Recording
    ---
    traces : <npy@>           # NumPy array, lazy load
    """

Typical pattern:

# Returns NpyRef (lazy)
ref = (NeuralActivity & key).fetch1('traces')

# Access like NumPy array (loads on demand)
subset = ref[:100, :]         # Efficient slicing
shape = ref.shape             # Metadata without loading

Why <npy@> over <blob@> for arrays: - Lazy loading (doesn't load until accessed) - Efficient slicing (can fetch subsets) - Preserves shape/dtype metadata - Native NumPy serialization

Lifecycle and Management

DataJoint-Managed (Integrated)

Use <blob@>, <npy@>, or <object@>:

class ManagedData(dj.Manual):
    definition = """
    data_id : uuid
    ---
    content : <blob@>         # DataJoint manages lifecycle
    """

DataJoint provides: - โœ… Automatic cleanup (garbage collection) - โœ… Transactional integrity (atomic with database) - โœ… Referential integrity (cascading deletes) - โœ… Content deduplication (for <blob@>, <attach@>)

User manages: - โŒ File paths (DataJoint decides) - โŒ Cleanup (automatic) - โŒ Integrity (enforced)


User-Managed (References)

Use <filepath@>:

class ExternalData(dj.Manual):
    definition = """
    data_id : uuid
    ---
    raw_file : <filepath@>    # User manages file
    """

User provides: - โœ… File paths (you control organization) - โœ… File lifecycle (you create/delete) - โœ… Existing files (reference external data)

DataJoint provides: - โœ… Path validation (file exists on insert) - โœ… ObjectRef for lazy access - โŒ No garbage collection - โŒ No transaction safety for files - โŒ No deduplication

Use when: - Files managed by external systems - Referencing existing data archives - Custom file organization required - Large instrument output directories

Storage Type Comparison

In-Table: <blob>

Storage: Database column (LONGBLOB)

Syntax:

small_data : <blob>

Characteristics: - โœ… Fast access (in database) - โœ… Transactional consistency - โœ… Automatic backup - โœ… No store configuration needed - โœ… Python object convenience: Insert/fetch dicts, lists, arrays directly (no manual IO) - โœ… Automatic serialization + gzip compression - โœ… Technical limit: 4 GiB (MySQL), unlimited (PostgreSQL) - โŒ Practical limit: Keep under ~1-10 MB for performance - โŒ No deduplication - โŒ Database bloat for large data

Best for: - Configuration JSON (dicts/lists) - Small arrays/matrices - Thumbnails - Nested data structures


In-Table: <attach>

Storage: Database column (LONGBLOB)

Syntax:

config_file : <attach>

Characteristics: - โœ… Fast access (in database) - โœ… Transactional consistency - โœ… Automatic backup - โœ… No store configuration needed - โœ… Filename preserved: Original filename stored with content - โœ… Automatic gzip compression - โœ… Technical limit: 4 GiB (MySQL), unlimited (PostgreSQL) - โŒ Practical limit: Keep under ~1-10 MB for performance - โŒ No deduplication - โŒ Returns file path (extracts to download directory), not Python object

Best for: - Small configuration files - Document attachments (< 10 MB) - Files where original filename matters - When you need the file extracted to disk

Difference from <blob>: - <blob>: Stores Python objects (dicts, arrays) โ†’ returns Python object - <attach>: Stores files with filename โ†’ returns local file path


Hash-Addressed: <blob@> or <attach@>

Storage: Object store at {store}/_hash/{schema}/{hash}

Syntax:

data : <blob@>              # Default store
data : <blob@mystore>       # Named store
file : <attach@>            # File attachments

Characteristics (both): - โœ… Content deduplication (identical data stored once) - โœ… Automatic gzip compression - โœ… Garbage collection - โœ… Transaction safety - โœ… Referential integrity - โœ… Moderate to large files (1 MB - 100 GB) - โŒ Full download on fetch (no streaming) - โŒ Storage path not browsable (hash-based)

<blob@> specific: - โœ… Python object convenience: Insert/fetch dicts, lists, arrays directly (no manual IO) - Returns: Python objects

<attach@> specific: - โœ… Filename preserved: Original filename stored with content - Returns: Local file path (extracts to download directory)

Best for <blob@>: - Large Python objects (NumPy arrays, dicts) - Processed results (nested structures) - Any Python data with duplicates

Best for <attach@>: - PDF/document files - Images, videos - Files where original filename/format matters

Key difference: - <blob@>: Python objects in, Python objects out (no file handling) - <attach@>: Files in, file paths out (preserves filename)


Schema-Addressed: <npy@> or <object@>

Storage: Object store at {store}/_schema/{schema}/{table}/{key}/{field}.{token}.ext

Syntax:

array : <npy@>              # NumPy arrays
dataset : <object@>         # Zarr, HDF5, custom

Characteristics: - โœ… Streaming access (no full download) - โœ… Partial reads (fetch chunks) - โœ… Browsable paths (organized by key) - โœ… Accessible by external tools (not just DataJoint) - โœ… Very large files (100 MB - TB+) - โœ… Multi-file datasets (e.g., Zarr directory structures) - โŒ No deduplication - โŒ One file per field per row

Key advantages: - Schema-addressed storage is browsable - can be navigated and accessed by external tools (Zarr viewers, HDF5 utilities, direct filesystem access), not just through DataJoint - <npy@> provides array convenience - insert/fetch array-like objects directly (no manual .npy file handling) - <object@> provides flexibility - you manage the format (Zarr, HDF5, custom), DataJoint provides storage and references

Best for: - <npy@>: NumPy arrays with lazy loading (no manual IO) - <object@>: Zarr arrays, HDF5 datasets, custom formats (you manage format) - Large video files - Multi-file experimental outputs - Data that needs to be accessed by non-DataJoint tools

Difference <npy@> vs <object@>: - <npy@>: Insert/fetch array-like objects (like <blob> but lazy) - no manual .npy handling - <object@>: You manage format and IO (Zarr, HDF5, custom) - more flexible but requires format knowledge


Filepath References: <filepath@>

Storage: User-managed paths in object store

Syntax:

raw_data : <filepath@>      # User-managed file

Characteristics: - โœ… Reference existing files - โœ… User controls paths - โœ… External system compatibility - โœ… Custom organization - โŒ No lifecycle management - โŒ No garbage collection - โŒ No transaction safety - โŒ No deduplication - โŒ Must avoid _hash/ and _schema/ prefixes

Best for: - Large instrument data directories - Externally managed archives - Legacy data integration - Custom file organization requirements

Common Scenarios

Scenario 1: Image Processing Pipeline

@schema
class RawImage(dj.Manual):
    """Imported from microscope"""
    definition = """
    image_id : uuid
    ---
    raw_file : <filepath@acquisition>    # Reference microscope output
    """

@schema
class CalibratedImage(dj.Computed):
    """Calibrated, moderate size"""
    definition = """
    -> RawImage
    ---
    calibrated : <blob@>                 # 5 MB processed image
    """

@schema
class Thumbnail(dj.Computed):
    """Preview for dashboard"""
    definition = """
    -> CalibratedImage
    ---
    preview : <blob>                     # 100 KB thumbnail, in-table
    """

Rationale: - <filepath@>: Reference existing microscope files (large, externally managed) - <blob@>: Processed images (moderate size, deduplicated if reprocessed) - <blob>: Thumbnails (tiny, fast access for UI)


Scenario 2: Electrophysiology Recording

@schema
class RecordingSession(dj.Manual):
    """Recording metadata"""
    definition = """
    session_id : uuid
    ---
    config : <blob>                      # 50 KB parameters, in-table
    """

@schema
class ContinuousData(dj.Imported):
    """Raw voltage traces"""
    definition = """
    -> RecordingSession
    ---
    raw_voltage : <object@raw>           # 10 GB Zarr array, streaming
    """

@schema
class SpikeWaveforms(dj.Computed):
    """Extracted spike shapes"""
    definition = """
    -> ContinuousData
    unit_id : int64
    ---
    waveforms : <npy@>                   # 20 MB array, lazy load
    """

@schema
class UnitStats(dj.Computed):
    """Summary statistics"""
    definition = """
    -> SpikeWaveforms
    ---
    stats : <blob>                       # 10 KB stats dict, in-table
    """

Rationale: - <blob>: Config and stats (small metadata, fast access) - <object@>: Raw voltage (huge, stream for spike detection) - <npy@>: Waveforms (moderate arrays, load for clustering)


Scenario 3: Calcium Imaging Analysis

@schema
class Movie(dj.Manual):
    """Raw calcium imaging movie"""
    definition = """
    movie_id : uuid
    ---
    frames : <object@movies>             # 2 GB TIFF stack, streaming
    """

@schema
class SegmentedCells(dj.Computed):
    """Cell masks"""
    definition = """
    -> Movie
    ---
    masks : <npy@>                       # 50 MB mask array, lazy load
    """

@schema
class FluorescenceTraces(dj.Computed):
    """Extracted time series"""
    definition = """
    -> SegmentedCells
    cell_id : int64
    ---
    trace : <blob@>                      # 500 KB per cell, deduplicated
    """

@schema
class TraceSummary(dj.Computed):
    """Event detection results"""
    definition = """
    -> FluorescenceTraces
    ---
    events : <blob>                      # 5 KB event times, in-table
    """

Rationale: - <object@>: Movies (huge, stream for segmentation) - <npy@>: Masks (moderate, load for trace extraction) - <blob@>: Traces (per-cell, many rows, deduplication helps) - <blob>: Event summaries (tiny, fast query results)

Configuration Examples

Single Store (Development)

{
  "stores": {
    "default": "main",
    "main": {
      "protocol": "file",
      "location": "/data/my-project"
    }
  }
}

All @ codecs use this store: - <blob@> โ†’ /data/my-project/_hash/{schema}/{hash} - <npy@> โ†’ /data/my-project/_schema/{schema}/{table}/{key}/


Multiple Stores (Production)

{
  "stores": {
    "default": "main",
    "filepath_default": "acquisition",
    "main": {
      "protocol": "file",
      "location": "/data/processed"
    },
    "acquisition": {
      "protocol": "file",
      "location": "/mnt/microscope"
    },
    "archive": {
      "protocol": "s3",
      "bucket": "long-term-storage",
      "location": "lab-data/archive"
    }
  }
}

Usage in table definitions:

raw : <filepath@>               # Uses filepath_default (acquisition)
processed : <blob@>             # Uses default (main)
backup : <blob@archive>         # Uses named store (archive)

Performance Considerations

Read Performance

Codec Random Access Streaming Latency
<blob> โšก Excellent N/A <1ms
<blob@> โœ… Good โŒ No ~100ms
<npy@> โœ… Good (lazy) โœ… Yes ~100ms + chunk time
<object@> โœ… Excellent โœ… Yes ~100ms + chunk time
<filepath@> โœ… Good โœ… Yes ~100ms + network

Write Performance

Codec Insert Speed Transaction Safe Deduplication
<blob> โšก Fastest โœ… Yes โŒ No
<blob@> โœ… Fast โœ… Yes โœ… Yes
<npy@> โœ… Fast โœ… Yes โŒ No
<object@> โœ… Fast โœ… Yes โŒ No
<filepath@> โšก Fastest โš ๏ธ Path only โŒ No

Storage Efficiency

Codec Deduplication Compression Overhead
<blob> โŒ No โœ… gzip (automatic) Low
<blob@> โœ… Yes โœ… gzip (automatic) Medium
<npy@> โŒ No โš ๏ธ Format-specific Low
<object@> โŒ No โš ๏ธ Format-specific Low
<filepath@> โŒ No User-managed Minimal

Migration Between Storage Types

In-Table โ†’ Object Store

# Add new column with object storage
@schema
class MyTable(dj.Manual):
    definition = """
    id : int
    ---
    data_old : <blob>          # Legacy in-table
    data_new : <blob@>         # New object storage
    """

# Migrate data
for key in MyTable.fetch('KEY'):
    old_data = (MyTable & key).fetch1('data_old')
    (MyTable & key).update1({**key, 'data_new': old_data})

# After verification, drop old column via alter()

Hash-Addressed โ†’ Schema-Addressed

# For large files that need streaming
@schema
class Recording(dj.Manual):
    definition = """
    recording_id : uuid
    ---
    data_blob : <blob@>        # Old: full download
    data_stream : <object@>    # New: streaming access
    """

# Convert and store as Zarr
import zarr
for key in Recording.fetch('KEY'):
    data = (Recording & key).fetch1('data_blob')

    # Create Zarr array
    ref = (Recording & key).create_object_ref('data_stream', '.zarr')
    z = zarr.open(ref.fsmap, mode='w', shape=data.shape, dtype=data.dtype)
    z[:] = data

    # Update row
    (Recording & key).update1({**key, 'data_stream': ref})

Troubleshooting

"DataJointError: Store not configured"

Problem: Using @ without store configuration

Solution:

{
  "stores": {
    "default": "main",
    "main": {
      "protocol": "file",
      "location": "/data/storage"
    }
  }
}

"ValueError: Path conflicts with reserved section"

Problem: <filepath@> path uses _hash/ or _schema/

Solution: Use different path:

# Bad
table.insert1({'id': 1, 'file': '_hash/mydata.bin'})  # Error!

# Good
table.insert1({'id': 1, 'file': 'raw/mydata.bin'})    # OK

Data not deduplicated

Problem: Using <npy@> or <object@> expecting deduplication

Solution: Use <blob@> for deduplication:

# No deduplication
data : <npy@>

# With deduplication
data : <blob@>

Out of memory loading large array

Problem: Using <blob@> for huge files

Solution: Use <object@> or <npy@> for streaming:

# Bad: loads 10 GB into memory
large_data : <blob@>

# Good: streaming access
large_data : <object@>

See Also