Skip to content

Object Storage Overview

Navigate DataJoint's object storage documentation to find what you need.

Quick Navigation by Task

I want to...

Task Guide Est. Time
โœ… Decide which storage type to use Choose a Storage Type 5-10 min
โœ… Set up S3, MinIO, or file storage Configure Object Storage 10-15 min
โœ… Store and retrieve large data Use Object Storage 15-20 min
โœ… Work with NumPy arrays efficiently Use NPY Codec 10 min
โœ… Create domain-specific types Create Custom Codec 30-45 min
โœ… Optimize storage performance Manage Large Data 20 min
โœ… Clean up unused data Garbage Collection 10 min

Conceptual Understanding

Why does DataJoint have object storage?

Traditional databases excel at structured, relational data but struggle with large arrays, files, and streaming data. DataJoint's Object-Augmented Schema (OAS) unifies relational tables with object storage into a single coherent system:

  • Relational database: Metadata, keys, relationships (structured data < 1 MB)
  • Object storage: Arrays, files, datasets (large data > 1 MB)
  • Full referential integrity: Maintained across both layers

Read: Object-Augmented Schemas for conceptual overview.

Three Storage Modes

In-Table Storage (<blob>)

What: Data stored directly in database column When: Small objects < 1 MB (JSON, thumbnails, small arrays) Why: Fast access, transactional consistency, no store setup

metadata : <blob>         # Stored in MySQL

Guide: Use Object Storage


Object Store (Integrated)

What: DataJoint-managed storage in S3, file systems, or cloud storage When: Large data (arrays, files, datasets) needing lifecycle management Why: Deduplication, garbage collection, transaction safety, referential integrity

Two addressing schemes:

Hash-Addressed (<blob@>, <attach@>)

  • Content-based paths (MD5 hash)
  • Automatic deduplication
  • Best for: Write-once data, attachments
waveform : <blob@>        # Hash: _hash/{schema}/{hash}
document : <attach@>      # Hash: _hash/{schema}/{hash}

Schema-Addressed (<npy@>, <object@>)

  • Key-based paths (browsable)
  • Streaming access, partial reads
  • Best for: Zarr, HDF5, large arrays
traces : <npy@>           # Schema: _schema/{schema}/{table}/{key}/
volume : <object@>        # Schema: _schema/{schema}/{table}/{key}/

Guides: - Choose a Storage Type โ€” Decision criteria - Use Object Storage โ€” How to use codecs


Filepath References (<filepath@>)

What: User-managed file paths (DataJoint stores path string only) When: Existing data archives, externally managed files Why: No file lifecycle management, no deduplication, user controls paths

raw_data : <filepath@>    # User-managed path

Guide: Use Object Storage

Documentation by Level

Getting Started

  1. Choose a Storage Type โ€” Start here

    • Quick decision tree (5 minutes)
    • Size guidelines (< 1 MB, 1-100 MB, > 100 MB)
    • Access pattern considerations
    • Lifecycle management options
  2. Configure Object Storage โ€” Setup

    • File system, S3, MinIO configuration
    • Single vs multiple stores
    • Credentials management
    • Store verification
  3. Use Object Storage โ€” Basic usage

    • Insert/fetch patterns
    • In-table vs object store
    • Addressing schemes (hash vs schema)
    • ObjectRef for lazy access

Intermediate

  1. Use NPY Codec โ€” NumPy arrays

    • Lazy loading (doesn't load until accessed)
    • Efficient slicing (fetch subsets)
    • Shape/dtype metadata
    • When to use <npy@> vs <blob@>
  2. Manage Large Data โ€” Optimization

    • Storage tiers (hot/warm/cold)
    • Compression strategies
    • Batch operations
    • Performance tuning
  3. Garbage Collection โ€” Cleanup

    • Automatic cleanup for integrated storage
    • Manual cleanup for filepath references
    • Orphan detection
    • Recovery procedures

Advanced

  1. Create Custom Codec โ€” Extensibility
    • Domain-specific types
    • Codec API (encode/decode)
    • HashCodec vs SchemaCodec patterns
    • Integration with existing formats

Technical Reference

For implementation details and specifications:

Specifications

Explanations

Common Workflows

Workflow 1: Adding Object Storage to Existing Pipeline

  1. Configure Object Storage โ€” Set up store
  2. Choose a Storage Type โ€” Select codec
  3. Update table definitions with @ modifier
  4. Use Object Storage โ€” Insert/fetch patterns

Estimate: 30-60 minutes


Workflow 2: Migrating from In-Table to Object Store

  1. Choose a Storage Type โ€” Determine new codec
  2. Add new column with object storage codec
  3. Migrate data (see Use Object Storage)
  4. Verify data integrity
  5. Drop old column (see Alter Tables)

Estimate: 1-2 hours for small datasets


Workflow 3: Working with Very Large Arrays (> 1 GB)

  1. Use <object@> or <npy@> (not <blob@>)
  2. Configure Object Storage โ€” Ensure adequate storage
  3. For Zarr: Store as <object@> with .zarr extension
  4. For streaming: Use ObjectRef.fsmap (see Use Object Storage)

Key advantage: No need to download full dataset into memory


Workflow 4: Building Custom Domain Types

  1. Read Custom Codecs โ€” Understand patterns
  2. Create Custom Codec โ€” Implementation guide
  3. Codec API Spec โ€” Technical reference
  4. Test with small dataset
  5. Deploy to production

Estimate: 2-4 hours for simple codecs

Decision Trees

"Which storage mode?"

Is data < 1 MB per row?
โ”œโ”€ YES โ†’ <blob> (in-table)
โ””โ”€ NO  โ†’ Continue...

Is data managed externally?
โ”œโ”€ YES โ†’ <filepath@> (user-managed reference)
โ””โ”€ NO  โ†’ Continue...

Need streaming or partial reads?
โ”œโ”€ YES โ†’ <object@> or <npy@> (schema-addressed)
โ””โ”€ NO  โ†’ <blob@> (hash-addressed, full download)

Full guide: Choose a Storage Type


"Which codec for object storage?"

NumPy arrays that benefit from lazy loading?
โ”œโ”€ YES โ†’ <npy@>
โ””โ”€ NO  โ†’ Continue...

Large files (> 100 MB) needing streaming?
โ”œโ”€ YES โ†’ <object@>
โ””โ”€ NO  โ†’ Continue...

Write-once data with potential duplicates?
โ”œโ”€ YES โ†’ <blob@> (deduplication via hashing)
โ””โ”€ NO  โ†’ <blob@> or <object@> (choose based on access pattern)

Full guide: Choose a Storage Type

Troubleshooting

Common Issues

Problem Likely Cause Solution Guide
"Store not configured" Missing stores config Configure Object Storage
Out of memory loading array Using <blob@> for huge data Choose a Storage Type โ†’ Use <object@>
Slow fetches Wrong codec choice Manage Large Data
Data not deduplicated Using wrong codec Choose a Storage Type
Path conflicts with reserved <filepath@> using _hash/ or _schema/ Use Object Storage
Missing files after delete Expected behavior for integrated storage Garbage Collection

Getting Help

See Also