Object Storage Overview¶

Navigate DataJoint's object storage documentation to find what you need.

I want to...

Task	Guide	Est. Time
✅ Decide which storage type to use	Choose a Storage Type	5-10 min
✅ Set up S3, MinIO, or file storage	Configure Object Storage	10-15 min
✅ Store and retrieve large data	Use Object Storage	15-20 min
✅ Work with NumPy arrays efficiently	Use NPY Codec	10 min
✅ Create domain-specific types	Create Custom Codec	30-45 min
✅ Optimize storage performance	Manage Large Data	20 min
✅ Clean up unused data	Garbage Collection	10 min

Conceptual Understanding¶

Why does DataJoint have object storage?

Traditional databases excel at structured, relational data but struggle with large arrays, files, and streaming data. DataJoint's Object-Augmented Schema (OAS) unifies relational tables with object storage into a single coherent system:

Relational database: Metadata, keys, relationships (structured data < 1 MB)
Object storage: Arrays, files, datasets (large data > 1 MB)
Full referential integrity: Maintained across both layers

Read: Object-Augmented Schemas for conceptual overview.

Three Storage Modes¶

In-Table Storage (`<blob>`)¶

What: Data stored directly in database column When: Small objects < 1 MB (JSON, thumbnails, small arrays) Why: Fast access, transactional consistency, no store setup

metadata : <blob>         # Stored in MySQL

Guide: Use Object Storage

Object Store (Integrated)¶

What: DataJoint-managed storage in S3, file systems, or cloud storage When: Large data (arrays, files, datasets) needing lifecycle management Why: Deduplication, garbage collection, transaction safety, referential integrity

Two addressing schemes:

Hash-Addressed (`<blob@>`, `<attach@>`)¶

Content-based paths (MD5 hash)
Automatic deduplication
Best for: Write-once data, attachments

waveform : <blob@>        # Hash: _hash/{schema}/{hash}
document : <attach@>      # Hash: _hash/{schema}/{hash}

Schema-Addressed (`<npy@>`, `<object@>`)¶

Key-based paths (browsable)
Streaming access, partial reads
Best for: Zarr, HDF5, large arrays

traces : <npy@>           # Schema: _schema/{schema}/{table}/{key}/
volume : <object@>        # Schema: _schema/{schema}/{table}/{key}/

Guides: - Choose a Storage Type — Decision criteria - Use Object Storage — How to use codecs

Filepath References (`<filepath@>`)¶

What: User-managed file paths (DataJoint stores path string only) When: Existing data archives, externally managed files Why: No file lifecycle management, no deduplication, user controls paths

raw_data : <filepath@>    # User-managed path

Guide: Use Object Storage

Documentation by Level¶

Getting Started¶

Choose a Storage Type — Start here
- Quick decision tree (5 minutes)
- Size guidelines (< 1 MB, 1-100 MB, > 100 MB)
- Access pattern considerations
- Lifecycle management options
Configure Object Storage — Setup
- File system, S3, MinIO configuration
- Single vs multiple stores
- Credentials management
- Store verification
Use Object Storage — Basic usage
- Insert/fetch patterns
- In-table vs object store
- Addressing schemes (hash vs schema)
- ObjectRef for lazy access

Intermediate¶

Use NPY Codec — NumPy arrays
- Lazy loading (doesn't load until accessed)
- Efficient slicing (fetch subsets)
- Shape/dtype metadata
- When to use <npy@> vs <blob@>
Manage Large Data — Optimization
- Storage tiers (hot/warm/cold)
- Compression strategies
- Batch operations
- Performance tuning
Garbage Collection — Cleanup
- Automatic cleanup for integrated storage
- Manual cleanup for filepath references
- Orphan detection
- Recovery procedures

Advanced¶

Create Custom Codec — Extensibility
- Domain-specific types
- Codec API (encode/decode)
- HashCodec vs SchemaCodec patterns
- Integration with existing formats

Technical Reference¶

For implementation details and specifications:

Specifications¶

Type System Spec — Three-layer architecture
Codec API Spec — Custom codec interface
NPY Codec Spec — NumPy array storage
Object Store Configuration Spec — Store config details

Explanations¶

Type System — Conceptual overview
Data Pipelines (OAS section) — Why OAS exists
Custom Codecs — Design patterns

Common Workflows¶

Workflow 1: Adding Object Storage to Existing Pipeline¶

Configure Object Storage — Set up store
Choose a Storage Type — Select codec
Update table definitions with @ modifier
Use Object Storage — Insert/fetch patterns

Estimate: 30-60 minutes

Workflow 2: Migrating from In-Table to Object Store¶

Choose a Storage Type — Determine new codec
Add new column with object storage codec
Migrate data (see Use Object Storage)
Verify data integrity
Drop old column (see Alter Tables)

Estimate: 1-2 hours for small datasets

Workflow 3: Working with Very Large Arrays (> 1 GB)¶

Use <object@> or <npy@> (not <blob@>)
Configure Object Storage — Ensure adequate storage
For Zarr: Store as <object@> with .zarr extension
For streaming: Use ObjectRef.fsmap (see Use Object Storage)

Key advantage: No need to download full dataset into memory

Workflow 4: Building Custom Domain Types¶

Read Custom Codecs — Understand patterns
Create Custom Codec — Implementation guide
Codec API Spec — Technical reference
Test with small dataset
Deploy to production

Estimate: 2-4 hours for simple codecs

Decision Trees¶

"Which storage mode?"¶

Is data < 1 MB per row?
├─ YES → <blob> (in-table)
└─ NO  → Continue...

Is data managed externally?
├─ YES → <filepath@> (user-managed reference)
└─ NO  → Continue...

Need streaming or partial reads?
├─ YES → <object@> or <npy@> (schema-addressed)
└─ NO  → <blob@> (hash-addressed, full download)

Full guide: Choose a Storage Type

"Which codec for object storage?"¶

NumPy arrays that benefit from lazy loading?
├─ YES → <npy@>
└─ NO  → Continue...

Large files (> 100 MB) needing streaming?
├─ YES → <object@>
└─ NO  → Continue...

Write-once data with potential duplicates?
├─ YES → <blob@> (deduplication via hashing)
└─ NO  → <blob@> or <object@> (choose based on access pattern)

Full guide: Choose a Storage Type

Troubleshooting¶

Common Issues¶

Problem	Likely Cause	Solution Guide
"Store not configured"	Missing stores config	Configure Object Storage
Out of memory loading array	Using `<blob@>` for huge data	Choose a Storage Type → Use `<object@>`
Slow fetches	Wrong codec choice	Manage Large Data
Data not deduplicated	Using wrong codec	Choose a Storage Type
Path conflicts with reserved	`<filepath@>` using `_hash/` or `_schema/`	Use Object Storage
Missing files after delete	Expected behavior for integrated storage	Garbage Collection

Getting Help¶

Check FAQ for common questions
Search GitHub Discussions
Review specification for exact behavior

Object Storage Overview¶

Quick Navigation by Task¶

Conceptual Understanding¶

Three Storage Modes¶

In-Table Storage (`<blob>`)¶

Object Store (Integrated)¶

Hash-Addressed (`<blob@>`, `<attach@>`)¶

Schema-Addressed (`<npy@>`, `<object@>`)¶

Filepath References (`<filepath@>`)¶

Documentation by Level¶

Getting Started¶

Intermediate¶

Advanced¶

Technical Reference¶

Specifications¶

Explanations¶

Common Workflows¶

Workflow 1: Adding Object Storage to Existing Pipeline¶

Workflow 2: Migrating from In-Table to Object Store¶

Workflow 3: Working with Very Large Arrays (> 1 GB)¶

Workflow 4: Building Custom Domain Types¶

Decision Trees¶

"Which storage mode?"¶

"Which codec for object storage?"¶

Troubleshooting¶

Common Issues¶

Getting Help¶

See Also¶

Object Storage Overview¶

Quick Navigation by Task¶

Conceptual Understanding¶

Three Storage Modes¶

In-Table Storage (<blob>)¶

Object Store (Integrated)¶

Hash-Addressed (<blob@>, <attach@>)¶

Schema-Addressed (<npy@>, <object@>)¶

Filepath References (<filepath@>)¶

Documentation by Level¶

Getting Started¶

Intermediate¶

Advanced¶

Technical Reference¶

Specifications¶

Explanations¶

Common Workflows¶

Workflow 1: Adding Object Storage to Existing Pipeline¶

Workflow 2: Migrating from In-Table to Object Store¶

Workflow 3: Working with Very Large Arrays (> 1 GB)¶

Workflow 4: Building Custom Domain Types¶

Decision Trees¶

"Which storage mode?"¶

"Which codec for object storage?"¶

Troubleshooting¶

Common Issues¶

Getting Help¶

See Also¶

Related Concepts¶

Related How-Tos¶

Related Tutorials¶

In-Table Storage (`<blob>`)¶

Hash-Addressed (`<blob@>`, `<attach@>`)¶

Schema-Addressed (`<npy@>`, `<object@>`)¶

Filepath References (`<filepath@>`)¶