Object Storage Overview¶
Navigate DataJoint's object storage documentation to find what you need.
Quick Navigation by Task¶
I want to...
| Task | Guide | Est. Time |
|---|---|---|
| โ Decide which storage type to use | Choose a Storage Type | 5-10 min |
| โ Set up S3, MinIO, or file storage | Configure Object Storage | 10-15 min |
| โ Store and retrieve large data | Use Object Storage | 15-20 min |
| โ Work with NumPy arrays efficiently | Use NPY Codec | 10 min |
| โ Create domain-specific types | Create Custom Codec | 30-45 min |
| โ Optimize storage performance | Manage Large Data | 20 min |
| โ Clean up unused data | Garbage Collection | 10 min |
Conceptual Understanding¶
Why does DataJoint have object storage?
Traditional databases excel at structured, relational data but struggle with large arrays, files, and streaming data. DataJoint's Object-Augmented Schema (OAS) unifies relational tables with object storage into a single coherent system:
- Relational database: Metadata, keys, relationships (structured data < 1 MB)
- Object storage: Arrays, files, datasets (large data > 1 MB)
- Full referential integrity: Maintained across both layers
Read: Object-Augmented Schemas for conceptual overview.
Three Storage Modes¶
In-Table Storage (<blob>)¶
What: Data stored directly in database column When: Small objects < 1 MB (JSON, thumbnails, small arrays) Why: Fast access, transactional consistency, no store setup
metadata : <blob> # Stored in MySQL
Guide: Use Object Storage
Object Store (Integrated)¶
What: DataJoint-managed storage in S3, file systems, or cloud storage When: Large data (arrays, files, datasets) needing lifecycle management Why: Deduplication, garbage collection, transaction safety, referential integrity
Two addressing schemes:
Hash-Addressed (<blob@>, <attach@>)¶
- Content-based paths (MD5 hash)
- Automatic deduplication
- Best for: Write-once data, attachments
waveform : <blob@> # Hash: _hash/{schema}/{hash}
document : <attach@> # Hash: _hash/{schema}/{hash}
Schema-Addressed (<npy@>, <object@>)¶
- Key-based paths (browsable)
- Streaming access, partial reads
- Best for: Zarr, HDF5, large arrays
traces : <npy@> # Schema: _schema/{schema}/{table}/{key}/
volume : <object@> # Schema: _schema/{schema}/{table}/{key}/
Guides: - Choose a Storage Type โ Decision criteria - Use Object Storage โ How to use codecs
Filepath References (<filepath@>)¶
What: User-managed file paths (DataJoint stores path string only) When: Existing data archives, externally managed files Why: No file lifecycle management, no deduplication, user controls paths
raw_data : <filepath@> # User-managed path
Guide: Use Object Storage
Documentation by Level¶
Getting Started¶
-
Choose a Storage Type โ Start here
- Quick decision tree (5 minutes)
- Size guidelines (< 1 MB, 1-100 MB, > 100 MB)
- Access pattern considerations
- Lifecycle management options
-
Configure Object Storage โ Setup
- File system, S3, MinIO configuration
- Single vs multiple stores
- Credentials management
- Store verification
-
Use Object Storage โ Basic usage
- Insert/fetch patterns
- In-table vs object store
- Addressing schemes (hash vs schema)
- ObjectRef for lazy access
Intermediate¶
-
Use NPY Codec โ NumPy arrays
- Lazy loading (doesn't load until accessed)
- Efficient slicing (fetch subsets)
- Shape/dtype metadata
- When to use
<npy@>vs<blob@>
-
Manage Large Data โ Optimization
- Storage tiers (hot/warm/cold)
- Compression strategies
- Batch operations
- Performance tuning
-
Garbage Collection โ Cleanup
- Automatic cleanup for integrated storage
- Manual cleanup for filepath references
- Orphan detection
- Recovery procedures
Advanced¶
- Create Custom Codec โ Extensibility
- Domain-specific types
- Codec API (encode/decode)
- HashCodec vs SchemaCodec patterns
- Integration with existing formats
Technical Reference¶
For implementation details and specifications:
Specifications¶
- Type System Spec โ Three-layer architecture
- Codec API Spec โ Custom codec interface
- NPY Codec Spec โ NumPy array storage
- Object Store Configuration Spec โ Store config details
Explanations¶
- Type System โ Conceptual overview
- Data Pipelines (OAS section) โ Why OAS exists
- Custom Codecs โ Design patterns
Common Workflows¶
Workflow 1: Adding Object Storage to Existing Pipeline¶
- Configure Object Storage โ Set up store
- Choose a Storage Type โ Select codec
- Update table definitions with
@modifier - Use Object Storage โ Insert/fetch patterns
Estimate: 30-60 minutes
Workflow 2: Migrating from In-Table to Object Store¶
- Choose a Storage Type โ Determine new codec
- Add new column with object storage codec
- Migrate data (see Use Object Storage)
- Verify data integrity
- Drop old column (see Alter Tables)
Estimate: 1-2 hours for small datasets
Workflow 3: Working with Very Large Arrays (> 1 GB)¶
- Use
<object@>or<npy@>(not<blob@>) - Configure Object Storage โ Ensure adequate storage
- For Zarr: Store as
<object@>with.zarrextension - For streaming: Use
ObjectRef.fsmap(see Use Object Storage)
Key advantage: No need to download full dataset into memory
Workflow 4: Building Custom Domain Types¶
- Read Custom Codecs โ Understand patterns
- Create Custom Codec โ Implementation guide
- Codec API Spec โ Technical reference
- Test with small dataset
- Deploy to production
Estimate: 2-4 hours for simple codecs
Decision Trees¶
"Which storage mode?"¶
Is data < 1 MB per row?
โโ YES โ <blob> (in-table)
โโ NO โ Continue...
Is data managed externally?
โโ YES โ <filepath@> (user-managed reference)
โโ NO โ Continue...
Need streaming or partial reads?
โโ YES โ <object@> or <npy@> (schema-addressed)
โโ NO โ <blob@> (hash-addressed, full download)
Full guide: Choose a Storage Type
"Which codec for object storage?"¶
NumPy arrays that benefit from lazy loading?
โโ YES โ <npy@>
โโ NO โ Continue...
Large files (> 100 MB) needing streaming?
โโ YES โ <object@>
โโ NO โ Continue...
Write-once data with potential duplicates?
โโ YES โ <blob@> (deduplication via hashing)
โโ NO โ <blob@> or <object@> (choose based on access pattern)
Full guide: Choose a Storage Type
Troubleshooting¶
Common Issues¶
| Problem | Likely Cause | Solution Guide |
|---|---|---|
| "Store not configured" | Missing stores config | Configure Object Storage |
| Out of memory loading array | Using <blob@> for huge data |
Choose a Storage Type โ Use <object@> |
| Slow fetches | Wrong codec choice | Manage Large Data |
| Data not deduplicated | Using wrong codec | Choose a Storage Type |
| Path conflicts with reserved | <filepath@> using _hash/ or _schema/ |
Use Object Storage |
| Missing files after delete | Expected behavior for integrated storage | Garbage Collection |
Getting Help¶
- Check FAQ for common questions
- Search GitHub Discussions
- Review specification for exact behavior
See Also¶
Related Concepts¶
- Type System โ Three-layer type architecture
- Data Pipelines โ Object-Augmented Schemas
Related How-Tos¶
- Manage Secrets โ Credentials for S3/cloud storage
- Define Tables โ Table definition syntax
- Insert Data โ Data insertion patterns
Related Tutorials¶
- Object Storage Tutorial โ Hands-on learning
- Custom Codecs Tutorial โ Build your own codec