Use the <npy> Codec¶
Store NumPy arrays with lazy loading and metadata access.
Overview¶
The <npy@> codec stores NumPy arrays as portable .npy files in object storage. On fetch, you get an NpyRef that provides metadata without downloading.
Key benefits:
- Access shape, dtype, size without I/O
- Lazy loading - download only when needed
- Memory mapping - random access to large arrays
- Safe bulk fetch - inspect before downloading
- Portable .npy format
Quick Start¶
1. Configure a Store¶
import datajoint as dj
# Add store configuration
dj.config.object_storage.stores['mystore'] = {
'protocol': 's3',
'endpoint': 'localhost:9000',
'bucket': 'my-bucket',
'access_key': 'access_key',
'secret_key': 'secret_key',
'location': 'data',
}
Or in datajoint.json:
{
"object_storage": {
"stores": {
"mystore": {
"protocol": "s3",
"endpoint": "s3.amazonaws.com",
"bucket": "my-bucket",
"location": "data"
}
}
}
}
2. Define Table with <npy@>¶
@schema
class Recording(dj.Manual):
definition = """
recording_id : int32
---
waveform : <npy@mystore>
"""
3. Insert Arrays¶
import numpy as np
Recording.insert1({
'recording_id': 1,
'waveform': np.random.randn(1000, 32),
})
4. Fetch with Lazy Loading¶
# Returns NpyRef, not array
ref = (Recording & 'recording_id=1').fetch1('waveform')
# Metadata without download
print(ref.shape) # (1000, 32)
print(ref.dtype) # float64
# Load when ready
arr = ref.load()
NpyRef Reference¶
Metadata Properties (No I/O)¶
ref.shape # Tuple of dimensions
ref.dtype # NumPy dtype
ref.ndim # Number of dimensions
ref.size # Total elements
ref.nbytes # Total bytes
ref.path # Storage path
ref.store # Store name
ref.is_loaded # Whether data is cached
Loading Methods¶
# Explicit load (recommended)
arr = ref.load()
# Via NumPy functions (auto-loads)
mean = np.mean(ref)
std = np.std(ref, axis=0)
# Via conversion (auto-loads)
arr = np.asarray(ref)
# Indexing (loads then indexes)
first_row = ref[0]
snippet = ref[100:200, :]
Memory Mapping¶
For large arrays, use mmap_mode to access data without loading it all into memory:
# Memory-mapped loading (random access)
arr = ref.load(mmap_mode='r')
# Only reads the portion you access
slice = arr[1000:2000, :] # Efficient for large arrays
Modes:
- 'r' - Read-only (recommended)
- 'r+' - Read-write
- 'c' - Copy-on-write (changes not saved)
Performance: - Local filesystem stores: mmaps directly (no copy) - Remote stores (S3): downloads to cache first, then mmaps
Common Patterns¶
Bulk Fetch with Filtering¶
# Fetch all - returns NpyRefs, not arrays
results = MyTable.to_dicts()
# Filter by metadata (no downloads)
large = [r for r in results if r['data'].shape[0] > 1000]
# Load only what you need
for rec in large:
arr = rec['data'].load()
process(arr)
Computed Tables¶
@schema
class ProcessedData(dj.Computed):
definition = """
-> RawData
---
result : <npy@mystore>
"""
def make(self, key):
# Fetch lazy reference
ref = (RawData & key).fetch1('raw')
# NumPy functions auto-load
result = np.fft.fft(ref, axis=1)
self.insert1({**key, 'result': result})
Memory-Efficient Processing¶
# Process recordings one at a time
for key in Recording.keys():
ref = (Recording & key).fetch1('data')
# Check size before loading
if ref.nbytes > 1e9: # > 1 GB
print(f"Skipping large recording: {ref.nbytes/1e9:.1f} GB")
continue
process(ref.load())
Comparison with <blob@>¶
| Aspect | <npy@> |
<blob@> |
|---|---|---|
| On fetch | NpyRef (lazy) | Array (eager) |
| Metadata access | Without download | Must download |
| Memory mapping | Yes, via mmap_mode |
No |
| Addressing | Schema-addressed | Hash-addressed |
| Deduplication | No | Yes |
| Format | .npy (portable) |
DJ blob (Python) |
| Best for | Large arrays, lazy loading | Small arrays, dedup |
When to Use Each¶
Use <npy@> when:
- Arrays are large (> 10 MB)
- You need to inspect shape/dtype before loading
- Fetching many rows but processing few
- Random access to slices of very large arrays (memory mapping)
- Interoperability matters (non-Python tools)
Use <blob@> when:
- Arrays are small (< 10 MB)
- Same arrays appear in multiple rows (deduplication)
- Storing non-array Python objects (dicts, lists)
Supported Array Types¶
The <npy> codec supports any NumPy array except object dtype:
# Supported
np.array([1, 2, 3], dtype=np.int32) # Integer
np.array([1.0, 2.0], dtype=np.float64) # Float
np.array([True, False], dtype=np.bool_) # Boolean
np.array([1+2j, 3+4j], dtype=np.complex128) # Complex
np.zeros((10, 10, 10)) # N-dimensional
np.array(42) # 0-dimensional scalar
# Structured arrays
dt = np.dtype([('x', np.float64), ('y', np.float64)])
np.array([(1.0, 2.0), (3.0, 4.0)], dtype=dt)
# NOT supported
np.array([{}, []], dtype=object) # Object dtype
Troubleshooting¶
"Store not configured"¶
Ensure your store is configured before using <npy@store>:
dj.config.object_storage.stores['store'] = {...}
"requires @ (store only)"¶
The <npy> codec requires the @ modifier:
# Wrong
data : <npy>
# Correct
data : <npy@>
data : <npy@mystore>
Memory issues with large arrays¶
Use lazy loading or memory mapping to control memory:
# Check size before loading
if ref.nbytes > available_memory:
# Use memory mapping for random access
arr = ref.load(mmap_mode='r')
# Process in chunks
for i in range(0, len(arr), chunk_size):
process(arr[i:i+chunk_size])
else:
arr = ref.load()
See Also¶
- Use Object Storage - Complete storage guide
- Configure Object Storage - Store setup
<npy>Codec Specification - Full spec