Hash Registry¶
Content hashing for external storage
Hash-addressed storage registry for DataJoint.
This module provides hash-addressed storage with deduplication for the <hash>
codec. Content is identified by a Base32-encoded MD5 hash and stored with
per-schema isolation::
_hash/{schema}/{hash}
With optional subfolding (configured per-store)::
_hash/{schema}/{fold1}/{fold2}/{hash}
Subfolding creates directory hierarchies to improve performance on filesystems that struggle with large directories (ext3, FAT32, NFS). Modern filesystems (ext4, XFS, ZFS, S3) handle flat directories efficiently.
Storage Model:
- Hash is used for content identification (deduplication, integrity verification)
- Path is always stored in metadata and used for all file operations
This design protects against configuration changes (e.g., subfolding) affecting existing data. The path stored at insert time is always used for retrieval.
Hash-addressed storage is used by <hash@>, <blob@>, and <attach@> types.
Deduplication occurs within each schema. Deletion requires garbage collection
via dj.gc.collect().
See Also
datajoint.gc : Garbage collection for orphaned storage items.
compute_hash ¶
compute_hash(data)
Compute Base32-encoded MD5 hash of content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
bytes
|
Content bytes. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Base32-encoded hash (26 lowercase characters, no padding). |
build_hash_path ¶
build_hash_path(content_hash, schema_name, subfolding=None)
Build the storage path for hash-addressed storage.
Path structure without subfolding::
_hash/{schema}/{hash}
Path structure with subfolding (e.g., (2, 2))::
_hash/{schema}/{fold1}/{fold2}/{hash}
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
content_hash
|
str
|
Base32-encoded hash (26 characters). |
required |
schema_name
|
str
|
Database/schema name for isolation. |
required |
subfolding
|
tuple[int, ...]
|
Subfolding pattern from store config. None means flat (no subfolding). |
None
|
Returns:
| Type | Description |
|---|---|
str
|
Relative path within the store. |
get_store_backend ¶
get_store_backend(store_name=None)
Get a StorageBackend for hash-addressed storage.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
store_name
|
str
|
Name of the store to use. If None, uses stores.default. |
None
|
Returns:
| Type | Description |
|---|---|
StorageBackend
|
StorageBackend instance. |
get_store_subfolding ¶
get_store_subfolding(store_name=None)
Get the subfolding configuration for a store.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
store_name
|
str
|
Name of the store. If None, uses stores.default. |
None
|
Returns:
| Type | Description |
|---|---|
tuple[int, ...] | None
|
Subfolding pattern (e.g., (2, 2)) or None for flat storage. |
put_hash ¶
put_hash(data, schema_name, store_name=None)
Store content using hash-addressed storage.
If the content already exists (same hash in same schema), it is not re-uploaded. Returns metadata including the hash, path, store, and size.
The path is always stored in metadata and used for retrieval, protecting against configuration changes (e.g., subfolding) affecting existing data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
bytes
|
Content bytes to store. |
required |
schema_name
|
str
|
Database/schema name for path isolation. |
required |
store_name
|
str
|
Name of the store. If None, uses default store. |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Metadata dict with keys: hash, path, schema, store, size. |
get_hash ¶
get_hash(metadata)
Retrieve content using stored metadata.
Uses the stored path directly (not derived from hash) to protect against configuration changes affecting existing data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata
|
dict
|
Metadata dict with keys: path, hash, store (optional). |
required |
Returns:
| Type | Description |
|---|---|
bytes
|
Content bytes. |
Raises:
| Type | Description |
|---|---|
MissingExternalFile
|
If content is not found at the stored path. |
DataJointError
|
If hash verification fails (data corruption). |
delete_path ¶
delete_path(path, store_name=None)
Delete content at the specified path from storage.
This should only be called after verifying no references exist. Use garbage collection to safely remove unreferenced content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Storage path (as stored in metadata). |
required |
store_name
|
str
|
Name of the store. If None, uses default store. |
None
|
Returns:
| Type | Description |
|---|---|
bool
|
True if content was deleted, False if it didn't exist. |
Warnings
This permanently deletes content. Ensure no references exist first.