Skip to content

Hash Registry

Content hashing for external storage

Hash-addressed storage registry for DataJoint.

This module provides hash-addressed storage with deduplication for the <hash> codec. Content is identified by a Base32-encoded MD5 hash and stored with per-schema isolation::

_hash/{schema}/{hash}

With optional subfolding (configured per-store)::

_hash/{schema}/{fold1}/{fold2}/{hash}

Subfolding creates directory hierarchies to improve performance on filesystems that struggle with large directories (ext3, FAT32, NFS). Modern filesystems (ext4, XFS, ZFS, S3) handle flat directories efficiently.

Storage Model:

  • Hash is used for content identification (deduplication, integrity verification)
  • Path is always stored in metadata and used for all file operations

This design protects against configuration changes (e.g., subfolding) affecting existing data. The path stored at insert time is always used for retrieval.

Hash-addressed storage is used by <hash@>, <blob@>, and <attach@> types. Deduplication occurs within each schema. Deletion requires garbage collection via dj.gc.collect().

See Also

datajoint.gc : Garbage collection for orphaned storage items.

compute_hash

compute_hash(data)

Compute Base32-encoded MD5 hash of content.

Parameters:

Name Type Description Default
data bytes

Content bytes.

required

Returns:

Type Description
str

Base32-encoded hash (26 lowercase characters, no padding).

build_hash_path

build_hash_path(content_hash, schema_name, subfolding=None)

Build the storage path for hash-addressed storage.

Path structure without subfolding::

_hash/{schema}/{hash}

Path structure with subfolding (e.g., (2, 2))::

_hash/{schema}/{fold1}/{fold2}/{hash}

Parameters:

Name Type Description Default
content_hash str

Base32-encoded hash (26 characters).

required
schema_name str

Database/schema name for isolation.

required
subfolding tuple[int, ...]

Subfolding pattern from store config. None means flat (no subfolding).

None

Returns:

Type Description
str

Relative path within the store.

get_store_backend

get_store_backend(store_name=None)

Get a StorageBackend for hash-addressed storage.

Parameters:

Name Type Description Default
store_name str

Name of the store to use. If None, uses stores.default.

None

Returns:

Type Description
StorageBackend

StorageBackend instance.

get_store_subfolding

get_store_subfolding(store_name=None)

Get the subfolding configuration for a store.

Parameters:

Name Type Description Default
store_name str

Name of the store. If None, uses stores.default.

None

Returns:

Type Description
tuple[int, ...] | None

Subfolding pattern (e.g., (2, 2)) or None for flat storage.

put_hash

put_hash(data, schema_name, store_name=None)

Store content using hash-addressed storage.

If the content already exists (same hash in same schema), it is not re-uploaded. Returns metadata including the hash, path, store, and size.

The path is always stored in metadata and used for retrieval, protecting against configuration changes (e.g., subfolding) affecting existing data.

Parameters:

Name Type Description Default
data bytes

Content bytes to store.

required
schema_name str

Database/schema name for path isolation.

required
store_name str

Name of the store. If None, uses default store.

None

Returns:

Type Description
dict[str, Any]

Metadata dict with keys: hash, path, schema, store, size.

get_hash

get_hash(metadata)

Retrieve content using stored metadata.

Uses the stored path directly (not derived from hash) to protect against configuration changes affecting existing data.

Parameters:

Name Type Description Default
metadata dict

Metadata dict with keys: path, hash, store (optional).

required

Returns:

Type Description
bytes

Content bytes.

Raises:

Type Description
MissingExternalFile

If content is not found at the stored path.

DataJointError

If hash verification fails (data corruption).

delete_path

delete_path(path, store_name=None)

Delete content at the specified path from storage.

This should only be called after verifying no references exist. Use garbage collection to safely remove unreferenced content.

Parameters:

Name Type Description Default
path str

Storage path (as stored in metadata).

required
store_name str

Name of the store. If None, uses default store.

None

Returns:

Type Description
bool

True if content was deleted, False if it didn't exist.

Warnings

This permanently deletes content. Ensure no references exist first.