Configure Object Stores¶

Set up S3, MinIO, or filesystem storage for DataJoint's Object-Augmented Schema (OAS).

Tip: DataJoint.com provides pre-configured object stores integrated with your database—no setup required.

Overview¶

DataJoint's Object-Augmented Schema (OAS) integrates relational tables with object storage as a single coherent system. Large data objects (arrays, files, Zarr datasets) are stored in file systems or cloud storage while maintaining full referential integrity with the relational database.

Storage models:

Hash-addressed and schema-addressed storage are integrated into the OAS. DataJoint manages paths, lifecycle, integrity, garbage collection, transaction safety, and deduplication.
Filepath storage stores only path strings. DataJoint provides no lifecycle management, garbage collection, transaction safety, or deduplication. Users control file creation, organization, and lifecycle.

Storage is configured per-project using named stores. Each store can be used for:

Hash-addressed storage (<blob@>, <attach@>) — content-addressed with deduplication using _hash/ section
Schema-addressed storage (<object@>, <npy@>) — key-based paths with streaming access using _schema/ section
Filepath storage (<filepath@>) — user-managed paths anywhere in the store except _hash/ and _schema/ (reserved for DataJoint)

Multiple stores can be configured for different data types or storage tiers. One store is designated as the default.

Configuration Methods¶

DataJoint loads configuration in priority order:

Environment variables (highest priority)
Secrets directory (.secrets/)
Config file (datajoint.json)
Defaults (lowest priority)

Single Store Configuration¶

File System Store¶

For local or network-mounted storage:

{
  "stores": {
    "default": "main",
    "main": {
      "protocol": "file",
      "location": "/data/my-project/production"
    }
  }
}

Paths will be:

Hash: /data/my-project/production/_hash/{schema}/{hash}
Schema: /data/my-project/production/_schema/{schema}/{table}/{key}/

S3 Store¶

For Amazon S3 or S3-compatible storage:

{
  "stores": {
    "default": "main",
    "main": {
      "protocol": "s3",
      "endpoint": "s3.amazonaws.com",
      "bucket": "my-bucket",
      "location": "my-project/production",
      "secure": true
    }
  }
}

Store credentials separately in .secrets/:

.secrets/
├── stores.main.access_key
└── stores.main.secret_key

Paths will be:

Hash: s3://my-bucket/my-project/production/_hash/{schema}/{hash}
Schema: s3://my-bucket/my-project/production/_schema/{schema}/{table}/{key}/

MinIO Store¶

MinIO uses the S3 protocol with a custom endpoint:

{
  "stores": {
    "default": "main",
    "main": {
      "protocol": "s3",
      "endpoint": "minio.example.com:9000",
      "bucket": "datajoint",
      "location": "lab-data",
      "secure": false
    }
  }
}

Multiple Stores Configuration¶

Define multiple stores for different data types or storage tiers:

{
  "stores": {
    "default": "main",
    "main": {
      "protocol": "file",
      "location": "/data/my-project/main",
      "partition_pattern": "subject_id/session_date"
    },
    "raw": {
      "protocol": "file",
      "location": "/data/my-project/raw",
      "subfolding": [2, 2]
    },
    "archive": {
      "protocol": "s3",
      "endpoint": "s3.amazonaws.com",
      "bucket": "archive-bucket",
      "location": "my-project/long-term"
    }
  }
}

Store credentials in .secrets/:

.secrets/
├── stores.archive.access_key
└── stores.archive.secret_key

Use named stores in table definitions:

@schema
class Recording(dj.Manual):
    definition = """
    recording_id : uuid
    ---
    raw_data : <blob@raw>         # Hash: _hash/{schema}/{hash}
    zarr_scan : <object@raw>      # Schema: _schema/{schema}/{table}/{key}/
    summary : <blob@>             # Uses default store (main)
    old_data : <blob@archive>     # Archive store, hash-addressed
    """

Notice that <blob@raw> and <object@raw> both use the "raw" store, just different _hash and _schema sections.

Example paths with partitioning:

For a Recording with subject_id=042, session_date=2024-01-15 in the main store:

/data/my-project/main/_schema/subject_id=042/session_date=2024-01-15/experiment/Recording/recording_id=uuid-value/zarr_scan.x8f2a9b1.zarr

Without those attributes, it follows normal structure:

/data/my-project/main/_schema/experiment/Recording/recording_id=uuid-value/zarr_scan.x8f2a9b1.zarr

Verify Configuration¶

import datajoint as dj

# Check default store
spec = dj.config.get_store_spec()  # Uses stores.default
print(spec)

# Check named store
spec = dj.config.get_store_spec("archive")
print(spec)

# List all configured stores
print(dj.config.stores.keys())

Configuration Options¶

Option	Required	Description
`stores.default`	Yes	Name of the default store
`stores.<name>.protocol`	Yes	`file`, `s3`, `gcs`, or `azure`
`stores.<name>.location`	Yes	Base path or prefix (includes project context)
`stores.<name>.bucket`	S3/GCS	Bucket name
`stores.<name>.endpoint`	S3	S3 endpoint URL
`stores.<name>.secure`	No	Use HTTPS (default: true)
`stores.<name>.access_key`	S3	Access key ID (store in `.secrets/`)
`stores.<name>.secret_key`	S3	Secret access key (store in `.secrets/`)
`stores.<name>.subfolding`	No	Hash-addressed hierarchy: `[2, 2]` for 2-level nesting (default: no subfolding)
`stores.<name>.partition_pattern`	No	Schema-addressed path partitioning: `"subject_id/session_date"` (default: no partitioning)
`stores.<name>.token_length`	No	Random token length for schema-addressed filenames (default: `8`)

Subfolding (Hash-Addressed Storage Only)¶

Hash-addressed storage (<blob@>, <attach@>) stores content using a Base32-encoded hash as the filename. By default, all files are stored in a flat directory structure:

_hash/{schema}/abcdefghijklmnopqrstuvwxyz

Some filesystems perform poorly with large directories (thousands of files). Subfolding creates a directory hierarchy to distribute files:

{
  "stores": {
    "default": "main",
    "main": {
      "protocol": "file",
      "location": "/data/store",
      "project_name": "my_project",
      "subfolding": [2, 2]
    }
  }
}

With [2, 2] subfolding, hash-addressed paths become:

_hash/{schema}/ab/cd/abcdefghijklmnopqrstuvwxyz

Schema-addressed storage (<object@>, <npy@>) does not use subfolding—it uses key-based paths:

{location}/_schema/{partition}/{schema}/{table}/{key}/{field_name}.{token}.{ext}

Filesystem Recommendations¶

Filesystem	Subfolding Needed	Notes
ext3	Yes	Limited directory indexing
FAT32/exFAT	Yes	Linear directory scans
NFS	Yes	Network latency amplifies directory lookups
CIFS/SMB	Yes	Windows network shares
ext4	No	HTree indexing handles large directories
XFS	No	B+ tree directories scale well
ZFS	No	Efficient directory handling
Btrfs	No	B-tree based
S3/MinIO	No	Object storage uses hash-based lookups
GCS	No	Object storage
Azure Blob	No	Object storage

Recommendation: Use [2, 2] for network-mounted filesystems and legacy systems. Modern local filesystems and cloud object storage work well without subfolding.

URL Representation¶

DataJoint uses consistent URL representation for all storage backends internally. This means:

Local filesystem paths are represented as file:// URLs
S3 paths use s3://bucket/path
GCS paths use gs://bucket/path
Azure paths use az://container/path

You can use either format when specifying paths:

# Both are equivalent for local files
"/data/myfile.dat"
"file:///data/myfile.dat"

This unified approach enables:

Consistent internal handling across all storage types
Seamless switching between local and cloud storage
Integration with fsspec for streaming access

Customizing Storage Sections¶

Each store is divided into sections for different storage types. By default, DataJoint uses _hash/ for hash-addressed storage and _schema/ for schema-addressed storage. You can customize the path prefix for each section using the *_prefix configuration parameters to map DataJoint to existing storage layouts:

{
  "stores": {
    "legacy": {
      "protocol": "file",
      "location": "/data/existing_storage",
      "hash_prefix": "content_addressed",
      "schema_prefix": "structured_data",
      "filepath_prefix": "raw_files"
    }
  }
}

Requirements:

Sections must be mutually exclusive (path prefixes cannot nest)
The hash_prefix and schema_prefix sections are reserved for DataJoint-managed storage
The filepath_prefix is optional (null = unrestricted, or set a required prefix)

Example with hierarchical layout:

{
  "stores": {
    "organized": {
      "protocol": "s3",
      "endpoint": "s3.amazonaws.com",
      "bucket": "neuroscience-data",
      "location": "lab-project-2024",
      "hash_prefix": "managed/blobs",      // Path prefix for hash section
      "schema_prefix": "managed/arrays",    // Path prefix for schema section
      "filepath_prefix": "imported"         // Path prefix for filepath section
    }
  }
}

Storage section paths become:

Hash: s3://neuroscience-data/lab-project-2024/managed/blobs/{schema}/{hash}
Schema: s3://neuroscience-data/lab-project-2024/managed/arrays/{schema}/{table}/{key}/
Filepath: s3://neuroscience-data/lab-project-2024/imported/{user_path}

Reserved Sections and Filepath Storage¶

DataJoint reserves sections within each store for managed storage. These sections are defined by prefix configuration parameters:

Hash-addressed section (configured via hash_prefix, default: _hash/) — Content-addressed storage for <blob@> and <attach@> with deduplication
Schema-addressed section (configured via schema_prefix, default: _schema/) — Key-based storage for <object@> and <npy@> with streaming access

User-Managed Filepath Storage¶

The <filepath@> codec stores paths to files that you manage. DataJoint does not manage lifecycle (no garbage collection), integrity (no transaction safety), or deduplication for filepath storage. You can reference existing files or create new ones—DataJoint simply stores the path string. Files can be anywhere in the store except the reserved sections:

@schema
class RawData(dj.Manual):
    definition = """
    session_id : int
    ---
    recording : <filepath@acquisition>  # User-managed file path
    """

# Valid paths (user-managed)
table.insert1({'session_id': 1, 'recording': 'subject01/session001/data.bin'})  # Existing or new file
table.insert1({'session_id': 2, 'recording': 'raw/experiment_2024/data.nwb'})  # Existing or new file

# Invalid paths (reserved for DataJoint - will raise ValueError)
# These use the default prefixes (_hash and _schema)
table.insert1({'session_id': 3, 'recording': '_hash/abc123...'})      # Error!
table.insert1({'session_id': 4, 'recording': '_schema/myschema/...'}) # Error!

# If you configured custom prefixes like "content_addressed", those would also be blocked
# table.insert1({'session_id': 5, 'recording': 'content_addressed/file.dat'})  # Error!

Key characteristics of <filepath@>:

Stores path string only (DataJoint does not manage the files)
No lifecycle management: no garbage collection, no transaction safety, no deduplication
User controls file creation, organization, and deletion
Can reference existing files or create new ones
Returns ObjectRef for lazy access on fetch
Validates file exists on insert
Cannot use reserved sections (configured by hash_prefix and schema_prefix)
Can be restricted to specific prefix using filepath_prefix configuration

Configuring stores via environment variables¶

New in 2.2.4

DJ_STORES carries a JSON-encoded copy of the stores dict for env-var-only deployments (Kubernetes pods, Lambda, the DataJoint platform). Combined with DJ_IGNORE_CONFIG_FILE=true, it removes the need for any file on disk.

The JSON shape is identical to the stores block of datajoint.json:

export DJ_STORES='{
  "default": "main",
  "main": {
    "protocol": "s3",
    "endpoint": "s3.amazonaws.com",
    "bucket": "my-bucket",
    "location": "my-project/production",
    "access_key": "AKIA...",
    "secret_key": "wJal..."
  }
}'

For plugin-registered adapters, declare whatever fields the adapter requires:

export DJ_STORES='{
  "uc": {
    "protocol": "databricks",
    "workspace_url": "https://my-workspace.cloud.databricks.com",
    "volume": "main.default.my_volume",
    "token": "dapibd..."
  }
}'

DJ_STORES, when set, replaces the stores block loaded from datajoint.json. The .secrets/ directory still runs afterward and fills in any attribute that DJ_STORES omits — useful if you want to keep non-sensitive store config in a file and inject only credentials via env vars.

See Manage Secrets for credential hygiene and Storage Adapter API for the plugin contract.