Skip to content

Hidden Job Metadata in Computed Tables

Overview

Job execution metadata (start time, duration, code version) should be persisted in computed tables themselves, not just in ephemeral job entries. This is accomplished using hidden attributes.

Motivation

The current job table (~~table_name) tracks execution metadata, but: 1. Job entries are deleted after completion (unless keep_completed=True) 2. Users often need to know when and with what code version each row was computed 3. This metadata should be transparent - not cluttering the user-facing schema

Hidden attributes (prefixed with _) provide the solution: stored in the database but filtered from user-facing APIs.

Hidden Job Metadata Attributes

Attribute Type Description
_job_start_time datetime(3) When computation began
_job_duration float32 Computation duration in seconds
_job_version varchar(64) Code version (e.g., git commit hash)

Design notes: - _job_duration (elapsed time) rather than _job_completed_time because duration is more informative for performance analysis - varchar(64) for version is sufficient for git hashes (40 chars for SHA-1, 7-8 for short hash) - datetime(3) provides millisecond precision

Configuration

Settings Structure

Job metadata is controlled via config.jobs settings:

class JobsSettings(BaseSettings):
    """Job queue configuration for AutoPopulate 2.0."""

    model_config = SettingsConfigDict(
        env_prefix="DJ_JOBS_",
        case_sensitive=False,
        extra="forbid",
        validate_assignment=True,
    )

    # Existing settings
    auto_refresh: bool = Field(default=True, ...)
    keep_completed: bool = Field(default=False, ...)
    stale_timeout: int = Field(default=3600, ...)
    default_priority: int = Field(default=5, ...)
    version_method: Literal["git", "none"] | None = Field(default=None, ...)
    allow_new_pk_fields_in_computed_tables: bool = Field(default=False, ...)

    # New setting for hidden job metadata
    add_job_metadata: bool = Field(
        default=False,
        description="Add hidden job metadata attributes (_job_start_time, _job_duration, _job_version) "
        "to Computed and Imported tables during declaration. Tables created without this setting "
        "will not receive metadata updates during populate."
    )

Access Patterns

import datajoint as dj

# Read setting
dj.config.jobs.add_job_metadata  # False (default)

# Enable programmatically
dj.config.jobs.add_job_metadata = True

# Enable via environment variable
# DJ_JOBS_ADD_JOB_METADATA=true

# Enable in config file (dj_config.yaml)
# jobs:
#   add_job_metadata: true

# Temporary override
with dj.config.override(jobs={"add_job_metadata": True}):
    schema(MyComputedTable)  # Declared with metadata columns

Setting Interactions

Setting Effect on Job Metadata
add_job_metadata=True New Computed/Imported tables get hidden metadata columns
add_job_metadata=False Tables declared without metadata columns (default)
version_method="git" _job_version populated with git short hash
version_method="none" _job_version left empty
version_method=None _job_version left empty (same as "none")

Behavior at Declaration vs Populate

add_job_metadata at declare add_job_metadata at populate Result
True True Metadata columns created and populated
True False Metadata columns exist but not populated
False True No metadata columns, populate skips silently
False False No metadata columns, normal behavior

Retrofitting Existing Tables

Tables created before enabling add_job_metadata do not have the hidden metadata columns. To add metadata columns to existing tables, use the migration utility (not automatic):

from datajoint.migrate import add_job_metadata_columns

# Add hidden metadata columns to specific table
add_job_metadata_columns(MyComputedTable)

# Add to all Computed/Imported tables in a schema
add_job_metadata_columns(schema)

This utility: - ALTERs the table to add the three hidden columns - Does NOT populate existing rows (metadata remains NULL) - Future populate() calls will populate metadata for new rows

Behavior

Declaration-time

When config.jobs.add_job_metadata=True and a Computed/Imported table is declared: - Hidden metadata columns are added to the table definition - Only master tables receive metadata columns; Part tables never get them

Population-time

After make() completes successfully: 1. Check if the table has hidden metadata columns 2. If yes: UPDATE the just-inserted rows with start_time, duration, version 3. If no: Silently skip (no error, no ALTER)

This applies to both: - Direct mode (reserve_jobs=False): Single-process populate - Distributed mode (reserve_jobs=True): Multi-worker with job table coordination

Excluding Hidden Attributes from Binary Operators

Problem Statement

If two tables have hidden attributes with the same name (e.g., both have _job_start_time), SQL's NATURAL JOIN would incorrectly match on them:

-- NATURAL JOIN matches ALL common attributes including hidden
SELECT * FROM table_a NATURAL JOIN table_b
-- Would incorrectly match on _job_start_time!

Solution: Replace NATURAL JOIN with USING Clause

Hidden attributes must be excluded from all binary operator considerations. The result of a join does not preserve hidden attributes from its operands.

Pre-2.0 implementation:

def from_clause(self):
    clause = next(support)
    for s, left in zip(support, self._left):
        clause += " NATURAL{left} JOIN {clause}".format(...)

DataJoint 2.0 implementation:

def from_clause(self):
    clause = next(support)
    for s, (left, using_attrs) in zip(support, self._joins):
        if using_attrs:
            using = "USING ({})".format(", ".join(f"`{a}`" for a in using_attrs))
            clause += " {left}JOIN {s} {using}".format(
                left="LEFT " if left else "",
                s=s,
                using=using
            )
        else:
            # Cross join (no common non-hidden attributes)
            clause += " CROSS JOIN " + s if not left else " LEFT JOIN " + s + " ON TRUE"
    return clause

Changes Required

1. QueryExpression._left โ†’ QueryExpression._joins

Replace _left: List[bool] with _joins: List[Tuple[bool, List[str]]]

Each join stores: - left: Whether it's a left join - using_attrs: Non-hidden common attributes to join on

# Before
result._left = self._left + [left] + other._left

# After
join_attributes = [n for n in self.heading.names if n in other.heading.names]
result._joins = self._joins + [(left, join_attributes)] + other._joins

2. heading.names (existing behavior)

Already filters out hidden attributes:

@property
def names(self):
    return [k for k in self.attributes]  # attributes excludes is_hidden=True

This ensures join attribute computation automatically excludes hidden attributes.

Behavior Summary

Scenario Hidden Attributes Result
A * B (join) Same hidden attr in both NOT matched - excluded from USING
A & B (restriction) Same hidden attr in both NOT matched
A - B (anti-restriction) Same hidden attr in both NOT matched
A.proj() Hidden attrs in A NOT projected (unless explicitly named)
A.to_dicts() Hidden attrs in A NOT returned by default

Implementation Details

1. Declaration (declare.py)

def declare(full_table_name, definition, context):
    # ... existing code ...

    # Add hidden job metadata for auto-populated tables
    if config.jobs.add_job_metadata and table_tier in (TableTier.COMPUTED, TableTier.IMPORTED):
        # Only for master tables, not parts
        if not is_part_table:
            job_metadata_sql = [
                "`_job_start_time` datetime(3) DEFAULT NULL",
                "`_job_duration` float DEFAULT NULL",
                "`_job_version` varchar(64) DEFAULT ''",
            ]
            attribute_sql.extend(job_metadata_sql)

2. Population (autopopulate.py)

def _populate1(self, key, callback, use_jobs, jobs):
    start_time = datetime.now()
    version = _get_job_version()

    # ... call make() ...

    duration = time.time() - start_time.timestamp()

    # Update job metadata if table has the hidden attributes
    if self._has_job_metadata_attrs():
        self._update_job_metadata(
            key,
            start_time=start_time,
            duration=duration,
            version=version
        )

def _has_job_metadata_attrs(self):
    """Check if table has hidden job metadata columns."""
    hidden_attrs = self.heading._attributes  # includes hidden
    return '_job_start_time' in hidden_attrs

def _update_job_metadata(self, key, start_time, duration, version):
    """Update hidden job metadata for the given key."""
    # UPDATE using primary key
    pk_condition = make_condition(self, key, set())
    self.connection.query(
        f"UPDATE {self.full_table_name} SET "
        f"`_job_start_time`=%s, `_job_duration`=%s, `_job_version`=%s "
        f"WHERE {pk_condition}",
        args=(start_time, duration, version[:64])
    )

3. Job table (jobs.py)

Update version field length:

version=""      : varchar(64)

4. Version helper

def _get_job_version() -> str:
    """Get version string, truncated to 64 chars."""
    from .settings import config

    method = config.jobs.version_method
    if method is None or method == "none":
        return ""
    elif method == "git":
        try:
            result = subprocess.run(
                ["git", "rev-parse", "--short", "HEAD"],
                capture_output=True,
                text=True,
                timeout=5,
            )
            return result.stdout.strip()[:64] if result.returncode == 0 else ""
        except Exception:
            return ""
    return ""

Example Usage

# Enable job metadata for new tables
dj.config.jobs.add_job_metadata = True

@schema
class ProcessedData(dj.Computed):
    definition = """
    -> RawData
    ---
    result : float
    """

    def make(self, key):
        # User code - unaware of hidden attributes
        self.insert1({**key, 'result': compute(key)})

# Job metadata automatically added and populated:
# _job_start_time, _job_duration, _job_version

# User-facing API unaffected:
ProcessedData().heading.names  # ['raw_data_id', 'result']
ProcessedData().to_dicts()  # Returns only visible attributes

# Access hidden attributes explicitly if needed:
ProcessedData().to_arrays('_job_start_time', '_job_duration', '_job_version')

Summary of Design Decisions

Decision Resolution
Configuration config.jobs.add_job_metadata (default False)
Environment variable DJ_JOBS_ADD_JOB_METADATA
Existing tables No automatic ALTER - silently skip metadata if columns absent
Retrofitting Manual via datajoint.migrate.add_job_metadata_columns() utility
Populate modes Record metadata in both direct and distributed modes
Part tables No metadata columns - only master tables
Version length varchar(64) in both jobs table and computed tables
Binary operators Hidden attributes excluded via USING clause instead of NATURAL JOIN
Failed makes N/A - transaction rolls back, no rows to update