Entity Integrity¶

Entity integrity ensures a one-to-one correspondence between real-world entities and their database records. This is the foundation of reliable data management.

The Core Guarantee¶

Each real-world entity → exactly one database record
Each database record → exactly one real-world entity

Without entity integrity, databases become unreliable:

Integrity Failure	Consequence
Same entity, multiple records	Fragmented data, conflicting information
Multiple entities, same record	Mixed data, privacy violations
Cannot match entity to record	Lost data, broken workflows

The Three Questions¶

When designing a primary key, answer these three questions:

1. How do I prevent duplicate records?¶

Ensure the same entity cannot appear twice in the table.

Ensure different entities cannot share the same record.

3. How do I match entities to records?¶

When an entity arrives, how do I find its corresponding record?

Example: Laboratory Mouse Database¶

Consider a neuroscience lab tracking mice:

Question	Answer
Prevent duplicates?	Each mouse gets a unique ear tag at arrival; database rejects duplicate tags
Prevent sharing?	Ear tags are never reused; retired tags are archived
Match entities?	Read the ear tag → look up record by primary key

@schema
class Mouse(dj.Manual):
    definition = """
    ear_tag : char(6)   # unique ear tag (e.g., 'M00142')
    ---
    date_of_birth : date
    sex : enum('M', 'F', 'U')
    strain : varchar(50)
    """

The database enforces the first two questions through the primary key constraint. The third question requires a physical identification system—ear tags, barcodes, or RFID chips that link physical entities to database records.

Primary Key Requirements¶

In DataJoint, every table must have a primary key. Primary key attributes:

Cannot be NULL — Every entity must be identifiable
Must be unique — No two entities share the same key
Cannot be changed — Keys are immutable after insertion
Declared above the --- line — Syntactic convention

Natural Keys vs. Surrogate Keys¶

Natural Keys¶

Use attributes that naturally identify entities in your domain:

@schema
class Gene(dj.Lookup):
    definition = """
    gene_symbol : varchar(20)   # Official gene symbol (e.g., 'BRCA1')
    ---
    full_name : varchar(200)
    chromosome : varchar(5)
    """

Advantages:

Meaningful to humans
Self-documenting
No additional lookup needed

Surrogate Keys¶

A surrogate key is an identifier used primarily inside the database, with minimal or no exposure to end users. Users typically don't search for entities by surrogate keys or use them in conversation.

@schema
class InternalRecord(dj.Manual):
    definition = """
    record_id : uuid    # internal identifier, not exposed to users
    ---
    created_timestamp : datetime(3)
    data : <blob>
    """

Key distinction from natural keys: Surrogate keys don't require external identification systems because users don't need to match physical entities to records by these keys.

When surrogate keys are appropriate:

Entities that exist only within the system (no physical counterpart)
Privacy-sensitive contexts where natural identifiers shouldn't be stored
Internal system records that users never reference directly

Generating surrogate keys: DataJoint requires explicit key values rather than database-generated auto-increment. This is intentional:

Auto-increment encourages treating keys as "row numbers" rather than entity identifiers
It's incompatible with composite keys, which DataJoint uses extensively
It breaks reproducibility (different IDs when rebuilding pipelines)
It prevents the client-server handshake needed for proper entity integrity

Use client-side generation instead:

UUIDs — Generate with uuid.uuid4() before insertion
ULIDs — Sortable unique IDs
Client-side counters — Query max value and increment

DataJoint recommendation: Prefer natural keys when they're stable and meaningful. Use surrogates only when no natural identifier exists or for privacy-sensitive contexts.

Composite Keys¶

When no single attribute uniquely identifies an entity, combine multiple attributes:

@schema
class Recording(dj.Manual):
    definition = """
    -> Session
    recording_idx : int32   # Recording number within session
    ---
    duration : float32       # seconds
    """

Here, (subject_id, session_idx, recording_idx) together form the primary key. Neither alone would be unique.

Foreign Keys and Dependencies¶

Foreign keys in DataJoint serve dual purposes:

Referential integrity — Ensures referenced entities exist
Workflow dependency — Declares that this entity depends on another

@schema
class Segmentation(dj.Computed):
    definition = """
    -> Scan           # Depends on Scan
    ---
    num_cells : int64
    """

The arrow -> inherits the primary key from Scan and establishes both referential integrity and workflow dependency.

Schema Dimensions¶

A dimension is an independent axis of variation in your data. The fundamental principle:

Any table that introduces a new primary key attribute introduces a new dimension.

This is true whether the table has only new attributes or also inherits attributes from foreign keys. The key is simply: new primary key attribute = new dimension.

Tables That Introduce Dimensions¶

@schema
class Subject(dj.Manual):
    definition = """
    subject_id : varchar(16)   # NEW dimension: subject_id
    ---
    species : varchar(50)
    """

@schema
class Session(dj.Manual):
    definition = """
    -> Subject                 # Inherits subject_id
    session_idx : int32       # NEW dimension: session_idx
    ---
    session_date : date
    """

@schema
class Trial(dj.Manual):
    definition = """
    -> Session                 # Inherits subject_id, session_idx
    trial_idx : int32         # NEW dimension: trial_idx
    ---
    outcome : enum('success', 'fail')
    """

All three tables introduce dimensions:

Subject introduces subject_id dimension
Session introduces session_idx dimension (even though it also inherits subject_id)
Trial introduces trial_idx dimension (even though it also inherits subject_id, session_idx)

In schema diagrams, tables that introduce at least one new dimension have underlined names.

Tables That Don't Introduce Dimensions¶

A table introduces no dimensions when its entire primary key comes from foreign keys:

@schema
class SubjectProfile(dj.Manual):
    definition = """
    -> Subject                 # Inherits subject_id only
    ---
    weight : float32
    """

SubjectProfile doesn't introduce any new primary key attribute—it extends the Subject dimension with additional attributes. There's exactly one profile per subject.

In schema diagrams, these tables have non-underlined names.

Computed Tables and Dimensions¶

Computed tables never introduce dimensions. Their primary key is entirely inherited from their dependencies:

@schema
class SessionSummary(dj.Computed):
    definition = """
    -> Session                 # PK = (subject_id, session_idx)
    ---
    num_trials : int64
    accuracy : float32
    """

This makes sense—computed tables derive data from existing entities rather than introducing new ones.

Part Tables CAN Introduce Dimensions¶

Unlike computed tables, part tables can introduce new dimensions:

@schema
class Detection(dj.Computed):
    definition = """
    -> Image                   # Inherits image_id
    -> DetectionParams         # Inherits params_id
    ---
    num_blobs : int64
    """

    class Blob(dj.Part):
        definition = """
        -> master              # Inherits (image_id, params_id)
        blob_idx : int32      # NEW dimension within detection
        ---
        x : float32
        y : float32
        """

Detection inherits dimensions (no underline in diagram), but Detection.Blob introduces a new dimension (blob_idx) for individual blobs within each detection.

Dimensions and Attribute Lineage¶

Every foreign key attribute traces back to the dimension where it was first defined. This is called attribute lineage:

Subject.subject_id      → myschema.subject.subject_id (origin)
Session.subject_id      → myschema.subject.subject_id (inherited via foreign key)
Session.session_idx     → myschema.session.session_idx (origin)
Trial.subject_id        → myschema.subject.subject_id (inherited via foreign key)
Trial.session_idx       → myschema.session.session_idx (inherited via foreign key)
Trial.trial_idx         → myschema.trial.trial_idx (origin)

Lineage enables semantic matching—DataJoint only joins attributes that trace back to the same dimension. Two attributes named id from different dimensions cannot be accidentally joined.

See Semantic Matching for details.

Recognizing Dimensions in Diagrams¶

In schema diagrams:

Visual	Meaning
Underlined name	Introduces at least one new dimension
Non-underlined name	All PK attributes inherited (no new dimensions)
Thick solid line	One-to-one extension (no new dimension)
Thin solid line	Containment (may introduce dimension)

Common dimensions in neuroscience:

Subject — Who/what is being studied
Session — When data was collected
Trial — Individual experimental unit
Modality — Type of data (ephys, imaging, behavior)
Parameter set — Configuration for analysis

Understanding dimensions helps design schemas that naturally express your experimental structure and ensures correct joins through semantic matching.

Best Practices¶

Answer the three questions before designing any table
Choose stable identifiers that won't need to change
Keep keys minimal — Include only what's necessary for uniqueness
Document key semantics — Explain what the key represents
Consider downstream queries — Keys affect join performance

Common Mistakes¶

Too few key attributes¶

# Wrong: experiment_id alone isn't unique
class Trial(dj.Manual):
    definition = """
    experiment_id : int64
    ---
    trial_number : int32   # Should be part of key!
    result : float32
    """

Too many key attributes¶

# Wrong: timestamp makes every row unique, losing entity semantics
class Measurement(dj.Manual):
    definition = """
    subject_id : int64
    timestamp : datetime(6)   # Microsecond precision
    ---
    value : float32
    """

Mutable natural keys¶

# Risky: names can change
class Patient(dj.Manual):
    definition = """
    patient_name : varchar(100)   # What if they change their name?
    ---
    date_of_birth : date
    """

Summary¶

Entity integrity is maintained by:

Primary keys that uniquely identify each entity
Foreign keys that establish valid references
Physical systems that link real-world entities to records

The three questions framework ensures your primary keys provide meaningful, stable identification for your domain entities.