Entity Integrity¶
Entity integrity ensures a one-to-one correspondence between real-world entities and their database records. This is the foundation of reliable data management.
The Core Guarantee¶
- Each real-world entity → exactly one database record
- Each database record → exactly one real-world entity
Without entity integrity, databases become unreliable:
| Integrity Failure | Consequence |
|---|---|
| Same entity, multiple records | Fragmented data, conflicting information |
| Multiple entities, same record | Mixed data, privacy violations |
| Cannot match entity to record | Lost data, broken workflows |
The Three Questions¶
When designing a primary key, answer these three questions:
1. How do I prevent duplicate records?¶
Ensure the same entity cannot appear twice in the table.
2. How do I prevent record sharing?¶
Ensure different entities cannot share the same record.
3. How do I match entities to records?¶
When an entity arrives, how do I find its corresponding record?
Example: Laboratory Mouse Database¶
Consider a neuroscience lab tracking mice:
| Question | Answer |
|---|---|
| Prevent duplicates? | Each mouse gets a unique ear tag at arrival; database rejects duplicate tags |
| Prevent sharing? | Ear tags are never reused; retired tags are archived |
| Match entities? | Read the ear tag → look up record by primary key |
@schema
class Mouse(dj.Manual):
definition = """
ear_tag : char(6) # unique ear tag (e.g., 'M00142')
---
date_of_birth : date
sex : enum('M', 'F', 'U')
strain : varchar(50)
"""
The database enforces the first two questions through the primary key constraint. The third question requires a physical identification system—ear tags, barcodes, or RFID chips that link physical entities to database records.
Primary Key Requirements¶
In DataJoint, every table must have a primary key. Primary key attributes:
- Cannot be NULL — Every entity must be identifiable
- Must be unique — No two entities share the same key
- Cannot be changed — Keys are immutable after insertion
- Declared above the
---line — Syntactic convention
Natural Keys vs. Surrogate Keys¶
Natural Keys¶
Use attributes that naturally identify entities in your domain:
@schema
class Gene(dj.Lookup):
definition = """
gene_symbol : varchar(20) # Official gene symbol (e.g., 'BRCA1')
---
full_name : varchar(200)
chromosome : varchar(5)
"""
Advantages:
- Meaningful to humans
- Self-documenting
- No additional lookup needed
Surrogate Keys¶
A surrogate key is an identifier used primarily inside the database, with minimal or no exposure to end users. Users typically don't search for entities by surrogate keys or use them in conversation.
@schema
class InternalRecord(dj.Manual):
definition = """
record_id : uuid # internal identifier, not exposed to users
---
created_timestamp : datetime(3)
data : <blob>
"""
Key distinction from natural keys: Surrogate keys don't require external identification systems because users don't need to match physical entities to records by these keys.
When surrogate keys are appropriate:
- Entities that exist only within the system (no physical counterpart)
- Privacy-sensitive contexts where natural identifiers shouldn't be stored
- Internal system records that users never reference directly
Generating surrogate keys: DataJoint requires explicit key values rather than database-generated auto-increment. This is intentional:
- Auto-increment encourages treating keys as "row numbers" rather than entity identifiers
- It's incompatible with composite keys, which DataJoint uses extensively
- It breaks reproducibility (different IDs when rebuilding pipelines)
- It prevents the client-server handshake needed for proper entity integrity
Use client-side generation instead:
- UUIDs — Generate with
uuid.uuid4()before insertion - ULIDs — Sortable unique IDs
- Client-side counters — Query max value and increment
DataJoint recommendation: Prefer natural keys when they're stable and meaningful. Use surrogates only when no natural identifier exists or for privacy-sensitive contexts.
Composite Keys¶
When no single attribute uniquely identifies an entity, combine multiple attributes:
@schema
class Recording(dj.Manual):
definition = """
-> Session
recording_idx : int32 # Recording number within session
---
duration : float32 # seconds
"""
Here, (subject_id, session_idx, recording_idx) together form the primary key.
Neither alone would be unique.
Foreign Keys and Dependencies¶
Foreign keys in DataJoint serve dual purposes:
- Referential integrity — Ensures referenced entities exist
- Workflow dependency — Declares that this entity depends on another
@schema
class Segmentation(dj.Computed):
definition = """
-> Scan # Depends on Scan
---
num_cells : int64
"""
The arrow -> inherits the primary key from Scan and establishes both
referential integrity and workflow dependency.
Schema Dimensions¶
A dimension is an independent axis of variation in your data. The fundamental principle:
Any table that introduces a new primary key attribute introduces a new dimension.
This is true whether the table has only new attributes or also inherits attributes from foreign keys. The key is simply: new primary key attribute = new dimension.
Tables That Introduce Dimensions¶
@schema
class Subject(dj.Manual):
definition = """
subject_id : varchar(16) # NEW dimension: subject_id
---
species : varchar(50)
"""
@schema
class Session(dj.Manual):
definition = """
-> Subject # Inherits subject_id
session_idx : int32 # NEW dimension: session_idx
---
session_date : date
"""
@schema
class Trial(dj.Manual):
definition = """
-> Session # Inherits subject_id, session_idx
trial_idx : int32 # NEW dimension: trial_idx
---
outcome : enum('success', 'fail')
"""
All three tables introduce dimensions:
Subjectintroducessubject_iddimensionSessionintroducessession_idxdimension (even though it also inheritssubject_id)Trialintroducestrial_idxdimension (even though it also inheritssubject_id,session_idx)
In schema diagrams, tables that introduce at least one new dimension have underlined names.
Tables That Don't Introduce Dimensions¶
A table introduces no dimensions when its entire primary key comes from foreign keys:
@schema
class SubjectProfile(dj.Manual):
definition = """
-> Subject # Inherits subject_id only
---
weight : float32
"""
SubjectProfile doesn't introduce any new primary key attribute—it extends the Subject dimension with additional attributes. There's exactly one profile per subject.
In schema diagrams, these tables have non-underlined names.
Computed Tables and Dimensions¶
Computed tables never introduce dimensions. Their primary key is entirely inherited from their dependencies:
@schema
class SessionSummary(dj.Computed):
definition = """
-> Session # PK = (subject_id, session_idx)
---
num_trials : int64
accuracy : float32
"""
This makes sense—computed tables derive data from existing entities rather than introducing new ones.
Part Tables CAN Introduce Dimensions¶
Unlike computed tables, part tables can introduce new dimensions:
@schema
class Detection(dj.Computed):
definition = """
-> Image # Inherits image_id
-> DetectionParams # Inherits params_id
---
num_blobs : int64
"""
class Blob(dj.Part):
definition = """
-> master # Inherits (image_id, params_id)
blob_idx : int32 # NEW dimension within detection
---
x : float32
y : float32
"""
Detection inherits dimensions (no underline in diagram), but Detection.Blob
introduces a new dimension (blob_idx) for individual blobs within each
detection.
Dimensions and Attribute Lineage¶
Every foreign key attribute traces back to the dimension where it was first defined. This is called attribute lineage:
Subject.subject_id → myschema.subject.subject_id (origin)
Session.subject_id → myschema.subject.subject_id (inherited via foreign key)
Session.session_idx → myschema.session.session_idx (origin)
Trial.subject_id → myschema.subject.subject_id (inherited via foreign key)
Trial.session_idx → myschema.session.session_idx (inherited via foreign key)
Trial.trial_idx → myschema.trial.trial_idx (origin)
Lineage enables semantic matching—DataJoint only joins attributes that
trace back to the same dimension. Two attributes named id from different
dimensions cannot be accidentally joined.
See Semantic Matching for details.
Recognizing Dimensions in Diagrams¶
In schema diagrams:
| Visual | Meaning |
|---|---|
| Underlined name | Introduces at least one new dimension |
| Non-underlined name | All PK attributes inherited (no new dimensions) |
| Thick solid line | One-to-one extension (no new dimension) |
| Thin solid line | Containment (may introduce dimension) |
Common dimensions in neuroscience:
- Subject — Who/what is being studied
- Session — When data was collected
- Trial — Individual experimental unit
- Modality — Type of data (ephys, imaging, behavior)
- Parameter set — Configuration for analysis
Understanding dimensions helps design schemas that naturally express your experimental structure and ensures correct joins through semantic matching.
Best Practices¶
- Answer the three questions before designing any table
- Choose stable identifiers that won't need to change
- Keep keys minimal — Include only what's necessary for uniqueness
- Document key semantics — Explain what the key represents
- Consider downstream queries — Keys affect join performance
Common Mistakes¶
Too few key attributes¶
# Wrong: experiment_id alone isn't unique
class Trial(dj.Manual):
definition = """
experiment_id : int64
---
trial_number : int32 # Should be part of key!
result : float32
"""
Too many key attributes¶
# Wrong: timestamp makes every row unique, losing entity semantics
class Measurement(dj.Manual):
definition = """
subject_id : int64
timestamp : datetime(6) # Microsecond precision
---
value : float32
"""
Mutable natural keys¶
# Risky: names can change
class Patient(dj.Manual):
definition = """
patient_name : varchar(100) # What if they change their name?
---
date_of_birth : date
"""
Summary¶
Entity integrity is maintained by:
- Primary keys that uniquely identify each entity
- Foreign keys that establish valid references
- Physical systems that link real-world entities to records
The three questions framework ensures your primary keys provide meaningful, stable identification for your domain entities.