Primary Key Rules in Relational Operators¶
In DataJoint, the result of each query operator produces a valid entity set with a well-defined entity type and primary key. This section specifies how the primary key is determined for each relational operator.
General Principle¶
The primary key of a query result identifies unique entities in that result. For most operators, the primary key is preserved from the left operand. For joins, the primary key depends on the functional dependencies between the operands.
Integration with Semantic Matching¶
Primary key determination is applied after semantic compatibility is verified. The evaluation order is:
- Semantic Check:
assert_join_compatibility()ensures all namesakes are homologous (same lineage) - PK Determination: The functional dependency relationship is computed using attribute names
- Left Join Validation: If
left=True, verify A β B
This ordering is important because:
- After semantic matching passes, namesakes represent semantically equivalent attributes
- The name-based functional dependency check is therefore semantically valid
- Attribute names in the context of a semantically-valid join represent the same entity
The functional dependency relationship is derived from the structure of primary keys and secondary attributes and their names in the two operands:
A β Bholds when every attribute in PK(B) appears (by name) somewhere in A (either as a primary key or secondary attribute)- After semantic matching passes, namesakes represent semantically equivalent attributes, so checking by name is valid
- Aliased attributes (same lineage, different names) don't participate in natural joins
Notation¶
In the examples below, * marks primary key attributes:
A(x*, y*, z)means A has primary key{x, y}and secondary attributezA β Bmeans "A functionally determines B" (defined below)
Rules by Operator¶
| Operator | Primary Key Rule |
|---|---|
A & B (restriction) |
PK(A) β preserved from left operand |
A - B (anti-restriction) |
PK(A) β preserved from left operand |
A.proj(...) (projection) |
PK(A) β preserved from left operand |
A.aggr(B, ...) (aggregation) |
PK(A) β preserved from left operand |
A.extend(B) (extension) |
PK(A) β requires A β B |
A * B (join) |
Depends on functional dependencies (see below) |
Join Primary Key Rule¶
The join operator requires special handling because it combines two entity sets. The primary key of A * B depends on the functional dependency relationship between the operands.
Definitions¶
A β B (A functionally determines B): Every attribute in PK(B) is in A.
A β B iff βb β PK(B): b β A
Since PK(A) βͺ secondary(A) = all attributes in A, this is equivalent to saying every attribute in B's primary key exists somewhere in A (as either a primary key or secondary attribute).
Intuitively, A β B means that knowing A's primary key is sufficient to determine B's primary key through the functional dependencies implied by A's structure.
B β A (B functionally determines A): Every attribute in PK(A) is in B.
B β A iff βa β PK(A): a β B
Join Primary Key Algorithm¶
For A * B:
| Condition | PK(A * B) | Attribute Order |
|---|---|---|
| A β B | PK(A) | A's attributes first |
| B β A (and not A β B) | PK(B) | B's attributes first |
| Neither | PK(A) βͺ PK(B) | PK(A) first, then PK(B) β PK(A) |
When both A β B and B β A hold, the left operand takes precedence (use PK(A)).
Examples¶
Example 1: B β A
A: x*, y*
B: x*, z*, y (y is secondary in B, so z β y)
- A β B? PK(B) = {x, z}. Is z in PK(A) or secondary in A? No (z not in A). No.
- B β A? PK(A) = {x, y}. Is y in PK(B) or secondary in B? Yes (secondary). Yes.
- Result: PK(A * B) = {x, z} with B's attributes first.
Example 2: Both directions (bijection-like)
A: x*, y*, z (z is secondary in A)
B: y*, z*, x (x is secondary in B)
- A β B? PK(B) = {y, z}. Is z in PK(A) or secondary in A? Yes (secondary). Yes.
- B β A? PK(A) = {x, y}. Is x in PK(B) or secondary in B? Yes (secondary). Yes.
- Both hold, prefer left operand: PK(A * B) = {x, y} with A's attributes first.
Example 3: Neither direction
A: x*, y*
B: z*, x (x is secondary in B)
- A β B? PK(B) = {z}. Is z in PK(A) or secondary in A? No. No.
- B β A? PK(A) = {x, y}. Is y in PK(B) or secondary in B? No (y not in B). No.
- Result: PK(A * B) = {x, y, z} (union) with A's attributes first.
Example 4: A β B (subordinate relationship)
Session: session_id*
Trial: session_id*, trial_num* (references Session)
- A β B? PK(Trial) = {session_id, trial_num}. Is trial_num in PK(Session) or secondary? No. No.
- B β A? PK(Session) = {session_id}. Is session_id in PK(Trial)? Yes. Yes.
- Result: PK(Session * Trial) = {session_id, trial_num} with Trial's attributes first.
Join primary key determination:
A * BwhereA β B: result has PK(A)A * BwhereB β A(notA β B): result has PK(B), B's attributes firstA * Bwhere bothA β BandB β A: result has PK(A) (left preference)A * Bwhere neither direction: result has PK(A) βͺ PK(B)- Verify attribute ordering matches primary key source
- Verify non-commutativity:
A * BvsB * Amay differ in PK and order
Design Tradeoff: Predictability vs. Minimality¶
The join primary key rule prioritizes predictability over minimality. In some cases, the resulting primary key may not be minimal (i.e., it may contain functionally redundant attributes).
Example of non-minimal result:
A: x*, y*
B: z*, x (x is secondary in B, so z β x)
The mathematically minimal primary key for A * B would be {y, z} because:
z β x(from B's structure){y, z} β {x, y, z}(z gives us x, and we have y)
However, {y, z} is problematic:
- It is not the primary key of either operand (A has
{x, y}, B has{z}) - It is not the union of the primary keys
- It represents a novel entity type that doesn't correspond to A, B, or their natural pairing
This creates confusion: what kind of entity does {y, z} identify?
The simplified rule produces {x, y, z} (the union), which:
- Is immediately recognizable as "one A entity paired with one B entity"
- Contains A's full primary key and B's full primary key
- May have redundancy (
xis determined byz) but is semantically clear
Rationale: Users can always project away redundant attributes if they need the minimal key. But starting with a predictable, interpretable primary key reduces confusion and errors.
Attribute Ordering¶
The primary key attributes always appear first in the result's attribute list, followed by secondary attributes. When B β A (and not A β B), the join is conceptually reordered as B * A to maintain this invariant:
- If PK = PK(A): A's attributes appear first
- If PK = PK(B): B's attributes appear first
- If PK = PK(A) βͺ PK(B): PK(A) attributes first, then PK(B) β PK(A), then secondaries
Non-Commutativity¶
With these rules, join is not commutative in terms of:
- Primary key selection:
A * Bmay have a different PK thanB * Awhen functional dependency holds in one direction but not the other - Attribute ordering: The left operand's attributes appear first (unless B β A)
The result set (the actual rows returned) remains the same regardless of order, but the schema (primary key and attribute order) may differ.
Left Join Constraint¶
For left joins (A.join(B, left=True)), the functional dependency A β B is required.
Why this constraint exists:
In a left join, all rows from A are retained even if there's no matching row in B. For unmatched rows, B's attributes are NULL. This creates a problem for primary key validity:
| Scenario | PK by inner join rule | Left join problem |
|---|---|---|
| A β B | PK(A) | β Safe β A's attrs always present |
| B β A | PK(B) | β B's PK attrs could be NULL |
| Neither | PK(A) βͺ PK(B) | β B's PK attrs could be NULL |
Example of invalid left join:
A: x*, y* PK(A) = {x, y}
B: x*, z*, y PK(B) = {x, z}, y is secondary
Inner join: PK = {x, z} (B β A rule)
Left join attempt: FAILS because z could be NULL for unmatched A rows
Valid left join example:
Session: session_id*, date
Trial: session_id*, trial_num*, stimulus (references Session)
Session.join(Trial, left=True) # OK: Session β Trial
# PK = {session_id}, all sessions retained even without trials
Error message:
DataJointError: Left join requires the left operand to determine the right operand (A β B).
The following attributes from the right operand's primary key are not determined by
the left operand: ['z']. Use an inner join or restructure the query.
Conceptual Note: Left Join as Extension¶
When A β B, the left join A.join(B, left=True) is conceptually distinct from the general join operator A * B. It is better understood as an extension operation rather than a join:
| Aspect | General Join (A * B) | Left Join when A β B |
|---|---|---|
| Conceptual model | Cartesian product restricted to matching rows | Extend A with attributes from B |
| Row count | May increase, decrease, or stay same | Always equals len(A) |
| Primary key | Depends on functional dependencies | Always PK(A) |
| Relation to projection | Different operation | Variation of projection |
The extension perspective:
The operation A.join(B, left=True) when A β B is closer to projection than to join:
- It adds new attributes to A (like
A.proj(..., new_attr=...)) - It preserves all rows of A
- It preserves A's primary key
- It lacks the Cartesian product aspect that defines joins
DataJoint provides an explicit extend() method for this pattern:
# These are equivalent when A β B:
A.join(B, left=True)
A.extend(B) # clearer intent: extend A with B's attributes
The extend() method:
- Requires
A β B(raisesDataJointErrorotherwise) - Does not expose
allow_nullable_pk(that's an internal mechanism) - Expresses the semantic intent: "add B's attributes to A's entities"
Relationship to aggregation:
A similar argument applies to A.aggr(B, ...):
- It preserves A's primary key
- It adds computed attributes derived from B
- It's conceptually a variation of projection with grouping
Both A.join(B, left=True) (when A β B) and A.aggr(B, ...) can be viewed as projection-like operations that extend A's attributes while preserving its entity identity.
Bypassing the Left Join Constraint¶
The allow_nullable_pk Parameter¶
The join() method accepts an allow_nullable_pk parameter that bypasses the A β B requirement for left joins:
def join(self, other, semantic_check=True, left=False, allow_nullable_pk=False):
...
Why the Constraint Exists¶
In a left join, unmatched rows from A have NULL values for all of B's attributes. If B's primary key attributes are part of the result's primary key, those NULLs violate entity integrityβprimary keys must not contain NULLs.
The A β B requirement guarantees the result's primary key is PK(A), which contains only A's attributes. Since A is the left operand in a left join, all its rows are retained with their original values, so the primary key cannot contain NULLs.
What allow_nullable_pk=True Does¶
When the constraint is bypassed:
- The
A β Bcheck is skipped - The result's primary key becomes
PK(A) βͺ PK(B)(union of both) - Primary key attributes from B may be NULL for unmatched rows
# Normally blocked - A does not determine B
A.join(B, left=True) # Error: A β B not satisfied
# Bypass the constraint
A.join(B, left=True, allow_nullable_pk=True) # PK = PK(A) βͺ PK(B)
When to Use It¶
This parameter is primarily internal. It exists because aggregation (aggr) needs a left join but has the opposite functional dependency requirement (B β A instead of A β B). The bypass is safe in aggregation because the GROUP BY clause resets the primary key to PK(A).
Users should generally avoid this parameter. If you need left join behavior without A β B:
- Use
extend()if A should determine B (restructure your query) - Use
aggr()if you're computing aggregates - Reconsider whether a left join is appropriate for your use case
If you do use allow_nullable_pk=True, you must ensure subsequent operations (such as GROUP BY or projection) establish a valid, non-nullable primary key.
Aggregation Exception¶
A.aggr(B) (with default exclude_nonmatching=False) uses a left join internally but has the opposite requirement: B β A (the group expression B must have all of A's primary key attributes).
This apparent contradiction is resolved by the GROUP BY clause:
- Aggregation requires B β A so that B can be grouped by A's primary key
- The intermediate left join
A LEFT JOIN Bwould have an invalid PK under the normal left join rules - Aggregation internally allows the invalid PK, producing PK(A) βͺ PK(B)
- The
GROUP BY PK(A)clause then resets the primary key to PK(A) - The final result has PK(A), which consists entirely of non-NULL values from A
Note: The semantic check (homologous namesake validation) is still performed for aggregation's internal join. Only the primary key validity constraint is bypassed.
Example:
Session: session_id*, date
Trial: session_id*, trial_num*, response_time (references Session)
# Aggregation (default keeps all rows)
Session.aggr(Trial, avg_rt='avg(response_time)')
# Internally: Session LEFT JOIN Trial (with invalid PK allowed)
# Intermediate PK would be {session_id} βͺ {session_id, trial_num} = {session_id, trial_num}
# But GROUP BY session_id resets PK to {session_id}
# Result: All sessions, with avg_rt=NULL for sessions without trials
Universal Set dj.U¶
dj.U() or dj.U('attr1', 'attr2', ...) represents the universal set of all possible values and lineages.
Homology with dj.U¶
Since dj.U conceptually contains all possible lineages, its attributes are homologous to any namesake attribute in other expressions.
Valid Operations¶
# Restriction: promotes a, b to PK; lineage transferred from A
dj.U('a', 'b') & A
# Aggregation: groups by a, b
dj.U('a', 'b').aggr(A, count='count(*)')
Invalid Operations¶
# Anti-restriction: produces infinite set
dj.U('a', 'b') - A # DataJointError
# Join: deprecated, use & instead
dj.U('a', 'b') * A # DataJointError with migration guidance