Semantic Matching - Specification¶
Overview¶
This document specifies semantic matching for binary operators in DataJoint 2.0. Semantic matching ensures that attributes are only matched when they share both the same name and the same lineage (origin), preventing accidental matches on unrelated attributes that happen to share names. This replaces the name-based matching rules from pre-2.0 versions.
Goals¶
- Prevent incorrect matches on attributes that share names but represent different entities
- Enable valid operations that were previously blocked due to overly restrictive rules
- Maintain backward compatibility for well-designed schemas
- Provide clear error messages when semantic conflicts are detected
User Guide¶
Quick Start¶
Semantic matching is enabled by default in DataJoint 2.0. For most well-designed schemas, no changes are required.
When You Might See Errors¶
# Two tables with generic 'id' attribute
class Student(dj.Manual):
definition = """
id : int64
---
name : varchar(100)
"""
class Course(dj.Manual):
definition = """
id : int64
---
title : varchar(100)
"""
# This will raise an error because 'id' has different lineages
Student() * Course() # DataJointError!
How to Resolve¶
Option 1: Rename attributes using projection
Student() * Course().proj(course_id='id') # OK
Option 2: Bypass semantic check (use with caution)
Student().join(Course(), semantic_check=False) # OK, but be careful!
Option 3: Use descriptive names (best practice)
class Student(dj.Manual):
definition = """
student_id : int64
---
name : varchar(100)
"""
Rebuilding Lineage for Existing Schemas¶
If you have existing schemas created before DataJoint 2.0, rebuild their lineage tables:
import datajoint as dj
# Connect and get your schema
schema = dj.Schema('my_database')
# Rebuild lineage (do this once per schema)
schema.rebuild_lineage()
# Restart Python kernel to pick up changes
Important: If your schema references tables in other schemas, rebuild those upstream schemas first.
API Reference¶
Schema Methods¶
schema.rebuild_lineage()¶
Rebuild the ~lineage table for all tables in this schema.
schema.rebuild_lineage()
Description: Recomputes lineage for all attributes by querying FK relationships from the database's information_schema. Use this to restore lineage for schemas that predate the lineage system or after corruption.
Requirements:
- Schema must exist
- Upstream schemas (referenced via cross-schema FKs) must have their lineage rebuilt first
Side Effects:
- Creates
~lineagetable if it doesn't exist - Deletes and repopulates all lineage entries for tables in the schema
Post-Action: Restart Python kernel and reimport to pick up new lineage information.
schema.lineage_table_exists¶
Property indicating whether the ~lineage table exists in this schema.
if schema.lineage_table_exists:
print("Lineage tracking is enabled")
Returns: bool - True if ~lineage table exists, False otherwise.
schema.lineage¶
Property returning all lineage entries for the schema.
schema.lineage
# {'myschema.session.session_id': 'myschema.session.session_id',
# 'myschema.trial.session_id': 'myschema.session.session_id',
# 'myschema.trial.trial_num': 'myschema.trial.trial_num'}
Returns: dict - Maps 'schema.table.attribute' to its lineage origin
Join Methods¶
expr.join(other, semantic_check=True)¶
Join two expressions with optional semantic checking.
result = A.join(B) # semantic_check=True (default)
result = A.join(B, semantic_check=False) # bypass semantic check
Parameters:
other: Another query expression to join withsemantic_check(bool): IfTrue(default), raise error on non-homologous namesakes. IfFalse, perform natural join without lineage checking.
Raises: DataJointError if semantic_check=True and namesake attributes have different lineages.
expr.restrict(other, semantic_check=True)¶
Restrict expression with optional semantic checking.
result = A.restrict(B) # semantic_check=True (default)
result = A.restrict(B, semantic_check=False) # bypass semantic check
Parameters:
other: Restriction condition (expression, dict, string, etc.)semantic_check(bool): IfTrue(default), raise error on non-homologous namesakes when restricting by another expression. IfFalse, no lineage checking.
Raises: DataJointError if semantic_check=True and namesake attributes have different lineages.
Operators¶
A * B (Join)¶
Equivalent to A.join(B, semantic_check=True).
A & B (Restriction)¶
Equivalent to A.restrict(B, semantic_check=True).
A - B (Anti-restriction)¶
Restriction with negation. Semantic checking applies.
To bypass semantic checking: A.restrict(dj.Not(B), semantic_check=False)
A + B (Union)¶
Union of expressions. Requires all namesake attributes to have matching lineage.
Universal Set (dj.U)¶
Valid Operations¶
dj.U('a', 'b') & A # Restriction: promotes a, b to PK
dj.U('a', 'b').aggr(A, ...) # Aggregation: groups by a, b
dj.U() & A # Distinct primary keys of A
Invalid Operations¶
dj.U('a', 'b') - A # DataJointError: produces infinite set
dj.U('a', 'b') * A # DataJointError: use & instead
For conceptual background on lineage, terminology, and matching rules, see Semantic Matching (Explanation).
Implementation Details¶
~lineage Table¶
Each schema has a hidden ~lineage table storing lineage information:
CREATE TABLE `schema_name`.`~lineage` (
table_name VARCHAR(64) NOT NULL,
attribute_name VARCHAR(64) NOT NULL,
lineage VARCHAR(255) NOT NULL,
PRIMARY KEY (table_name, attribute_name)
)
Lineage Population¶
At table declaration:
- Delete any existing lineage entries for the table
- For FK attributes: copy lineage from parent (with warning if parent lineage missing)
- For native PK attributes: set lineage to
schema.table.attribute - Native secondary attributes: no entry (lineage = None)
At table drop:
- Delete all lineage entries for the table
Missing Lineage Handling¶
If ~lineage table doesn't exist:
- Warning issued during semantic check
- Semantic checking disabled (join proceeds as natural join)
If parent lineage missing during declaration:
- Warning issued
- Parent attribute used as origin
- Recommend rebuilding lineage after parent schema is fixed
Heading's lineage_available Property¶
The Heading class tracks whether lineage information is available:
heading.lineage_available # True if ~lineage table exists for this schema
This property is:
- Set when heading is loaded from database
- Propagated through projections, joins, and other operations
- Used by
assert_join_compatibilityto decide whether to perform semantic checking
Error Messages¶
Non-Homologous Namesakes¶
DataJointError: Cannot join on attribute `id`: different lineages
(university.student.id vs university.course.id).
Use .proj() to rename one of the attributes.
Missing Lineage Warning¶
WARNING: Semantic check disabled: ~lineage table not found.
To enable semantic matching, rebuild lineage with: schema.rebuild_lineage()
Parent Lineage Missing Warning¶
WARNING: Lineage for `parent_db`.`parent_table`.`attr` not found
(parent schema's ~lineage table may be missing or incomplete).
Using it as origin. Once the parent schema's lineage is rebuilt,
run schema.rebuild_lineage() on this schema to correct the lineage.