DataJoint Query Algebra Specification¶
Overview¶
This document specifies the query algebra in DataJoint Python. Query expressions are composable objects that represent database queries. All operators return new QueryExpression objects without modifying the original—expressions are immutable.
1. Query Expression Fundamentals¶
1.1 Immutability¶
All query expressions are immutable. Every operator creates a new expression:
original = Session()
restricted = original & "session_date > '2024-01-01'" # New object
# original is unchanged
1.2 Primary Key Preservation¶
For most binary operators, the primary key of the result is the primary key of the left operand:
A & condition→ PK(A)A - condition→ PK(A)A.proj(...)→ PK(A)A.aggr(B, ...)→ PK(A)A.extend(B)→ PK(A)A + B→ PK(A) = PK(B)dj.U('x', 'y') & table→ PK(U) = (x, y)
The only exception is Join (A * B), where the result's primary key depends on functional dependencies between operands:
| Condition | Result PK |
|---|---|
| A → B (A determines B) | PK(A) |
| B → A (B determines A) | PK(B) |
| Both directions | PK(A) |
| Neither direction | PK(A) ∪ PK(B) |
1.3 Lazy Evaluation¶
Expressions are not executed until data is fetched:
expr = (Session * Trial) & "trial_type = 'test'" # No database query yet
data = expr.to_dicts() # Query executed here
2. Restriction (& and -)¶
2.1 Syntax¶
result = expression & condition # Select matching rows
result = expression - condition # Select non-matching rows (anti-restriction)
result = expression.restrict(condition, semantic_check=True)
2.2 Condition Types¶
| Type | Example | Behavior |
|---|---|---|
| String | "x > 5" |
SQL WHERE condition |
| Dict | {"status": "active"} |
Equality on attributes |
| QueryExpression | OtherTable |
Rows with matching keys in other table |
| List/Tuple/Set | [cond1, cond2] |
OR of conditions |
| Boolean | True / False |
No effect / empty result |
| pandas.DataFrame | df |
OR of row conditions |
| numpy.void | record |
Treated as dict |
2.3 String Conditions¶
SQL expressions using attribute names:
Session & "session_date > '2024-01-01'"
Session & "subject_id IN (1, 2, 3)"
Session & "notes LIKE '%test%'"
Session & "(x > 0) AND (y < 100)"
2.4 Dictionary Conditions¶
Attribute-value equality:
Session & {"subject_id": 1}
Session & {"subject_id": 1, "session_type": "training"}
Multiple key-value pairs are combined with AND.
Unmatched keys are silently ignored:
# If 'nonexistent' is not an attribute of Session:
Session & {"subject_id": 1, "nonexistent": "value"}
# Equivalent to:
Session & {"subject_id": 1} # unmatched key skipped
# If NO keys match, condition evaluates to True (all rows):
Session & {"nonexistent": "value"} # returns all rows
This applies to:
- Misspelled attribute names
- Hidden attributes (prefixed with _)
- Keys from a different table's schema
Rationale: Enables restricting with dicts containing extra keys (e.g., a row fetched from another table) without explicitly filtering to matching attributes.
Caution: Typos fail silently. Use string restrictions for explicit validation:
# SQL error if attribute doesn't exist
Session & "nonexistent = 'value'"
2.5 Restriction by Query Expression¶
Restrict to rows with matching primary keys in another expression:
# Sessions that have at least one trial
Session & Trial
# Sessions for active subjects only
Session & (Subject & "status = 'active'")
2.6 Collection Conditions (OR)¶
Lists, tuples, and sets create OR conditions:
# Either condition matches
Session & [{"subject_id": 1}, {"subject_id": 2}]
# Equivalent to
Session & "subject_id IN (1, 2)"
2.7 Anti-Restriction¶
The - operator selects rows that do NOT match:
# Sessions without any trials
Session - Trial
# Sessions not from subject 1
Session - {"subject_id": 1}
2.8 Chaining Restrictions¶
Sequential restrictions combine with AND:
(Session & cond1) & cond2
# Equivalent to
Session & cond1 & cond2
2.9 Semantic Matching¶
With semantic_check=True (default), expression conditions match only on homologous namesakes—attributes with the same name AND same lineage.
# Default: semantic matching
Session & Trial
# Disable semantic check (natural join on all namesakes)
Session.restrict(Trial, semantic_check=False)
2.10 Algebraic Properties¶
| Property | Value |
|---|---|
| Primary Key | Preserved: PK(result) = PK(input) |
| Attributes | Preserved: all attributes retained |
| Entity Type | Preserved |
2.11 Error Conditions¶
| Condition | Error |
|---|---|
| Unknown attribute in string | UnknownAttributeError |
| Non-homologous namesakes | DataJointError (semantic mismatch) |
3. Projection (.proj())¶
3.1 Syntax¶
result = expression.proj() # Primary key only
result = expression.proj(...) # All attributes
result = expression.proj('attr1', 'attr2') # PK + specified
result = expression.proj(..., '-secret') # All except secret
result = expression.proj(new_name='old_name') # Rename
result = expression.proj(computed='x + y') # Computed attribute
3.2 Attribute Selection¶
| Syntax | Meaning |
|---|---|
'attr' |
Include attribute |
... (Ellipsis) |
Include all secondary attributes |
'-attr' |
Exclude attribute (use with ...) |
Primary key attributes are always included, even if not specified.
3.3 Renaming Attributes¶
# Rename 'name' to 'subject_name'
Subject.proj(subject_name='name')
# Duplicate attribute with new name (parentheses preserve original)
Subject.proj('name', subject_name='(name)')
3.4 Computed Attributes¶
Create new attributes from SQL expressions:
# Arithmetic
Trial.proj(speed='distance / duration')
# Functions
Session.proj(year='YEAR(session_date)')
# Aggregation-like (per row)
Trial.proj(centered='value - mean_value')
3.5 Primary Key Renaming¶
Primary key attributes CAN be renamed:
Subject.proj(mouse_id='subject_id')
# Result PK: (mouse_id,) instead of (subject_id,)
3.6 Excluding Attributes¶
Use - prefix with ellipsis to exclude:
# All attributes except 'internal_notes'
Session.proj(..., '-internal_notes')
# Multiple exclusions
Session.proj(..., '-notes', '-metadata')
Cannot exclude primary key attributes.
3.7 Algebraic Properties¶
| Property | Value |
|---|---|
| Primary Key | Preserved (may be renamed) |
| Attributes | Selected/computed subset |
| Entity Type | Preserved |
3.8 Error Conditions¶
| Condition | Error |
|---|---|
| Attribute not found | UnknownAttributeError |
| Excluding PK attribute | DataJointError |
| Duplicate attribute name | DataJointError |
4. Join (*)¶
4.1 Syntax¶
result = A * B # Inner join
result = A.join(B, semantic_check=True, left=False)
4.2 Parameters¶
| Parameter | Default | Description |
|---|---|---|
semantic_check |
True |
Match only homologous namesakes |
left |
False |
LEFT JOIN (preserve all rows from A) |
4.3 Join Condition¶
With semantic_check=True (default), joins match only on homologous namesakes—attributes with the same name AND same lineage (traced through foreign keys to the same original definition):
# Session has (subject_id, session_id) and Trial has (subject_id, session_id, trial_id)
# Both subject_id attributes trace to Subject.subject_id via foreign keys
# Join matches on (subject_id, session_id)
Session * Trial
With semantic_check=False, joins match on all namesakes regardless of lineage (natural join):
# Match on any shared attribute names, ignoring lineage
A.join(B, semantic_check=False)
See Section 9: Semantic Matching for details on attribute lineage.
4.4 Primary Key Determination¶
The result's primary key depends on functional dependencies:
| Condition | Result PK | Attribute Order |
|---|---|---|
| A → B | PK(A) | A's attributes first |
| B → A | PK(B) | B's attributes first |
| Both | PK(A) | A's attributes first |
| Neither | PK(A) ∪ PK(B) | A's PK, then B's additional PK |
A → B means: All of B's primary key attributes exist in A (as PK or secondary).
4.5 Examples¶
# Session → Trial (Session's PK is subset of Trial's PK)
Session * Trial
# Result PK: (subject_id, session_id) — same as Session
# Neither determines the other
Subject * Experimenter
# Result PK: (subject_id, experimenter_id) — union of PKs
4.6 Left Join¶
Preserve all rows from left operand:
# All sessions, with trial data where available
Session.join(Trial, left=True)
Constraint: Left join requires A → B to prevent NULL values in result's primary key.
4.7 Semantic Matching¶
With semantic_check=True, only homologous namesakes are matched:
# Semantic join (default)
TableA * TableB
# Natural join (match all namesakes regardless of lineage)
TableA.join(TableB, semantic_check=False)
4.8 Algebraic Properties¶
| Property | Value |
|---|---|
| Primary Key | Depends on functional dependencies |
| Attributes | Union of both operands' attributes |
| Commutativity | Result rows same, but PK/order may differ |
4.9 Error Conditions¶
| Condition | Error |
|---|---|
| Different database connections | DataJointError |
| Non-homologous namesakes (semantic mode) | DataJointError |
| Left join without A → B | DataJointError |
5. Aggregation (.aggr())¶
5.1 Syntax¶
result = A.aggr(B, ...) # All A attributes
result = A.aggr(B, 'attr1', 'attr2') # PK + specified from A
result = A.aggr(B, ..., count='count(*)') # With aggregate
result = A.aggr(B, ..., exclude_nonmatching=True) # Only rows with matches
5.2 Parameters¶
| Parameter | Default | Description |
|---|---|---|
*attributes |
— | Attributes from A to include |
exclude_nonmatching |
False |
If True, exclude rows from A that have no matches in B (INNER JOIN). Default keeps all rows (LEFT JOIN). |
**named_attributes |
— | Computed aggregates |
5.3 Requirement¶
B must contain all primary key attributes of A. This enables grouping B's rows by A's primary key.
5.4 Aggregate Functions¶
# Count
Session.aggr(Trial, n_trials='count(*)')
# Sum, average, min, max
Session.aggr(Trial,
total='sum(score)',
avg_score='avg(score)',
best='max(score)',
worst='min(score)'
)
# Group concatenation
Session.aggr(Trial, trial_list='group_concat(trial_id)')
# Conditional count
Session.aggr(Trial, n_correct='sum(correct = 1)')
5.5 SQL Equivalent¶
SELECT A.pk1, A.pk2, A.secondary, agg_func(B.col) AS new_attr
FROM A
[LEFT] JOIN B USING (pk1, pk2)
WHERE <A restrictions>
GROUP BY A.pk1, A.pk2
HAVING <B restrictions>
5.6 Restriction Behavior¶
Restrictions on A attributes → WHERE clause (before GROUP BY) Restrictions on B attributes → HAVING clause (after GROUP BY)
# WHERE: only 2024 sessions, then count trials
(Session & "YEAR(session_date) = 2024").aggr(Trial, n='count(*)')
# HAVING: sessions with more than 10 trials
Session.aggr(Trial, n='count(*)') & "n > 10"
5.7 Default Behavior: Keep All Rows¶
By default (exclude_nonmatching=False), aggregation keeps all rows from A, even those without matches in B:
# All sessions included; those without trials have n=0
Session.aggr(Trial, n='count(trial_id)')
# Only sessions that have at least one trial
Session.aggr(Trial, n='count(trial_id)', exclude_nonmatching=True)
Note: Use count(pk_attr) rather than count(*) to correctly count 0 for sessions without trials. count(*) counts all rows including the NULL-filled left join row.
5.8 Algebraic Properties¶
| Property | Value |
|---|---|
| Primary Key | PK(A) — grouping expression's PK |
| Entity Type | Same as A |
5.9 Error Conditions¶
| Condition | Error |
|---|---|
| B missing A's PK attributes | DataJointError |
| Semantic mismatch | DataJointError |
6. Extension (.extend())¶
6.1 Syntax¶
result = A.extend(B)
result = A.extend(B, semantic_check=True)
6.2 Semantics¶
Extend is a left join that adds attributes from B while preserving A's entity identity:
A.extend(B)
# Equivalent to:
A.join(B, left=True)
6.3 Requirement¶
A must determine B (A → B). All of B's primary key attributes must exist in A.
6.4 Use Case¶
Add optional attributes without losing rows:
# Add experimenter info to sessions (some sessions may lack experimenter)
Session.extend(Experimenter)
6.5 Algebraic Properties¶
| Property | Value |
|---|---|
| Primary Key | PK(A) |
| Attributes | A's attributes + B's non-PK attributes |
| Entity Type | Same as A |
6.6 Error Conditions¶
| Condition | Error |
|---|---|
| A does not determine B | DataJointError |
7. Union (+)¶
7.1 Syntax¶
result = A + B
7.2 Requirements¶
- Same connection: Both from same database
- Same primary key: Identical PK attributes (names and types)
- No secondary attribute overlap: A and B cannot share secondary attributes
7.3 Semantics¶
Combines entity sets from both operands:
# All subjects that are either mice or rats
Mouse + Rat
7.4 Attribute Handling¶
| Scenario | Result |
|---|---|
| PK only in both | Union of PKs |
| A has secondary attrs | A's secondaries (NULL for B-only rows) |
| B has secondary attrs | B's secondaries (NULL for A-only rows) |
| Overlapping PKs | A's values take precedence |
7.5 SQL Implementation¶
-- With secondary attributes
(SELECT A.* FROM A LEFT JOIN B USING (pk))
UNION
(SELECT B.* FROM B WHERE (B.pk) NOT IN (SELECT A.pk FROM A))
7.6 Algebraic Properties¶
| Property | Value |
|---|---|
| Primary Key | PK(A) = PK(B) |
| Associative | (A + B) + C = A + (B + C) |
| Commutative | A + B has same rows as B + A |
7.7 Error Conditions¶
| Condition | Error |
|---|---|
| Different connections | DataJointError |
| Different primary keys | DataJointError |
| Overlapping secondary attributes | DataJointError |
8. Universal Sets (dj.U())¶
8.1 Syntax¶
dj.U() # Singular entity (one row, no attributes)
dj.U('attr1', 'attr2') # Set of all combinations
8.2 Unique Value Enumeration¶
Extract distinct values:
# All unique last names
dj.U('last_name') & Student
# All unique (year, month) combinations
dj.U('year', 'month') & Session.proj(year='YEAR(date)', month='MONTH(date)')
Result has specified attributes as primary key, with DISTINCT semantics.
8.3 Universal Aggregation¶
Aggregate entire table (no grouping):
# Count all students
dj.U().aggr(Student, n='count(*)')
# Result: single row with n = total count
# Global statistics
dj.U().aggr(Trial,
total='count(*)',
avg_score='avg(score)',
std_score='std(score)'
)
8.4 Arbitrary Grouping¶
Group by attributes not in original PK:
# Count students by graduation year
dj.U('grad_year').aggr(Student, n='count(*)')
# Monthly session counts
dj.U('year', 'month').aggr(
Session.proj(year='YEAR(date)', month='MONTH(date)'),
n='count(*)'
)
8.5 Primary Key Behavior¶
| Usage | Result PK |
|---|---|
dj.U() & table |
Empty (single row) |
dj.U('a', 'b') & table |
(a, b) |
dj.U().aggr(table, ...) |
Empty (single row) |
dj.U('a').aggr(table, ...) |
(a,) |
8.6 Restrictions¶
# U attributes must exist in the table
dj.U('name') & Student # OK: 'name' in Student
dj.U('invalid') & Student # Error: 'invalid' not found
8.7 Error Conditions¶
| Condition | Error |
|---|---|
table * dj.U() |
DataJointError (use & instead) |
dj.U() - table |
DataJointError (infinite set) |
| U attributes not in table | DataJointError |
dj.U().aggr(..., exclude_nonmatching=False) |
DataJointError (cannot keep all rows from infinite set) |
9. Semantic Matching¶
9.1 Attribute Lineage¶
Every attribute has a lineage tracing to its original definition:
schema.table.attribute
Foreign key inheritance preserves lineage:
class Session(dj.Manual):
definition = """
-> Subject # Inherits subject_id with Subject's lineage
session_id : int
"""
9.2 Homologous Namesakes¶
Two attributes are homologous namesakes if they have: 1. Same name 2. Same lineage (trace to same original definition)
9.3 Non-Homologous Namesakes¶
Attributes with same name but different lineage create semantic collisions:
# Both have 'name' but from different origins
Student * Course # Error if both have 'name' attribute
9.4 Resolution¶
Rename to avoid collisions:
Student * Course.proj(..., course_name='name')
9.5 Semantic Check Parameter¶
| Value | Behavior |
|---|---|
True (default) |
Match only homologous namesakes; error on collisions |
False |
Natural join on all namesakes regardless of lineage |
10. Operator Precedence¶
Python operator precedence applies:
| Precedence | Operator | Operation |
|---|---|---|
| Highest | * |
Join |
+, - |
Union, Anti-restriction | |
| Lowest | & |
Restriction |
Use parentheses for clarity:
(Session & condition) * Trial # Restrict then join
Session & (Trial * Stimulus) # Join then restrict
11. Subquery Generation¶
Subqueries are generated automatically when needed:
| Situation | Subquery Created |
|---|---|
| Restrict on computed attribute | Yes |
| Join on computed attribute | Yes |
| Aggregation operand | Yes |
| Union operand | Yes |
| Restriction after TOP | Yes |
12. Top (dj.Top)¶
12.1 Syntax¶
result = expression & dj.Top() # First row by primary key
result = expression & dj.Top(limit=5) # First 5 rows by primary key
result = expression & dj.Top(5, 'score DESC') # Top 5 by score descending
result = expression & dj.Top(10, order_by='date DESC') # Top 10 by date descending
result = expression & dj.Top(5, offset=10) # Skip 10, take 5
result = expression & dj.Top(None, 'score DESC') # All rows, ordered by score
12.2 Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
limit |
int or None |
1 |
Maximum rows to return. None = unlimited. |
order_by |
str, list[str], or None |
"KEY" |
Ordering. "KEY" = primary key order. None = inherit existing order. |
offset |
int |
0 |
Rows to skip before taking limit. |
12.3 Ordering Specification¶
| Format | Meaning |
|---|---|
"KEY" |
Order by primary key (ascending) |
"attr" |
Order by attribute (ascending) |
"attr DESC" |
Order by attribute (descending) |
"attr ASC" |
Order by attribute (ascending, explicit) |
["attr1 DESC", "attr2"] |
Multiple columns |
None |
Inherit ordering from existing Top |
12.4 SQL Equivalent¶
SELECT * FROM table
ORDER BY order_by
LIMIT limit OFFSET offset
12.5 Chaining Tops¶
When multiple Tops are chained, behavior depends on the order_by parameter:
| Scenario | Behavior |
|---|---|
Second Top has order_by=None |
Merge: inherits ordering, limits combined |
Both Tops have identical order_by |
Merge: ordering preserved, limits combined |
Tops have different order_by |
Subquery: first Top executed, then second applied |
Merge behavior:
- limit = minimum of both limits
- offset = sum of both offsets
- order_by = preserved from first Top
# Merge: same result, single query
(Table & dj.Top(10, "score DESC")) & dj.Top(5, order_by=None)
# Effective: Top(5, "score DESC", offset=0)
# Merge with offsets
(Table & dj.Top(10, "x", offset=5)) & dj.Top(3, order_by=None, offset=2)
# Effective: Top(3, "x", offset=7)
# Subquery: different orderings
(Table & dj.Top(10, "score DESC")) & dj.Top(3, "id ASC")
# First selects top 10 by score, then reorders those 10 by id and takes 3
12.6 Preview and Limit¶
When fetching with a limit parameter, the limit is applied as an additional Top that inherits existing ordering:
# User applies custom ordering
query = Table & dj.Top(order_by="score DESC")
# Preview respects the ordering
query.to_arrays("id", "score", limit=5) # Top 5 by score descending
Internally, to_arrays(..., limit=N) applies dj.Top(N, order_by=None), which inherits the existing ordering.
12.7 Use Cases¶
Top N rows:
# Top 10 highest scores
Result & dj.Top(10, "score DESC")
Pagination:
# Page 3 (rows 20-29) sorted by date
Session & dj.Top(10, "session_date DESC", offset=20)
Sampling (deterministic):
# First 100 rows by primary key
BigTable & dj.Top(100)
Ordering without limit:
# All rows ordered by date
Session & dj.Top(None, "session_date DESC")
12.8 Algebraic Properties¶
| Property | Value |
|---|---|
| Primary Key | Preserved: PK(result) = PK(input) |
| Attributes | Preserved: all attributes retained |
| Entity Type | Preserved |
| Row Order | Determined by order_by |
12.9 Error Conditions¶
| Condition | Error |
|---|---|
limit not int or None |
TypeError |
order_by not str, list[str], or None |
TypeError |
offset not int |
TypeError |
| Top in OR list | DataJointError |
| Top in AndList | DataJointError |
13. SQL Transpilation¶
This section describes how DataJoint translates query expressions to SQL.
13.1 MySQL Clause Evaluation Order¶
MySQL differs from standard SQL in clause evaluation:
Standard SQL: FROM → WHERE → GROUP BY → HAVING → SELECT
MySQL: FROM → WHERE → SELECT → GROUP BY → HAVING
This allows GROUP BY and HAVING clauses to use alias column names created by SELECT. DataJoint targets MySQL's behavior where column aliases can be used in HAVING.
13.2 QueryExpression Properties¶
Each QueryExpression represents a SQL SELECT statement with these properties:
| Property | SQL Clause | Description |
|---|---|---|
heading |
SELECT |
Attributes and computed expressions |
restriction |
WHERE / HAVING |
Filter conditions |
support |
FROM |
Base tables and joined tables |
_top |
ORDER BY / LIMIT |
Ordering and pagination |
_group_by |
GROUP BY |
Grouping (for aggregations) |
13.3 Modify-in-Place vs. Subquery Wrapping¶
When an operator is applied to a QueryExpression, DataJoint either:
- Modifies in place: Adds/modifies clauses in the existing SELECT statement
- Wraps as subquery: Creates a new outer SELECT with the input as a FROM subquery
The choice depends on SQL semantics—whether the operation can be expressed by adding clauses, or whether the input must first be materialized as a derived table.
13.4 Restriction Rules (& and -)¶
Modify in place when restricting on: - Base table attributes (columns from FROM tables) - Primary key attributes (including those inherited via FK)
Wrap as subquery when restricting on:
- Computed/aliased attributes (created by .proj())
- Aggregated attributes (created by .aggr())
- Any attribute after LIMIT/OFFSET has been applied
# Modify in place: condition added to WHERE
Session & "session_date > '2024-01-01'"
# SQL: SELECT ... FROM session WHERE session_date > '2024-01-01'
# Subquery required: restriction references computed alias
Session.proj(year='YEAR(session_date)') & "year = 2024"
# SQL: SELECT * FROM (
# SELECT ..., YEAR(session_date) as year FROM session
# ) AS _s1 WHERE year = 2024
Aggregation special case: Restrictions after aggregation go to HAVING (MySQL) or require subquery (PostgreSQL):
Session.aggr(Trial, n='count(*)') & "n > 10"
# MySQL: ... GROUP BY pk HAVING n > 10
# PostgreSQL: SELECT * FROM (...GROUP BY pk) AS _s1 WHERE n > 10
13.5 Projection Rules (.proj())¶
Modify in place when: - Selecting a subset of existing attributes - Renaming base attributes - Computing new attributes from base columns
Wrap as subquery when:
- Computing attribute from another computed attribute (alias of alias)
- Projecting after LIMIT/OFFSET has been applied
# Modify in place: adds computed column to SELECT
Session.proj(year='YEAR(session_date)')
# SQL: SELECT pk, YEAR(session_date) as year FROM session
# Subquery required: decade depends on year alias
Session.proj(year='YEAR(session_date)').proj(decade='year - year % 10')
# SQL: SELECT pk, year - year % 10 as decade FROM (
# SELECT pk, YEAR(session_date) as year FROM session
# ) AS _s1
13.5.1 Join Rules (*)¶
Modify in place (merge both operands' clauses) when:
- Both operands are base tables or simple projections
- Join attributes are base columns (not computed)
- Neither operand has LIMIT/OFFSET or GROUP BY
Wrap operand as subquery when operand has:
- Computed attributes used in join condition
- LIMIT/OFFSET applied
- GROUP BY (is an aggregation result)
- Is a union expression
# Modify in place: merge FROM and WHERE clauses
Session * Trial
# SQL: SELECT ... FROM session JOIN trial USING (pk)
# Subquery required: aggregation as join operand
Session * Session.aggr(Trial, n='count(*)')
# SQL: SELECT ... FROM session JOIN (
# SELECT pk, count(*) as n FROM session
# LEFT JOIN trial USING (pk) GROUP BY pk
# ) AS _s1 USING (pk)
13.5.2 Aggregation Rules (.aggr())¶
Aggregation produces a query with GROUP BY. The grouped expression is joined:
SELECT A.pk, A.secondary, agg_func(B.col) AS computed
FROM A LEFT JOIN B USING (pk)
WHERE <restrictions on A attributes>
GROUP BY A.pk
HAVING <restrictions on B or computed attributes>
Restrictions routing:
- On grouping expression (A) attributes → WHERE (before GROUP BY)
- On grouped expression (B) or computed attributes → HAVING (after GROUP BY)
Aggregation result requires subquery when used as operand to:
- Another join (*)
- Further restriction on computed attributes
- Another aggregation
13.5.3 Top/Limit Rules (dj.Top)¶
Any operation after Top requires subquery (the limited result must materialize):
(Session & dj.Top(10, 'date DESC')) & "notes IS NOT NULL"
# SQL: SELECT * FROM (
# SELECT * FROM session ORDER BY date DESC LIMIT 10
# ) AS _s1 WHERE notes IS NOT NULL
Exception: Chaining Tops with compatible ordering can merge (see Section 12.5)
13.6 Backend-Specific Transpilation¶
DataJoint supports multiple database backends with differing SQL dialects. The adapter layer handles:
13.6.1 Function Translation¶
Aggregate functions are translated bidirectionally:
| DataJoint | MySQL | PostgreSQL |
|---|---|---|
GROUP_CONCAT(col) |
GROUP_CONCAT(col) |
STRING_AGG(col::text, ',') |
STRING_AGG(col, sep) |
GROUP_CONCAT(col SEPARATOR sep) |
STRING_AGG(col, sep) |
Code using either syntax works on both backends.
13.6.2 HAVING Clause Alias References¶
MySQL allows column aliases in HAVING:
SELECT person_id, COUNT(*) as n_languages
FROM proficiency
GROUP BY person_id
HAVING n_languages >= 4 -- MySQL: OK, PostgreSQL: ERROR
PostgreSQL (following SQL standard) requires repeating the expression or using a subquery:
-- Option A: Repeat expression
HAVING COUNT(*) >= 4
-- Option B: Subquery wrapper
SELECT * FROM (
SELECT person_id, COUNT(*) as n_languages
FROM proficiency
GROUP BY person_id
) AS subq
WHERE n_languages >= 4
DataJoint approach: When generating SQL for PostgreSQL, if the HAVING clause references computed attribute aliases, the aggregation is wrapped in a subquery and the HAVING condition becomes a WHERE clause on the outer query.
| Backend | HAVING with alias | Generated SQL |
|---|---|---|
| MySQL | Direct | ... GROUP BY pk HAVING alias > 5 |
| PostgreSQL | Subquery wrapper | SELECT * FROM (...) AS subq WHERE alias > 5 |
This ensures consistent behavior across backends while maintaining query correctness.
13.6.3 Identifier Quoting¶
| Backend | Identifier Quote | Example |
|---|---|---|
| MySQL | Backtick | `table`.`column` |
| PostgreSQL | Double quote | "table"."column" |
13.6.4 String Literal Quoting¶
| Backend | String Literals | Identifier Literals |
|---|---|---|
| MySQL | 'value' or "value" |
`name` |
| PostgreSQL | 'value' only |
"name" |
In PostgreSQL, double quotes denote identifiers, not string literals. Dictionary restrictions handle this automatically; string restrictions must use single quotes for values.
13.7 Union SQL¶
Union performs an outer join:
(SELECT A.pk, A.secondary, NULL as B.secondary FROM A)
UNION
(SELECT B.pk, NULL as A.secondary, B.secondary FROM B
WHERE B.pk NOT IN (SELECT pk FROM A))
All union inputs become subqueries except unrestricted unions.
13.8 Query Backprojection¶
Before execution, finalize() recursively projects out unnecessary attributes from all inputs. This optimization:
- Reduces data transfer (especially for blobs)
- Compensates for MySQL's query optimizer limitations
- Produces leaner queries for complex expressions
13.9 Subquery Generation Summary¶
| Trigger | Reason | Strategy |
|---|---|---|
| Restrict on computed attribute | Alias must exist before WHERE references it | Wrap input |
| Restrict after LIMIT/OFFSET | Limited result must materialize first | Wrap input |
| Project alias from alias | First alias must materialize | Wrap input |
| Join on computed attribute | Join condition can't reference SELECT alias | Wrap operand |
| Aggregation as join operand | GROUP BY must complete first | Wrap operand |
| Restrict on aggregated attribute | HAVING result must materialize (PostgreSQL) | Wrap aggregation |
| Union operands | UNION requires complete subqueries | Wrap both |
| PostgreSQL HAVING with alias | Standard SQL compliance | Wrap + convert to WHERE |
14. Implementation Reference¶
| File | Purpose |
|---|---|
expression.py |
QueryExpression base class, operators |
condition.py |
Restriction condition handling, Top class |
heading.py |
Attribute metadata and lineage |
table.py |
Table class, fetch interface |
U.py |
Universal set implementation |
15. Quick Reference¶
| Operation | Syntax | Result PK |
|---|---|---|
| Restrict | A & cond |
PK(A) |
| Anti-restrict | A - cond |
PK(A) |
| Project | A.proj(...) |
PK(A) |
| Join | A * B |
Depends on A→B |
| Aggregate | A.aggr(B, ...) |
PK(A) |
| Extend | A.extend(B) |
PK(A) |
| Union | A + B |
PK(A) = PK(B) |
| Unique values | dj.U('x') & A |
(x,) |
| Global aggregate | dj.U().aggr(A, ...) |
() |