A Simple Pipeline¶

This tutorial introduces DataJoint by building a simple research lab database. You'll learn to:

Define tables with primary keys and dependencies
Insert and query data
Use the four core operations: restriction, projection, join, aggregation
Understand the schema diagram

We'll work with Manual tables only—tables where you enter data directly. Later tutorials introduce automated computation.

Database Backend: All tutorials work identically on MySQL and PostgreSQL (PostgreSQL support added in DataJoint Python 2.1). The examples shown here were executed on PostgreSQL, but you can follow along using either backend.

For complete working examples, see:

University Database — Academic records with complex queries
Blob Detection — Image processing with computation

Setup¶

In [1]:

Copied!

import datajoint as dj

schema = dj.Schema('tutorial_first_pipeline')
import datajoint as dj

schema = dj.Schema('tutorial_first_pipeline')

[2026-05-18 19:42:46] DataJoint 2.2.2 connected to postgres@postgres:5432

The Domain: A Research Lab¶

We'll model a research lab that:

Has researchers who conduct experiments
Works with subjects (e.g., mice)
Runs sessions where data is collected
Collects recordings during each session

flowchart TD
    Researcher --> Session
    Subject --> Session
    Session --> Recording

Defining Tables¶

Each table is a Python class. The definition string specifies:

Primary key (above ---) — uniquely identifies each row
Attributes (below ---) — additional data for each row
Dependencies (->) — references to other tables

In [2]:

Copied!





@schema
class Researcher(dj.Manual):
    definition = """
    researcher_id : int32
    ---
    researcher_name : varchar(100)
    email : varchar(100)
    """
@schema
class Researcher(dj.Manual):
    definition = """
    researcher_id : int32
    ---
    researcher_name : varchar(100)
    email : varchar(100)
    """

In [3]:

Copied!





@schema
class Subject(dj.Manual):
    definition = """
    subject_id : int32
    ---
    species : varchar(32)
    date_of_birth : date
    sex : enum('M', 'F', 'unknown')
    """
@schema
class Subject(dj.Manual):
    definition = """
    subject_id : int32
    ---
    species : varchar(32)
    date_of_birth : date
    sex : enum('M', 'F', 'unknown')
    """

Dependencies¶

A Session involves one researcher and one subject. The -> syntax creates a dependency (foreign key):

In [4]:

Copied!





@schema
class Session(dj.Manual):
    definition = """
    -> Subject
    session_date : date
    ---
    -> Researcher
    session_notes : varchar(255)
    """
@schema
class Session(dj.Manual):
    definition = """
    -> Subject
    session_date : date
    ---
    -> Researcher
    session_notes : varchar(255)
    """

The -> Subject in the primary key means:

subject_id is automatically included in Session's primary key
Combined with session_date, each session is uniquely identified
You cannot create a session for a non-existent subject

The -> Researcher below the line is a non-primary dependency—it records who ran the session but isn't part of the unique identifier.

In [5]:

Copied!





@schema
class Recording(dj.Manual):
    definition = """
    -> Session
    recording_id : int16
    ---
    duration : decimal(5,1)     # recording duration (seconds)
    quality : enum('good', 'fair', 'poor')
    """
@schema
class Recording(dj.Manual):
    definition = """
    -> Session
    recording_id : int16
    ---
    duration : decimal(5,1)     # recording duration (seconds)
    quality : enum('good', 'fair', 'poor')
    """

Schema Diagram¶

The diagram shows tables and their dependencies:

In [6]:

Copied!

dj.Diagram(schema)
dj.Diagram(schema)

Out[6]:

No description has been provided for this image

Reading the diagram:

Green boxes = Manual tables (you enter data)
Solid lines = Primary key dependencies (part of identity)
Dashed lines = Non-primary dependencies (references)

Inserting Data¶

Data must be inserted in dependency order—you can't reference something that doesn't exist.

In [7]:

Copied!





# First: tables with no dependencies
Researcher.insert([
    {'researcher_id': 1, 'researcher_name': 'Alice Chen',
     'email': 'alice@lab.org'},
    {'researcher_id': 2, 'researcher_name': 'Bob Smith',
     'email': 'bob@lab.org'},
])

Subject.insert([
    {'subject_id': 1, 'species': 'mouse',
     'date_of_birth': '2024-01-15', 'sex': 'M'},
    {'subject_id': 2, 'species': 'mouse',
     'date_of_birth': '2024-01-20', 'sex': 'F'},
    {'subject_id': 3, 'species': 'mouse',
     'date_of_birth': '2024-02-01', 'sex': 'M'},
])
# First: tables with no dependencies
Researcher.insert([
    {'researcher_id': 1, 'researcher_name': 'Alice Chen',
     'email': 'alice@lab.org'},
    {'researcher_id': 2, 'researcher_name': 'Bob Smith',
     'email': 'bob@lab.org'},
])

Subject.insert([
    {'subject_id': 1, 'species': 'mouse',
     'date_of_birth': '2024-01-15', 'sex': 'M'},
    {'subject_id': 2, 'species': 'mouse',
     'date_of_birth': '2024-01-20', 'sex': 'F'},
    {'subject_id': 3, 'species': 'mouse',
     'date_of_birth': '2024-02-01', 'sex': 'M'},
])

In [8]:

Copied!





# Then: tables that depend on others
Session.insert([
    {'subject_id': 1, 'session_date': '2024-06-01',
     'researcher_id': 1, 'session_notes': 'First session'},
    {'subject_id': 1, 'session_date': '2024-06-15',
     'researcher_id': 1, 'session_notes': 'Follow-up'},
    {'subject_id': 2, 'session_date': '2024-06-10',
     'researcher_id': 2, 'session_notes': 'Initial recording'},
])
# Then: tables that depend on others
Session.insert([
    {'subject_id': 1, 'session_date': '2024-06-01',
     'researcher_id': 1, 'session_notes': 'First session'},
    {'subject_id': 1, 'session_date': '2024-06-15',
     'researcher_id': 1, 'session_notes': 'Follow-up'},
    {'subject_id': 2, 'session_date': '2024-06-10',
     'researcher_id': 2, 'session_notes': 'Initial recording'},
])

In [9]:

Copied!





# Finally: tables at the bottom of the hierarchy
Recording.insert([
    {'subject_id': 1, 'session_date': '2024-06-01', 'recording_id': 1,
     'duration': 300.5, 'quality': 'good'},
    {'subject_id': 1, 'session_date': '2024-06-01', 'recording_id': 2,
     'duration': 450.0, 'quality': 'good'},
    {'subject_id': 1, 'session_date': '2024-06-15', 'recording_id': 1,
     'duration': 600.0, 'quality': 'fair'},
    {'subject_id': 2, 'session_date': '2024-06-10', 'recording_id': 1,
     'duration': 350.0, 'quality': 'good'},
])
# Finally: tables at the bottom of the hierarchy
Recording.insert([
    {'subject_id': 1, 'session_date': '2024-06-01', 'recording_id': 1,
     'duration': 300.5, 'quality': 'good'},
    {'subject_id': 1, 'session_date': '2024-06-01', 'recording_id': 2,
     'duration': 450.0, 'quality': 'good'},
    {'subject_id': 1, 'session_date': '2024-06-15', 'recording_id': 1,
     'duration': 600.0, 'quality': 'fair'},
    {'subject_id': 2, 'session_date': '2024-06-10', 'recording_id': 1,
     'duration': 350.0, 'quality': 'good'},
])

Viewing Data¶

Display a table by calling it:

In [10]:

Copied!

Subject()
Subject()

Out[10]:

subject_id	species	date_of_birth	sex
1	mouse	2024-01-15	M
2	mouse	2024-01-20	F
3	mouse	2024-02-01	M

Total: 3

In [11]:

Copied!

Recording()
Recording()

Out[11]:

subject_id	session_date	recording_id	duration	quality
1	2024-06-01	1	300.5	good
1	2024-06-01	2	450.0	good
1	2024-06-15	1	600.0	fair
2	2024-06-10	1	350.0	good

Total: 4

The Four Core Operations¶

DataJoint queries use four fundamental operations. These compose to answer any question about your data.

1. Restriction (`&`) — Filter rows¶

Keep only rows matching a condition:

In [12]:

Copied!

# Subjects that are male
Subject & {'sex': 'M'}
# Subjects that are male
Subject & {'sex': 'M'}

Out[12]:

subject_id	species	date_of_birth	sex
1	mouse	2024-01-15	M
3	mouse	2024-02-01	M

Total: 2

In [13]:

Copied!

# Recordings with good quality
Recording & {'quality': 'good'}
# Recordings with good quality
Recording & {'quality': 'good'}

Out[13]:

subject_id	session_date	recording_id	duration	quality
1	2024-06-01	1	300.5	good
1	2024-06-01	2	450.0	good
2	2024-06-10	1	350.0	good

Total: 3

In [14]:

Copied!

# Sessions for subject 1
Session & {'subject_id': 1}
# Sessions for subject 1
Session & {'subject_id': 1}

Out[14]:

subject_id	session_date	researcher_id	session_notes
1	2024-06-01	1	First session
1	2024-06-15	1	Follow-up

Total: 2

2. Projection (`.proj()`) — Select columns¶

Choose which attributes to return, or compute new ones:

In [15]:

Copied!

# Just names and emails
Researcher.proj('researcher_name', 'email')
# Just names and emails
Researcher.proj('researcher_name', 'email')

Out[15]:

researcher_id	researcher_name	email
1	Alice Chen	alice@lab.org
2	Bob Smith	bob@lab.org

Total: 2

In [16]:

Copied!

# Compute duration in minutes
Recording.proj(duration_min='duration / 60')
# Compute duration in minutes
Recording.proj(duration_min='duration / 60')

Out[16]:

subject_id	session_date	recording_id	duration_min
1	2024-06-01	1	5.0083333333333333
1	2024-06-01	2	7.5000000000000000
1	2024-06-15	1	10.0000000000000000
2	2024-06-10	1	5.8333333333333333

Total: 4

3. Join (`*`) — Combine tables¶

Merge data from related tables:

In [17]:

Copied!

# Sessions with subject info
Session * Subject
# Sessions with subject info
Session * Subject

Out[17]:

subject_id	session_date	researcher_id	session_notes	species	date_of_birth	sex
1	2024-06-01	1	First session	mouse	2024-01-15	M
1	2024-06-15	1	Follow-up	mouse	2024-01-15	M
2	2024-06-10	2	Initial recording	mouse	2024-01-20	F

Total: 3

In [18]:

Copied!





# Full recording details with subject and researcher
(Recording * Session * Subject * Researcher).proj(
    'researcher_name', 'species', 'duration', 'quality'
)
# Full recording details with subject and researcher
(Recording * Session * Subject * Researcher).proj(
    'researcher_name', 'species', 'duration', 'quality'
)

Out[18]:

subject_id	session_date	recording_id	duration	quality	species	researcher_name
1	2024-06-01	1	300.5	good	mouse	Alice Chen
1	2024-06-01	2	450.0	good	mouse	Alice Chen
1	2024-06-15	1	600.0	fair	mouse	Alice Chen
2	2024-06-10	1	350.0	good	mouse	Bob Smith

Total: 4

4. Aggregation (`.aggr()`) — Summarize groups¶

Compute statistics across groups of rows:

In [19]:

Copied!

# Count recordings per session
Session.aggr(Recording, num_recordings='count(*)')
# Count recordings per session
Session.aggr(Recording, num_recordings='count(*)')

Out[19]:

subject_id	session_date	num_recordings
1	2024-06-01	2
1	2024-06-15	1
2	2024-06-10	1

Total: 3

In [20]:

Copied!

# Total recording time per subject
Subject.aggr(Recording, total_duration='sum(duration)')
# Total recording time per subject
Subject.aggr(Recording, total_duration='sum(duration)')

Out[20]:

subject_id	total_duration
1	1350.5
2	350.0
3	None

Total: 3

In [21]:

Copied!

# Average duration across all recordings
dj.U().aggr(Recording, avg_duration='avg(duration)')
# Average duration across all recordings
dj.U().aggr(Recording, avg_duration='avg(duration)')

Out[21]:

avg_duration
425.1250000000000000

Total: 1

Combining Operations¶

Operations chain together to answer complex questions:

In [22]:

Copied!





# Good-quality recordings for male subjects, with researcher name
(
    Recording 
    & {'quality': 'good'} 
    & (Subject & {'sex': 'M'})
) * Session * Researcher.proj('researcher_name')
# Good-quality recordings for male subjects, with researcher name
(
    Recording 
    & {'quality': 'good'} 
    & (Subject & {'sex': 'M'})
) * Session * Researcher.proj('researcher_name')

Out[22]:

subject_id	session_date	recording_id	duration	quality	researcher_id	session_notes	researcher_name
1	2024-06-01	1	300.5	good	1	First session	Alice Chen
1	2024-06-01	2	450.0	good	1	First session	Alice Chen

Total: 2

In [23]:

Copied!





# Count of good recordings per researcher
Researcher.aggr(
    Session * (Recording & {'quality': 'good'}),
    good_recordings='count(*)'
)
# Count of good recordings per researcher
Researcher.aggr(
    Session * (Recording & {'quality': 'good'}),
    good_recordings='count(*)'
)

Out[23]:

researcher_id	good_recordings
1	2
2	1

Total: 2

Fetching Data¶

To get data into Python, use fetch methods:

In [24]:

Copied!

# Fetch as list of dicts
Subject.to_dicts()
# Fetch as list of dicts
Subject.to_dicts()

Out[24]:

[{'subject_id': 1,
  'species': 'mouse',
  'date_of_birth': datetime.date(2024, 1, 15),
  'sex': 'M'},
 {'subject_id': 2,
  'species': 'mouse',
  'date_of_birth': datetime.date(2024, 1, 20),
  'sex': 'F'},
 {'subject_id': 3,
  'species': 'mouse',
  'date_of_birth': datetime.date(2024, 2, 1),
  'sex': 'M'}]

In [25]:

Copied!

# Fetch specific attributes as arrays
durations = (Recording & {'quality': 'good'}).to_arrays('duration')
print(f"Good recording durations: {durations}")
# Fetch specific attributes as arrays
durations = (Recording & {'quality': 'good'}).to_arrays('duration')
print(f"Good recording durations: {durations}")

Good recording durations: [Decimal('300.5') Decimal('450.0') Decimal('350.0')]

In [26]:

Copied!

# Fetch one row
one_subject = (Subject & {'subject_id': 1}).fetch1()
print(f"Subject 1: {one_subject}")
# Fetch one row
one_subject = (Subject & {'subject_id': 1}).fetch1()
print(f"Subject 1: {one_subject}")

Subject 1: {'subject_id': 1, 'species': 'mouse', 'date_of_birth': datetime.date(2024, 1, 15), 'sex': 'M'}

Deleting Data¶

Deleting respects dependencies—downstream data is deleted automatically:

In [27]:

Copied!

# This would delete the session AND all its recordings
# (Session & {'subject_id': 2, 'session_date': '2024-06-10'}).delete()

# Uncomment to try (will prompt for confirmation)
# This would delete the session AND all its recordings
# (Session & {'subject_id': 2, 'session_date': '2024-06-10'}).delete()

# Uncomment to try (will prompt for confirmation)

Summary¶

You've learned the fundamentals of DataJoint:

Concept	Description
Tables	Python classes with a `definition` string
Primary key	Above `---`, uniquely identifies rows
Dependencies	`->` creates foreign keys
Restriction	`&` filters rows
Projection	`.proj()` selects/computes columns
Join	`*` combines tables
Aggregation	`.aggr()` summarizes groups

Alternative: Using Instances¶

The tutorial above used DataJoint's global configuration (dj.config + dj.Schema()). For applications that need isolated connections—web servers, testing, multi-database workflows—use dj.Instance (new in 2.2):

inst = dj.Instance(host="localhost", user="root", password="password")
schema = inst.Schema("tutorial_first_pipeline_inst")

@schema
class Subject(dj.Manual):
    definition = """
    subject_id : int32
    ---
    species : varchar(32)
    """
# ... same table definitions and operations work identically

See Working with Instances for a full tutorial.

Next Steps¶

Schema Design — Primary keys, relationships, table tiers
Queries — Advanced query patterns
Computation — Automated processing with Imported/Computed tables

Complete Examples¶

University Database — Complex queries on academic records
Blob Detection — Image processing pipeline with computation

In [28]:

Copied!

# Cleanup
schema.drop(prompt=False)
# Cleanup
schema.drop(prompt=False)

A Simple Pipeline¶

Setup¶

The Domain: A Research Lab¶

Defining Tables¶

Dependencies¶

Schema Diagram¶

Inserting Data¶

Viewing Data¶

The Four Core Operations¶

1. Restriction (&) — Filter rows¶

2. Projection (.proj()) — Select columns¶

3. Join (*) — Combine tables¶

4. Aggregation (.aggr()) — Summarize groups¶

Combining Operations¶

Fetching Data¶

Deleting Data¶

Summary¶

Alternative: Using Instances¶

Next Steps¶

Complete Examples¶

1. Restriction (`&`) — Filter rows¶

2. Projection (`.proj()`) — Select columns¶

3. Join (`*`) — Combine tables¶

4. Aggregation (`.aggr()`) — Summarize groups¶