Data Entry¶
This tutorial covers how to manipulate data in DataJoint tables. You'll learn:
- Insert — Adding rows to tables
- Update — Modifying existing rows (for corrections)
- Delete — Removing rows with cascading
- Validation — Checking data before insertion
DataJoint is designed around insert and delete as the primary operations. Updates are intentionally limited to surgical corrections.
import datajoint as dj
import numpy as np
schema = dj.Schema('tutorial_data_entry')
[2026-02-06 11:44:48] DataJoint 2.1.0 connected to datajoint@127.0.0.1:5432
# Define tables for this tutorial
@schema
class Lab(dj.Manual):
definition = """
lab_id : varchar(16)
---
lab_name : varchar(100)
"""
@schema
class Subject(dj.Manual):
definition = """
subject_id : varchar(16)
---
-> Lab
species : varchar(50)
date_of_birth : date
notes = '' : varchar(1000)
"""
@schema
class Session(dj.Manual):
definition = """
-> Subject
session_idx : int32
---
session_date : date
duration : decimal(4,1) # minutes
"""
class Trial(dj.Part):
definition = """
-> master
trial_idx : int32
---
outcome : enum('hit', 'miss', 'false_alarm', 'correct_reject')
reaction_time : decimal(3,2) # seconds
"""
@schema
class ProcessedData(dj.Computed):
definition = """
-> Session
---
hit_rate : float32
"""
def make(self, key):
outcomes = (Session.Trial & key).to_arrays('outcome')
n_trials = len(outcomes)
hit_rate = np.sum(outcomes == 'hit') / n_trials if n_trials else 0.0
self.insert1({**key, 'hit_rate': hit_rate})
# Insert a single row
Lab.insert1({'lab_id': 'tolias', 'lab_name': 'Tolias Lab'})
Subject.insert1({
'subject_id': 'M001',
'lab_id': 'tolias',
'species': 'Mus musculus',
'date_of_birth': '2026-01-15'
})
Subject()
| subject_id | lab_id | species | date_of_birth | notes |
|---|---|---|---|---|
| M001 | tolias | Mus musculus | 2026-01-15 |
Total: 1
insert() — Multiple Rows¶
Use insert() to add multiple rows at once. This is more efficient than calling insert1() in a loop.
# Insert multiple rows as a list of dictionaries
Subject.insert([
{
'subject_id': 'M002',
'lab_id': 'tolias',
'species': 'Mus musculus',
'date_of_birth': '2026-02-01'
},
{
'subject_id': 'M003',
'lab_id': 'tolias',
'species': 'Mus musculus',
'date_of_birth': '2026-02-15'
},
])
Subject()
| subject_id | lab_id | species | date_of_birth | notes |
|---|---|---|---|---|
| M001 | tolias | Mus musculus | 2026-01-15 | |
| M002 | tolias | Mus musculus | 2026-02-01 | |
| M003 | tolias | Mus musculus | 2026-02-15 |
Total: 3
Accepted Input Formats¶
insert() accepts several formats:
| Format | Example |
|---|---|
| List of dicts | [{'id': 1, 'name': 'A'}, ...] |
| pandas DataFrame | pd.DataFrame({'id': [1, 2], 'name': ['A', 'B']}) |
| numpy structured array | np.array([(1, 'A')], dtype=[('id', int), ('name', 'U10')]) |
| QueryExpression | OtherTable.proj(...) (INSERT...SELECT) |
# Insert from pandas DataFrame
import pandas as pd
df = pd.DataFrame({
'subject_id': ['M004', 'M005'],
'lab_id': ['tolias', 'tolias'],
'species': ['Mus musculus', 'Mus musculus'],
'date_of_birth': ['2026-03-01', '2026-03-15']
})
Subject.insert(df)
print(f"Total subjects: {len(Subject())}")
Total subjects: 5
Handling Duplicates¶
By default, inserting a row with an existing primary key raises an error:
# This will raise an error - duplicate primary key
try:
Subject.insert1({'subject_id': 'M001', 'lab_id': 'tolias',
'species': 'Mus musculus', 'date_of_birth': '2026-01-15'})
except Exception as e:
print(f"Error: {type(e).__name__}")
print("Cannot insert duplicate primary key!")
Error: DuplicateError Cannot insert duplicate primary key!
Use skip_duplicates=True to silently skip rows with existing keys:
# Skip duplicates - existing row unchanged
Subject.insert1(
{'subject_id': 'M001', 'lab_id': 'tolias', 'species': 'Mus musculus', 'date_of_birth': '2026-01-15'},
skip_duplicates=True
)
print("Insert completed (duplicate skipped)")
Insert completed (duplicate skipped)
Note: replace=True is also available but has the same caveats as update1()—it bypasses immutability and can break provenance. Use sparingly for corrections only.
Extra Fields¶
By default, inserting a row with fields not in the table raises an error:
try:
Subject.insert1({'subject_id': 'M006', 'lab_id': 'tolias',
'species': 'Mus musculus', 'date_of_birth': '2026-04-01',
'unknown_field': 'some value'}) # Unknown field!
except Exception as e:
print(f"Error: {type(e).__name__}")
print("Field 'unknown_field' not in table!")
Error: KeyError Field 'unknown_field' not in table!
# Use ignore_extra_fields=True to silently ignore unknown fields
Subject.insert1(
{'subject_id': 'M006', 'lab_id': 'tolias', 'species': 'Mus musculus',
'date_of_birth': '2026-04-01', 'unknown_field': 'ignored'},
ignore_extra_fields=True
)
print(f"Total subjects: {len(Subject())}")
Total subjects: 6
Master-Part Tables and Transactions¶
Compositional integrity means that a master and all its parts must be inserted (or deleted) as an atomic unit. This ensures downstream computations see complete data.
- Auto-populated tables (Computed, Imported) enforce this automatically—
make()runs in a transaction - Manual tables require explicit transactions to maintain compositional integrity
Inserting Master with Parts¶
# Use a transaction to ensure master and parts are inserted atomically
with dj.conn().transaction:
Session.insert1({
'subject_id': 'M001',
'session_idx': 1,
'session_date': '2026-01-06',
'duration': 45.5
})
Session.Trial.insert([
{'subject_id': 'M001', 'session_idx': 1, 'trial_idx': 1,
'outcome': 'hit', 'reaction_time': 0.35},
{'subject_id': 'M001', 'session_idx': 1, 'trial_idx': 2,
'outcome': 'miss', 'reaction_time': 0.82},
{'subject_id': 'M001', 'session_idx': 1, 'trial_idx': 3,
'outcome': 'hit', 'reaction_time': 0.41},
{'subject_id': 'M001', 'session_idx': 1, 'trial_idx': 4,
'outcome': 'false_alarm', 'reaction_time': 0.28},
{'subject_id': 'M001', 'session_idx': 1, 'trial_idx': 5,
'outcome': 'hit', 'reaction_time': 0.39},
])
# Both master and parts committed together, or neither if error occurred
Session.Trial()
| subject_id | session_idx | trial_idx | outcome | reaction_time |
|---|---|---|---|---|
| M001 | 1 | 1 | hit | 0.35 |
| M001 | 1 | 2 | miss | 0.82 |
| M001 | 1 | 3 | hit | 0.41 |
| M001 | 1 | 4 | false_alarm | 0.28 |
| M001 | 1 | 5 | hit | 0.39 |
Total: 5
Update Operations¶
DataJoint provides only update1() for modifying single rows. This is intentional—updates bypass the normal workflow and should be used sparingly for corrective operations.
When to Use Updates¶
Appropriate uses:
- Fixing data entry errors (typos, wrong values)
- Adding notes or metadata after the fact
- Administrative corrections
Inappropriate uses (use delete + insert + populate instead):
- Regular workflow operations
- Changes that should trigger recomputation
# Update a single row - must provide all primary key values
Subject.update1({'subject_id': 'M001', 'notes': 'Primary subject for behavioral study'})
(Subject & {'subject_id': 'M001'}).fetch1()
{'subject_id': 'M001',
'lab_id': 'tolias',
'species': 'Mus musculus',
'date_of_birth': datetime.date(2026, 1, 15),
'notes': 'Primary subject for behavioral study'}
# Update multiple attributes at once
Subject.update1({
'subject_id': 'M002',
'notes': 'Control group',
'species': 'Mus musculus (C57BL/6)' # More specific
})
(Subject & {'subject_id': 'M002'}).fetch1()
{'subject_id': 'M002',
'lab_id': 'tolias',
'species': 'Mus musculus (C57BL/6)',
'date_of_birth': datetime.date(2026, 2, 1),
'notes': 'Control group'}
Update Requirements¶
- Complete primary key: All PK attributes must be provided
- Exactly one match: Must match exactly one existing row
- No restrictions: Cannot call on a restricted table
# Error: incomplete primary key
try:
Subject.update1({'notes': 'Missing subject_id!'})
except Exception as e:
print(f"Error: {type(e).__name__}")
print("Primary key must be complete")
Error: DataJointError Primary key must be complete
# Error: cannot update restricted table
try:
(Subject & {'subject_id': 'M001'}).update1({'subject_id': 'M001', 'notes': 'test'})
except Exception as e:
print(f"Error: {type(e).__name__}")
print("Cannot update restricted table")
Error: DataJointError Cannot update restricted table
Reset to Default¶
Setting an attribute to None resets it to its default value:
# Reset notes to default (empty string)
Subject.update1({'subject_id': 'M003', 'notes': None})
(Subject & {'subject_id': 'M003'}).fetch1()
{'subject_id': 'M003',
'lab_id': 'tolias',
'species': 'Mus musculus',
'date_of_birth': datetime.date(2026, 2, 15),
'notes': ''}
# First, let's see what we have
print(f"Sessions: {len(Session())}")
print(f"Trials: {len(Session.Trial())}")
# Populate computed table
ProcessedData.populate()
print(f"ProcessedData: {len(ProcessedData())}")
Sessions: 1 Trials: 5 ProcessedData: 1
# Delete a session - cascades to Trial and ProcessedData
(Session & {'subject_id': 'M001', 'session_idx': 1}).delete(prompt=False)
print(f"After delete:")
print(f"Sessions: {len(Session())}")
print(f"Trials: {len(Session.Trial())}")
print(f"ProcessedData: {len(ProcessedData())}")
[2026-02-06 11:44:48] Deleting 5 rows from "tutorial_data_entry"."session__trial"
[2026-02-06 11:44:48] Deleting 1 rows from "tutorial_data_entry"."__processed_data"
[2026-02-06 11:44:48] Deleting 1 rows from "tutorial_data_entry"."session"
After delete: Sessions: 0 Trials: 0 ProcessedData: 0
Prompt Behavior¶
The prompt parameter controls whether delete() asks for confirmation. When prompt=None (default), the behavior is determined by dj.config['safemode']:
# Uses config['safemode'] setting (default)
(Table & condition).delete()
# Explicitly skip confirmation
(Table & condition).delete(prompt=False)
# Explicitly require confirmation
(Table & condition).delete(prompt=True)
# Add more data for demonstration
with dj.conn().transaction:
Session.insert1({
'subject_id': 'M002',
'session_idx': 1,
'session_date': '2026-01-07',
'duration': 30.0
})
Session.Trial.insert([
{'subject_id': 'M002', 'session_idx': 1, 'trial_idx': 1,
'outcome': 'hit', 'reaction_time': 0.40},
{'subject_id': 'M002', 'session_idx': 1, 'trial_idx': 2,
'outcome': 'hit', 'reaction_time': 0.38},
])
# Delete with prompt=False (no confirmation prompt)
(Session & {'subject_id': 'M002', 'session_idx': 1}).delete(prompt=False)
[2026-02-06 11:44:48] Deleting 2 rows from "tutorial_data_entry"."session__trial"
[2026-02-06 11:44:48] Deleting 1 rows from "tutorial_data_entry"."session"
1
The Recomputation Pattern¶
When source data needs to change, the correct pattern is delete → insert → populate. This ensures all derived data remains consistent:
# Add a session with trials (using transaction for compositional integrity)
with dj.conn().transaction:
Session.insert1({
'subject_id': 'M003',
'session_idx': 1,
'session_date': '2026-01-08',
'duration': 40.0
})
Session.Trial.insert([
{'subject_id': 'M003', 'session_idx': 1, 'trial_idx': 1,
'outcome': 'hit', 'reaction_time': 0.35},
{'subject_id': 'M003', 'session_idx': 1, 'trial_idx': 2,
'outcome': 'miss', 'reaction_time': 0.50},
])
# Compute results
ProcessedData.populate()
print("Before correction:", ProcessedData.fetch1())
Before correction: {'subject_id': 'M003', 'session_idx': 1, 'hit_rate': 0.5}
# Suppose we discovered trial 2 was actually a 'hit' not 'miss'
# WRONG: Updating the trial would leave ProcessedData stale!
# Session.Trial.update1({...}) # DON'T DO THIS
# CORRECT: Delete, reinsert, recompute
key = {'subject_id': 'M003', 'session_idx': 1}
# 1. Delete cascades to ProcessedData
(Session & key).delete(prompt=False)
# 2. Reinsert with corrected data (using transaction)
with dj.conn().transaction:
Session.insert1({**key, 'session_date': '2026-01-08', 'duration': 40.0})
Session.Trial.insert([
{**key, 'trial_idx': 1, 'outcome': 'hit', 'reaction_time': 0.35},
{**key, 'trial_idx': 2, 'outcome': 'hit', 'reaction_time': 0.50},
])
# 3. Recompute
ProcessedData.populate()
print("After correction:", ProcessedData.fetch1())
[2026-02-06 11:44:48] Deleting 2 rows from "tutorial_data_entry"."session__trial"
[2026-02-06 11:44:48] Deleting 1 rows from "tutorial_data_entry"."__processed_data"
[2026-02-06 11:44:48] Deleting 1 rows from "tutorial_data_entry"."session"
After correction: {'subject_id': 'M003', 'session_idx': 1, 'hit_rate': 1.0}
Validation¶
Use validate() to check data before insertion:
# Validate rows before inserting
rows_to_insert = [
{'subject_id': 'M007', 'lab_id': 'tolias', 'species': 'Mus musculus', 'date_of_birth': '2026-05-01'},
{'subject_id': 'M008', 'lab_id': 'tolias', 'species': 'Mus musculus', 'date_of_birth': '2026-05-15'},
]
result = Subject.validate(rows_to_insert)
if result:
Subject.insert(rows_to_insert)
print(f"Inserted {len(rows_to_insert)} rows")
else:
print("Validation failed:")
print(result.summary())
Inserted 2 rows
# Example of validation failure
bad_rows = [
{'subject_id': 'M009', 'species': 'Mus musculus', 'date_of_birth': '2026-05-20'}, # Missing lab_id!
]
result = Subject.validate(bad_rows)
if not result:
print("Validation failed!")
for error in result.errors:
print(f" {error}")
Validation failed! (0, 'lab_id', "Required field 'lab_id' is missing")
Transactions¶
Single operations are atomic by default. Use explicit transactions for:
- Master-part inserts — Maintain compositional integrity
- Multi-table operations — All succeed or all fail
- Complex workflows — Coordinate related changes
# Atomic transaction - all inserts succeed or none do
with dj.conn().transaction:
Session.insert1({
'subject_id': 'M007',
'session_idx': 1,
'session_date': '2026-01-10',
'duration': 35.0
})
Session.Trial.insert([
{'subject_id': 'M007', 'session_idx': 1, 'trial_idx': 1,
'outcome': 'hit', 'reaction_time': 0.33},
{'subject_id': 'M007', 'session_idx': 1, 'trial_idx': 2,
'outcome': 'miss', 'reaction_time': 0.45},
])
print(f"Session inserted with {len(Session.Trial & {'subject_id': 'M007'})} trials")
Session inserted with 2 trials
Best Practices¶
1. Prefer Insert/Delete Over Update¶
When source data changes, delete and reinsert rather than updating. Updates and replace=True bypass immutability and break provenance:
# Good: Delete and reinsert
(Trial & key).delete(prompt=False)
Trial.insert1(corrected_trial)
DerivedTable.populate()
# Avoid: Update that leaves derived data stale
Trial.update1({**key, 'value': new_value})
2. Use Transactions for Master-Part Inserts¶
# Ensures compositional integrity
with dj.conn().transaction:
Session.insert1(session_data)
Session.Trial.insert(trials)
3. Batch Inserts for Performance¶
# Good: Single insert call
Subject.insert(all_rows)
# Slow: Loop of insert1 calls
for row in all_rows:
Subject.insert1(row) # Creates many transactions
4. Validate Before Insert¶
result = Subject.validate(rows)
if not result:
raise ValueError(result.summary())
Subject.insert(rows)
5. Configure Safe Mode for Production¶
# In production scripts, explicitly control prompt behavior
(Subject & condition).delete(prompt=False) # No confirmation
# Or configure globally via settings
dj.config['safemode'] = True # Require confirmation by default
Quick Reference¶
| Operation | Method | Use Case |
|---|---|---|
| Insert one | insert1(row) |
Adding single entity |
| Insert many | insert(rows) |
Bulk data loading |
| Update one | update1(row) |
Surgical corrections only |
| Delete | delete() |
Removing entities (cascades) |
| Delete quick | delete_quick() |
Internal cleanup (no cascade) |
| Validate | validate(rows) |
Pre-insert check |
See the Data Manipulation Specification for complete details.
Next Steps¶
- Queries — Filtering, joining, and projecting data
- Computation — Building computational pipelines
# Cleanup
schema.drop(prompt=False)