Concepts¶

Understanding the principles behind DataJoint.

DataJoint implements the Relational Workflow Model—a paradigm that extends relational databases with native support for computational workflows. This section explains the core concepts that make DataJoint pipelines reliable, reproducible, and scalable.

Core Concepts¶

Relational Workflow Model
How DataJoint differs from traditional databases. The paradigm shift from storage to workflow.

Entity Integrity
Primary keys and the three questions. Ensuring one-to-one correspondence between entities and records.

Normalization
Schema design principles. Organizing tables around workflow steps to minimize redundancy.

Query Algebra
The five operators: restriction, join, projection, aggregation, union. Workflow-aware query semantics.

Type System
Three-layer architecture: native, core, and codec types. In-table and in-store storage modes.

Computation Model
AutoPopulate and Jobs 2.0. Automated, reproducible, distributed computation.

Schema as a Workflow Specification
The schema as a formal language for expressing scientific workflows. Grammar, semantics, algebra, and machine-readability.

Comparison to Workflow Languages
How DataJoint relates to CWL, Snakemake, Nextflow, Airflow, and other workflow tools. What each offers, what each omits, and when to use both.

Custom Codecs
Extend DataJoint with domain-specific types. The codec extensibility system.

Data Pipelines
From workflows to complete data operations systems. Project structure and object-augmented schemas.

Semantic Matching
How DataJoint ensures safe joins through attribute lineage tracking.

What's New in 2.0
Major changes, new features, and migration guidance for DataJoint 2.0.

FAQ
How DataJoint compares to ORMs, workflow managers, and lakehouses. Common questions answered.

Why These Concepts Matter¶

Traditional databases store data. DataJoint pipelines process data. Understanding the Relational Workflow Model helps you:

Design schemas that naturally express your workflow
Write queries that are both powerful and intuitive
Build computations that scale from laptop to cluster
Maintain data integrity throughout the pipeline lifecycle