Languages and Proficiency¶
This example demonstrates many-to-many relationships using an association table with international standards. You'll learn:
- Many-to-many relationships — People speak multiple languages; languages have multiple speakers
- Lookup tables — Standardized reference data (ISO language codes, CEFR levels)
- Association tables — Linking entities with additional attributes
- Complex queries — Aggregations, filtering, and joins
International Standards¶
This example uses two widely-adopted standards:
- ISO 639-1 — Two-letter language codes (
en,es,ja) - CEFR — Common European Framework of Reference for language proficiency (A1–C2)
Using international standards ensures data consistency and enables integration with external systems.
In [1]:
Copied!
import datajoint as dj
import numpy as np
from faker import Faker
dj.config['display.limit'] = 8
# Clean start
schema = dj.Schema('tutorial_languages')
schema.drop(prompt=False)
schema = dj.Schema('tutorial_languages')
import datajoint as dj
import numpy as np
from faker import Faker
dj.config['display.limit'] = 8
# Clean start
schema = dj.Schema('tutorial_languages')
schema.drop(prompt=False)
schema = dj.Schema('tutorial_languages')
[2026-02-06 11:46:39] DataJoint 2.1.0 connected to datajoint@127.0.0.1:5432
Lookup Tables¶
Lookup tables store standardized reference data that rarely changes. The contents attribute pre-populates them when the schema is created.
In [2]:
Copied!
@schema
class Language(dj.Lookup):
definition = """
# ISO 639-1 language codes
lang_code : char(2) # two-letter code (en, es, ja)
---
language : varchar(30) # full name
native_name : varchar(50) # name in native script
"""
contents = [
('ar', 'Arabic', 'العربية'),
('de', 'German', 'Deutsch'),
('en', 'English', 'English'),
('es', 'Spanish', 'Español'),
('fr', 'French', 'Français'),
('hi', 'Hindi', 'हिन्दी'),
('ja', 'Japanese', '日本語'),
('ko', 'Korean', '한국어'),
('pt', 'Portuguese', 'Português'),
('uk', 'Ukrainian', 'Українська'),
('zh', 'Chinese', '中文'),
]
@schema
class Language(dj.Lookup):
definition = """
# ISO 639-1 language codes
lang_code : char(2) # two-letter code (en, es, ja)
---
language : varchar(30) # full name
native_name : varchar(50) # name in native script
"""
contents = [
('ar', 'Arabic', 'العربية'),
('de', 'German', 'Deutsch'),
('en', 'English', 'English'),
('es', 'Spanish', 'Español'),
('fr', 'French', 'Français'),
('hi', 'Hindi', 'हिन्दी'),
('ja', 'Japanese', '日本語'),
('ko', 'Korean', '한국어'),
('pt', 'Portuguese', 'Português'),
('uk', 'Ukrainian', 'Українська'),
('zh', 'Chinese', '中文'),
]
In [3]:
Copied!
@schema
class CEFRLevel(dj.Lookup):
definition = """
# CEFR proficiency levels
cefr_level : char(2) # A1, A2, B1, B2, C1, C2
---
level_name : varchar(20) # descriptive name
category : enum('Basic', 'Independent', 'Proficient')
description : varchar(100) # can-do summary
"""
contents = [
('A1', 'Beginner', 'Basic',
'Can use familiar everyday expressions'),
('A2', 'Elementary', 'Basic',
'Can communicate in simple routine tasks'),
('B1', 'Intermediate', 'Independent',
'Can deal with most travel situations'),
('B2', 'Upper Intermediate', 'Independent',
'Can interact with fluency and spontaneity'),
('C1', 'Advanced', 'Proficient',
'Can express ideas fluently for professional use'),
('C2', 'Mastery', 'Proficient',
'Can understand virtually everything'),
]
@schema
class CEFRLevel(dj.Lookup):
definition = """
# CEFR proficiency levels
cefr_level : char(2) # A1, A2, B1, B2, C1, C2
---
level_name : varchar(20) # descriptive name
category : enum('Basic', 'Independent', 'Proficient')
description : varchar(100) # can-do summary
"""
contents = [
('A1', 'Beginner', 'Basic',
'Can use familiar everyday expressions'),
('A2', 'Elementary', 'Basic',
'Can communicate in simple routine tasks'),
('B1', 'Intermediate', 'Independent',
'Can deal with most travel situations'),
('B2', 'Upper Intermediate', 'Independent',
'Can interact with fluency and spontaneity'),
('C1', 'Advanced', 'Proficient',
'Can express ideas fluently for professional use'),
('C2', 'Mastery', 'Proficient',
'Can understand virtually everything'),
]
In [4]:
Copied!
print("Languages:")
print(Language())
print("\nCEFR Levels:")
print(CEFRLevel())
print("Languages:")
print(Language())
print("\nCEFR Levels:")
print(CEFRLevel())
Languages: *lang_code language native_name +-----------+ +----------+ +------------+ ar Arabic العربية de German Deutsch en English English es Spanish Español fr French Français hi Hindi हिन्दी ja Japanese 日本語 ko Korean 한국어 ... (Total: 11) CEFR Levels: *cefr_level level_name category description +------------+ +------------+ +------------+ +------------+ A1 Beginner Basic Can use famili A2 Elementary Basic Can communicat B1 Intermediate Independent Can deal with B2 Upper Intermed Independent Can interact w C1 Advanced Proficient Can express id C2 Mastery Proficient Can understand (Total: 6)
Entity and Association Tables¶
- Person — The main entity
- Proficiency — Association table linking Person, Language, and CEFRLevel
The association table's primary key includes both Person and Language, creating the many-to-many relationship.
In [5]:
Copied!
@schema
class Person(dj.Manual):
definition = """
# People with language skills
person_id : int32 # unique identifier
---
name : varchar(60)
date_of_birth : date
"""
@schema
class Person(dj.Manual):
definition = """
# People with language skills
person_id : int32 # unique identifier
---
name : varchar(60)
date_of_birth : date
"""
In [6]:
Copied!
@schema
class Proficiency(dj.Manual):
definition = """
# Language proficiency (many-to-many: person <-> language)
-> Person
-> Language
---
-> CEFRLevel
"""
@schema
class Proficiency(dj.Manual):
definition = """
# Language proficiency (many-to-many: person <-> language)
-> Person
-> Language
---
-> CEFRLevel
"""
In [7]:
Copied!
dj.Diagram(schema)
dj.Diagram(schema)
Out[7]:
Reading the diagram:
- Gray tables (Language, CEFRLevel) are Lookup tables
- Green table (Person) is Manual
- Solid lines indicate foreign keys in the primary key (many-to-many)
- Dashed line indicates foreign key in secondary attributes (reference)
Populate Sample Data¶
In [8]:
Copied!
np.random.seed(42)
fake = Faker()
fake.seed_instance(42)
# Generate 200 people
n_people = 200
Person.insert(
{
'person_id': i,
'name': fake.name(),
'date_of_birth': fake.date_of_birth(
minimum_age=18, maximum_age=70)
}
for i in range(n_people)
)
print(f"Created {len(Person())} people")
Person()
np.random.seed(42)
fake = Faker()
fake.seed_instance(42)
# Generate 200 people
n_people = 200
Person.insert(
{
'person_id': i,
'name': fake.name(),
'date_of_birth': fake.date_of_birth(
minimum_age=18, maximum_age=70)
}
for i in range(n_people)
)
print(f"Created {len(Person())} people")
Person()
Created 200 people
Out[8]:
| person_id | name | date_of_birth |
|---|---|---|
| 0 | Allison Hill | 1966-12-05 |
| 1 | Megan Mcclain | 1959-09-15 |
| 2 | Allen Robinson | 1981-11-18 |
| 3 | Cristian Santos | 1983-12-24 |
| 4 | Kevin Pacheco | 1955-06-11 |
| 5 | Melissa Peterson | 1963-05-04 |
| 6 | Gabrielle Davis | 1960-03-23 |
| 7 | Lindsey Roman | 1993-10-10 |
...
Total: 200
In [9]:
Copied!
# Assign random language proficiencies
lang_keys = list(Language.keys())
cefr_keys = list(CEFRLevel.keys())
# More people at intermediate levels than extremes
cefr_weights = [0.08, 0.12, 0.20, 0.25, 0.20, 0.15]
avg_languages = 2.5
for person_key in Person.keys():
n_langs = np.random.poisson(avg_languages)
if n_langs > 0:
selected_langs = np.random.choice(
len(lang_keys), min(n_langs, len(lang_keys)), replace=False)
Proficiency.insert(
{
**person_key,
**lang_keys[i],
**np.random.choice(cefr_keys, p=cefr_weights)
}
for i in selected_langs
)
print(f"Created {len(Proficiency())} proficiency records")
Proficiency()
# Assign random language proficiencies
lang_keys = list(Language.keys())
cefr_keys = list(CEFRLevel.keys())
# More people at intermediate levels than extremes
cefr_weights = [0.08, 0.12, 0.20, 0.25, 0.20, 0.15]
avg_languages = 2.5
for person_key in Person.keys():
n_langs = np.random.poisson(avg_languages)
if n_langs > 0:
selected_langs = np.random.choice(
len(lang_keys), min(n_langs, len(lang_keys)), replace=False)
Proficiency.insert(
{
**person_key,
**lang_keys[i],
**np.random.choice(cefr_keys, p=cefr_weights)
}
for i in selected_langs
)
print(f"Created {len(Proficiency())} proficiency records")
Proficiency()
Created 527 proficiency records
Out[9]:
| person_id | lang_code | cefr_level |
|---|---|---|
| 0 | ar | A1 |
| 0 | de | B2 |
| 0 | hi | C2 |
| 0 | uk | B2 |
| 3 | de | B2 |
| 3 | uk | A1 |
| 4 | hi | B2 |
| 4 | pt | C1 |
...
Total: 527
In [10]:
Copied!
# Proficient English speakers (C1 or C2)
proficient_english = (
Person.proj('name') &
(Proficiency & {'lang_code': 'en'} & "cefr_level >= 'C1'")
)
print(f"Proficient English speakers: {len(proficient_english)}")
proficient_english
# Proficient English speakers (C1 or C2)
proficient_english = (
Person.proj('name') &
(Proficiency & {'lang_code': 'en'} & "cefr_level >= 'C1'")
)
print(f"Proficient English speakers: {len(proficient_english)}")
proficient_english
Proficient English speakers: 18
Out[10]:
| person_id | name |
|---|---|
| 22 | Brian Burton |
| 32 | Elizabeth Brown |
| 33 | Angelica Tucker |
| 37 | Zachary Santos |
| 38 | Barbara Walker |
| 42 | Timothy Duncan |
| 53 | Whitney Peters |
| 67 | Teresa Taylor |
...
Total: 18
In [11]:
Copied!
# People who speak BOTH English AND Spanish
bilingual = (
Person.proj('name') &
(Proficiency & {'lang_code': 'en'}) &
(Proficiency & {'lang_code': 'es'})
)
print(f"English + Spanish speakers: {len(bilingual)}")
bilingual
# People who speak BOTH English AND Spanish
bilingual = (
Person.proj('name') &
(Proficiency & {'lang_code': 'en'}) &
(Proficiency & {'lang_code': 'es'})
)
print(f"English + Spanish speakers: {len(bilingual)}")
bilingual
English + Spanish speakers: 6
Out[11]:
| person_id | name |
|---|---|
| 38 | Barbara Walker |
| 67 | Teresa Taylor |
| 77 | Richard Henson |
| 113 | Denise Jones |
| 122 | Michael Powell |
| 137 | Lindsay Martinez |
Total: 6
In [12]:
Copied!
# People who speak English OR Spanish
either = (
Person.proj('name') &
(Proficiency & "lang_code in ('en', 'es')")
)
print(f"English or Spanish speakers: {len(either)}")
either
# People who speak English OR Spanish
either = (
Person.proj('name') &
(Proficiency & "lang_code in ('en', 'es')")
)
print(f"English or Spanish speakers: {len(either)}")
either
English or Spanish speakers: 79
Out[12]:
| person_id | name |
|---|---|
| 6 | Gabrielle Davis |
| 11 | David Garcia |
| 12 | Holly Wood |
| 17 | Daniel Hahn |
| 19 | Derek Wright |
| 20 | Kevin Hurst |
| 22 | Brian Burton |
| 27 | Sherri Baker |
...
Total: 79
Aggregations¶
In [13]:
Copied!
# People who speak 4+ languages
polyglots = Person.aggr(
Proficiency,
'name',
n_languages='COUNT(lang_code)',
languages='GROUP_CONCAT(lang_code)'
) & 'n_languages >= 4'
print(f"Polyglots (4+ languages): {len(polyglots)}")
polyglots
# People who speak 4+ languages
polyglots = Person.aggr(
Proficiency,
'name',
n_languages='COUNT(lang_code)',
languages='GROUP_CONCAT(lang_code)'
) & 'n_languages >= 4'
print(f"Polyglots (4+ languages): {len(polyglots)}")
polyglots
Polyglots (4+ languages): 56
Out[13]:
| person_id | name | n_languages | languages |
|---|---|---|---|
| 0 | Allison Hill | 4 | ar,hi,uk,de |
| 6 | Gabrielle Davis | 5 | hi,de,uk,fr,en |
| 8 | Valerie Gray | 4 | hi,zh,ko,de |
| 9 | Lisa Hensley | 4 | zh,pt,ar,ja |
| 11 | David Garcia | 4 | es,ko,zh,hi |
| 12 | Holly Wood | 4 | pt,en,uk,de |
| 14 | Nicholas Martin | 5 | zh,pt,hi,fr,ko |
| 15 | Margaret Hawkins DDS | 4 | ja,fr,uk,de |
...
Total: 56
In [14]:
Copied!
# Top 5 polyglots
top_polyglots = Person.aggr(
Proficiency,
'name',
n_languages='COUNT(lang_code)'
) & dj.Top(5, order_by='n_languages DESC')
top_polyglots
# Top 5 polyglots
top_polyglots = Person.aggr(
Proficiency,
'name',
n_languages='COUNT(lang_code)'
) & dj.Top(5, order_by='n_languages DESC')
top_polyglots
Out[14]:
| person_id | name | n_languages |
|---|---|---|
| 58 | Bryan Zamora | 7 |
| 20 | Kevin Hurst | 7 |
| 116 | Joshua Perry | 7 |
| 42 | Timothy Duncan | 7 |
| 77 | Richard Henson | 7 |
Total: 5
In [15]:
Copied!
# Number of speakers per language
speakers_per_lang = Language.aggr(
Proficiency,
'language',
n_speakers='COUNT(person_id)'
)
speakers_per_lang
# Number of speakers per language
speakers_per_lang = Language.aggr(
Proficiency,
'language',
n_speakers='COUNT(person_id)'
)
speakers_per_lang
Out[15]:
| lang_code | language | n_speakers |
|---|---|---|
| ar | Arabic | 41 |
| de | German | 55 |
| en | English | 45 |
| es | Spanish | 40 |
| fr | French | 49 |
| hi | Hindi | 54 |
| ja | Japanese | 47 |
| ko | Korean | 47 |
...
Total: 11
In [16]:
Copied!
# CEFR level distribution for English
english_levels = CEFRLevel.aggr(
Proficiency & {'lang_code': 'en'},
'level_name',
n_speakers='COUNT(person_id)'
)
english_levels
# CEFR level distribution for English
english_levels = CEFRLevel.aggr(
Proficiency & {'lang_code': 'en'},
'level_name',
n_speakers='COUNT(person_id)'
)
english_levels
Out[16]:
| cefr_level | level_name | n_speakers |
|---|---|---|
| A1 | Beginner | 1 |
| A2 | Elementary | 5 |
| B1 | Intermediate | 8 |
| B2 | Upper Intermediate | 13 |
| C1 | Advanced | 13 |
| C2 | Mastery | 5 |
Total: 6
Joining Tables¶
In [17]:
Copied!
# Full profile: person + language + proficiency details
full_profile = (
Person * Proficiency * Language * CEFRLevel
).proj('name', 'language', 'level_name', 'category')
# Show profile for person_id=0
full_profile & {'person_id': 0}
# Full profile: person + language + proficiency details
full_profile = (
Person * Proficiency * Language * CEFRLevel
).proj('name', 'language', 'level_name', 'category')
# Show profile for person_id=0
full_profile & {'person_id': 0}
Out[17]:
| person_id | lang_code | name | language | level_name | category |
|---|---|---|---|---|---|
| 0 | ar | Allison Hill | Arabic | Beginner | Basic |
| 0 | de | Allison Hill | German | Upper Intermediate | Independent |
| 0 | hi | Allison Hill | Hindi | Mastery | Proficient |
| 0 | uk | Allison Hill | Ukrainian | Upper Intermediate | Independent |
Total: 4
In [18]:
Copied!
# Find people with C1+ proficiency in multiple languages
advanced_polyglots = Person.aggr(
Proficiency & "cefr_level >= 'C1'",
'name',
n_advanced='COUNT(*)'
) & 'n_advanced >= 2'
print(f"Advanced in 2+ languages: {len(advanced_polyglots)}")
advanced_polyglots
# Find people with C1+ proficiency in multiple languages
advanced_polyglots = Person.aggr(
Proficiency & "cefr_level >= 'C1'",
'name',
n_advanced='COUNT(*)'
) & 'n_advanced >= 2'
print(f"Advanced in 2+ languages: {len(advanced_polyglots)}")
advanced_polyglots
Advanced in 2+ languages: 49
Out[18]:
| person_id | name | n_advanced |
|---|---|---|
| 6 | Gabrielle Davis | 2 |
| 7 | Lindsey Roman | 2 |
| 14 | Nicholas Martin | 3 |
| 19 | Derek Wright | 2 |
| 20 | Kevin Hurst | 3 |
| 25 | Melanie Herrera | 2 |
| 27 | Sherri Baker | 2 |
| 29 | Lisa Hernandez | 2 |
...
Total: 49
Key Concepts¶
| Pattern | Implementation |
|---|---|
| Many-to-many | Proficiency links Person and Language |
| Lookup tables | Language and CEFRLevel with contents |
| Association data | cefr_level stored in the association table |
| Standards | ISO 639-1 codes, CEFR levels |
Benefits of Lookup Tables¶
- Data consistency — Only valid codes can be used
- Rich metadata — Full names, descriptions stored once
- Easy updates — Change "Español" to "Spanish" in one place
- Self-documenting —
Language()shows all valid options
Next Steps¶
- University Database — Academic records
- Hotel Reservations — Workflow dependencies
- Queries Tutorial — Query operators in depth
In [19]:
Copied!
# Cleanup
schema.drop(prompt=False)
# Cleanup
schema.drop(prompt=False)