An open challenge from the International Linguistics Olympiad
Can a system reason out a
language it has never seen?
Every year, the strongest young linguists are given a few lines of a language they have never seen and asked to work out the rest using nothing but logic. IOL‑AI 2026 gives your system the same puzzles under the same rules, and asks one question: how close can it get to an Olympiad medallist?
until the submission deadline · 26 Jul 2026, 23:59 UTC
The mission
A reasoning benchmark you can't memorise your way through.
Linguistics problems cannot be solved from memorised facts. Each one is built on a language the solver has almost certainly never seen, so the only way to the answer is to reason from a handful of examples. That makes it a test of reasoning rather than recall.
IOL‑AI 2026 turns the International Linguistics Olympiad into a public, reproducible benchmark for that kind of reasoning. Your system faces the same contest the human contestants faced. Entries are scored automatically, and the strongest are reviewed by hand by the jury, on equal footing with the students. Everything runs in the open: public entries, open method, public leaderboard.
First principles only
No language can be looked up. Every answer is deduced from the data in the problem.
Human-comparable
The same problems, point values, and scoring the IOL uses for its own contestants.
Open and reproducible
Submit a public Hugging Face repository and a script. Anyone can rerun it and verify the result.
The task
Self-contained logic puzzles, no outside knowledge required.
Each problem gives a small amount of data from a language the solver has almost certainly never seen, plus the hints needed to solve it. From those examples, a solution works out enough of the grammar and vocabulary to translate new forms, fill in blanks, answer multiple-choice items, and correct mistakes.
You submit a Hugging Face repository containing a script.py. The platform
runs your script on a hidden test set from the IOL individual contest and scores the
answers it produces. Everything needed is in each problem's context, so no
prior knowledge of the languages is required.
Scoring
Get it exactly right and get close. You need both.
Exact match
Is the answer exactly right? Points-weighted using the official IOL point values, so harder, heavier items matter more.
chrF
Character n-gram overlap with the correct answer, giving partial credit for near-misses so the leaderboard is smooth rather than a step function.
Final score
The geometric mean of the two, on a 0 to 100 scale. Because it is a geometric mean, fuzzy overlap alone or a few exact hits alone will not get you far.
The public leaderboard scores a subset of items on every submission. The private leaderboard covers the rest and is revealed at the deadline. Pick your 2 final submissions before then.
The data
Hidden, self-contained, and mounted at evaluation time.
The test set is the IOL individual contest, reformatted into self-contained
sub-questions. It is hidden: your script.py reads it at evaluation time
from /tmp/data/test.csv, which the platform mounts before your script
runs. There is no separate training set. Each problem carries all the data needed to
solve it.
| Column | Description |
|---|---|
id | Unique sub-question id. Copy it back unchanged. |
question_number | Which numbered item within the problem's query this row asks for. |
context | The full problem statement: bilingual data, hints, language meta-info. |
query | The instruction listing every numbered item for the problem. |
task_type | translation, mapping, fill_blanks, classification, editing |
The Human Evaluation Challenge
A parallel track, reviewed by the jury, on equal footing with IOL participants.
Alongside the automatic competition runs a human-scored track. The jury reviews a shortlist of submissions by hand, the same way it reviews the human contestants.
To opt in, add an explanation column to your submission.csv
with a short, plain-language explanation of the reasoning behind each answer. It is
never scored automatically. It exists only for the jury.
- Include an
explanationfor most of your answers. - We take the top submissions by score with a valid explanation rate of at least 50%.
- The jury scores them by hand during the IOL.
How to submit
A Hugging Face model repo with a script.py that writes submission.csv.
- Put your model and a
script.pyin one Hugging Face model repo. That repo is your script's working directory at run time, so load any local weights from".". - Have the script read the test set from
/tmp/data/test.csv. You cannot download it: your token is revoked before your code runs. - Have the script write
submission.csvwith one row per problem: anidcopied from the test set and apredwith your answers. Add anexplanationcolumn to enter the jury track. - Enter your model repo id in the competition Space. You get up to 5 submissions a day, and pick 2 for the private leaderboard before the deadline. Each run has a 1-hour limit.
The evaluation sandbox has no internet, so ship your model inside the submission repo and load it from ".". This script.py uses Qwen/Qwen2.5-1.5B-Instruct, a small public model, with greedy decoding so results are reproducible. The one-time step that downloads the weights into your repo is documented on the competition Space.
import subprocess, sys
subprocess.run([sys.executable, "-m", "pip", "install", "-q",
"transformers>=4.43", "accelerate>=0.30", "torch>=2.2", "pandas"], check=True)
import os
# The repo is the working directory at run time, and the sandbox has no network.
os.environ["HF_HUB_OFFLINE"] = "1"
os.environ["TRANSFORMERS_OFFLINE"] = "1"
MODEL_ID = "." # your model's weights are shipped in this repo
import json
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tok = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, torch_dtype=torch.float16, device_map="auto"
).eval()
df = pd.read_csv("/tmp/data/test.csv", dtype=str).fillna("")
rows = []
for _, r in df.iterrows():
messages = [
{"role": "system", "content":
"You solve International Linguistics Olympiad problems. Answer every numbered "
"item. Put each answer on its own line, in order, with no numbering and no extra text."},
{"role": "user", "content": f"{r['context'].strip()}\n\n{r['query'].strip()}"},
]
ids = tok.apply_chat_template(
messages, add_generation_prompt=True, return_tensors="pt",
).to(model.device)
with torch.no_grad():
out = model.generate(ids, max_new_tokens=512, do_sample=False)
text = tok.decode(out[0][ids.shape[-1]:], skip_special_tokens=True).strip()
answers = [ln.strip() for ln in text.splitlines() if ln.strip()]
rows.append({"id": r["id"], "pred": json.dumps(answers, ensure_ascii=False)})
pd.DataFrame(rows).to_csv("submission.csv", index=False)
Leaderboard
The public leaderboard. Best submission per team.
| # | Team | Score | Exact match | chrF | Expl. % |
|---|---|---|---|---|---|
| Loading leaderboard… | |||||
Snapshot regenerated automatically. It may lag the live competition Space by up to an hour.
Timeline
- 12 Jun 2026Competition opens
- 26 Jul 2026, 23:59 UTCSubmission deadline. Select your 2 final entries
- 26 Jul 2026Private leaderboard revealed
- During the IOLHuman Evaluation Challenge reviewed by the jury