An open challenge from the International Linguistics Olympiad

Can a system reason out a
language it has never seen?

Every year, the strongest young linguists are given a few lines of a language they have never seen and asked to work out the rest using nothing but logic. IOL‑AI 2026 gives your system the same puzzles under the same rules, and asks one question: how close can it get to an Olympiad medallist?

00days
00hrs
00min
00sec

until the submission deadline · 26 Jul 2026, 23:59 UTC

The mission

A reasoning benchmark you can't memorise your way through.

Linguistics problems cannot be solved from memorised facts. Each one is built on a language the solver has almost certainly never seen, so the only way to the answer is to reason from a handful of examples. That makes it a test of reasoning rather than recall.

IOL‑AI 2026 turns the International Linguistics Olympiad into a public, reproducible benchmark for that kind of reasoning. Your system faces the same contest the human contestants faced. Entries are scored automatically, and the strongest are reviewed by hand by the jury, on equal footing with the students. Everything runs in the open: public entries, open method, public leaderboard.

语

First principles only

No language can be looked up. Every answer is deduced from the data in the problem.

⚖︎

Human-comparable

The same problems, point values, and scoring the IOL uses for its own contestants.

↻

Open and reproducible

Submit a public Hugging Face repository and a script. Anyone can rerun it and verify the result.

The task

Self-contained logic puzzles, no outside knowledge required.

Each problem gives a small amount of data from a language the solver has almost certainly never seen, plus the hints needed to solve it. From those examples, a solution works out enough of the grammar and vocabulary to translate new forms, fill in blanks, answer multiple-choice items, and correct mistakes.

You submit a Hugging Face repository containing a script.py. The platform runs your script on a hidden test set from the IOL individual contest and scores the answers it produces. Everything needed is in each problem's context, so no prior knowledge of the languages is required.

Scoring

Get it exactly right and get close. You need both.

Exact match

Is the answer exactly right? Points-weighted using the official IOL point values, so harder, heavier items matter more.

chrF

Character n-gram overlap with the correct answer, giving partial credit for near-misses so the leaderboard is smooth rather than a step function.

Final score

score = 100 · √( EMw · chrFw )

The geometric mean of the two, on a 0 to 100 scale. Because it is a geometric mean, fuzzy overlap alone or a few exact hits alone will not get you far.

The public leaderboard scores a subset of items on every submission. The private leaderboard covers the rest and is revealed at the deadline. Pick your 2 final submissions before then.

The data

Hidden, self-contained, and mounted at evaluation time.

The test set is the IOL individual contest, reformatted into self-contained sub-questions. It is hidden: your script.py reads it at evaluation time from /tmp/data/test.csv, which the platform mounts before your script runs. There is no separate training set. Each problem carries all the data needed to solve it.

ColumnDescription
idUnique sub-question id. Copy it back unchanged.
question_numberWhich numbered item within the problem's query this row asks for.
contextThe full problem statement: bilingual data, hints, language meta-info.
queryThe instruction listing every numbered item for the problem.
task_typetranslation, mapping, fill_blanks, classification, editing

The Human Evaluation Challenge

A parallel track, reviewed by the jury, on equal footing with IOL participants.

Alongside the automatic competition runs a human-scored track. The jury reviews a shortlist of submissions by hand, the same way it reviews the human contestants.

To opt in, add an explanation column to your submission.csv with a short, plain-language explanation of the reasoning behind each answer. It is never scored automatically. It exists only for the jury.

  1. Include an explanation for most of your answers.
  2. We take the top submissions by score with a valid explanation rate of at least 50%.
  3. The jury scores them by hand during the IOL.

How to submit

A Hugging Face model repo with a script.py that writes submission.csv.

  1. Put your model and a script.py in one Hugging Face model repo. That repo is your script's working directory at run time, so load any local weights from ".".
  2. Have the script read the test set from /tmp/data/test.csv. You cannot download it: your token is revoked before your code runs.
  3. Have the script write submission.csv with one row per problem: an id copied from the test set and a pred with your answers. Add an explanation column to enter the jury track.
  4. Enter your model repo id in the competition Space. You get up to 5 submissions a day, and pick 2 for the private leaderboard before the deadline. Each run has a 1-hour limit.

The evaluation sandbox has no internet, so ship your model inside the submission repo and load it from ".". This script.py uses Qwen/Qwen2.5-1.5B-Instruct, a small public model, with greedy decoding so results are reproducible. The one-time step that downloads the weights into your repo is documented on the competition Space.

import subprocess, sys
subprocess.run([sys.executable, "-m", "pip", "install", "-q",
                "transformers>=4.43", "accelerate>=0.30", "torch>=2.2", "pandas"], check=True)

import os
# The repo is the working directory at run time, and the sandbox has no network.
os.environ["HF_HUB_OFFLINE"] = "1"
os.environ["TRANSFORMERS_OFFLINE"] = "1"
MODEL_ID = "."   # your model's weights are shipped in this repo

import json
import pandas as pd
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tok = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, torch_dtype=torch.float16, device_map="auto"
).eval()

df = pd.read_csv("/tmp/data/test.csv", dtype=str).fillna("")

rows = []
for _, r in df.iterrows():
    messages = [
        {"role": "system", "content":
            "You solve International Linguistics Olympiad problems. Answer every numbered "
            "item. Put each answer on its own line, in order, with no numbering and no extra text."},
        {"role": "user", "content": f"{r['context'].strip()}\n\n{r['query'].strip()}"},
    ]
    ids = tok.apply_chat_template(
        messages, add_generation_prompt=True, return_tensors="pt",
    ).to(model.device)
    with torch.no_grad():
        out = model.generate(ids, max_new_tokens=512, do_sample=False)
    text = tok.decode(out[0][ids.shape[-1]:], skip_special_tokens=True).strip()
    answers = [ln.strip() for ln in text.splitlines() if ln.strip()]
    rows.append({"id": r["id"], "pred": json.dumps(answers, ensure_ascii=False)})

pd.DataFrame(rows).to_csv("submission.csv", index=False)

Leaderboard

The public leaderboard. Best submission per team.

Loading… ★ eligible for jury review (explanation ≥ 50%)
# Team Score Exact match chrF Expl. %
Loading leaderboard…

Snapshot regenerated automatically. It may lag the live competition Space by up to an hour.

Timeline