Model Leaderboard

Polish Legal RAG Leaderboard

Explore and compare model performance on Polish legal QA tasks.

Use filters to narrow by name and parameter bins.
Use "Exclude tasks" to hide selected metrics; avg_score updates accordingly.
Click column headers to sort; data updates automatically as filters change.

Filter by name

Exclude tasks

Select tasks to hide; all are shown by default

Leaderboard

Methodology

src_clf: Source classification of a fragment.
sum_rag: RAG-style QA strictly from provided passages. Answers are graded by a judge gpt-4o model on a 0-2 scale; we report F1 score.
sum_rag_v2: Advanced legal reasoning dataset with multiple question types:
- Contradiction resolution: Questions about resolving contradictions or ambiguities within legal texts, requiring analysis of conflicting rules or statements
- Legal inference: Questions testing whether hypothetical situations meet specific legal criteria, requiring identification of legal prerequisites and exceptions

Notes

GPT-5-nano sometimes fails to answer, responding with an empty string.
GPT-4o has 100% precision on the sum_rag_v2 task, but seems to have surprisingly low recall.
Llama-3-8B-Instruct family has limited context length (3 - 8k, 3.1 - 16k), so if the passages are too long, the model will not be able to answer (and will thus be given score 0).
Gaius-Lex v0.8 model is based on Llama-3-8B-Instruct with RoPE scaling = 2.0. It wasn't trained for src_clf task.

Language and RAG prompt

All tasks, passages and questions are in Polish. The models are instructed to answer in Polish.

Odpowiadasz tylko i wyłącznie po polsku. Twoim zadaniem jest odpowiedzieć na pytanie na podstawie źródeł. Podaj wszystkie interesujące informacje oraz argumenty i cytaty na dowód ich prawdziwości.
Nie odpowiadaj na podstawie własnej wiedzy. Jeżeli w źródłach nie ma wymaganych informacji, powiedz to krótko.
<relevant_info>
{passages}
</relevant_info>

Odpowiedz na pytanie: `{question}` tylko i wyłącznie na podstawie źródeł. Nie odbiegaj od ich treści.
Jeżeli odpowiedź nie jest zawarta w <relevant_info>, odpowiedz że nie ma odpowiedzi w źródłach.
To jest kluczowe, że odpowiedź musi być oparta wyłącznie na <relevant_info>.