Home Benchmarks Get Early Access
LIVE RESULTS

Qiskit-HumanEval
Hard Leaderboard

151 challenging quantum programming tasks. Models evaluated on generating valid, runnable Qiskit 2.0 code — verified by execution, not self-report.

MODELS 7
BEST PASS RATE 91.6%
AVG PASS RATE 50.0%
TOTAL RUNS 19
ANALYTICS

Pass Rate Comparison

pass@1, temp 0.0

Score vs. Cost

INSIGHTS
GRAYGATE PASS RATE 91.6%
LEAD OVER #2 +31.7pp
EXCLUSIVE SOLVES 27
HARD TASK RATE 98%

Hard Task Performance

49 hardest tasks

Tasks solved by ≤2 providers

GrayGate Advantage

vs Competitors

Tasks GrayGate passed that others failed

Provider Kill Count

5 Competitors

Tasks GrayGate passes that stumped N others

LEADERBOARD
# Model Provider Pass Rate Passed Runs Tokens Cost $/Pass
GrayGate
GrayGate PIPELINE + Gemini 3 Flash
range 90.7–92.7%
Graygate
91.6%
138/151 3 Free
Gemini
gemini-3-pro-preview
range 59.6–60.3%
Google
59.9%
90/151 2 56k $0.52 $0.006
Gemini
gemini-3-flash-preview
Google
55.6%
84/151 3 90k $0.24 $0.003
OpenAI
gpt-5.2
range 42.4–49.0%
Openai
46.1%
70/151 3 57k $0.64 $0.009
Qwen
qwen/qwen3.5-397b-a17b
range 37.8–39.1%
Qwen
38.4%
58/151 3 149k $0.14 $0.002
DeepSeek
deepseek-reasoner
range 31.1–33.8%
Deepseek
32.5%
49/151 3 410k $0.17 $0.003
Qiskit
hf.co/Qiskit/qwen2.5-coder-14b-qiskit-GGUF:latest
Ollama
25.8%
39/151 2 65k Free
METHODOLOGY

How Scores Are Computed

Prompt Extraction

Each of the 151 tasks contains a function signature, docstring, and a hidden check() test harness.

Code Generation

The model receives the prompt and must return a complete function body. Temperature is fixed at 0.0 for deterministic output.

Sandboxed Execution

Generated code runs in an isolated venv with Qiskit 2.0.0 and pinned dependencies. 60 s timeout per task.

Test Validation

The hidden test harness runs assertions against the generated function. A task is PASS only if every assertion succeeds with zero errors.

Scoring

Final score = passed tasks / 151. Token counts and cost are tracked per-task for efficiency analysis.

Verified Execution

Every submission runs in a real Python process. No regex matching, no partial credit — the code either passes all tests or it doesn't.

Open Source

The evaluation framework, datasets, and scoring logic are fully open. Anyone can reproduce results or add new models.

Pinned Dependencies

Qiskit 2.0.0, numpy, scipy — every run uses the exact same package versions for fair comparison across models and dates.

DATASETS
qiskitbench-hard ACTIVE

Complex quantum algorithms, transpilation, error correction. The primary leaderboard benchmark.

151 tasks HuggingFace
qiskitbench SUPPORTED

Standard quantum computing tasks with Qiskit 2.0. Easier baseline for model capability assessment.

99 tasks HuggingFace
critpt COMING SOON

Research-level physics code generation. 190 tasks covering critical point calculations.

190 tasks Local only
GET STARTED

Run Your Own Benchmarks

Install GrayBench, point it at any provider, and get verified results in minutes. Open source, reproducible, and extensible.

terminal

$ git clone https://github.com/GrayArea-Labs/GrayBench.git

$ cd graybench && pip install -e .

$ graybench env setup qiskitbench

$ graybench keys set google

$ graybench run qiskitbench-hard -m google/gemini-2.5-flash