LIVE RESULTS

Qiskit-HumanEval
Hard Leaderboard

151 challenging quantum programming tasks. Models evaluated on generating valid, runnable Qiskit 2.0 code — verified by execution, not self-report.

GrayBench Dataset

MODELS 7

BEST PASS RATE 91.6%

AVG PASS RATE 50.0%

TOTAL RUNS 19

TASKS 151

ANALYTICS

Pass Rate Comparison

pass@1, temp 0.0

Score vs. Cost

INSIGHTS

GRAYGATE PASS RATE 91.6%

LEAD OVER #2 +31.7pp

EXCLUSIVE SOLVES 27

HARD TASK RATE 98%

Hard Task Performance

49 hardest tasks

Tasks solved by ≤2 providers

GrayGate Advantage

vs Competitors

Tasks GrayGate passed that others failed

Provider Kill Count

5 Competitors

Tasks GrayGate passes that stumped N others

LEADERBOARD

Model ▼	Provider	Pass Rate ▼	Passed ▼	Runs	Tokens ▼	Cost ▼	$/Pass
GrayGate PIPELINE + Gemini 3 Flash range 90.7–92.7%	Graygate	91.6%	138/151	3	—	Free	—
gemini-3-pro-preview range 59.6–60.3%	Google	59.9%	90/151	2	56k	$0.52	$0.006
gemini-3-flash-preview	Google	55.6%	84/151	3	90k	$0.24	$0.003
gpt-5.2 range 42.4–49.0%	Openai	46.1%	70/151	3	57k	$0.64	$0.009
qwen/qwen3.5-397b-a17b range 37.8–39.1%	Qwen	38.4%	58/151	3	149k	$0.14	$0.002
deepseek-reasoner range 31.1–33.8%	Deepseek	32.5%	49/151	3	410k	$0.17	$0.003
hf.co/Qiskit/qwen2.5-coder-14b-qiskit-GGUF:latest	Ollama	25.8%	39/151	2	65k	Free	—

METHODOLOGY

How Scores Are Computed

Prompt Extraction

Each of the 151 tasks contains a function signature, docstring, and a hidden check() test harness.

Code Generation

The model receives the prompt and must return a complete function body. Temperature is fixed at 0.0 for deterministic output.

Sandboxed Execution

Generated code runs in an isolated venv with Qiskit 2.0.0 and pinned dependencies. 60 s timeout per task.

Test Validation

The hidden test harness runs assertions against the generated function. A task is PASS only if every assertion succeeds with zero errors.

Scoring

Final score = passed tasks / 151. Token counts and cost are tracked per-task for efficiency analysis.

Verified Execution

Every submission runs in a real Python process. No regex matching, no partial credit — the code either passes all tests or it doesn't.

Open Source

The evaluation framework, datasets, and scoring logic are fully open. Anyone can reproduce results or add new models.

Pinned Dependencies

Qiskit 2.0.0, numpy, scipy — every run uses the exact same package versions for fair comparison across models and dates.

DATASETS

qiskitbench-hard ACTIVE

Complex quantum algorithms, transpilation, error correction. The primary leaderboard benchmark.

151 tasks HuggingFace

qiskitbench SUPPORTED

Standard quantum computing tasks with Qiskit 2.0. Easier baseline for model capability assessment.

99 tasks HuggingFace

critpt COMING SOON

Research-level physics code generation. 190 tasks covering critical point calculations.

190 tasks Local only

GET STARTED

Run Your Own Benchmarks

Install GrayBench, point it at any provider, and get verified results in minutes. Open source, reproducible, and extensible.

Star on GitHub Contribute

terminal

$ git clone https://github.com/GrayArea-Labs/GrayBench.git

$ cd graybench && pip install -e .

$ graybench env setup qiskitbench

$ graybench keys set google

$ graybench run qiskitbench-hard -m google/gemini-2.5-flash

Qiskit-HumanEval Hard Leaderboard