Live Results
Qiskit-HumanEval-Hard
Leaderboard
151 challenging quantum programming tasks testing code generation, correctness, and execution. Models are evaluated on their ability to generate valid, runnable Qiskit code that passes simulation verification.
0
Models Tested
151
Total Tasks
Qiskit 2.3
Version
Zero
Temperature
LEADERBOARD
Model Rankings
Higher is better
Rank
Model
Provider
Pass Rate
Passed
No benchmark data available
Check back later for updated results.
ERROR ANALYSIS
Failure Modes
Understanding where models fail helps identify areas for improvement in quantum code generation.