Verified Qiskit.
Not guesses.
GrayGate runs your quantum code through simulation before you see it. If it doesn't pass, you don't get broken output.
Quantum code is hard to trust
When your circuit compiles but produces garbage distributions, you've already lost an afternoon. Current AI tools make this worse, not better.
Bugs that run
A wrong gate doesn't throw an error. It runs, simulates, and gives you counts that look plausible until you realize they're nonsense. Debugging quantum logic is slow because the feedback loop is broken.
Stale training data
Qiskit 1.0 broke half the tutorials online. LLMs trained on 2021 examples still suggest
execute() instead of run(). The API moves faster than model weights update.
No execution check
ChatGPT predicts the next likely token. It doesn't run your circuit. It doesn't know if the output compiles, let alone if the simulation produces valid Bell state correlations.
A 10-stage reliability pipeline
GrayGate wraps code generation in retrieval, planning, and two verification gates. Code only ships if simulation passes.
Pipeline Flow
Example Output
from qiskit import QuantumCircuit
from qiskit_aer import AerSimulator
qc = QuantumCircuit(2)
qc.h(0)
qc.cx(0, 1)
qc.measure_all()
sim = AerSimulator()
result = sim.run(qc).result()
counts = result.get_counts()
Verification Report
Key insight: The runtime gate executes on Qiskit Aer and checks that output matches the acceptance test defined during planning. Wrong distributions = no output.
Qiskit-HumanEval-Hard
151 challenging quantum programming tasks. GrayGate uses Gemini 3.0 Flash as its base model, then wraps it in verification. The wrapper more than doubles the pass rate.
Pass Rate Comparison
Development Status
Active development
GrayGate improves weekly. Architecture and retrieval systems are under constant iteration.
Fine-tuning pipeline
Building infrastructure to train Qiskit-specialized models. Current results use off-the-shelf Gemini.
Autonomous research
Long-term: autonomous quantum algorithm research and evaluation systems.
These benchmarks reflect current state. We're transparent about what works and what we're building.
Same base model. 2× the results.
The verification loop is the difference.