Every score is pass@1, zero-shot, temperature 0, on the public test split. Quercus-34B leads four of six categories we publish; we report the ones we don't win, too. Raw logs in the model card.
We don't separate research from product. Every Quercus model starts as a training experiment and becomes an API call the week it clears eval. The frontier only matters if developers can build on it.
And when someone else has the right model for the job, we host it — Llama, Mistral, Qwen, DeepSeek, all on the same endpoint. Frontier capability isn't always our capability, and we're honest about that.