Part 1 of a series examining the need and search for new scaling laws. This piece reflects ongoing research at Fig. We welcome discussion and collaboration as we work to formalize these observations. OpenAI's Orion consumed 15-30× more training compute than GPT-4. By the scaling laws that have
GPT-4 scored 95% on HumanEval. So did Claude. So did Gemini. But your production deployment still breaks on basic customer queries. We've collectively entered the what is fast becoming a dangerous phase of AI development: when benchmarks tell us nothing about what actually matters. Models have memorized the