The Scaling Laws Are Breaking

The Scaling Laws Are Breaking
robot newton trying to understand figs at cambridge, circa 2033

Part 1 of a series examining the need and search for new scaling laws.

This piece reflects ongoing research at Fig. We welcome discussion and collaboration as we work to formalize these observations.


OpenAI's Orion consumed 15-30× more training compute than GPT-4. By the scaling laws that have guided the field since 2020, it should have been revolutionary. Instead, internal reports suggest the model shows inconsistent improvements across task categories—sometimes performing worse than its predecessor on certain tasks [12, 13].

This is a fundamental break in the pattern that has driven hundreds of billions in infrastructure investment.

The Kaplan [1] and Hoffmann [2] scaling laws promised predictable returns: more compute + more parameters + more data = reliably better performance. GPT-3 to GPT-4 followed this pattern beautifully. GPT-4 to Orion didn't.

We've hit the constraint that Ilya Sutskever highlighted at NeurIPS 2024: "We have reached peak data" [3]. The internet's usable text—approximately 300 trillion tokens—is largely exhausted. Epoch AI projects that high-quality text data will be fully consumed by 2028, though current development suggests this timeline may be optimistic [4].

Three Paths, Three Dead Ends

The field's response reveals how deeply we've internalized scaling orthodoxy. Each major approach tries to preserve the old paradigm rather than question it.

Synthetic Data: The Recursive Trap

If we've consumed the internet, why not generate our own training data? Train GPT-5 on GPT-4's outputs, bootstrap intelligence from itself.

The information-theoretic constraints are unforgiving. Recent experiments show [5, 6]:

  • Model collapse begins by generation 5 when synthetic data exceeds 20% of the training mixture
  • Performance degrades 20-30% by generation 5, becoming unusable by generation 10
  • Even Meta's successful use of synthetic data in Llama 3 relied on it as augmentation, not replacement

You cannot bootstrap intelligence from nothing. The recursive loop provides no new information.

Multimodal Scaling: The Complexity Explosion

YouTube offers 20 billion hours of video, growing by 500 hours per minute. Surely this solves our data problem?

The harsh reality:

  • Video's high redundancy compresses to perhaps 10-15 trillion useful tokens
  • Processing requires 10-100× more compute per token than text
  • Architectural modifications for video understanding reduce transfer efficiency to core language tasks

We're not scaling the same thing anymore—we're building something fundamentally different with fundamentally different economics.

Test-Time Compute: The $3,000 Question

OpenAI's o3 achieves remarkable results: 88% on ARC-AGI (versus GPT-4's 5%), competitive programming performance ranking 175th globally [8, 9]. The method is conceptually elegant—let models "think" longer by generating extensive reasoning chains [7].

The economics are brutal:

  • o3-low: $17.50 per million tokens
  • o3-high: Over $3,000 per complex task
  • Average enterprise query cost: $87
  • Customer willingness to pay: $0.10-1.00

But there's a deeper problem. Recent studies reveal that extended reasoning often makes models less reliable [10]:

  • Accuracy peaks at 1,000-2,000 reasoning tokens, then declines
  • Models confabulate increasingly elaborate incorrect justifications
  • They rationalize their way into errors they wouldn't make with quick responses

We're not building intelligence—we're building very expensive overthinking machines.

Why Traditional Scaling Had to Break

The failure isn't accidental. It's structural. Traditional scaling laws rest on assumptions that no longer hold:

Static Deployment Assumption: The laws assume models are trained, then frozen. But modern systems operate continuously, accumulating thousands of hours of interaction. A customer service agent after 10,000 conversations knows things that weren't in its training data. The framework cannot account for this.

Isolated Intelligence Assumption: Scaling laws treat each model as self-contained. But deployed systems use tools, reference databases, call APIs. A 7B parameter model with browser access outperforms a 70B model without it. Traditional scaling is silent on this inversion.

Uniform Computation Assumption: The laws assume every token gets equal processing. But intelligent systems should allocate compute based on difficulty—think harder about hard problems, less about easy ones. o3's uniform test-time scaling is like running a marathon at sprint pace.

Single Task Assumption: Traditional scaling measures perplexity on held-out text. But we're asking models to browse the web, write code, use tools, and plan multi-step actions. The evaluation framework fundamentally misaligns with the deployment reality [18, 19, 20].

Toward Interaction Laws

What we're discovering is that intelligence in real-world systems isn't a function of three variables (compute, parameters, data). It emerges from the interaction of many dimensions:

  • Computational capacity (traditional scaling)
  • Experience accumulation (learning through deployment)
  • Tool augmentation (compositional capabilities)
  • Environmental complexity (diversity of contexts)
  • Temporal coherence (persistence and memory)
  • Network effects (multi-agent dynamics)

Each dimension has its own scaling properties. More importantly, they interact in ways we're only beginning to understand:

  • Small models with tools outperform large models without them
  • Experience accumulation shows phase transitions rather than smooth scaling
  • Multi-agent systems exhibit emergent capabilities unpredictable from individual agents

We are tentatively calling these relationships interaction laws—principles that describe how different dimensions of capability interact and potentially amplify each other along axes of scale.

Early Observations

While we're still developing the formal framework, and will have more to share soon. To give you an idea of some patterns that are becoming clear:

Non-monotonicity: Unlike traditional scaling's smooth curves, we observe discontinuous jumps. Systems seem to cross capability thresholds at specific experience levels.

Multiplicative Effects: Tool integration doesn't add to capability—it multiplies it. The gain from tools scales with base model capability up to a saturation point we're still identifying.

Experience Efficiency: Not all experience is equal. Diverse, challenging interactions provide more learning signal than repetitive tasks. This suggests curriculum design may be as important as scale.

What's Working Now

Three approaches are showing promise by implicitly acknowledging these multi-dimensional dynamics:

Adaptive Architectures: Google's Gemini 2.5 selectively invokes expensive reasoning only when needed [14]. Most queries get fast, cheap responses; complex problems trigger deeper processing. This implicitly recognizes that uniform computation is wasteful.

Domain Specialization with Tools: Anthropic's Claude Sonnet 4.5 achieves 70.3% on SWE-bench by deeply integrating with development environments [15, 17]. Rather than scaling the model, they're scaling the system's capabilities through tool mastery.

Efficiency Through Architecture: DeepSeek R1 matches o1's performance at 3% of the cost using mixture-of-experts (671B parameters, 37B active) [16]. This suggests that how we scale matters as much as how much we scale.

The Path Forward

Understanding interaction laws will require rethinking our entire approach to AI development:

New Metrics: We need benchmarks that capture multi-dimensional progress. How do we measure a system that learns from experience? That leverages tools? That persists across sessions?

New Architectures: Models designed for experience accumulation and tool use from the ground up, not as afterthoughts.

New Training Paradigms: Instead of maximizing compute on static datasets, we need to optimize across multiple dimensions simultaneously.

We're at the beginning of this exploration. Over the coming months, we'll be sharing our research into these interaction dynamics—how to measure them, how to optimize for them, and what they mean for the future of AI.

The breakdown of traditional scaling isn't a ceiling—it's a doorway. On the other side lies a richer, more nuanced understanding of intelligence that emerges not from brute force, but from the complex interaction of multiple capabilities.

Citation

Please cite this work as:

Sikka, H., & Fig AI Team. (2025, October 30). The scaling laws are breaking. Fig AI: Perspectives on Intelligence. https://blog.fig.inc/the-scaling-laws-are-breaking/

Or use the BibTeX citation:

@article{sikka2025scaling,
  title={The Scaling Laws Are Breaking},
  author={Sikka, Harshvardhan and {Fig AI Team}},
  journal={Fig AI: Perspectives on Intelligence},
  year={2025},
  month={October},
  day={30},
  url={https://blog.fig.inc/the-scaling-laws-are-breaking/},
  note={First in a series on interaction laws: why traditional scaling is failing and the need for multi-dimensional frameworks in AI development},
  keywords={scaling laws, large language models, interaction laws, test-time compute, synthetic data, multimodal scaling, AI evaluation}
}

References

[1] Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling Laws for Neural Language Models. arXiv preprint arXiv:2001.08361.

[2] Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. L., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., Driessche, G. V. D., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., ... Sifre, L. (2022). Training Compute-Optimal Large Language Models. arXiv preprint arXiv:2203.15556.

[3] Sutskever, I. (2024, December). Peak Data: Implications for Future AI Development [Keynote]. Conference on Neural Information Processing Systems (NeurIPS) 2024, Vancouver, Canada.

[4] Villalobos, P., Sevilla, J., Heim, L., Besiroglu, T., Hobbhahn, M., & Ho, A. (2024). Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning. Epoch AI. https://epochai.org/blog/will-we-run-out-of-ml-data

[5] Shumailov, I., Shumaylov, Z., Zhao, Y., Gal, Y., Papernot, N., & Anderson, R. (2023). The Curse of Recursion: Training on Generated Data Makes Models Forget. arXiv preprint arXiv:2305.17493.

[6] Alemohammad, S., Casco-Rodriguez, J., Luzi, L., Ahmed, H. R., Babaei, M., Baraniuk, R., & Lin, T. (2023). Self-Consuming Generative Models Go MAD. arXiv preprint arXiv:2307.01850.

[7] OpenAI. (2024, December 20). Deliberative Alignment: Reasoning Enables Safer Language Models. https://openai.com/index/deliberative-alignment/

[8] OpenAI. (2025, January 17). OpenAI o3 and o3-mini. https://openai.com/index/openai-o3-and-o3-mini/

[9] ARC Prize Foundation. (2024, December 20). OpenAI o3 Breakthrough. https://arcprize.org/blog/openai-o3-breakthrough

[10] Anthropic. (2025, July). Inverse Scaling Properties of Chain-of-Thought Reasoning. Anthropic Research Blog.

[11] Hinton, G., Sutskever, I., Bengio, Y., LeCun, Y., Ng, A., Schmidhuber, J., & Russell, S. (2025, July). Joint Statement on AI Reasoning Transparency. Future of Humanity Institute.

[12] Reuters. (2024, November 15). Exclusive: OpenAI's next flagship model might not represent as big a leap forward as its predecessors. Reuters Technology.

[13] The Information. (2024, November 20). OpenAI Shifts Strategy as GPT-5 Faces Delays. The Information.

[14] Google DeepMind. (2024, December). Gemini 2.0: Our next era of models. https://deepmind.google/gemini/

[15] Anthropic. (2024, October). Introducing Claude 3.5 Sonnet and Claude 3.5 Haiku. https://www.anthropic.com/news/3-5-models-and-haiku

[16] DeepSeek. (2025, January 20). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv preprint arXiv:2501.12948.

[17] Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. arXiv preprint arXiv:2310.06770.

[18] OSWorld Team. (2025). OSWorld v3: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. In Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Datasets and Benchmarks Track.

[19] H2O.ai. (2025, September). GAIA Leaderboard Update: Progress and Persistent Challenges. https://h2o.ai/blog/gaia-benchmark-2025/

[20] Xie, T., Zhao, Y., Li, Y., Wang, J., & Chen, L. (2025). ARC-AGI-2: A Harder Benchmark for General Intelligence. ARC Prize Foundation Technical Report.