Generative AI Act II: Test Time Scaling Drives Cognition Engineering

About

The first generation of Large Language Models - what might be called "Act I" of generative AI (2020-2023) - achieved remarkable success through massive parameter and data scaling, yet exhibited fundamental limitations such as knowledge latency, shallow reasoning, and constrained cognitive processes. During this era, prompt engineering emerged as our primary interface with AI, enabling dialogue-level communication through natural language. We now witness the emergence of "Act II" (2024-present), where models are transitioning from knowledge-retrieval systems (in latent space) to thought-construction engines through test-time scaling techniques. This new paradigm establishes a mind-level connection with AI through language-based thoughts. In this paper, we clarify the conceptual foundations of cognition engineering and explain why this moment is critical for its development. We systematically break down these advanced approaches through comprehensive tutorials and optimized implementations, democratizing access to cognition engineering and enabling every practitioner to participate in AI's second act. We provide a regularly updated collection of papers on test-time scaling in the GitHub Repository: https://github.com/GAIR-NLP/cognition-engineering

Shijie Xia, Yiwei Qin, Xuefeng Li, Yan Ma, Run-Ze Fan, Steffi Chern, Haoyang Zou, Fan Zhou, Xiangkun Hu, Jiahe Jin, Yanheng He, Yixin Ye, Yixiu Liu, Pengfei Liu• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K (test)	Accuracy91.05	954
Mathematical Reasoning	MATH500 (test)	Accuracy76.4	922
Mathematical Reasoning	AIME 2024	Accuracy12.08	370
Mathematical Reasoning	Omni-MATH	Accuracy25.34	135
Mathematical Reasoning	OlympiadBench	Accuracy44.51	134
Reasoning	Reasoning domain benchmarks ARC-C, BBH, GPQA, CALM, KOR-BENCH	ARC-C Score80.7	16
Mathematical Reasoning	AIME 2025	Accuracy10	13

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord