On the Role of Chain of Thoughts in the Long In-Context Learning

1The Hong Kong University of Science and Technology
2Fudan University   3WeChat AI

Abstract

In-Context Learning (ICL) has emerged as a powerful paradigm for adapting Large Language Models (LLMs) to new tasks without gradient updates. While recent advances in long-context models have enabled a shift from few-shot to many-shot ICL, achieving performance comparable to fine-tuning, research has largely focused on classification tasks. In contrast, the behavior of Chain-of-Thought (CoT) prompting, which elicits complex reasoning, remains underexplored in many-shot settings. We present a comprehensive analysis of many-shot in-context CoT learning, revealing fundamental differences from traditional classification-based ICL. Our findings show that CoT performance in many-shot settings deviates notably from earlier observations. In addition, demonstration selection based on input similarity, which is a common heuristic in ICL, becomes ineffective under the CoT paradigm. Counterintuitively, our experiments show that the quality of the reasoning chain, as measured by its ground-truth correctness, is not the primary factor for success. Instead, we observe a consistent performance hierarchy where model-self-generated CoTs with incorrectness outperform those with human-verified, correct reasoning. This suggests that the effectiveness of many-shot CoT prompting is driven less by demonstration quality and more by alignment with the LLM’s internal reasoning processes. Our findings challenge prevailing assumptions and underscore the need for new strategies tailored to the unique dynamics of many-shot CoT learning.