Understanding the Challenges of LLMs in Composite Reasoning
Written on
Chapter 1: The Limitations of LLMs
Large Language Models (LLMs) possess remarkable computational capabilities, yet they struggle with a specific category of questions. In an experiment, I posed a straightforward question to Bard, Bing, and ChatGPT, and all three produced erroneous results across various iterations of the query.
Prompt: Count each character in the sequence: abbcccdddeeefffaddaaa
Correct Response: {a:5, b:3, c:3, d:5, e:3, f:3}
Responses from ChatGPT, Bing, and Bard:
Both ChatGPT and Bing made mistakes with characters 'd' and 'a,' while Bard's output was largely inaccurate except for 'c.'
Bard's answers were mostly incorrect for characters 'a', 'b', 'd', 'e', and 'f', while ChatGPT and Bing overestimated 'a' and underestimated 'd'. This pattern persisted even with simpler sequences, indicating that LLMs struggle when counts must be maintained for interspersed characters, such as in "abaabaa" or "abbcccaddaad".
Question: What explains this phenomenon?
Hypothesis: LLMs seem to depend on the quickest and most superficial mappings from input to output. Consequently, they adopt broad rules based on average experiences with similar sequences. While this approach works well for subjective language, it falters in objective domains like mathematics.
Mathematics demands precision with singular correct answers, making it unsuitable for learning through unregulated gradient descent aimed at general solutions. This limitation raises questions about the inductive biases inherent in transformers versus the auto-regressive training methods employed in these models.
If the challenge lies with inductive biases, it seems tasks like these might be better suited for Long Short-Term Memory (LSTM) networks, due to their ability to manage sequential data and leverage two types of memory: recurrent and cell. However, research suggests that even LSTMs do not outperform transformers in this regard.
In the video "Battle of the AIs: Can Bing and Bard Beat ChatGPT at Research?", various AI models are tested on their reasoning capabilities, shedding light on their strengths and weaknesses.
Chapter 2: Analyzing Mathematical Reasoning
Mathematical reasoning presents unique challenges that are crucial to human intelligence. As detailed in research, transformers have shown superior performance over LSTMs and their variations when it comes to reasoning tasks.
Hypothesis: To develop a single model proficient in both language comprehension and complex reasoning tasks, strong inductive biases need to be integrated. This could involve employing hippocampal memory for precise memorization alongside neocortex-like structures for learning general concepts.
For instance, Google’s Minerva model, based on PaLM with billions of parameters, demonstrated exceptional capabilities in quantitative reasoning.
The video "ChatGPT vs Claude vs BARD vs Bing Chat vs Perplexity AI vs Pi - The AI Face-Off" compares various AI models in their reasoning abilities, highlighting their performance in quantitative tasks.
Despite its impressive statistics, Minerva still exhibits similar failure modes as earlier models, particularly in providing trustworthy outputs without intrinsic validation—unlike symbolic reasoning or external tools like calculators.
Current Solution: A promising approach combines the strengths of ChatGPT with Wolfram Alpha, enhancing mathematical reasoning capabilities.
Transitive reasoning also poses a significant challenge for LLMs, where an answer can shift dramatically based on a single piece of information. For example, when given two ranked groups of symbols and new evidence, Bing struggled to utilize this critical information effectively.
Even when provided with complete information, Bing was unable to solve the task, raising concerns about its reasoning proficiency.
If you found this article insightful and wish to support my work, consider subscribing for updates on future publications. You can also access a wide range of Medium articles through my referral link, where I receive a small portion of your subscription fee.