What LLMs are still not good at?

Exploration of current capabilities and fallacies of LLMs

Published in

AI Advances

12 min readNov 11, 2024

Large Language Models (LLMs) have revolutionised natural language processing, demonstrating remarkable capabilities in text generation, translation, and various language-related tasks. These models, such as GPT-4, BERT, and T5, are based on transformer architectures and are trained on vast amounts of text data to predict the next word in a sequence.

How LLMs Work?

LLMs operate by processing input text through multiple layers of attention mechanisms, allowing them to capture complex relationships between words and phrases. This process involves several key components and steps:

Tokenization and Embedding

First, the input text is tokenized into smaller units, typically words or subwords. These tokens are then converted into numerical representations called embeddings. For example, the sentence “The cat sat on the mat” might be tokenised into [“The”, “cat”, “sat”, “on”, “the”, “mat”], and each token would be assigned a unique vector representation.

Multi-Layer Processing

The embedded tokens are processed through multiple transformer layers, each containing self-attention mechanisms and feed-forward neural networks.

In each layer:

Self-Attention: The model computes attention scores between all pairs of tokens, allowing it to weigh the importance of different words in relation to each other. For instance, in the sentence “The bank by the river is closed,” the model might assign higher attention scores between “bank” and “river” to understand the context.
Feed-Forward Networks: These networks further process the attention-weighted representations, allowing the model to capture more complex patterns.

Contextual Understanding

As the input passes through these layers, the model builds increasingly sophisticated representations of the text, capturing both local and global contexts. This enables the LLM to understand nuanced relationships, such as:

Long-range dependencies (e.g., understanding pronoun references across sentences)
Semantic similarities and differences
Idiomatic expressions and figurative language

Training and Pattern Recognition

During training, LLMs are exposed to vast amounts of text data, allowing them to learn patterns and structures in language.

This includes:

Grammar and Syntax: The model learns the rules governing sentence structure and word order.
Semantic Relationships: It recognizes connections between related concepts (e.g., “dog” and “puppy”).
Common Phrases and Idioms: The model learns frequently used expressions and their meanings.

Generating Responses

When generating responses, the LLM uses its learned patterns to predict the most likely next word or token given the context. This process is iterative, with each generated token influencing the prediction of subsequent tokens.For example, if prompted with “The Eiffel Tower is located in”, the model might generate “Paris” based on its learned associations between these concepts.By leveraging these complex mechanisms, LLMs can generate coherent and contextually appropriate responses that often exhibit human-like understanding of language and knowledge

Limitations in Reasoning and Planning

Despite their impressive abilities, LLMs still face significant challenges in areas such as reasoning and planning. Research by Subbarao Kambhampati and his colleagues has shed light on these limitations, revealing several key areas where LLMs struggle to match human-level cognitive abilities.

Lack of Causal Understanding

Kambhampati’s work reveals that LLMs struggle with causal reasoning, which is fundamental to understanding how events and actions are related in the real world.

Example 1: Weather and Clothing
When presented with a scenario where a person is wearing a coat because it’s cold outside, LLMs may fail to infer that removing the coat would make the person feel colder. This demonstrates their inability to understand the causal relationship between temperature and clothing choices

Example 2: Plant Growth
If given information about a plant dying due to lack of water, an LLM might not reliably conclude that regular watering would have prevented the plant’s death. This shows a lack of understanding of the causal link between water and plant survival.

Difficulty with Multi-Step Planning

Another area where LLMs fall short is in multi-step planning. Kambhampati’s research demonstrates that these models often struggle to break down complex tasks into logical sequences of actions

Example: Birthday Party Planning
When asked to plan a birthday party, an LLM might generate a list of relevant items or activities, such as:

Invite guests
Buy decorations
Order cake
Prepare food
Set up music

However, this list lacks the logical sequencing and dependencies crucial for effective planning. A more sophisticated plan would include steps like:

Set a date and time
Create a guest list
Send invitations (at least two weeks before the event)
Determine the party theme
Plan the menu based on guest preferences and dietary restrictions
Order the cake (at least one week in advance)
Purchase decorations and party supplies
Prepare or order food (timing depends on the type of food)
Set up decorations on the day of the party
Arrange music and entertainment

Blockword problem

One of the primary focuses of Kambhampati’s research is the Blocksworld domain, a classic planning problem that involves stacking and unstacking blocks to achieve a desired configuration. Despite its apparent simplicity, this domain has proven to be surprisingly challenging for LLMs. Kambhampati’s experiments revealed that even advanced models like GPT-3 struggle to generate correct plans for Blocksworld tasks autonomously.In their study, Kambhampati and his colleagues tested various LLMs, including different versions of GPT, on a set of 600 Blocksworld instances. The results were striking — even the most capable models solved only a small fraction of the problems correctly. For example, GPT-3 (Instruct) managed to solve only about 12.5% of the instances when given natural language prompts, and this performance dropped even further when using more formal PDDL (Planning Domain Definition Language) prompts.

Schema and example of blockworld problem (image by author)

The research also explored the impact of fine-tuning on LLMs’ planning abilities. Kambhampati’s team fine-tuned GPT-3 on a dataset of 1,000 Blocksworld instances, separate from their test set. However, the results were disappointing — the fine-tuned model only solved about 20% of the test instances, suggesting that fine-tuning has limited effectiveness in improving LLMs’ planning capabilities in this domain.

Kambhampati’s work reveals that LLMs tend to rely heavily on pattern matching rather than developing a true understanding of the planning problem. This was evidenced by the observation that changing the example plan in the prompt led to a significant drop in accuracy, even for instances where the model had previously generated correct plans.

GPT-4 Performance:

On the standard Blocksworld domain with natural language prompts:
Zero-shot: Solved 210 out of 600 instances (35%)
One-shot: Solved 206 out of 600 instances (34.3%)

2. With PDDL-style prompts — PDDL-style prompts are a way of presenting planning problems to Large Language Models using the Planning Domain Definition Language (PDDL) syntax. This approach differs from natural language prompts by employing a formal, structured language to describe planning domains and problems. PDDL-style prompts typically consist of two main components: a domain description and a problem description. The domain description defines the predicates and actions that are common across all problems in a particular domain, while the problem description specifies the initial state, objects, and goal state for a specific problem instance:

Zero-shot: Solved 106 out of 600 instances (17.7%)
One-shot: Solved 75 out of 600 instances (12.5%)

3. Comparison to Other Models:

GPT-4 performed significantly better than previous GPT models on the Blocksworld domain.
GPT-3.5 did not solve a single instance in the entire set of natural language instances.

4. Chain of Thought Prompting:

Chain of thought prompting did not significantly improve performance over one-shot natural language prompts.

5.Mystery Blocksworld:

When the domain was obfuscated (renamed to “Mystery Blocksworld”), GPT-4’s performance dropped dramatically:
Natural language prompts: Solved only 1 out of 600 instances
PDDL-style prompts: Solved only 3 out of 600 instances

6. Fine-tuning Results (GPT-3):

A GPT-3 model fine-tuned on 1,000 Blocksworld instances solved only about 20% (122 out of 600) of the test instances.

These findings highlight a fundamental limitation of current LLMs in handling tasks that require multi-step reasoning and planning.

Temporal Reasoning

Kambhampati’s research also highlights LLMs’ difficulties with temporal reasoning, especially when it involves understanding the sequence of events or the passage of time

Example: Historical Timeline
When asked to arrange historical events chronologically, LLMs might make errors in sequencing, particularly for events that are closely related or occurred in rapid succession. For instance, they might incorrectly order events of the French Revolution or misplace the timing of key battles in World War II.

Counterfactual Reasoning

Another area of difficulty identified by Kambhampati is counterfactual reasoning — the ability to consider hypothetical scenarios that contradict known facts.

Example: Alternative History
When prompted with “What if the Industrial Revolution had not occurred?”, LLMs might struggle to construct a coherent alternative history. They could generate responses that fail to consider the far-reaching implications of such a change, or they might inadvertently include technologies or societal structures that would not exist without industrialization.

These limitations highlight the need for continued development in AI systems to bridge the gaps in reasoning and planning capabilities. Kambhampati’s work suggests that while LLMs excel at pattern recognition and language generation, they still lack the deeper understanding and logical reasoning abilities that humans possess.

This underscores the importance of developing hybrid AI systems that combine the strengths of LLMs with other AI techniques to achieve more robust reasoning and planning capabilities.

Token and Numerical Errors

LLMs exhibit peculiar errors when dealing with numbers and comparisons, particularly with decimal numbers and mathematical operations. These errors stem from the model’s tokenization process and its lack of true numerical understanding. Let’s explore this issue in more depth:

Tokenization and Numerical Representation

The root cause of these errors lies in how LLMs tokenize and process numerical input. As explained in the tokenization guide, numbers are often split into separate tokens in inconsistent ways.

For example:

“380” might be tokenized as a single token
“381” could be split into two tokens: “38” and “1”
“3000” might be one token, while “3100” could be split into “3” and “100”

This inconsistent tokenization makes it difficult for the model to maintain a coherent understanding of numerical values.

One example of mistakes made by ChatGPT when reasoning about tokens (Screenshot by authro, 8th Nov 2024)

Decimal Comparison Errors

The example of 9.9 < 9.11 is a classic case that demonstrates this issue.

How chatGPT reasons about this problem (screenshot by author)

Here’s why this error occurs:

The model tokenizes “9.9” and “9.11” separately.
It may treat these as string comparisons rather than numerical comparisons.
In string comparison, “9.11” would indeed come after “9.9” alphabetically.

This leads to the incorrect assertion that 9.9 is less than 9.11.

More Examples of Numerical Errors

Arithmetic Operations: LLMs often struggle with basic arithmetic, especially with larger numbers or decimal operations. For instance, they might incorrectly calculate 127 + 677
Inconsistent Rounding: When asked to round numbers, LLMs may produce inconsistent results, especially with numbers close to rounding thresholds.
Order of Operations: LLMs can fail to correctly apply the order of operations in mathematical expressions.
Large Number Comparisons: Comparing very large numbers or numbers with many decimal places often leads to errors.

Research and Data

A study by Patel et al. (2021) found that GPT-3, despite its impressive language capabilities, struggled with basic arithmetic tasks. The model’s accuracy dropped significantly for calculations involving numbers with more than three digits.

Another research paper by Zhang et al. (2022) demonstrated that LLMs perform poorly on tasks requiring precise numerical reasoning. They found that even state-of-the-art models like GPT-3 and PaLM achieved less than 50% accuracy on a dataset of numerical reasoning problems.

Implications and Potential Solutions

These numerical errors have significant implications, especially in fields like finance, engineering, or scientific research where precise calculations are crucial. Some potential solutions being explored include:

Specialized Tokenization: Developing tokenization methods that treat numbers more consistently.
Hybrid Models: Combining LLMs with specialized numerical processing modules.
Enhanced Training: Incorporating more numerical reasoning tasks in the training process.
External Calculators: Using external tools for arithmetic operations and feeding the results back to the LLM.

Understanding these limitations is crucial for developers and users of LLM-based systems. It highlights the need for careful validation of any numerical outputs and the importance of not blindly trusting LLM results for critical numerical tasks.

Hallucinations and Biases

Hallucinations

One of the most significant issues with LLMs is their tendency to generate false or nonsensical information, known as hallucinations. These occur when the model produces content that is irrelevant, made-up, or inconsistent with the input data

Biases

LLMs can inadvertently perpetuate biases present in their training data. This can lead to the generation of stereotypical or prejudiced content, potentially promoting misinformation and unfair stereotypes

Snapshot from the internet (2023, up to early 2024)

Understanding the sources

Large language models do not understand the sources of information and could not distinguish between parody articles such as ones coming from Onion, or the real ones. If you prompt model (in for example RAG settings) to provide the answer just based on provided articles, and pass some of the parody articles, models usually will not understand they are parody and use them. That is how Gemini was able to answer at some point that it is recommended for people to eat at least one small stone a day, or that you should fix cheese to pizza with non-toxic glue.

Snapshot from the internet for illustration purposes (May 2024)

Other Limitations

Source CitationLLMs cannot accurately cite sources for the information they provide. They often generate plausible but entirely fabricated sources, which is problematic for tasks requiring reliable referencing

Mathematical Abilities

Despite their language prowess, LLMs often struggle with even basic mathematical tasks. They can provide incorrect answers to simple calculations, highlighting their limitations in numerical reasoning

Contextual Understanding

LLMs sometimes fail to maintain consistency over long sequences of text or struggle to understand complex contexts, leading to contradictory or irrelevant responses. This is especially true when developing agents, where the agent can’t keep focus on the root task, and gets diluted with the planning items and moves away from context. This phenomenon, known as “context drift,” can manifest in several ways:

Examples of Consistency Issues in LLMs

Long-Range DependenciesLLMs often struggle to maintain coherence over extended text sequences. For instance:

In a long story-writing task, an LLM might introduce a character named “John” at the beginning, but later refer to him as “Mike” without explanation.
When summarizing a lengthy academic paper, the model may accurately capture the introduction but misrepresent or omit key findings from later sections (https://arxiv.org/html/2404.02060v2)

Task PersistenceAgents powered by LLMs can lose sight of their primary objective:

A virtual assistant tasked with planning a vacation might begin by suggesting destinations but then veer off into discussing local cuisines without completing the original itinerary.
An AI coding assistant asked to optimize a specific function might start refactoring unrelated parts of the codebase, losing focus on the initial request (more on https://paperswithcode.com/paper/self-consistency-of-large-language-models)

Contextual UnderstandingComplex or nuanced contexts can lead to inconsistent responses:

In a multi-turn conversation about climate change, the LLM might initially provide scientifically accurate information but later contradict itself by stating incorrect facts about greenhouse gases.
When analyzing a legal document, the model may correctly interpret early clauses but misapply that understanding to later, more nuanced sections (https://arxiv.org/html/2404.02060v2)

Factors Contributing to Inconsistency

Attention Mechanism LimitationsThe self-attention mechanism in Transformers, while powerful, can struggle with very long sequences:

As the input grows, the model’s ability to attend to relevant information from earlier in the sequence diminishes, leading to a recency bias
This can result in the model favoring information from later parts of the input, potentially contradicting or ignoring earlier context.

Training Data BiasesLLMs trained on diverse datasets may struggle with consistency due to conflicting information:

The model might have been exposed to contradictory facts or opinions during training, leading to inconsistent outputs when queried on those topics.
This can be particularly problematic in domains with evolving knowledge or conflicting schools of thought.

Lack of True UnderstandingDespite their impressive capabilities, LLMs don’t possess genuine comprehension:

They operate based on statistical patterns in text rather than a deep understanding of concepts and their relationships.
This limitation can lead to logical inconsistencies or failure to maintain complex reasoning chains over long sequences

Conclusion

In conclusion, while LLMs have made significant strides in natural language processing, they still face substantial challenges in reasoning, planning, numerical understanding, and maintaining factual accuracy.

The consistency issues pose significant challenges for developing reliable AI systems:

In critical applications like healthcare or financial analysis, inconsistent outputs could lead to serious consequences.
For AI agents designed to perform complex, multi-step tasks, the inability to maintain focus and context severely limits their effectiveness

Addressing these limitations is an active area of research in the field of AI. Techniques such as improved attention mechanisms, specialized training regimes, and the development of external memory systems are being explored to enhance the long-range consistency and contextual understanding of LLMs. Ongoing research aims to address these limitations, paving the way for more robust and reliable language models in the future.