Detect, Classify, and Fix: Zero-Shot Hallucination Reasoning for LLMs

5 min readNov 17, 2024

Large Language Models (LLMs) are the backbone of many advanced AI systems, powering applications across industries like finance, healthcare, and customer service. However, a critical challenge in their deployment is the generation of hallucinated outputs – content that is incorrect or unverifiable. The recently proposed hallucination reasoning framework by Lee et al. introduces a robust method to classify LLM outputs into three categories: aligned, misaligned, and fabricated. This framework is a game-changer for understanding and mitigating hallucinations.

In this blog post, we will take a deep dive into the entire hallucination reasoning framework, exploring both stages – MKT and Alignment Testing – in detail. By the end, you’ll have a robust understanding of how these methods work and how they can be extended to create reliable AI systems.

Hallucination in LLMs: The Core Problem

When LLMs produce hallucinated outputs, the problem isn’t just that the text is wrong. The issue is deeper:

• Fabricated Content: The model generates text when it lacks knowledge, such as inventing a fake biography or scientific fact.

• Misalignment: The model has knowledge but fails to generate text consistent with it due to randomness in token selection or contextual errors.

What is Hallucination Reasoning?

The hallucination reasoning framework categorizes LLM-generated outputs into three distinct types:

1. Aligned: Content that is correct and consistent with the LLM’s internal knowledge.

2. Misaligned: Content where the LLM has knowledge but generates inconsistent or incorrect outputs due to randomness or contextual errors.

3. Fabricated: Content generated when the LLM lacks sufficient knowledge about the subject.

This categorization is crucial because the causes of hallucinations differ:

• Misaligned outputs can often be fixed by adjusting generation parameters or re-prompting.

• Fabricated outputs require external knowledge or a retraining process to address knowledge gaps.

The hallucination reasoning workflow involves two stages:

1. Model Knowledge Test (MKT): Identifies whether the LLM possesses sufficient knowledge about the subject in a prompt.

2. Alignment Test: Verifies whether the generated text aligns with the LLM’s internal knowledge.

Let’s break down each stage in detail.

Stage 1: Model Knowledge Test (MKT)

The Model Knowledge Test (MKT) determines if the LLM has sufficient knowledge about a subject in a prompt. It achieves this by perturbing the embeddings of the subject and evaluating how this impacts the generated text.

Step 1: Subject Identification

The MKT begins by identifying the most important subject or entity in the input prompt. This is done using attention mechanisms.

Attention scores measure how much focus the LLM places on each token in the prompt.

Tools like SpaCy extract noun phrases (e.g., “Pika” in “What is the habitat of the Pika?”). The noun phrase receiving the most attention is selected as the subject.

Step 2: Perturbation of Embeddings

After identifying the subject, the MKT perturbs its embeddings:

Embeddings are numerical representations of words in a high-dimensional space, enabling the model to understand relationships between concepts.

For example, the word “Pika” might be represented as [0.45, 0.67, -0.23, …].

Perturbation involves adding controlled noise to the subject’s embedding. For instance, [0.45, 0.67, -0.23] might become [0.47, 0.63, -0.21].

The idea is to blur the representation slightly and see if the LLM can still associate the subject with relevant knowledge.

The perturbed embedding is injected back into the model’s internal computation pipeline at the appropriate stage.

Instead of using the original embedding for “Pika,” the perturbed version is used when predicting the next tokens.

Step 3: Measuring the Impact

To evaluate the impact of perturbation, the MKT compares the LLM’s outputs before and after perturbation using Kullback-Leibler (KL) Divergence.

KL Divergence measures how much one probability distribution differs from another. In this context, it compares the likelihood of tokens in the original and perturbed outputs.

• If the LLM truly understands the subject, perturbing its embedding significantly alters the generated text (high KL Divergence).

• If the subject is fabricated, perturbation has little to no impact (low KL Divergence).

Step 4: Classification

Based on the KL Divergence score:

• Outputs with low scores are classified as fabricated.

• Outputs with high scores proceed to the Alignment Test.

Stage 2: Alignment Test

Once the MKT confirms that the LLM has sufficient knowledge about the subject, the Alignment Test checks whether the generated text aligns with the LLM’s internal knowledge. This stage focuses on logical consistency and factual accuracy.

Step 1: Multiple Generations

The alignment test generates multiple responses for the same prompt. For example:

Prompt: “What is the habitat of the pika?”

Generated responses:

• Response 1: “The pika is found in rocky mountain ranges at high altitudes.”

• Response 2: “The pika inhabits rocky areas with sparse vegetation in high mountains.”

• Response 3: “The pika lives in deserts at low altitudes.”

Step 2: Semantic Comparison

The responses are analyzed for consistency:

Semantic Similarity measures how closely the responses align with each other in meaning.

Consistent responses suggest alignment with the LLM’s knowledge.

Contradictions:

• Detects conflicting statements (e.g., “high altitudes” vs. “low altitudes”).

Fact Repetition:

• Checks if key facts (e.g., “rocky mountain ranges”) appear consistently across responses.

Step 3: Scoring and Classification

Each response is scored on alignment metrics. Based on these scores. Texts that are consistent and aligned with internal knowledge are classified as aligned. Texts that are inconsistent or contradictory are classified as misaligned.

Enhanced Hallucination Testing Workflow: Combining MKT and Alignment Test

Below is the enhanced hallucinations testing workflow as suggested by the paper.

            +-----------------------------------------+
            |             User Input Prompt           |
            +-----------------------------------------+
                              |
                +-------------------------------+
                |  Subject Identification (MKT) |
                +-------------------------------+
                              |
        +----------------------------------------------+
        | Embedding Perturbation & Impact Measurement  |
        +----------------------------------------------+
                              |
      +----------------------+            +---------------------+
      |  Low Knowledge Score |            | High Knowledge Score |
      |  (Fabricated Output) |            | (Proceed to Alignment |
      +----------------------+            |         Test)          |
                              |                     |
                              v                     v
                +-------------------------------------------+
                |           Alignment Test (SelfCheckGPT)   |
                +-------------------------------------------+
                              |
    +---------------------+            +------------------------+
    | Misaligned Output   |            | Aligned Output (Pass) |
    +---------------------+            +------------------------+

Practical Applications of the Framework

Conversational AI

Ensure chatbot responses are factually correct and aligned with internal knowledge.

Example: Detect and regenerate hallucinated policy information in customer service bots.

Content Summarization

Validate the accuracy of AI-generated research summaries.

Example: Prevent hallucination in summaries of financial or medical studies.

Real-time Applications

Deploy the framework in real-time systems for sensitive domains like healthcare or compliance systems.

Example: Flag fabricated or misaligned outputs in legal advice systems.

Extending the Framework

Multi-modal Testing

Extend the MKT and Alignment Test to evaluate multi-modal models (e.g., text and images).

Example: Test whether image captions align with textual descriptions.

Integration with External Knowledge

Use external databases or knowledge bases to cross-validate alignment scores.

Example: Verify if generated summaries match the factset.

Confidence Scoring

Introduce confidence scores to help users assess reliability.

Example: Label responses as “high confidence” or “low confidence.”

Wrapping up

By combining the Model Knowledge Test and the Alignment Test, we can systematically identify and address hallucinations in LLMs. This framework is not just a theoretical advance but a practical tool for building reliable, trustworthy AI systems. With extensions like multi-modal testing and external knowledge integration, this methodology can be applied across diverse applications, from conversational AI to content generation.