Does Fine-Tuning cause more Hallucinations?

In the past few years, the gains in capability of large language models (LLMs) have been teased out by pre-training on vast text corpora — the vast mass of raw data essentially ensembles factual knowledge parametrically in the model — and, after this, supervised fine-tuning is added to deliberately shape the model towards particular behaviors. This often involves a ‘soft gold standard’ by training the model on outputs from human annotators or on other language models which didn’t itself have access to the same knowledge, but can ‘hallucinate’ new facts. This raises the question: how does an LLM integrate new (extrapolated) facts beyond the knowledge it’s ‘seen’ during pre-training, and what impact does this have on hallucinations?

The study “Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?” explores the implications of fine-tuning large language models (LLMs) on new factual knowledge. Researchers employed a novel method, Sampling-based Categorization of Knowledge (SliCK), to classify fine-tuning examples into Known and Unknown categories, further dividing Known examples into HighlyKnown, MaybeKnown, and WeaklyKnown. Through controlled experiments focusing on closed-book question answering (QA), they varied the proportion of Unknown examples in the fine-tuning dataset to assess the impact on the model’s tendency to hallucinate.

The empirical results reveal a linear correlation between learning from Unknown examples and the model’s propensity to hallucinate, while Known examples enhance the utilization of pre-existing knowledge. The study indicates that LLMs struggle to integrate new factual knowledge effectively, instead reinforcing their pre-existing knowledge. To mitigate the risk of hallucinations, the researchers suggest techniques such as early stopping and filtering out Unknown examples during fine-tuning. The findings underscore the importance of understanding the balance between leveraging pre-existing knowledge and incorporating new information in LLMs to minimize hallucinations and optimize performance.

Here is a summary of the paper:

Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?

The researchers begin by highlighting the importance of pre-training large language models (LLMs) on textual corpora and how it embeds substantial factual knowledge within the model’s parameters. However, further alignment to desired behaviors is often achieved through supervised fine-tuning on instruction-following tasks and preference learning from human feedback. The fine-tuning phase involves training the model on outputs created by human annotators or other LLMs, which may lead the model to encounter new factual information beyond its pre-training knowledge. This brings up the question of how LLMs integrate new facts outside of their pre-existing knowledge.

The research aims to study how learning new factual knowledge through fine-tuning impacts the model’s tendency to hallucinate with respect to its pre-existing knowledge. To assess the impact of new knowledge, the researchers propose a method called SliCK, which categorizes fine-tuning examples into Known and Unknown types and further splits Known examples into HighlyKnown, MaybeKnown, and WeaklyKnown categories. They conduct a controlled study focused on closed-book question answering (QA) and vary the proportion of the fine-tuning examples categorized as Unknown, while controlling other factors.

The study empirically demonstrates that learning from Unknown fine-tuning examples is linearly correlated with the model’s tendency to hallucinate with respect to its pre-existing knowledge. On the other hand, learning from Known examples is correlated with better utilization of pre-existing knowledge. The analysis of the training dynamics reveals that LLMs struggle to integrate new factual knowledge present in the Unknown fine-tuning examples, instead learning to expose their pre-existing knowledge using the Known fine-tuning examples.

From a practical perspective, the researchers suggest that mitigating overfitting using early-stopping or filtering out the Unknown fine-tuning examples can minimize the risk of hallucinations caused by fitting the Unknown examples, without sacrificing performance. They evaluate the impact of fine-tuning examples from different categories of Known knowledge and find that incorporating MaybeKnown fine-tuning examples plays an important part in properly handling such examples in test time.

The research indicates that LLMs struggle to integrate new knowledge through fine-tuning and are more likely to experience hallucinations with respect to their pre-existing knowledge as they learn new knowledge. This suggests that fine-tuning may be more useful as a mechanism to enhance the utilization of pre-existing knowledge rather than introducing new knowledge.

Study Setup

The research paper’s study setup involves working with a fine-tuning dataset, denoted as D, and a pre-trained language model (LLM) referred to as M. The researchers create a model, denoted as MD, by fine-tuning M on D. The objective of the study is to investigate how new knowledge in D impacts the performance of MD. In order to achieve this, the researchers design a controlled setup where they create variants of D with different proportions of examples that are unknown to M.

During the construction of D, the researchers aim to reflect instruction tuning on diverse knowledge-intensive tasks while maintaining control over the experimental setting. They focus on factual knowledge that can be structured as (subject, relation, object) triplets, which are then converted into closed-book question-answering (QA) format. The dataset is represented as D = {(qi, ai)}Ni=1, where qi is a knowledge-seeking question related to a specific triplet, and ai is the ground-truth answer. The researchers utilize ENTITYQUESTIONS (Sciavolino et al., 2021) for this purpose. This involves converting triplets from a diverse set of relations from Wikidata (Vrandečić and Krötzsch, 2014) to QA pairs. These relations cover a wide range of factual knowledge, providing a robust basis for the study.

The study setup involves fine-tuning a pre-trained language model on a dataset containing knowledge-seeking questions and ground-truth answers structured as triplets. The dataset is constructed to encompass diverse knowledge-intensive tasks while allowing for control over the experimental setting. The use of ENTITYQUESTIONS and Wikidata ensures a broad spectrum of factual knowledge is included in the dataset, providing a strong foundation for the study’s objectives.

The research paper discusses the concept of known and unknown answer prediction in the context of natural language processing (NLP) models. The authors introduce different levels of knownness, such as HighlyKnown, MaybeKnown, WeaklyKnown, and Unknown, to quantify the confidence of the model in predicting the correct answer to a given question. HighlyKnown refers to the scenario where greedy decoding always predicts the correct answer, while MaybeKnown represents situations where greedy decoding sometimes, but not always, predicts the correct answer. WeaklyKnown occurs when greedy decoding never predicts the correct answer, and temperature sampling with T > 0 sometimes predicts the correct answer. Unknown signifies that the model never predicts the correct answer and lacks the knowledge of the correct answer.

The study involves the use of the PaLM 2-M base model as M, and the evaluation metric used is exact match (EM). The research applies these concepts to a specific dataset, using the original development and test splits, and sub-sampling the train split to create different variants of D. The focus is on 12 diverse relations, with 7 additional relations reserved for an out-of-distribution test set. The authors ensure the inclusion of biographical information, geographical data, ownership and authorship details, and historical data in the dataset.

The paper provides a thorough explanation of the concept of knownness in the context of answer prediction and the evaluation metrics used. It also outlines the methodology and specific details regarding the dataset and the model used in the study. The inclusion of diverse relations in the dataset and the separation of out-of-distribution test sets demonstrate the comprehensive approach taken by the authors to evaluate the performance of the model in predicting answers. This section of the paper provides a detailed and technical insight into the methodology and evaluation process used in the study, laying the groundwork for the subsequent analysis and findings in the paper.

Quantifying Knowledge in LLMs

The section “Quantifying Knowledge in LLMs” discusses the approach to evaluating the impact of new knowledge in a language model (LM) on the performance of downstream tasks. The authors propose a method called Sampling-based Categorization of Knowledge (SliCK) to annotate each (question, answer) pair in a dataset D based on the LM’s knowledge of the answer. The authors define a measure called P Correct, which represents the likelihood of the LM accurately generating the correct answer to a given question when prompted with random few-shot exemplars and using a decoding temperature (T). They adopt the perspective that the LM knows the answer to a question if it consistently generates the correct answer under greedy decoding.

The authors use T = 0 for predicting a greedy answer and T = 0.5 for sampling answers, and estimate P Correct using N ex = 10 different random 4-shot prompts. The categorization of knowledge is based on the value of P Correct, with an “Unknown” category representing pairs for which the LM never predicts the correct answer, a “Known” category for pairs where the LM occasionally predicts the correct answer, a “HighlyKnown” category for pairs where the LM always greedily predicts the correct answer, a “MaybeKnown” category for pairs with inconsistent but occasional greedy predictions, and a “WeaklyKnown” category for pairs where the LM never greedily predicts the correct answer.

The SliCK approach is then applied to annotate each (question, answer) pair in the dataset with its knowledge category with respect to the LM. The authors note that the quality of the knowledge categories is analyzed in a subsequent section of the paper.

The section introduces the SliCK approach for annotating knowledge categories in a dataset based on a language model’s knowledge of the answers. It details the measure P Correct, the estimation method using random prompts and decoding temperatures, and the categorization of knowledge into unknown, known, highly known, maybe known, and weakly known, providing a framework for quantifying the level and consistency of the LM’s knowledge across different question-answer pairs in the dataset. The analysis of the quality of these categories is reserved for a later section of the paper.

How Harmful are Unknown Examples?

In this section of the research paper, the authors investigate the impact of new knowledge in the fine-tuning dataset on the model’s performance. They create variants of the dataset with different proportions of Unknown and Known examples, keeping the total number of examples constant. The experiment aims to isolate the effect of unknown examples on the model’s training. They measure the test performance to understand the effect of overfitting and find that training for more epochs reduces performance, especially with a higher percentage of Unknown examples, indicating a higher risk of overfitting. The performance drop for a higher percentage of Unknown examples is attributed to the lower number of known examples, leading to underperformance of the dataset with a higher Unknown ratio. It is also observed that the presence of Unknown examples makes the variants more prone to overfitting.

The authors observe that the harmful effect of Unknown examples is mainly seen in later training stages, and it can be empirically avoided by using early stopping. The model fits Unknown fine-tuning examples substantially slower than Known examples. This slower fitting rate suggests that Language Model Models (LLMs) struggle to acquire new factual knowledge through fine-tuning and instead learn to expose their preexisting knowledge using the Known examples.

The paper also explores the impact of fitting Known and Unknown training examples on test accuracy using a linear regression model. The results show that a higher Unknown ratio leads to lower out-of-distribution (OOD) test performance and that Unknown examples are harmful for OOD performance, particularly when the model fits them. The study establishes that fine-tuning on Unknown examples can lead to hallucinations on seemingly unrelated questions, indicating that the model learns the behavior of generating answers that are not grounded in its pre-existing knowledge.

The research findings demonstrate that Unknown examples have a detrimental impact on the model’s performance, mainly through increased risks of overfitting, slower fitting rates, and negative effects on OOD performance. This suggests that the model’s ability to learn and generalize from unknown examples is limited, potentially leading to inaccurate and non-grounded answers.

Understanding Knowledge Types: Their Value and Impact

The section “Understanding Knowledge Types: Their Value and Impact” in the research paper focuses on the effect of fine-tuning examples on different known categories and benchmarking unknown test examples. The study addresses the main research question concerning the impact of unknown fine-tuning examples and treats the known categories collectively for simplicity. The performance of the fine-grained known categories, including HighlyKnown, MaybeKnown, and WeaklyKnown, is analyzed post fine-tuning. The results show that HighlyKnown consistently exceeds 95% accuracy, while MaybeKnown and WeaklyKnown represent weaker knowledge degrees. The categorization of fine-tuning examples proves to be useful in revealing insights on the importance of MaybeKnown examples.

The benchmarking of unknown test examples is also discussed. The accuracy on unknown examples is found to be extremely low, indicating that most of these examples are actually unknown to the model. The paper also compares the approach of classifying examples based on a continuous score, P(True), and the SliCK approach. The results suggest that the SliCK approach categorizes unknown examples for which the model’s performance after fine-tuning is significantly worse. The comparison further indicates that using samples from multiple few-shot prompts to approximate the probability of correctness is crucial, as it leads to higher test accuracy on SliCK unknown examples.

The section provides detailed insights into the impact of unknown fine-tuning examples on different known categories, as well as the benchmarking of unknown test examples. The findings demonstrate the meaningfulness of categorizing fine-tuning examples and highlight the importance of accurately classifying unknown examples. The comparison of different approaches sheds light on the effectiveness of the SliCK methodology in categorizing unknown examples based on model performance post fine-tuning.

Discussion

The discussion section of the research paper focuses on the practical implications and findings related to the risks and challenges associated with the fine-tuning of large language models (LLMs). The researchers highlight the risks of using supervised fine-tuning to update LLMs’ knowledge, as they present empirical evidence that acquiring new knowledge through fine-tuning is correlated with hallucinations in relation to pre-existing knowledge. The paper raises important questions regarding the fine-tuning practices, particularly concerning the speed at which unknown examples are fitted compared to known ones and the negative effect this may have as a form of overfitting. It emphasizes the importance of using early-stopping instead of a fixed number of fine-tuning steps, but notes that early-stopping may be less effective when fine-tuning on numerous tasks with distinct optimal stopping points.

The researchers propose an alternative solution, which involves aligning the fine-tuning data with the model’s knowledge by filtering out unknown examples. They provide initial evidence that this approach can reduce the risk of overfitting without compromising performance. They also acknowledge a possible drawback of filtering, as unknown fine-tuning examples may still be useful to teach LLMs to express uncertainty on unknown test examples. They explore the possibility of re-labeling unknown fine-tuning examples with uncertainty expressions (e.g., “I don’t know”) to reduce their negative effect and present preliminary experimental evidence indicating that this approach could be promising for future research.

The discussion also addresses the Superficial Alignment Hypothesis, which posits that LLMs largely learn their knowledge and capabilities during pretraining, while alignment is a simple process where the model learns the style or format for interacting with users. The researchers provide evidence that LLMs struggle to acquire new knowledge present in unknown examples and mostly learn to utilize their pre-existing knowledge through fine-tuning. They suggest that despite most of the LLM’s knowledge being acquired through pre-training, the model learns more than just style or format through fine-tuning, as the selection of fine-tuning examples significantly influences the model’s capability to utilize its pre-existing knowledge post fine-tuning.

The discussion section of the paper highlights the various challenges, potential solutions, and future research directions related to the fine-tuning of LLMs, shedding light on the complexities and implications of this process.

Related Work

The related work section of the paper discusses various studies that have explored the relationship between fine-tuning on new factual knowledge and the potential for inducing hallucinations in Large Language Models (LLMs). Schulman (2023), Goldberg (2023), and Gudibande et al. (2023) have raised the speculation that fine-tuning on new factual knowledge might contribute to the occurrence of hallucinations. Huang et al. (2023) further elaborate on this concept, categorizing hallucination causes and formally defining the scenario as capability misalignment. They highlight the limited research addressing capability misalignment due to the challenges of defining the knowledge boundary of LLMs. Kang et al. (2024) demonstrated that when a fine-tuned LLM encounters unknown queries during testing, its responses mirror the responses associated with the unknown examples in the fine-tuning data. Yin et al. (2023) observed that LLMs’ performance is unsatisfactory when they encounter new knowledge in their input contexts, while Lee et al. (2023) analyzed the impact of unknown in-context learning examples. The authors emphasize that their work is the first to empirically assess the impact of exposure to new knowledge through fine-tuning on the likelihood of the fine-tuned model to hallucinate.

In addition to exploring the impact of fine-tuning on new knowledge, the paper also delves into the quantification of knowledge in LLMs. It introduces the SliCK method as a confidence elicitation approach for the ground truth label, wherein “M knows (q, a) if it is confident that a is the answer to q.” Existing work has derived calibrated confidence from LLMs through various methods such as examining agreement across multiple samples (Kuhn et al., 2023; Manakul et al., 2023; Tian et al., 2023a; Lyu et al., 2024), probing internal representations (Azaria and Mitchell, 2023; Burns et al., 2022), eliciting verbalized probability (Tian et al., 2023b), and direct prompting (Kadavath et al., 2022). Kadavath et al. also trained a separate P(IK) model to predict if the LLM knows the answer to q, approximating the label for P(IK) by the fraction of correct sampled answers, which aligns conceptually with P Correct. The paper distinguishes itself by defining the SliCK categories and providing evidence that these categories capture meaningful and useful information.

Conclusion

The conclusion of the research paper examines the impact of integrating new factual knowledge through fine-tuning on the model’s tendency to hallucinate. The study introduces SliCK, a categorization of facts with respect to language model’s (LLM) knowledge, and conducts a controlled study to isolate the impact of new knowledge and assess its effects. The key findings of the study include the correlation between acquiring new knowledge via supervised fine-tuning and hallucinations with respect to pre-existing knowledge. It is highlighted that LLMs struggle to effectively integrate new knowledge through fine-tuning and predominantly rely on their pre-existing knowledge.

The paper acknowledges the limitations of the study, emphasizing that the experiments were conducted using a single LLM, and it remains uncertain whether the results would vary with different LLMs. The computationally intensive nature of the study is also mentioned, with a focus on the extensive computation required for fine-tuning and annotating a large-scale dataset with respect to the SliCK categories. The need for 170 inference steps per example, amounting to more than 15M inference steps to categorize the full dataset, is noted as a significant computational challenge.

The practical implications of the study in settings involving long-form text generation are highlighted, particularly the need for validation of filtering-out unknown fine-tuning examples. The paper emphasizes the requirement for adaptations to SliCK and an effective method to compare sampled answers with the ground-truth to approximate correctness probability (P Correct) in long-form generation tasks. The authors propose this as a potential area for future work, given the evaluation challenges associated with long-form generation tasks.

The study comments on the limitation of not testing the effect of adding additional fine-tuning examples from diverse tasks into the fine-tuning mixture. While this could more closely approximate a typical instruction fine-tuning scenario, it is noted that such dataset extension may introduce new factual knowledge in an uncontrollable way, thereby limiting the findings of the study.

The research paper provides valuable insights into the impact of integrating new factual knowledge through fine-tuning on LLMs. It acknowledges the study’s limitations and computational challenges while highlighting the need for further research in long-form text generation tasks and the implications of filtering-out unknown fine-tuning examples.

Watch out for the next article discussing how cross-layer attention can reduce Transformer Key-Value Cache size!!

Large Language Models

Hallucinations

Fine Tuning

Retrieval Augmented Gen