Despite their remarkable capabilities, large language models (LLMs) are not without flaws. These advanced artificial intelligence systems can sometimes produce “hallucinations,” which involve generating inaccuracies or unsupported information in response to inquiries. This issue can be especially critical in high-stakes environments such as healthcare or finance, where the accuracy of information is paramount. To mitigate the risks associated with these hallucinations, responses from LLMs are frequently subject to validation by human fact-checkers. However, the verification process generally requires individuals to wade through extensive documents cited by the model, which can be a laborious and error-prone endeavor, possibly deterring users from adopting generative AI technologies altogether.
In an effort to aid human validators, researchers at MIT developed a streamlined system designed to enhance the verification process for LLM-generated responses. Named SymGen, this user-friendly tool allows individuals to confirm the accuracy of an LLM’s output more efficiently. The innovative approach enables the model to produce responses that include citations directing users to specific portions of the source material, such as a particular cell within a database. Users can hover over highlighted sections of the response to view the exact data referenced by the model for that specific word or phrase. Conversely, the unhighlighted text indicates areas that may require further scrutiny and validation.
Shannon Shen, a graduate student in electrical engineering and computer science (EECS) and co-lead author of the research paper on SymGen, stated, “We provide users with the ability to selectively concentrate on the parts of the text that warrant greater attention. Ultimately, SymGen offers users increased confidence in a model’s responses, as they can easily verify the information presented.” In user studies conducted by Shen and his colleagues, it was shown that SymGen accelerated the verification process by approximately 20% compared to traditional manual methods. By simplifying and speeding up the validation of model outputs, SymGen holds the potential to assist users in identifying inaccuracies in LLMs across various real-world applications, ranging from drafting clinical notes to summarizing financial market reports.
Alongside Shen, the paper includes contributions from co-lead author Lucas Torroba Hennigen, fellow graduate student Aniruddha “Ani” Nrusimha, Bernhard Gapp, president of the Good Data Initiative, and senior authors David Sontag—a professor in EECS and a member of the MIT Jameel Clinic—and Yoon Kim, an assistant professor in EECS affiliated with CSAIL. The team presented their findings at the Conference on Language Modeling.
Typically, to facilitate validation, many LLMs are programmed to generate citations alongside their language-based outputs. These citations direct users to external documents for verification. However, as Shen noted, these verification systems are often implemented without considering the challenges users face while sifting through numerous citations: “Generative AI aims to minimize the time needed for users to accomplish tasks. However, if they must spend hours poring over a multitude of documents to verify the model’s claims, the usefulness of these generation processes diminishes significantly.”
The researchers approached the problem of validation from the standpoint of the human users tasked with performing the checks. To utilize SymGen effectively, a user first supplies the LLM with structured data it can reference, such as a table containing relevant statistics, for instance, from a basketball game. Instead of directly asking the model to generate a task-specific output—like summarizing the game—the researchers introduced an intermediate step. They instructed the model to produce its response in a symbolic format.
In this manner, every time the model seeks to reference specific words in its response, it must denote the explicit cell from the data table that contains the information. For example, if the model references the term “Portland Trailblazers,” it would replace that text with the corresponding cell designation from the data table containing that phrase.
“The introduction of this symbolic approach allows for highly detailed references. We can pinpoint exactly where each segment of text in the output originates from in the data,” explained Torroba Hennigen. SymGen subsequently resolves each citation using a rule-based tool that extracts the relevant text from the data table to incorporate it accurately into the model’s final response. “This ensures that the information included is a verbatim copy, thereby eliminating potential errors from the text corresponding to the actual stored data,” added Shen.
The model’s ability to generate symbolic responses stems from its training methodology. LLMs are exposed to vast amounts of data sourced from the internet, which often utilize “placeholder formats” where codes replace actual values. When prompting the model to produce a symbolic response, the researchers intentionally structured the prompt to leverage the model’s inherent capabilities.
User studies revealed that most participants found it easier to validate LLM-generated text using SymGen. They reported that verification of the model’s outputs was possible around 20% faster than using conventional verification approaches. However, it’s essential to note that SymGen’s effectiveness is contingent upon the quality of the underlying source data. An LLM could potentially cite a faulty variable, leaving human verifiers unaware of the issue. Additionally, users must supply the source data in a structured format, such as in tables, for SymGen to function, limiting its current application scope.
Looking ahead, the researchers are working to enhance SymGen’s functionality to accommodate more varied data formats beyond tabular data. With these improvements, the system could assist in validating segments of AI-generated legal document summaries, for example. The team also plans to trial SymGen with medical professionals to ascertain how it can help identify errors in AI-generated clinical summaries. This research has received funding from multiple sources, including Liberty Mutual and the MIT Quest for Intelligence Initiative.
Source link