VerifAI Project: Open Source Biomedical Question Answering with Verified Answers

Experiences from building LLM-based (Mistral 7B) biomedical question-answering system on top of Qdrant and OpenSearch indices with hallucination detection method

Nikola Milosevic (Data Warrior)
TDS Archive
Published in
12 min readJul 15, 2024

Last September (2023), we embarked on the development of the VerifAI project, after receiving funding from the NGI Search funding scheme of Horizon Europe.

The idea of the project was to create a generative search engine for the biomedical domain, based on vetted documents (therefore we used a repository of biomedical journal publications called PubMed), with an additional model that would verify the generated answer, by comparing the referenced article and generated claim. In domains such as biomedicine, but also in general, in sciences, there is a low tolerance for hallucinations.

While there are projects and products, such as Elicit or Perplexity, that do partially RAG (Retrieval-Augmented Generation) and can answer and reference documents for biomedical questions, there are a few factors that differentiate our project. Firstly, we focus at the moment on the biomedical documents. Secondly, as this is a funded project by the EU, we commit to open-source everything that we have created, from source code, models, model adapters, datasets, everything. Thirdly, no other product, that is available at the moment, does a posteriori verification of the generated answer, but they usually just rely on fairly simple RAG, which reduces hallucinations but does not remove them completely. One of the main aims of the project is to address the issue of so-called hallucinations. Hallucinations in Large Language Models (LLMs) refer to instances where the model generates text that is plausible-sounding but factually incorrect, misleading, or nonsensical. In this regard, the project adds to the live system's unique value.

The project has been shared under the AGPLv3 license.

Overall method

The overall methodology that we have applied can be seen in the following figure.

The overall architecture of the system (image by authors)

When a user asks a question, the user query is transformed into a query, and the information retrieval engine is asked for the most relevant biomedical abstracts, indexed in PubMed, for the given question. In order to obtain the most relevant documents, we have created both a lexical index, based on OpenSearch, and a vector/semantic search based on Qdrant. Namely, lexical search is great at retrieving relevant documents containing exact terms as the query, while semantic search helps us to search semantic space and retrieve documents that mean the same things, but may phrase it differently. The retrieved scores are normalized and a combination of documents from these two indices are retrieved (hybrid search). The top documents from the hybrid search, with questions, are passed into the context of a model for answer generation. In our case, we have fine-tuned the Mistral 7B-instruct model, using QLoRA, because this allowed us to host a fairly small well-performing model on a fairly cheap cloud instance (containing NVidia Tesla T4 GPU with 16GB of GPU RAM). After the answer is generated, the answer is parsed into sentences and references that support these sentences and passed to the separate model that checks whether the generated claim is supported by the content of the referenced abstract. This model classifies claims into supported, no-evidence, and contradicting classes. Finally, the answer, with highlighted claims that may not be fully supported by the abstracts, is presented to the user.

Therefore, the system has 3 components — information retrieval, answer generation, and answer verification. In the following sections, we will describe in more detail each of these sections.

Information retrieval

From the start of the project, we aimed at building a hybrid search, combining semantic and lexical search. The initial idea was to create it using a single software, however, that turned out not so easy, especially for the index of PubMed’s size. PubMed contains about 35 million documents, however, not all contain full abstracts. There are old documents, from the 1940s and 1950s that may not have abstracts, but as well some guide documents and similar that have only titles. We have indexed only the documents containing full abstracts and ended up with about 25 million documents. PubMed unpacked is about 120GB in size.

It is not problematic to create a well-performing index for lexical search in OpenSearch. This worked pretty much out of the box. We have indexed titles, and abstract texts, and added some metadata for filtering, like year of publication, authors, and journal. OpenSearch supports FAISS as a vector store. So we tried to index our data with FAISS, but this was not possible, as the index was too large and we were running out of memory (we had a 64GB cloud instance for index). The indexing was done using MSMarco fine-tuned model based on DistilBERT (sentence-transformers/msmarco-distilbert-base-tas-b). Since we learned that FAISS only supported the in-memory index, we needed to find another solution that would be able to store a part of the index on the hard drive. The solution was found in the Qdrant database, as it supports in-memory mapping and on-disk storage of part of the index.

Another issue that appeared while creating the index was that once we did memory mapping and created the whole PubMed index, the query would be executed for a long time (almost 30 seconds). The problem was that calculations of dot products in 32-bit precision were taking a while for the index having 25 million documents (and potentially loading parts of the index from HDD). Therefore, we have made a search using only 8-bit precision, and we have reduced the time needed from about 30 seconds to less than a half second.

The lexical index contained whole abstracts, however, for the semantic index, documents needed to be split, because the transformer model we used for building the semantic index could swallow 512 tokens. Therefore the documents were split on full stop before the 512th token, and the same happened on every next 512 tokens.

In order to create a combination of semantic and lexical search, we have normalized the outputs of the queries, by dividing all scores returned either from Qdrant or OpenSearch by the top score returned for the given document store. In that way, we have obtained two numbers, one for semantic and the other for lexical search, in the range between 0–1. Then we tested the precision of retrieved most relevant documents in the top retrieved documents using the BioASQ dataset. The results can be seen in the table below.

The results of information retrieval, evaluating weights of semantic and lexical search. As it can be seen, best results we got for lexical search weight of 0.7 and semantic 0.3. Image from our paper accepted on BioNLP, and preprint available at https://arxiv.org/abs/2407.05015v1

We have done some re-ranking experiments using full precision, so you can see more details in the paper. But this has not been used in the application at the end. The overall conclusion was that lexical search does a pretty good job, and there is some contribution of semantic search, with the best performance obtained at weights of 0.7 for lexical search, and 0.3 for semantic.

Finally, we have built a query processing, where for lexical querying stopwords were excluded from the query, and search was performed in the lexical index, and similarity was calculated for the semantic index. Values for documents from both semantic and lexical indexes were normalized and summed up, and the top 10 documents were retrieved.

Referenced answer generation

Once the top 10 documents were retrieved, we could pass these documents to a generative model for referenced answer generation. We have tested several models. This can be done well with GPT4-Turbo models, and most of commercially available platforms would use GPT4 or Claude models. However, we wanted to create an open-source variant, where we do not depend on commercial models, having smaller and more efficient models while having also performance close to the performance of commercial models. Therefore, we have tested things with Mistral 7B instruct in both zero-shot regime and fine-tuned using 4bit QLora.

In order to fine-tune Mistral, we needed to create a dataset for referenced question answering using PubMed. We created a dataset by randomly selecting questions from the PubMedQA dataset, then retrieved the top 10 relevant documents and used GPT-4 Turbo for referenced answer generation. We have called this dataset the PQAref dataset and published it on HuggingFace. Each sample in this dataset contains a question, a set of 10 documents, and a generated answer with referenced documents (based on 10 passed in context).

Using this dataset, we have created a QLoRA adapter for Mistral-7B-instruct. This was trained on the Serbian National AI platform in the National Data Center of Serbia, using Nvidia A100 GPU. The training lasted around 32 hours.

We have performed an evaluation comparing Mistral 7B instruct v1, and Mistral 7B instruct v2, with and without QLoRA fine-tuning (so without is zero-shot, based only on instruction, while with QLoRA we could save some tokens as instruction was not necessary for the prompt, as fine-tuning would make model do what is needed), and we compared it with GPT-4 Turbo (with prompt: “Answer the question using relevant abstracts provided, up to 300 words. Reference the statements with the provided abstract_id in brackets next to the statement.”). Several evaluations about the number of referenced documents and whether the referenced documents are relevant have been done. These results can be seen in the tables below.

The number of answers containing N references in various models. Image from our paper accepted on BioNLP, and preprint available at https://arxiv.org/abs/2407.05015v1

From this table, it can be concluded that Mistral, especially the first version in zero-shot (0-M1), does not really often reference context documents, even though it was requested in the prompt. On the other hand, the second version showed much better performance, but it was far from GPT4-Turbo or fine-tuned Mistrals 7B. Fine-tuned Mistrals tended to cite more documents, even if the answer could be found in one of the documents, and added some additional information compared to GPT4-Turbo.

A number of referenced relevant documents for various models. Image from our paper accepted on BioNLP, and preprint available at https://arxiv.org/abs/2407.05015v1

As can be seen from the second table, GPT4-Turbo missed relevant references only once in the whole test set, while Mistral 7B-instruct v2 with fine-tuning missed a bit more, but showed still comparable performance, given much smaller model size.

We have also looked at several answers manually, to make sure the answers make sense. In the end, in the app, we are using Mistral 7B instruct v2 with a fine-tuned QLoRA adapter.

Answer verification

The final part of the system is answer verification. For answer verification, we have developed several features, however, the main one is a system that verifies whether the generated claim is based on the abstract that was referenced. We have fine-tuned several models on the SciFact dataset from the Allen Institute for AI with several BERT and Roberta-based models.

In order to model input, we have parsed the answer to find sentences and related references. We have found that the first and last sentence in our system are often introduction or conclusion sentences, and may not be referenced. All other sentences should have a reference. If a sentence contained a reference, it was based on that PubMed document. If the sentence does not contain a reference, but the sentence before and after are referenced, we calculate the dot product of the embeddings of that sentence and sentences in 2 abstracts. Whichever abstract contains the sentence with the highest dot product is treated as the abstract that the sentence was based on.

Once the answer is parsed and we have found all the abstracts the claims are based on, we pass it to the fine-tuned model. The input to the model was engineered in the following way

For deBERT-a models:

[CLS]claim[SEP]evidence[SEP]

For Roberta-based models:

<s>claim</s></s>evidence</s>

Where the claim is a generated claim from the generative component, and evidence is the text of concatenated title and abstract from referenced the PubMed document.

We have evaluated the performance of our fine-tuned models and obtained the following results:

Evaluation of the models trained and tested on the SciFact dataset. Image by authors, publication submitted to the 16th International Conference on Knowledge Management and Information Systems

Often, when the models are fine-tuned on the same dataset as they are tested, the results are good. Therefore, we wanted to test it also on out-of-domain data. So we have selected the HealthVer dataset, which is also in the healthcare domain used for claim verification. The results were the following:

Results of testing on the HealthVer dataset. Image by authors, publication submitted to the 16th International Conference on Knowledge Management and Information Systems

We also evaluated the SciFact label prediction task using the GPT-4 model (with a prompt “Critically asses whether the statement is supported, contradicted or there is no evidence for the statement in the given abstract. Output SUPPORT if the statement is supported by the abstract. Output CONTRADICT if the statement is in contradiction with the abstract and output NO EVIDENCE if there is no evidence for the statement in the abstract.” ), resulting in a precision of 0.81, a recall of 0.80, and an F-1 score of 0.79. Therefore, our model has better performance and due to the much lower number of parameters was more efficient.

On top of the verification using this model, we have also calculated the closest sentence from the abstract using dot product similarity (using the same MSMarco model we use for semantic search). We visualize the closest sentence to the generated one on the user interface by hovering over a sentence.

User interface

We have developed a user interface, to which users can register, log in, and ask questions, where they will get referenced answers, links to PubMed, and verification by described posterior model. Here are a few screenshots of the user interface:

The user interface, while generating an answer. Screenshot by authors
Output, including verification. Screenshot by authors
Generating a different answer and the configuration window open. Screenshot by authors

Conclusion

VerifAI project logo. Logo by authors

We are here to present our experience from building the VerifAI project, which is the biomedical generative question-answering engine with verifiable answers. We are open-sourcing whole code and models and we are opening the application, at least temporarily (depends on the budget for how long, and how and whether we can find a sustainable solution for hosting). In the next section, you can find the links to the application, website, code, and models.

The application is a result of the work of multiple people (see under Team section) and almost a year of research and development. We are happy and proud to present it now to a wider public and hope that people will enjoy it, and as well contribute to it, to make it more sustainable and better in the future.

Cite our papers

If you use any of the methodology, models, datasets, or are mentioning this project in your paper’s background section, please cite some of the following papers:

Availability

The application can be accessed and tried at https://verifai-project.com/ or https://app.verifai-project.com/. Users can here register, and ask questions to try our platform.

The code of the project is available at GitHub at https://github.com/nikolamilosevic86/verif.ai. The QLoRA adapters for Mistra 7B instruct can be found on HuggingFace at https://huggingface.co/BojanaBas/Mistral-7B-Instruct-v0.2-pqa-10 and https://huggingface.co/BojanaBas/Mistral-7B-Instruct-v0.1-pqa-10. A generated dataset for fine-tuning the generative component can be found also on HuggingFace at https://huggingface.co/datasets/BojanaBas/PQAref. A model for answer verification can be found at https://huggingface.co/MilosKosRad/TextualEntailment_DeBERTa_preprocessedSciFACT.

Team

The VerifAI project was developed as a collaborative project between Bayer Pharma R&D and the Institute for Artificial Intelligence Research and Development of Serbia, funded by the NGI Search project under grant agreement No 101069364. The people involved are Nikola Milosevic, Lorenzo Cassano, Bojana Bašaragin, Miloš Košprdić, Adela Ljajić and Darija Medvecki.

If you would like to put faces to the team — Team (with some family members) at Prater Berlin Biergarten (image by authors, 10th July 2024)

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Nikola Milosevic (Data Warrior)
Nikola Milosevic (Data Warrior)

Written by Nikola Milosevic (Data Warrior)

Nikola Milosevic is a PhD in Computer Science; Natural language processing, machine learning and cybersecurity enthusiast. Loves open source. World traveler.

No responses yet

Write a response