Summary of Question Answering task
Do a summary of the task QA (or Q&A, doesn’t matter) is very hard to do, due to the big amount of different existing solutions available. This article, and maybe another one, want to summarize what I discovered while I was scouting solutions for this task intending to develop a business product for the company I work for. I write it to not let disappear all that I have learned and also let someone else benefit from my days spent studying it. I hope it will be useful to you!
In first, I want to talk about it without going into depth, to create a map to be used to start with the right foot on this field. For sure I will do some mistakes due to my expertise, but I promise I’ll do my best.
So, let’s start.
First of all: what we are talking about? QA task is the aim of building a computer system that automatically answers questions posed by humans in natural language. It is a very interesting challenge that has been light again only in 2017 (the release date of DrQA, one of the most known solutions), since when a lot of different solutions have been published.
It is different from the research of documents made by Google (or ElasticSearch, to be clear) because we don’t want documents as answers, we want some text. However, Google also partially fulfils this task. When? If you ask Google, for example, “When was Mario Draghi born?” (Italian prime minister, at the moment I’m writing), the answer will be:
This is the Q&A task: you ask a question and the system gives you the information you need. What Google usually does instead is called Information Retrieval, which is another thing (not so much distant, by the way).
This kind of system can be very useful at the moment, for example, a company has a corpus of documents which want to let be used by their customers. They can easily use an algorithm of Information Retrieval (e.g. ElasticSearch) to let customers find articles inside this corpus and a QA system to help them find quick answers to questions, always using the knowledge contained in that corpus. An approach that also Google use is to let you find an interesting article and when you open it, some words or phrases are highlighted like they were the response to your search. I think this is a good way to offer a service, because you try to offer a specific answer to a question, but also let the customer understand if those words or phrases are a correct answer because you can read the contest where they are.
For example, if you write on Google “iPhone release date theverge” this will appear:
And, if you open it you redirect to a specific part of the article. Here is where it brings you:
As you can see Google highlighted some phrases intending to answer as soon as possible to you search/question. This happens only for some searches, in particular when they are written in a question form.
Let’s now talk about possible solutions to do things like that.
Open-Domain QA vs Close-Domain QA
If you start to read some paper or articles about QA, for sure you will read about ODQA and CDQA. What are they? ODQA is short for Open-Domain Question and Answering, and CDQA is for Close-Domain Question Answering. In other words? It is a way to classify a QA system according to its ability to answer or not to questions about every argument.
In Open-Domain QA systems, the questions haven’t to be limited to predefined domains and domain knowledge; ideally, the system should be able to answer questions of any domain (sport, space, politics, geography, etc.). Instead, in Close-Domain QA systems the answers have to be limited to certain domains only because the system is not able to respond to others.
In my opinion, this kind of splitting is useful just for the implementation of a solution, not for the solution itself. Why? DrQA is a solution proposed in 2017 to solve the QA task. It has the ability to be applied to a collection of textual documents, it doesn’t matter what those documents talk about or how much they are. They could be the entire Wikipedia or just a few documents about court cases. This solution so can be used to develop an Open-Domain QA system (if we use Wikipedia as a corpus) or a Close-Domain QA system (if we use the court cases documents).
I don’t find this split very interesting and, in the following, I won’t talk about it anymore. It becomes useful (maybe) only when you have to explain what kind of system do you want to buy or use. But this is not the case, so bye.
Open-book vs Close-book
A more interesting division is between Open-book and Close-book solutions. Yes, solutions. This time we don’t classify implementations, we classify solutions.
What are we talking about this time? A QA system may work with or without access to an external source of knowledge (e.g. Wikipedia) and these two conditions are referred to as open-book or closed-book question answering, respectively.
This is a very important splitting, and in the following, we will talk about each.
Open-book solutions
Why it’s called “open-book”? Because in an open-book exam, students are allowed to refer to external resources like notes and books while answering test questions. Similarly, an Open-book QA system can be paired with a rich knowledge base to identify relevant documents as evidence of answers.
In this type of solutions we can decompose the process of finding answers to given questions into two stages:
- Find some relevant documents in an external repository of knowledge (e.g. Wikipedia, company’s documents, etc.)
- Process the retrieved documents to extract an answer
Usually, papers and articles represent this like a pipeline of two stages/blocks, for each of which different solutions have been proposed.
The first stage is generally called Retriever: it aims to retrieve some documents by the knowledge base that can be useful to offer the best answer possible. It is usually implemented with the solutions used to do Information Retrieval, which can be of two main types:
- Non-learning-based: TF-IDF, etc.
- Learning-based: dense vector, etc.
The second stage has instead the purpose to take the documents retrieved by the Retriever and use them to create an answer. It has therefore to solve a Machine Reading Comprehension (MRC) task, that is a kind of task that require the machines to understand a text like a human. In this case, this ability is used to create an answer given one or more unstructured text, that the machine has to read.
I used the word create because exist different ways to get an answer and, about this, we can do another split based precisely on the way the system uses the documents retrieved to get the answer.
The second stage can be:
- Extractive: the final answer is a span of one of the documents retrieved by the first stage
- Generative: the final answer is new text, generated by the second stage after reading the documents retrieved by the Retriever module.
Let’s see these two possible systems, which are generally called Retriever-Reader the first one, and Retriever-Generator the second one. The first stage is always the same, the only difference is in the second one.
Retriever-Extractor
As I said before this kind of architecture expects an MRC task where the aim is to extract a span of text by a document. Of course, the documents can be more than one, so we potentially will have more than one answer, from which we will have to choose.
The Reader can be implemented in different ways. Bi-directional LSTM was used a lot in the past (e.g. by DrQA, solution which overlaps with this kind of architecture), but nowadays we have the Attention mechanism, so why not use it? Attention mechanism performs better than recurrent neural networks, so today is preferred to use BERT (or other Transformer-like models).
Retriever-Generator
In this type of MRC task, the aim is instead to generate an answer by reading some documents. Of course, the documents can also in this case be more than one, but this time we will get just an answer, that contains knowledge got potentially by all the documents retrieved.
In this case, using big seq2seq language models (T5, BART, etc.) that has been pre-trained on a large collection of unsupervised textual corpus, is the best choice. BERT, just to be clear, doesn’t fit well because it’s made up of only the Encoder part of Transformer, and not also the Decoder part, like T5 and BART.
Close-book solutions
These kinds of solutions aren’t paired, unlike the Open-Domain QA solution, with a rich knowledge base.
So, how are designed these kinds of solutions? Well, very simple. They are based only on a big seq2seq language model (T5, BART, GPT3, etc.) that has been pre-trained on a large collection of unsupervised textual corpus. The same used in the Retriever-Generator architecture, but this time it is alone (there isn’t reading comprehension). At training time it has to memorize all the knowledge it will need later when it will be used (SPOILER: to date, they have a lot of problems).
Why does it work? Given enough parameters, these models can memorize knowledge within parameter weights (just to let you know, GPT3’s full version has a capacity of 175 billion machine learning parameter).
Therefore, we can use these models to do question-answering without an external knowledge base, just like in a closed-book exam. Of course, the only thing that these models can do is produce free text to respond to questions, they can’t respond with a span like some Open-book solutions.
(Extra) Some QA solutions
In this paragraph, I want to talk about some solutions that have been presented in the near and in the far past. I will not go in-depth, showing math and formulas, I will just let you know what is the base of some of the most known solutions.
I want to show you this table that resumes some of the most known solutions:
They seem to be ordered by date of publication: the first one for age is DrQA and the last one is Fusion-in-Decoder. This is a very interesting table because resume perfectly the different implementations of the two stages generally used (Retriever and Reader/Generator).
I want to add just a few notes about the columns:
- What is the End2End column? If both stages are trainable you have two possibilities: train them jointly or independently. Train jointly can of course bring better results, but it is not always simple.
- What is the Pre-training / Fine-tuning column? It tells you if the model expects more pre-training (continuing the pre-training already done) or fine-tuning, and with what.
I also want to add a few words about the models:
- DrQA is THE model: published in 2017, it has lighted again the QA challenge. However, it is quite old, nowadays there are models much more effective.
- BERTserini: when BERT came out they understand that it could be much better than Bi-directional LSRT. Why don’t use it for QA?
- Multi-passage BERT: in BERTserini (for example) BERT, as Reader, take one fragment of text one by one and read it. The results of each fragment are not normalized. Multi-passage BERT do it!
- R³, ORQA, REALM and DPR are recent models quite difficult to understand: I need more time to talk about them.
- DenSPI doesn’t have the Reader/Generator stage. This because its Retriever retrieves phrases that are already finished answers.
- As you can see GPT3 and T5 + SMM don’t have the Retriever, as they have Generator-like architecture.
- RAG is a Retriever-Extractor system.
- Fusion-in-Decoder is similar to RAG, but it is a Retriever-Generator system.
Here are some results of some these model in three different datasets:
Fusion-in-Decoder, one of the last architecture that has been published, get the best result on all three datasets.
For today is all, go in peace!
Here I want to let you some resources that can be useful to expand your knowledge about QA. First of all, I recommend this article that explains the difference between Question and Answering and Machine Reading Comprehension:
Also, I recommend this video:
It is a presentation help by two expertise of the field that talked about the story and the last discovered in QA. It lasts 4 hours and I didn’t see it at all, but I read the slides and seems very interesting and complete. Here I let you also the slides: https://github.com/danqi/acl2020-openqa-tutorial
Here I let you also some articles:
And some papers:
Enjoy ☺.