Solved – How to implement question answering program based on Q&A data

For example I have QA data:

Q: Do you like apples?

A: Yes.

Q: Do you like running?

A: Yes.

Algorithm should take that input, thesaurus(synonyms, antonyms, etc.), word categorization dictionary(e.g. dbpedia) and produce a program that answers questions. E.g.

Q: Do you like fruits?

A: Yes.

Q: What activities do you prefer?

A: Running.

Q: Are you a robot?

A: I don't know.

Does such algorithm exist? Links to papers and code are appreciated.

This program is also called chatterbot. First chatterbots were developed in lisp, now cannot be run because they were written for old computer architecture.

Now looking into NLP software to do that.

Answer is based on researching paper Kwiatkowski et al. "Scaling Semantic parsers with on-the-fly ontology matching" mentioned by @Richard and UBL software. The problem is solved using Java for parsing questions and Prolog for holding and quering factual database.

As inputs Java program takes:

1) Training data in form

what is the highest point of the state with the largest area

(answer (highest (place (loc_2 (largest_one (area_1 (state all:e)))))))

First sentence is question. Second statement is answer in Prolog*. Prolog is used in many NLP applications because of its focus on logical programming.

2) Lexicon (en-np-fixedlex.geo) that holds noun phrases (NP) and their mapping to Prolog query. This data is used to parse questions to understand where nouns are in sentences.

ohio :- NP : (stateid ohio:e)

the missouri river :- NP : (riverid missouri:e)

Syntactic category NP is written in Combinatory Categorial Grammar(CCG). For more information on them read Wikipedia entry for CCG or Steedman's "A very short introduction to CCG"**.

3) Probability of occurrence of words in this factual database. Probabilities are used to select the best grammar for question. Higher probability means that produced grammar is more likely to be used.

The goal of the program is to learn translating questions into Prolog code. User then needs to execute Prolog sentence to get an answer. It's stated that the algorithm described in the paper is domain independent and can be used for any dataset besides geographical data. It's good for answering questions about hierarchical data with countries that consist of states, cities which in turn consist of rivers, mountains etc.

I would recommend using described approach only for learning. I noticed that java can take some time to train based on QA data. So running it multiple times to test ideas can be slow. SWI-Prolog is stated to be focused more on features than on performance. So large factual databases are also out of the question.

* SWI-Prolog (free Prolog implementation) does not understand this syntax. I assume Sicstus should. I had to change it to conform to syntax from geobase to run in swi-prolog.

** Important thing to understand about CCG is forward and backward application and how grammar is combined to form a sentence.

Similar Posts:

Rate this post

Leave a Comment