QUANTUM - ANR Project

Machine reading

A prerequisite for adequate question generation is a proper understanding of the semantic and discursive content of a document. In this project, we aim to develop advanced methods for a comprehensive representation of textual content, that are able to capture the intricacies needed for complex question generation. Prior work has primarily focused on the representation of single sentences, which is generally sufficient for generating and answering simple factoid questions. However, an accurate understanding of a document – and the subsequent generation of more complex questions – requires a profound analysis that goes beyond the sentence level. Our aim is to accomplish this task by integrating various sources of information, with regard to semantics as well as document and discourse structure.

First of all, our aim is to examine neural representations of text that capture information beyond the sentence level. In recent years, neural architectures for machine reading and question answering have considerably improved the state of the art (Hermann et al., 2015; Cheng et al., 2016). The bulk of these models, however, focus on moderately easy tasks, such as simple factoid question answering; solving the kind of complex questions we aim to tackle in this project are largely beyond their scope. We want to improve neural architectures for document representation in two different ways. First, we want to extend the neural architectures themselves, accommodating them to the processing of heterogeneous document structure, where the essential information is represented throughout the document. A promising direction in this respect is the development of a hierarchical neural approach to document representation (Serban et al., 2017). Secondly, we aim to implement more complex training objectives, that are able to integrate information scattered throughout a document. The goal is to conceive training objectives that are capable of capturing the required information in a more informed and principled way.

Secondly, we would like to improve the representations constructed using neural approaches through the integration of supplementary information provided by document and discourse structure. Current approaches largely ignore all document structure, such as titles, lists, etc. However, the kind of complex questions we aim to deal with often heavily rely on such elements of document structure. By explicitly integrating them into our approach, we aim to enable our models to uncover the appropriate textual interdependencies needed for complex question generation. In addition to document structure, we aim to augment our representation with discourse relations. By explicitly integrating discourse relations, our hope is to equip our models with an improved capability for generalization.

References

Question generation

We will explore two main ways in which document structure can condition the generation of complex questions: typography (sections, lists, tables, etc.) and discourse relations (causal, temporal, conditional, contrast, etc.). While early work on question generation focused on designing rule-based and statistical systems which could transform input text into questions (Heilman and Smith, 2010; Rus et al., 2012), neural approaches have introduced a novel paradigm where question generation is viewed as a two step encoding-decoding process. The encoder converts the input text into a continuous representation, and the decoder then generates a question from this continuous representation (Duan et al., 2017; Du and Cardie, 2018). Our goal is to extend previous work on neural QG in two main ways. We will target the generation of complex questions and we will focus on questions that can be used for the development of conversational agents in the technical domain.

Complex Questions. The generation of questions whose answers are complex text fragments, possibly distributed in the input document, raises two main challenges: content selection (how to identify the relevant content i.e., the answer in the input document) and reformulation (how to map the answer in the input text to a well-formed question). We will explore these two issues as follows. On the one hand, we will investigate how the document representations produced by a neural system that takes into account document and discourse structure can help detect relevant input fragments. That is, we will explore how question generation can benefit from the rich document representations induced in the context of our first project goal. On the other hand, we will explore how techniques from abstractive summarization (Rush et al., 2015; Chopra et al.,2016) can be used to support the generation of well-formed, semantically appropriate questions.

Technical Domain. Most existing work on question generation focuses on non-technical questions, such as questions related to encyclopedic text (SQuAD, Rajpurkar et al., 2016), news articles (NewsQA, Trischler et al., 2017) or movies (MovieQA, Tapaswi et al., 2016). While some of our work will bear on forum and benchmark data, we will additionally focus on questions which can be used to build conversational agents that can answer questions about technical domains. Building on Synapse’s expertise, we will focus on the insurance domain and explore two main research questions: (i) how to exploit multitask and/or transfer learning to leverage all data (existing QG/QA datasets, forum data and Synapse data) collected over the course of our project; (ii) how to use domain knowledge to improve generation. Cao et al. (2018) have shown that automatically extracted text templates can be used to improve the generation of headers. We will investigate how question patterns can be extracted from technical documents and exploited to improve neural QG in the technical domain.

References

Cao, Z., Li, W., Li, S., and Wei, F. (2018). Retrieve, rerank and rewrite: Soft template based neural summarization. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 152–161.
Chopra, S., Auli, M., and Rush, A. M. (2016). Abstractive sentence summarization with attentive recurrent neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 93–98. Association for Computational Linguistics.
Du, X., Shao, J., and Cardie, C. (2017). Learning to ask: Neural question generation for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1342–1352. Association for Computational Linguistics.
Duan, N., Tang, D., Chen, P., and Zhou, M. (2017). Question generation for question answering. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 866–874.
Heilman, M. and Smith, N. A. (2010). Good question! statistical ranking for question generation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 609–617.
Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. (2016). Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392. Association for Computational Linguistics.
Rus, V., Wyse, B., Piwek, P., Lintean, M., Stoyanchev, S., and Moldovan, C. (2012). A detailed account of the first question generation shared task evaluation challenge. Dialogue & Discourse, 3(2):177–204.
Rush, A. M., Chopra, S., and Weston, J. (2015). A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 379–389. Association for Computational Linguistics.
Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., and Fidler, S. (2016). MovieQA: Understanding Stories in Movies through Question-Answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Trischler, A., Wang, T., Yuan, X., Harris, J., Sordoni, A., Bachman, P., and Suleman, K. (2017). Newsqa: A machine comprehension dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 191–200. Association for Computational Linguistics.

Research

Machine reading

Question generation

ecological validation

Publications

Resources

Contact