We will explore two main ways in which document
structure can condition the generation of complex
questions: typography (sections, lists, tables, etc.)
and discourse relations (causal, temporal, conditional,
contrast, etc.). While early work on question
generation focused on designing rule-based and
statistical systems which could transform input text
into questions (Heilman and Smith, 2010; Rus et al.,
2012), neural approaches have introduced a novel
paradigm where question generation is viewed as a two
step encoding-decoding process. The encoder
converts the input text into a continuous
representation, and the decoder then generates a
question from this continuous representation (Duan et
al., 2017; Du and Cardie, 2018). Our goal is to
extend previous work on neural QG in two main ways. We
will target the generation of complex questions and we
will focus on questions that can be used for the
development of conversational agents in the technical
domain.
Complex Questions. The generation of
questions whose answers are complex text fragments,
possibly distributed in the input document, raises two
main challenges: content selection (how to identify
the relevant content i.e., the answer in the input
document) and reformulation (how to map the answer in
the input text to a well-formed question). We will
explore these two issues as follows. On the one hand,
we will investigate how the document representations
produced by a neural system that takes into account
document and discourse structure can help detect
relevant input fragments. That is, we will explore how
question generation can benefit from the rich document
representations induced in the context of our first
project goal. On the other hand, we will explore how
techniques from abstractive summarization (Rush et
al., 2015; Chopra et al.,2016) can be used to support
the generation of well-formed, semantically
appropriate questions.
Technical Domain. Most existing work on
question generation focuses on non-technical
questions, such as questions related to encyclopedic
text (SQuAD, Rajpurkar et al., 2016), news articles
(NewsQA, Trischler et al., 2017) or movies (MovieQA,
Tapaswi et al., 2016). While some of our work will
bear on forum and benchmark data, we will additionally
focus on questions which can be used to build
conversational agents that can answer questions about
technical domains. Building on Synapse’s expertise, we
will focus on the insurance domain and explore two
main research questions: (i) how to exploit multitask
and/or transfer learning to leverage all data
(existing QG/QA datasets, forum data and Synapse data)
collected over the course of our project; (ii) how to
use domain knowledge to improve generation. Cao et
al. (2018) have shown that automatically extracted
text templates can be used to improve the generation
of headers. We will investigate how question patterns
can be extracted from technical documents and
exploited to improve neural QG in the technical
domain.