CAAS
Information Retrieval Systems (IRS) aim at retrieving information that meet a user’s need expressed in a query. Retrieving relevant information to a query implies a two step process: off line, the system indexes documents, generally using a bag of words representation; online, the system computes the similarity between the user’s query and the document representations (indexing terms) to retrieve the most similar documents. Current IRS, e.g. search engines on the web are general search tools implementing the same mechanisms and the same methods of data processing and matching, whatever the context of the search, the user, the type of information needs, or information usage are.
The assumption of the project CAAS is that context could improve the performances of the IRS, explicating certain elements of the information retrieval. The contextual aspect refers to tacit or explicit knowledge concerning the intentions of users, the environment of users and the system itself
The fundamental scientific issues that we can quote are:
- Control of the variety of the contexts: To raise this issue, we will have to define models making it possible to represent the various aspects of the context in IR. Recently, some of these aspects were approached within the framework of IR, but no model makes it possible to integrate all the context aspects. Research begins on the real representation of the context. Many study remains to make so that the community manages to propose a nomenclature of these various aspects. Effort must be continued to model each of these aspects. It is also a question of studying the variety of the treatments and their adequacy with the variety of the contexts.
- Training of the contexts: Modelling context is not an end in itself. The system must be able to have the intelligence to decide the most adequate technologies compared to a given context, i.e.: to adapt the methods of IR to the context
- Recognize a context: when a context comes across, the system has to detect it among the leant contexts in order to be able to decide which method it should apply.
To tackle these challenges, CAAS will consider the various aspects that may impact the IR process first as independently as possible, then considering the cross-effects. We will focus on the following contextual elements:
- the users’ expectation and users’ queries
- the documents
- the system components
For each of them, we will consider various collections and will qualify them (defining features and extracting these features), then we will analyse them deeply in the aim of extracting models and behaviour. Once each contextual element will be analysed, we will consider the cross effect. For example, one of the results could be that query reformulation using relevance feedback is useful when the query contains proper nouns.
We will consider both benchmark collections from international program and more realistic collections from companies.
The analysis and extraction of models is the core of the project. However, we also aim at developing modules from our findings. These modules will be integrated in IR platforms so that they could be re-used as components of complete IR systems. Because analysis and modelling is the core of the project, the partners are all academics. This do not means that companies are not considered: they will be implied in several ways: first we have contacted one major IR web search engine who will provide us with query logs; we also contacted smaller companies who also will provide us with query logs that we will use during the project. Companies will also be considered in the spreading results activities: we will contact different companies in order to show our finding and either will suggest customizing the developed modules for them or transferring the technologies. For example, one application is to suggest adds to be associated to users’ queries in a web site.
To tackle the challenges, the consortium is composed of two institutes in computer sciences, both specialists in IR, but with complementary skills. LIA (Laboratoire Informatique Avignon) works on Question Answering problems, while IRIT (Institut de Recherche en Informatique de Toulouse) is more specialists in Adhoc retrieval and detecting novelty. IRIT works in close relation with IMT (Institut de Mathématique de Toulouse) and for this project with the Statistique et Probabilité group. Even if IMT does not appear as a partner they will be working in this project. Indeed IRIT and IMT are partners in the Plan Pluri-Formation FREMIT which aims at developing collaborative work. CLLE (Cognition, Langues, Langage, Ergonomie) is partner of this project regarding their linguistic skills and past work in IR (Mothe and Tanguy, 2005) and natural language processing.