Paper 3 – TLDKS Journal

Optimizing Data Coverage and Significance in Multiple Hypothesis Testing on User Groups

Authors: Nassim Bouarour, Idir Benouaret, Sihem Amer-Yahia

Volume 51 (2022) Special Edition

Abstract

We tackle the question of checking hypotheses on user data. In particular, we address the challenges that arise in the context of testing an input hypothesis on many data samples, in our case, user groups. This is referred to as Multiple Hypothesis Testing, a method of choice for data-driven discoveries. Ensuring sound discoveries in large datasets poses two challenges: the likelihood of accepting a hypothesis by chance, i.e., returning false discoveries, and the pitfall of not being representative of the input (data coverage). We develop GROUPTEST, a framework for group testing that addresses both challenges. We formulate VALMIN and COVMAX, two generic top-n problems that seek n user groups satisfying one-sample, two-sample, or multiple-sample tests. VALMIN optimizes significance while setting a constraint on data coverage and COVMAX aims to maximize data coverage while controlling significance. We show the hardness of VALMIN and COVMAX. We develop a greedy algorithm to solve the former problem and two algorithms to solve the latter where the first one is a greedy algorithm with a provable approximation guarantee and the second one is a heuristic-based algorithm based on α-investing. Our extensive experiments on real-world datasets demonstrate the necessity to optimize coverage for sound discoveries on large datasets, and the efficiency of our algorithms.

Keywords: Hypothesis testing, Data coverage, Exploratory data analysis.