When Less is More? Entropy Based Subsampling for Big Data
Prof. Sujit Ghosh PhD
In many real-life situations, not all information carries equal importance or significance. The variations and diversity among instances within a dataset contribute to its meaningfulness and value. In Information Theory, the entropy captures the uncertainty and amount of information in a dataset, highlighting the importance of diverse and informative instances. In practical terms, reducing the full dataset to a smaller, informative subset and conducting the same analysis can often yield similar estimation efficiency. However, the key challenge with such subsampling technique lies in striking the right balance between reducing computational cost and potential loss in efficiency. For multiple linear regression models, traditional subsampling techniques like leveraging methods have primarily measured information loss based only on covariates (design matrix) excluding the responses and thus often lead to a loss of statistical estimation accuracy. Unlike the idea of only keeping the subsample, the problem is viewed as extracting a representative set of the full data in terms of entropy. Two methods are presented: (i) a Likelihood-based Optimal Subsample Selection (LBOSS) given a desired sub-sample size; and (ii) a Bayesian-based Optimal Subsample Selection (BBOSS) which automatically determines the subsample size using posterior distribution. The proposed entropy-based criteria not only provide a better measure of the information loss due to subsampling but are also applicable to any likelihood-based estimation methods. In addition to some theoretical guarantees of the proposed methods, we provide extensive numerical illustrations to compare the proposed methods with some of the recently published methods in the literature. In terms of computational efficiency and statistical accuracy, the proposed methods are shown to perform relatively better.