When Less is More? Entropy Based Subsampling for Big Data - presented by Prof. Sujit Ghosh PhD

When Less is More? Entropy Based Subsampling for Big Data

Prof. Sujit Ghosh PhD

Abstract

Speaker

AI Outline

When Less is More? Entropy Based Subsampling for Big Data

Sujit Ghosh

North Carolina State University

Associated Journal of Statistical Theory and Practice article

Q. Sui and S. K. Ghosh (2024) Entropy-Based Subsampling Methods for Big Data. Journal of Statistical Theory and Practice

View on publisher site

Article of record

In many real-life situations, not all information carries equal importance or significance. The variations and diversity among instances within a dataset contribute to its meaningfulness and value. In Information Theory, the entropy captures the uncertainty and amount of information in a dataset, highlighting the importance of diverse and informative instances. In practical terms, reducing the full dataset to a smaller, informative subset and conducting the same analysis can often yield similar estimation efficiency. However, the key challenge with such subsampling technique lies in striking the right balance between reducing computational cost and potential loss in efficiency. For multiple linear regression models, traditional subsampling techniques like leveraging methods have primarily measured information loss based only on covariates (design matrix) excluding the responses and thus often lead to a loss of statistical estimation accuracy. Unlike the idea of only keeping the subsample, the problem is viewed as extracting a representative set of the full data in terms of entropy. Two methods are presented: (i) a Likelihood-based Optimal Subsample Selection (LBOSS) given a desired sub-sample size; and (ii) a Bayesian-based Optimal Subsample Selection (BBOSS) which automatically determines the subsample size using posterior distribution. The proposed entropy-based criteria not only provide a better measure of the information loss due to subsampling but are also applicable to any likelihood-based estimation methods. In addition to some theoretical guarantees of the proposed methods, we provide extensive numerical illustrations to compare the proposed methods with some of the recently published methods in the literature. In terms of computational efficiency and statistical accuracy, the proposed methods are shown to perform relatively better.

References

1.
Q. Sui and S. K. Ghosh (2024) Entropy-Based Subsampling Methods for Big Data. Journal of Statistical Theory and Practice
2.
Q. Sui and S. K. Ghosh (2024) Similarity-based active learning methods. Expert Systems with Applications
3.
Q. Sui and S. K. Ghosh (2024) Active Learning for Stacking and AdaBoost-Related Models. Stats

Journal of Statistical Theory and Practice Webinars

Journal of Statistical Theory and Practice

Cite as

S. Ghosh (2024, July 24), When Less is More? Entropy Based Subsampling for Big Data

doi.org/10.52843/cassyni.nslkf0

Share

Timestamped permalink

Details

Listed seminar This seminar is open to all

Recorded Available to all

Video length 53:18

Disclaimer The views expressed in this seminar are those of the speaker and not necessarily those of the journal

Organizing a seminar?

Save time with Cassyni's automagic seminar platform.