When Less is More? Entropy Based Subsampling for Big Data - presented by Prof. Sujit Ghosh PhD

When Less is More? Entropy Based Subsampling for Big Data

Prof. Sujit Ghosh PhD

Prof. Sujit Ghosh PhD
Ask the seminar a question! BETA
When Less is More? Entropy Based Subsampling for Big Data
Prof. Sujit Ghosh PhD
Sujit Ghosh
North Carolina State University

Associated Journal of Statistical Theory and Practice article

Q. Sui and S. K. Ghosh (2024) Entropy-Based Subsampling Methods for Big Data. Journal of Statistical Theory and Practice
Article of record

In many real-life situations, not all information carries equal importance or significance. The variations and diversity among instances within a dataset contribute to its meaningfulness and value. In Information Theory, the entropy captures the uncertainty and amount of information in a dataset, highlighting the importance of diverse and informative instances. In practical terms, reducing the full dataset to a smaller, informative subset and conducting the same analysis can often yield similar estimation efficiency. However, the key challenge with such subsampling technique lies in striking the right balance between reducing computational cost and potential loss in efficiency. For multiple linear regression models, traditional subsampling techniques like leveraging methods have primarily measured information loss based only on covariates (design matrix) excluding the responses and thus often lead to a loss of statistical estimation accuracy. Unlike the idea of only keeping the subsample, the problem is viewed as extracting a representative set of the full data in terms of entropy. Two methods are presented: (i) a Likelihood-based Optimal Subsample Selection (LBOSS) given a desired sub-sample size; and (ii) a Bayesian-based Optimal Subsample Selection (BBOSS) which automatically determines the subsample size using posterior distribution. The proposed entropy-based criteria not only provide a better measure of the information loss due to subsampling but are also applicable to any likelihood-based estimation methods. In addition to some theoretical guarantees of the proposed methods, we provide extensive numerical illustrations to compare the proposed methods with some of the recently published methods in the literature. In terms of computational efficiency and statistical accuracy, the proposed methods are shown to perform relatively better.

References
  • 1.
    Q. Sui and S. K. Ghosh (2024) Entropy-Based Subsampling Methods for Big Data. Journal of Statistical Theory and Practice
  • 2.
    Q. Sui and S. K. Ghosh (2024) Similarity-based active learning methods. Expert Systems with Applications
  • 3.
    Q. Sui and S. K. Ghosh (2024) Active Learning for Stacking and AdaBoost-Related Models. Stats
Journal of Statistical Theory and Practice logo
Journal of Statistical Theory and Practice Webinars
Journal of Statistical Theory and Practice
Cite as
S. Ghosh (2024, July 24), When Less is More? Entropy Based Subsampling for Big Data
Share
Details
Listed seminar This seminar is open to all
Recorded Available to all
Video length 53:18
Disclaimer The views expressed in this seminar are those of the speaker and not necessarily those of the journal