Penalized weighted methods for robust offline and online learning

UNCG Author/Contributor (non-UNCG co-authors, if there are any, appear on document)
Mingyan Li (Creator)
Institution
The University of North Carolina at Greensboro (UNCG )
Web Site: http://library.uncg.edu/
Advisor
Haimeng Zhang

Abstract: Data contamination is a prevalent issue in real-life data sets, with approximately 10% of observations being affected, as noted by Hampel et al. in 1986 [32]. The presence of data contamination undermines the assumptions underlying existing machine learning algorithms. In this dissertation, we address this challenge by employing a penalized weighted method to enhance Stochastic Gradient Descent (SGD) and Random Forest (RF) models for regression analysis, particularly when mean-shift data contamination is present in the data set. The penalized weighted method assigns individual weights to observations in the training data set, and a Lasso-like penalty is applied to the individual weight. These individual weights, ranging from 0 to 1, govern the contribution of each training observation to the estimation of model parameters or the prediction of response variables. We present a novel approach, Penalized Weighted Stochastic Gradient Descent (PWSGD), designed for simultaneous outlier detection and accurate parameter estimation in regression problems. Furthermore, we introduce the Penalized Weighted Random Forest (PWRF) method, which adapts the RF model to enhance its robustness against systematic or trend contamination present in the training set. Both methods assess the impact of contamination in the training set based on the squared residual of each training observation, providing flexibility in handling unknown data contamination. Through numerical experiments and real data analysis, our observations indicate that the proposed methods exhibit competent performance, either yielding comparable results or outperforming benchmarking methods.

Additional Information

Publication
Dissertation
Language: English
Date: 2024
Keywords
Lasso, Machine Learning, Outlier Detection, Random Forest, Robust regression, Stochastic Gradient Descent
Subjects
Big data
Machine learning
Regression analysis
Robust statistics

Email this document to