Penalized weighted methods for robust offline and online learning
- UNCG Author/Contributor (non-UNCG co-authors, if there are any, appear on document)
- Mingyan Li (Creator)
- Institution
- The University of North Carolina at Greensboro (UNCG )
- Web Site: http://library.uncg.edu/
- Advisor
- Haimeng Zhang
Abstract: Data contamination is a prevalent issue in real-life data sets, with approximately 10% of observations being affected, as noted by Hampel et al. in 1986 [32]. The presence of data contamination undermines the assumptions underlying existing machine learning algorithms. In this dissertation, we address this challenge by employing a penalized weighted method to enhance Stochastic Gradient Descent (SGD) and Random Forest (RF) models for regression analysis, particularly when mean-shift data contamination is present in the data set. The penalized weighted method assigns individual weights to observations in the training data set, and a Lasso-like penalty is applied to the individual weight. These individual weights, ranging from 0 to 1, govern the contribution of each training observation to the estimation of model parameters or the prediction of response variables. We present a novel approach, Penalized Weighted Stochastic Gradient Descent (PWSGD), designed for simultaneous outlier detection and accurate parameter estimation in regression problems. Furthermore, we introduce the Penalized Weighted Random Forest (PWRF) method, which adapts the RF model to enhance its robustness against systematic or trend contamination present in the training set. Both methods assess the impact of contamination in the training set based on the squared residual of each training observation, providing flexibility in handling unknown data contamination. Through numerical experiments and real data analysis, our observations indicate that the proposed methods exhibit competent performance, either yielding comparable results or outperforming benchmarking methods.
Penalized weighted methods for robust offline and online learning
PDF (Portable Document Format)
885 KB
Created on 5/1/2024
Views: 71
Additional Information
- Publication
- Dissertation
- Language: English
- Date: 2024
- Keywords
- Lasso, Machine Learning, Outlier Detection, Random Forest, Robust regression, Stochastic Gradient Descent
- Subjects
- Big data
- Machine learning
- Regression analysis
- Robust statistics