Modeling of class imbalance using an empirical approach with spambase dataset and random forest classification
- UNCG Author/Contributor (non-UNCG co-authors, if there are any, appear on document)
- Shanmugatha "Shan" Suthaharan, Associate Professor (Creator)
- Institution
- The University of North Carolina at Greensboro (UNCG )
- Web Site: http://library.uncg.edu/
Abstract: Classification of imbalanced data is an important research problem as most of the data encountered in real world systems is imbalanced. Recently a representation learning technique called Synthetic Minority Over-sampling Technique (SMOTE) has been proposed to handle imbalanced data problem. Random Forest (RF) algorithm with SMOTE has been previously used to improve classification performance in minority class over majority class. Although RF with SMOTE demonstrates improved classification performance, the relationship between the classification performance and the imbalanced ratio between the majority and minority classes is not well defined. Therefore mathematical models that describe this relationship is useful especially in the big data environment which suffers from imbalanced data. In this paper, we proposed a mathematical model using an empirical approach applied to the well known Spambase dataset and Random Forest classification approach including its adoption with SMOTE representation learning technique. We have presented a linear model which describes the relationship between true positive classification rate and the imbalanced ratio between the majority and minority classes. This model can help IT researchers to develop better spam filter algorithms.
Modeling of class imbalance using an empirical approach with spambase dataset and random forest classification
PDF (Portable Document Format)
400 KB
Created on 12/17/2018
Views: 1706
Additional Information
- Publication
- Proceedings of the ACM RIIT 2014, pp. 75-80, doi: 10.1145/2656434.2656442
- Language: English
- Date: 2014
- Keywords
- Random forest, SMOTE, Imbalanced data, Classification, Machine learning