Modeling of class imbalance using an empirical approach with spambase dataset and random forest classification

UNCG Author/Contributor (non-UNCG co-authors, if there are any, appear on document)
Shanmugatha "Shan" Suthaharan, Associate Professor (Creator)
Institution
The University of North Carolina at Greensboro (UNCG )
Web Site: http://library.uncg.edu/

Abstract: Classification of imbalanced data is an important research problem as most of the data encountered in real world systems is imbalanced. Recently a representation learning technique called Synthetic Minority Over-sampling Technique (SMOTE) has been proposed to handle imbalanced data problem. Random Forest (RF) algorithm with SMOTE has been previously used to improve classification performance in minority class over majority class. Although RF with SMOTE demonstrates improved classification performance, the relationship between the classification performance and the imbalanced ratio between the majority and minority classes is not well defined. Therefore mathematical models that describe this relationship is useful especially in the big data environment which suffers from imbalanced data. In this paper, we proposed a mathematical model using an empirical approach applied to the well known Spambase dataset and Random Forest classification approach including its adoption with SMOTE representation learning technique. We have presented a linear model which describes the relationship between true positive classification rate and the imbalanced ratio between the majority and minority classes. This model can help IT researchers to develop better spam filter algorithms.

Additional Information

Publication
Proceedings of the ACM RIIT 2014, pp. 75-80, doi: 10.1145/2656434.2656442
Language: English
Date: 2014
Keywords
Random forest, SMOTE, Imbalanced data, Classification, Machine learning

Email this document to