Investigating Heart Disease Datasets and Building Predictive Models

ECSU Author/Contributor (non-ECSU co-authors, if there are any, appear on document)
Brandon Simmons , student (Creator)
Julian A. D. Allagan , Associate Professor (Contributor)
Elizabeth City State University (ECSU )
Web Site:

Abstract: We investigate several heart disease datasets commonly found on popular datasites such as Kaggle, Dataport, and the UCI machine learning repository. We discoveredmany issues in our attempts to authenticate these medical datasets as they relateto human errors (encoding) and sometimes negligence (duplicates); these underlyingissues have undoubtedly weakened many inferences or predictive models built onsome of the datasets that are already published. We addressed these issues throughfeatures analysis. Further, using Random forest and logistic regressions, we determinethe best dataset for machine learning and statistical analysis: the Cleveland data ona reduced set of six features. Three of which are statistically significant at explainingor classifying patients as ’Heart Disease’. They are thalach (maximmum heart rate),oldpeak and cp (chest pain).

Additional Information

Language: English
Date: 2021
heart disease, medical datasets, machine learning, Kaggle, Dataport

Email this document to