The analysis is centered around the prediction of presence of cardiovascular disease, given data about patients 1. The data includes different categorical and numerical variables commonly associated to an individual’s health status. After a brief data exploration, we perform first cluster analysis on the dataset and then fit different types of models: Trees and Random Forest, Logistic Regression and Natural Splines. Finally, we discuss some issues with the data, considering its unknown origin and unspecified gathering methods, and hint at how this work could possibly be improved.
More specifically, the following steps are performed:
- Data exploration
- Preprocessing
- Distribution analysis
- Correlation analysis
- K-Means Clustering
- Trees & Random Forest
- Logistic Regression
- Base Model
- Models with interection between variables
- Polynomial Regressions
- LASSO Regression
- Natural Splines
- Results overview