Machine Learning - Prediction of CHD Risk with R
Data science project in R to create a predictive model of the CHD risk, based on supervised learning.
Paper | Analysis | Code |
Abstract
In the current era in which we live, there is a clear and irreversible tendency to generate and store large volumes of information, from various sources such as: government agencies, public and private companies, clinics and hospitals, social networks, etc. Hence the great need to analyze the data in order to obtain some benefit for their owner, a third party or humanity in general. With this in mind, we conducted a descriptive and predictive analysis of public medical data of South Africa on patients with possible risk of presenting coronary heart disease (CHD), and applying advanced techniques of supervised machine learning and models calibration, we were able to determine when a person has high probabilities (close to 70%) of presenting or developing this disease, with the objective of being able to contribute to an early detection and diagnosis of it, for further treatment. Hopeful and convincing results were obtained, which can be improved if there is a greater amount of source data from which to learn.
Data
The dataset used is a sample of 462 records of a larger dataset, described in Rousseauw et al, 1983, South African Medical Journal, belonging to a non-profit organization called South African Heart Association (SAHA).
# | Variable | Type | Description | Data Type |
---|---|---|---|---|
1 | sbp | Input | Systolic Blood Pressure | Numerical |
2 | tobacco | Input | Cumulative Tobacco (kg) | Numerical |
3 | ldl | Input | Low Density Lipoprotein Cholesterol | Numerical |
4 | adiposity | Input | Adiposity | Numerical |
5 | famhist | Input | Family history of heart disease (Present, Absent) | Categorical |
6 | typea | Input | Type-A behavior | Numerical |
7 | obesity | Input | Obesity | Numerical |
8 | alcohol | Input | Current alcohol consumption | Numerical |
9 | age | Input | Age at onset | Numerical |
10 | chd | Target | Coronary heart disease (Yes, No) | Categorical |
Results
Based on the following results, the Naive Bayes model was selected as the best predictor for the problem we are solving, because it has an excellent Global Average Error of 28.77% and is the one with the highest YES detection index (key indicator), with 61.88%
Model | % Global Error | % Yes Detected | % No Detected |
---|---|---|---|
SVM | 28.71 | 50.00 | 82.45 |
Naive Bayes | 28.77 | 61.88 | 76.16 |
Decision Tree | 31.22 | 48.75 | 79.47 |
Random Forest | 31.77 | 41.88 | 82.12 |
K-NN | 33.08 | 41.88 | 80.46 |
Base Line | 34.63 | NA | NA |
Ada Boost | 34.95 | 43.13 | 76.82 |
Technologies and Techniques
- R 3.5.1 x64
- RStudio - Version 1.1.383
- Descriptive Data Analysis
- Supervised Learning (SL)
R Dependencies
library(e1071)
library(kknn)
library(MASS)
library(class)
library(rpart)
library(randomForest)
library(ada)
library(caret)
library(FactoMineR)
If you need to install a package, use the following command in the R console:
install.packages("package-name", dependencies=TRUE)
Contributing and Feedback
Any kind of feedback/criticism would be greatly appreciated (algorithm design, documentation, improvement ideas, spelling mistakes, etc…).
Authors
- Created by Segura Tinoco, Andrés and Orozco Cacique, Johana
- Created on July 7, 2017
License
This project is licensed under the terms of the MIT license.