Machine Learning based Diabetes Prediction Software (with source code) | Codeing School ~ Coding School

Machine Learning based Diabetes Prediction Software

Abstract

Diabetes is considered one of the deadliest and chronic diseases which causes an increase in blood sugar. Many complications occur if diabetes remains untreated and unidentified. The tedious identifying process results in visiting of a patient to a diagnostic centre and consulting doctor. But the rise in machine learning approaches solves this critical problem. The motive of this study is to design a model which can prognosticate the likelihood of diabetes in patients with maximum accuracy. Therefore three machine learning classification algorithms namely Decision Tree, SVM and Naive Bayes are used in this experiment to detect diabetes at an early stage. Experiments are performed on the Pima Indians Diabetes Database (PIDD) which is sourced from the UCI machine learning repository. The performances of all three algorithms are evaluated on various measures like Precision, Accuracy, F-Measure, and Recall. Accuracy is measured over correctly and incorrectly classified instances. Results obtained show Naive Bayes outperforms with the highest accuracy of 76.30% comparatively other algorithms. These results are verified using Receiver Operating Characteristic (ROC) curves in a proper and systematic manner.

The Data

The diabetes data set was originated from the UCI Machine Learning Repository and can be downloaded from source code.

Requirements:

Pandas, pip install pandas
Scikit Learn, pip install sklearn
Matplotlib, pip install matplotlib
Numpy, pip install numpy
Seaborn, pip install seaborn
Pywin32, pip install pywin32 or pip install pypiwin32

Source Code:

Github

Watch on YouTube:

Click Here

** In source code the “Diabetes_Prediction_Software.ipynb” or “diabetes prediction software.py” is the ready-made software, you can directly run this from your jupyter notebook for “.ipynb” and choose any ide for “.py”, after installing the dependency.

** If you want to learn how its work then read this and run the “knn.py” (in source code) for your knowledge.

Tutotial:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

diabetes = pd.read_csv('diabetes.csv')

print(diabetes.columns)

Index([‘Pregnancies’, ‘Glucose’, ‘BloodPressure’, ‘SkinThickness’, ‘Insulin’, ‘BMI’, ‘DiabetesPedigreeFunction’, ‘Age’, ‘Outcome’], dtype=’object’)

diabetes.head()

Figure 1

The diabetes data set consists of 768 data points, with 9 features each:

print("dimension of diabetes data: {}".format(diabetes.shape))

the dimension of diabetes data: (768, 9)

“Outcome” is the feature we are going to predict, 0 means No diabetes, 1 means diabetes. Of these 768 data points, 500 are labeled as 0 and 268 as 1:

print(diabetes.groupby('Outcome').size())

Figure 2

import seaborn as sns

sns.countplot(diabetes['Outcome'],label="Count")

Figure 3

diabetes.info()

Figure 4

k-Nearest Neighbors

The k-NN algorithm is arguably the simplest machine learning algorithm. Building the model consists only of storing the training data set. To make a prediction for a new data point, the algorithm finds the closest data points in the training data set — it's “nearest neighbours.”

First, Let’s investigate whether we can confirm the connection between model complexity and accuracy:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(diabetes.loc[:, diabetes.columns != 'Outcome'], diabetes['Outcome'], stratify=diabetes['Outcome'], random_state=66)

from sklearn.neighbors import KNeighborsClassifier

training_accuracy = []

test_accuracy = []

# try n_neighbors from 1 to 10

neighbors_settings = range(1, 11)

for n_neighbors in neighbors_settings:

# build the model

knn = KNeighborsClassifier(n_neighbors=n_neighbors)

knn.fit(X_train, y_train)

# record training set accuracy

training_accuracy.append(knn.score(X_train, y_train))

# record test set accuracy

test_accuracy.append(knn.score(X_test, y_test))

plt.plot(neighbors_settings, training_accuracy, label="training accuracy")

plt.plot(neighbors_settings, test_accuracy, label="test accuracy")

plt.ylabel("Accuracy")

plt.xlabel("n_neighbors")

plt.legend()

plt.savefig('knn_compare_model')

Figure 5

The above plot shows the training and test set accuracy on the y-axis against the setting of n_neighbors on the x-axis. Considering if we choose one single nearest neighbour, the prediction on the training set is perfect. But when more neighbours are considered, the training accuracy drops, indicating that using the single nearest neighbour leads to a model that is too complex. The best performance is somewhere around 9 neighbours.

The plot suggests that we should choose n_neighbors=9. Here we are:

knn = KNeighborsClassifier(n_neighbors=9)

knn.fit(X_train, y_train)

print('Accuracy of K-NN classifier on training set: {:.2f}'.format(knn.score(X_train, y_train)))

print('Accuracy of K-NN classifier on test set: {:.2f}'.format(knn.score(X_test, y_test)))

The accuracy of K-NN classifier on the training set: 0.79

The accuracy of K-NN classifier on test set: 0.78

Sunday, February 3, 2019

Machine Learning based Diabetes Prediction Software (with source code) | Codeing School

Watch on YouTube:

Click Here

2 comments:

Connect With Us

Pages

Topics

Popular Posts

Label

Contact Form

About