Sunday, February 3, 2019

Machine Learning based Diabetes Prediction Software (with source code) | Codeing School

Machine Learning based Diabetes Prediction Software | Codeing School
Machine Learning based Diabetes Prediction Software


Abstract
Diabetes is considered one of the deadliest and chronic diseases which causes an increase in blood sugar. Many complications occur if diabetes remains untreated and unidentified. The tedious identifying process results in visiting of a patient to a diagnostic centre and consulting doctor. But the rise in machine learning approaches solves this critical problem. The motive of this study is to design a model which can prognosticate the likelihood of diabetes in patients with maximum accuracy. Therefore three machine learning classification algorithms namely Decision Tree, SVM and Naive Bayes are used in this experiment to detect diabetes at an early stage. Experiments are performed on the Pima Indians Diabetes Database (PIDD) which is sourced from the UCI machine learning repository. The performances of all three algorithms are evaluated on various measures like Precision, Accuracy, F-Measure, and Recall. Accuracy is measured over correctly and incorrectly classified instances. Results obtained show Naive Bayes outperforms with the highest accuracy of 76.30% comparatively other algorithms. These results are verified using Receiver Operating Characteristic (ROC) curves in a proper and systematic manner.



The Data
The diabetes data set was originated from the UCI Machine Learning Repository and can be downloaded from source code.

Requirements:
  1. Pandas, pip install pandas
  2. Scikit Learn, pip install sklearn
  3. Matplotlib, pip install matplotlib
  4. Numpy, pip install numpy
  5. Seaborn, pip install seaborn
  6. Pywin32, pip install pywin32 or pip install pypiwin32


Source Code:

Watch on YouTube:

Click Here


** In source code the “Diabetes_Prediction_Software.ipynb” or “diabetes prediction software.py” is the ready-made software, you can directly run this from your jupyter notebook for “.ipynb” and choose any ide for “.py”, after installing the dependency.

** If you want to learn how its work then read this and run the “knn.py” (in source code) for your knowledge.



Tutotial:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
diabetes = pd.read_csv('diabetes.csv')
print(diabetes.columns)

Index([‘Pregnancies’, ‘Glucose’, ‘BloodPressure’, ‘SkinThickness’, ‘Insulin’, ‘BMI’, ‘DiabetesPedigreeFunction’, ‘Age’, ‘Outcome’], dtype=’object’)

diabetes.head()


Figure 1
The diabetes data set consists of 768 data points, with 9 features each:

print("dimension of diabetes data: {}".format(diabetes.shape))

the dimension of diabetes data: (768, 9)

“Outcome” is the feature we are going to predict, 0 means No diabetes, 1 means diabetes. Of these 768 data points, 500 are labeled as 0 and 268 as 1:

print(diabetes.groupby('Outcome').size())


Figure 2
import seaborn as sns
sns.countplot(diabetes['Outcome'],label="Count")



Figure 3

diabetes.info()



Figure 4



k-Nearest Neighbors
The k-NN algorithm is arguably the simplest machine learning algorithm. Building the model consists only of storing the training data set. To make a prediction for a new data point, the algorithm finds the closest data points in the training data set it's nearest neighbours.

First, Let’s investigate whether we can confirm the connection between model complexity and accuracy:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(diabetes.loc[:, diabetes.columns != 'Outcome'], diabetes['Outcome'], stratify=diabetes['Outcome'], random_state=66)
from sklearn.neighbors import KNeighborsClassifier
training_accuracy = []
test_accuracy = []
# try n_neighbors from 1 to 10
neighbors_settings = range(1, 11)
for n_neighbors in neighbors_settings:
    # build the model
    knn = KNeighborsClassifier(n_neighbors=n_neighbors)
    knn.fit(X_train, y_train)
    # record training set accuracy
    training_accuracy.append(knn.score(X_train, y_train))
    # record test set accuracy
    test_accuracy.append(knn.score(X_test, y_test))
plt.plot(neighbors_settings, training_accuracy, label="training accuracy")
plt.plot(neighbors_settings, test_accuracy, label="test accuracy")
plt.ylabel("Accuracy")
plt.xlabel("n_neighbors")
plt.legend()
plt.savefig('knn_compare_model')

Figure 5
The above plot shows the training and test set accuracy on the y-axis against the setting of n_neighbors on the x-axis. Considering if we choose one single nearest neighbour, the prediction on the training set is perfect. But when more neighbours are considered, the training accuracy drops, indicating that using the single nearest neighbour leads to a model that is too complex. The best performance is somewhere around 9 neighbours.

The plot suggests that we should choose n_neighbors=9. Here we are:

knn = KNeighborsClassifier(n_neighbors=9)
knn.fit(X_train, y_train)
print('Accuracy of K-NN classifier on training set: {:.2f}'.format(knn.score(X_train, y_train)))
print('Accuracy of K-NN classifier on test set: {:.2f}'.format(knn.score(X_test, y_test)))

The accuracy of K-NN classifier on the training set: 0.79
The accuracy of K-NN classifier on test set: 0.78

2 comments:

  1. Awesome Article and well enplaned code Thanks for sharing.
    For More similar Article @ https://onclick360.com

    ReplyDelete