Wednesday, April 17, 2019

Linear Regression: Implementation in python from scratch

Linear Regression: Implementation in python from scratch
Linear Regression: Implementation in python from scratch


This tutorial will be committed to seeing how the linear regression algorithm functions and executing it to make forecasts utilizing our informational collection.

How Does it Work?

Linear Regression is essentially just the best fit line. Given a set of data the algorithm will create the best fit line through those data points.



This line can be characterized by the condition y = m*x + b.

m is the slope. Meaning how much the y value increases for each x value.

b is the y intercept. Where the line crosses the y axis.

We can determine the slope(m) of the line by picking two points on the line (p1 and p2) and using the following equation: m = (y2 - y1) / (x2 - x1)



When the PC has produced this line it will utilize it to predict certain qualities.


Downloading Our Data

In this specific tutorial, we will be implementing the linear regression algorithm to predict students final grade based on a series of attributes. To do this we need some data!

We are going to be using the Student Performance data set from the UCI Machine Learning Repository. You can download the data set here or from the direct link below:

Download Data Set: Download Now

This data set consists of 33 attributes for each student. You can see a description of each attribute here. It is great that there are many attributes but we likely don't want to consider all of them when trying to predict a students grade. Therefore, we will trim this data set down so we only have the attributes we need.

Note: The examples above are done in 2D space. In reality, most of our best fit lines will span across multiple dimensions and therefore will have multiple-slope values.

Implementing the Algorithm

Now that we understand how linear regression works we can use it to predict students final grades.

We will start by defining the model which we will be using.

linear = linear_model.LinearRegression()

Next, we will train and score our model using the arrays we created in the previous tutorial.

linear.fit(x_train, y_train)
acc = linear.score(x_test, y_test) # acc stands for accuracy

To perceive how well our algorithm performed on our test information we can print out the precision.

print(acc)

For this specific data set a score of above 80% is fairly good.

Viewing The Constants
In the event that we need to see the constants used to create the line, we can type the accompanying.

print('Coefficient: \n', linear.coef_) # These are each slope value
print('Intercept: \n', linear.intercept_) # This is the intercept

Predicting on Specific Students
Seeing a score value is cool but I'd like to see how well our algorithm works on specific students. To do this we are going to print out all of our test data. Adjacent to this information we will print the real last grade and our models predicted grade.

predictions = linear.predict(x_test) # Gets a list of all predictions

for x in range(len(predictions)):
    print(predictions[x], x_test[x], y_test[x])

Our output should look something like this.




Full Code

import pandas as pd
import numpy as np
import sklearn
from sklearn import linear_model
from sklearn.utils import shuffle

data = pd.read_csv("student-mat.csv", sep=";")

data = data[["G1", "G2", "G3", "studytime", "failures", "absences"]]

predict = "G3"

X = np.array(data.drop([predict], 1))
y = np.array(data[predict])

x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.1)

linear = linear_model.LinearRegression()

linear.fit(x_train, y_train)
acc = linear.score(x_test, y_test)
print(acc)

print('Coefficient: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)

predictions = linear.predict(x_test)

for x in range(len(predictions)):

    print(predictions[x], x_test[x], y_test[x])

0 comments:

Post a Comment