LinReg.py
Problem
In this exercise, you will learn about Linear Regression.
Linear Regression is the process of finding the line of best fit for a given dataset. How does your calculator do this? How can one measure how good of a fit a line is?
What we can do is try to minimize the difference in y values.
For example, imagine the set f(x) = {(0, 0), (1, 2), (2, 5)}. The function g(x) = x +1 is not a great fit, because g(0) = 1, but f(0) = 0. Since the predicted value is greater than the actual value, we say that there is a "cost" of -1 at x=0. To make all of these costs positive, so that we can sum them all easily, we square the values of the costs. If you calculate the total cost of g(x) with f(x), you should get 5.
After writing a function that calculates this, we need to write a function that automatically minimizes this. I will be using a method called gradient descent. Unfortunately for you, this requires some calculus to do, so I can't really give you a great explanation here. The one thing that you need to know is that there is a learning rate "alpha". If this rate is too large, the function will go out of control and the cost head towards infinity. If it's too small, it may take the program too long to calculate. The picture below does a decent job explaining this phenomenon.
The input of the program will have 3 lines. The first line will be a float, the user-chosen learning rate "alpha". The second and third lines will be X and Y, the dataset with which we want to perform linear regression on, whose entries are space-separated.
For example,
0.001
0 1 2
0 2 5
corresponds with f(x) in the example above, with a chosen learning rate of 0.001.
I have given you a distribution code here for you to start with. Included is the driver code, and the function that will perform gradient descent.
You need to do 3 things:
Gather the user input. X and Y need to be converted into numpy arrays. I have already used import numpy as np, so you should be using np.array() somehow. Head to the numpy documentation if you need help with this.
Write the function LeastSquares which takes in X, Y, m, and b, the slope and y-intercept of a line. This function should sum the squares of the costs for every point.
Write and error checking system that checks if the given learning rate is too large. If this is the case, you should return early and notify the user. If the given learning rate will work, your function should output a tuple of m, b, cost, rounded to 3 decimal places. (this is already set up for you)
Testing
Your program should output the following. Your answers need to be accurate, rounded to 3 decimal places. (the distribution code takes care of this)
0.005
0 1 2 3 4
3 5 7 9 11
> (2.0, 3.0, 0.0)
0.006
0 1 2 3 4
3 5 7 9 11
> Your learning rate is too large. (or something like this)
0.001
0 1 2 3 4
-3 -5 -10 -16 -24
> (-5.299, -1.002, 12.3)
0.001
0 2 4 8 12 14 18
7 7 9 12 15 20 25
> (1.006, 5.237, 13.421)
0.00001
0 5 6 17 24 30 49
-10 -4 -7 -40 -57 -63 -90
> (-1.979, -0.34, 343.827)
0.1
0 2 3 6 15
20 24 30 100 200
> Your learning rate is too large. (or something like this)