Simple Linear Regression from scratch using Kotlin

January 07, 2019

Simple Linear Regression from Scratch Using Kotlin

Artificial Intelligence Illustration Artificial Intelligence Illustration

In this tutorial, we’ll learn how to use Kotlin to train and test a simple linear regression model without any external library. Simple linear regression is the easiest model in machine learning and therefore is a great candidate, to begin with.

This article doesn’t use any external library, the goal is to write everything down from scratch to allow for a better understanding of the mechanics behind the scenes.

This article is partly inspired by this one.

Simple Linear Regression

According to Wikipedia, simple linear regression is

a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the x and y coordinates in a Cartesian coordinate system) and finds a linear function (a non-vertical straight line) that, as accurately as possible, predicts the dependent variable values as a function of the independent variables. The adjective simplerefers to the fact that the outcome variable is related to a single predictor.

In other words, given a variable, the simple linear regression model is able to predict with more or less effectiveness the value of a variable linked to the input variable.

There are multiple examples of how simple linear regression can be used

  • Number of children in household -> Liters of milk consumed
  • Years of experience -> Salary
  • IQ -> Job Performance
  • etc.

Data

Data Set

The dataset we’re going to use is a dataset found on Kaggle. Multiple datasets are available online, this is just an arbitrary choice, you could use any other dataset of your choice or generate your own if you want.

Link to the dataset

Independent Variable & Dependent Variable

As we saw earlier with Wikipedia’s definition, the independent variable (also called explanatory variable) will help us define what the value of the dependent variable is.

In the schema hereunder, the independent variable is x while the dependent one is y .

Simple Linear Regression Simple Linear Regression

The goal of the exercise is of course to get an approximation of the optimal values of β₀ & β₁ in the simple linear regression formula :

y = β₀ + β₁*x

In this formula, y is the dependent variable, x is the independent variable, β₀ is the constant (varying the position of our line on the y-axis) and β₁ is the coefficient of the independent variable (varying the slope of our line).

Build & Train

Read Files

Let’s start off by reading the train.csv and test.csv files. We’ll also split them into independent and dependent variables for it will be easier to feed our model later.

val xTrain = mutableListOf<Double>()
val yTrain = mutableListOf<Double>()
val trainFileName = "train.csv"

File(trainFileName).forEachLine {
    val split = it.split(",")
    xTrain.add(split[0].toDouble())
    yTrain.add(split[1].toDouble())
}

val xTest = mutableListOf<Double>()
val yTest = mutableListOf<Double>()
val testFileName = "test.csv"

File(testFileName).forEachLine {
    val split = it.split(",")
    xTest.add(split[0].toDouble())
    yTest.add(split[1].toDouble())
}

Model

Let’s now create our model and feed it with the train data. The goal of our model is to process the training data directly.

val model = SimpleLinearRegressionModel(independentVariables = xTrain, dependentVariables = yTrain)

I left out the code for SimpleLinearRegressionModel on purpose because we’ll discover it method by method, field by field. For now, we just need to understand that we’ve filled the two fields independentVariables & dependentVariables .

Mean X & Mean Y

For the sake of intern calculation inside of our model, we’ll need the mean of the independent variables and the mean of the dependent variables. This is easily performed using the Collections.kt sum method and performing a division on the result.

private val meanX: Double = independentVariables.sum().div(independentVariables.count())
private val meanY: Double = dependentVariables.sum().div(dependentVariables.count())

Variance & Covariance

We can see β₁’s formula as the following

β₁ = covariance / variance

For us to get the value of β₁, we’ll have to calculate both of those.

The variance can be defined as the sum of the squared difference of each independent variable minus their mean.

private val variance: Double = independentVariables.stream().mapToDouble { (it - meanX).pow(2) }.sum()

The way to calculate covariance requires a bit more code but still is quite manageable. It can be described as the sum of products of, for each point of the graph, the value of x — meanX and the value of y — mean Y.

Hope the code is easier to understand…

private fun covariance(): Double {
    var covariance = 0.0
    for (i in 0 *until *independentVariables.size) {
        val xPart = independentVariables[i] - meanX
        val yPart = dependentVariables[i] - meanY
        covariance += xPart * yPart
    }
    return covariance
}

β₀ & β

Now that we have our variance and covariance calculated, we can go further and calculate what the values of β₁ and β₀ are.

For a reminder, their respective formulas are the followings

β₁ = covariance / variance β₀ = meanY — (meanX * β₁)

private val b1 = covariance.div(variance)
private val b0 = meanY - b1 * meanX

Test

To test our model, we’ll feed it with the test.csv dataset we extracted earlier and we’ll use a method to calculate our error named root-mean-square error.

We’ll also calculate the to evaluate the precision of our model.

fun test(xTest: List<Double>, yTest: List<Double>) {
    var errorSum = 0.0
    var sst = 0.0
    var ssr = 0.0
    for (i in 0 until xTest.count()) {
        val x = xTest[i]
        val y = yTest[i]
        val yPred = predict(x)
        errorSum += (yPred - y).pow(2)
        sst += (y - meanY).pow(2)
        ssr += (y - yPred).pow(2)
    }
    println("RMSE = " + Math.sqrt(errorSum.div(xTest.size)))
    println("R² = " + (1 - (ssr / sst)))
}

fun predict(independantVariable: Double) = b0 + b1 * independantVariable

Now that we have everything set up, our model prints the following results for RMSE & R²

RMSE = 3.07130626802983
R² = 0.9888226846629965

Which is a great result for our model since the closer R² is to 1, the better and a RMSE of 3.071 in this case is more than OK.

From Wikipedia From Wikipedia

Conclusion

This algorithm is the easiest one of machine learning, the code to write to train a model like this one is close to none. The complexity linked to this implementation is understandable without much knowledge of data science and it can still help understand more complex models in the future.

In the next articles, we’ll see how Multiple Linear Regression works, and introduce the concept of Gradient Descent to minimize errors of our model.