$$ \def\R{{\mathbb{R}}} \def\MSE{{\text{MSE}}} $$


I've been taking a course on machine learning (ML) lately and it required me to revisit the topic of linear regression. There is some interesting math involved but while learning the concepts I've got interested in modeling the whole process in software.

The context is that of a learning algorithm that generates a hypothesis function (HF) from a set of data points, which is capable of outputting a prediction from a given input not necessarily on the original set of data points. Linear regression is presented as a first learning algorithm.

Any learning algorithm needs a way to measure how good a candidate HF is, so that it can compare and optimize the final result. This measuring is done by a cost function (CF).

The CF is a function of the HF, but when the latter can be completely determined by a set of parameters, these parameters can be used as input for the CF. For example, if our HF is an affine function $$ \begin{equation} \label{eq:affine} h(x) = a + bx \end{equation} $$ then $a, b \in \R$ can be used as inputs for the CF.

For linear regression specifically, we model our HF as an affine function (as above) and we can extract a CF as follows:

Suppose we are given a set of $n$ data points (also called training examples in ML lingo) on the plane $$ P_i = (\hat{x_i}, \hat{y_i}) \in \R^2 $$ for $i \in \{1,2,3,\cdots,n\}$. Then we can find out the mean squared error (MSE) of our HF, relative to the training examples, by looking the average of the squared difference between the output of the HF and the associated data point. In other symbols: $$ \begin{equation} \label{eq:mean-squared-1} \MSE(h) = \frac{1}{n} \sum_{i = 1}^n (h(\hat{x_i}) - \hat{y_i})^2 \end{equation} $$ where $h$ is an HF candidate. $\MSE$ is an example of cost function.

Since we are doing linear regression, $h$ must be an affine function, therefore it has the same the form as equation $\ref{eq:affine}$, meaning $h$ is completely determined by parameters $a$ and $b$, and therefore we may rewrite $\ref{eq:mean-squared-1}$ as $$ \begin{equation} \label{eq:mean-squared-2} \MSE(a, b) = \frac{1}{n} \sum_{i = 1}^n (a + b\hat{x_i} - \hat{y_i})^2. \end{equation} $$ Our goal is to minimize the CF, because that would mean the HF differs minimally from the data points, and with this version of $\MSE$ (as a function from $\R^2$ to $\R$), we can use the optimization tools from calculus, in particular, the Gradient Descent algorithm.

Some code

The point of the prelude above was to outline the process and the involved data at each step. For example, I emphasized that the cost function $\MSE$ depends on the hypothesis function as an indicator that this could be modeled, in its general form, as a higher-order function.

I will be using Julia for the examples, since it has great support for lexical closures, higher-order functions and also number crunching.

We could simply define an MSE function that calculates the error for given HF, but we would like to factor out some reusable code. For example, by the definition of MSE, we need a function to do summation and another for averaging.

Luckly, Julia already provides a (higher-order) function sum which applies a given function on each element of an array and adds all the results together. The expression sum(f, A) would be equivalent to $$ \sum_{i=1}^n f(A_i) $$

Notice we are indexing from $1$, just as Julia.

We can now define our average function simply as

function average(term, data)
    sum(term, data) / length(data)

We are ignoring error handling for simplicity as usual, but this is fine here, since sum already throws an error when the array data is empty, which would also cause a division by zero on average.

Julia provides the nice do-block syntax to pass anonymous functions as the first parameter of a higher-order function.

# this...
average(x -> x * 2, [1,2,3])

# ...can be rewritten as
average([1,2,3]) do x
    x * 2

Now we can write our MSE function as follows: it takes a candidate HF h and an array dataPoints of $2$-tuples of numbers as parameters an computes the average when term is the function the computes the square of the difference as mentioned before

function meanSquaredError(h, dataPoints)
    average(dataPoints) do (x, y)
        (h(x) - y)^2

which would be the equivalent to $\ref{eq:mean-squared-1}$, but we can use it to create a version that is specialized for affine functions, matching equation $\ref{eq:mean-squared-2}$:

function mseAffine(a, b, dataPoints)
    meanSquaredError(x -> a + b * x, dataPoints)

We are now able to generate as many CFs as we want by using mseAffine with different parameters a and b and check which ones better approximate some fixed set of data points.


This is all really simple code, but the point here is to illustrate that by using higher-order functions, we were able construct a software model that closely resembles what we already had as a mathematical model, despite some minor notational differences.

This is important, because there's less "impedance mismatch" when we have to switch from working with the math to working with the program and back, similar to using the same programming language for the front end and back end in a Web project, for instance. This is specially true for ML, since it uses lots of statistical methods in its algorithms.