Machine Learning in Three Pages

A supervised machine learning problem can be viewed as approximating an unknown target function $f:\mathcal{X}\rightarrow \mathcal{Y}$ by constructing a function $\hat{f}:\mathcal{X}\rightarrow \mathcal{Y}$ using a set of $k$ training examples $T=\{\left<X_i, Y_i\right> | i=1..k, X_i\in\mathcal{X}, Y_i=f(X_i)\}$. The function $\hat{f}$ is referred to as a model. The goal of machine learning is to construct the model in such a way that its results on $X\in\mathcal{X}\setminus T$ are close – according to some measure – to $f(X)$. It is assumed that the training set $T$ is drawn from a fixed, but unknown, probability distribution. Under typical assumptions, the machine learning model trained on this set is only expected to perform well on previously unseen samples drawn from the same distribution.

For instance, if we consider the problem of predicting whether a source code file contains a security vulnerability, then the input space $\mathcal{X}$ is the set of all source code files and the output space $\mathcal{Y}$ is the set of class labels ${\mathrm{vulnerable}, \mathrm{not\ vulnerable}}$; the function $f$ assigns to each source file its true vulnerability status. Given a training set $T$ of source files with their true vulnerability label we would like to construct a model $\hat{f}$ that, given any source file, accurately predicts whether the file contains a vulnerability.

The model $\hat{f}$ is chosen from a hypothesis space $\mathcal{H}$. The size of this space determines the model capacity, that is the range of target functions $f$ a model can accurately approximate. While high capacity allows the model to accurately fit the training data, that is not always desirable, since it might lead to a phenomenon known as overfitting, where the model yields accurate predictions for the elements of the training set, but generalises poorly to previously unseen data. As an informal example, restricting $\mathcal{H}$ to linear functions might make it difficult to accurately fit the training set $T$; expanding $\mathcal{H}$ to polynomials of degrees up to 10 could provide a model that can perfectly fit all of the points in $T$, but in doing so might also fit the noise inherent in the training sample.

Controlling the model capacity is one of the greatest challenges in applied machine learning. It can be achieved not only by restricting the family of functions from which the model is chosen, but also by placing constraints on the chosen models that promote certain functions (usually considered “simpler” according to some metric) over alternatives. For example, in case of polynomials, a constraint might be placed that prioritises those with lower sum of coefficients, or those where the largest number of coefficients is set to zero. Placing such constraints on the models is known as regularisation.

The task of constructing the model $\hat{f}$ is typically performed in two steps. First, the family of models, that is the hypothesis space $\mathcal{H}$, is chosen – in applied machine learning this step is sometimes referred to as model selection. All models in $\mathcal{H}$ can be represented as $\hat{f}(X, \alpha)$, where $X\in\mathcal{X}$ and $\alpha$ is a parameter that determines the specific member of $\mathcal{H}$. The second step then involves finding the $\alpha$ that minimises a certain loss function $L(\alpha)$, where the choice of the loss function is task-specific. This second step is often called model training or model fitting.

For instance, consider the case where $\mathcal{X}$ and $\mathcal{Y}$ are both the set of real numbers $\mathbb{R}$ and in the first step of model construction $\mathcal{H}$ has been restricted to the linear functions $\mathbb{R}\to\mathbb{R}$. Each member of $\mathcal{H}$ can be represented as $\hat{f}(X, \left<a, b\right>) = aX+b$; the pair $\left<a,b\right>$ is the parameter $\alpha$ that uniquely identifies a member of $\mathcal{H}$. A loss function commonly used in such scenario is the squared error:

$$ L(\left<a,b\right>) = \sum\limits_{\left<X,Y\right>\in T}\left(\hat{f}(X, \left<a,b\right>)-Y\right)^2 $$

and the choice of optimal parameters involves finding the minimum of this function.

In general, the task of model training typically involves function optimisation. A particularly widely used algorithm in this context is gradient descent, where the gradient $\nabla L$ is used to determine the direction towards the $\alpha_0$ that minimises $L(\alpha)$. It should be noted that the tractability of optimisation depends on the specific loss function. Complex models can give rise to complex loss functions that, even if differentiable, can have multiple local minima, saddle points or flat regions, which make them difficult to optimise.

A wide range of model families is discussed in the literature and used in practice. Below we will survey a few of the widely used families, with particular focus on those applicable to classification problems, where the output space $\mathcal{Y}$ is a finite set of class labels. Unless indicated otherwise, we will assume that $\mathcal{X}=\mathbb{R}^n$, and we will refer to individual components of input vectors $X$ as features.

Logistic Regression

A class of models used for binary classification, with the two classes represented as $0$ and $1$. The models are functions of the form

$$ \hat{f}(X, \theta)=\frac{1}{1+e^{-\theta^T X}} $$

— the logistic sigmoid functions, where the vector of coefficients (weights) $\theta\in\mathbb{R}^n$ represents relative significance of each feature when determining the output class1. The range of $\hat{f}$ is the interval $(0,1)$, and the result can be interpreted as the confidence that the input $X$ belongs to class $1$; it is mapped to the output space $\mathcal{Y}={0,1}$ by rounding the output to the nearest whole number2. Logistic regression is one of the fundamental classification approaches thanks to relatively straightforward interpretation and availability of efficient training algorithms. It is considered a linear classifier, since the decision boundary $\hat{f}(X, \theta) = \frac{1}{2}$ is the solution of an equation $\theta^T X=0$ that is linear with respect to $X$. Linear binary classifiers can be understood as hyperplanes that separate the space $\mathbb{R}^n$ into two regions, corresponding to the two classes. Such classifiers often do not perform well if the classes are not linearly separable.

Naive Bayes

The Bayes’s conditional probability theorem, $P(Y|X)=\frac{P(X|Y)P(Y)}{P(X)}$, gives rise to a classifier family where $Y$ is the predicted class and $X$ is the input. The conditional probability $P(Y|X)$ is proportional to joint probability $P(Y, X)$, which, under the assumption that the features $x_1,\dots x_n$ in $X$ are independent, can be rewritten as $P(Y)\prod_{i=1}^n P(x_i|Y)$. A classifier can then be constructed by enumerating the joint probabilities for all possible output classes $Y\in\mathcal{Y}$ and the input $X$, and choosing the class with the highest probability:

$$ \hat{f}(X)=\mathrm{argmax}_ {Y\in\mathcal{Y}}P(Y)\prod_{i=1}^n P(x_i|Y) $$

The probabilities $P(Y)$ and $P(x_i|Y)$ can be estimated from the training data. While the independence assumption typically does not hold in real world applications (hence the “naive” in the name), this classification approach has proven surprisingly effective in domains such as spam detection. This efficacy, coupled with efficient training algorithm, makes naive Bayes one of the baseline classification methods.

Support vector machine (SVM)

A learning algorithm that constructs a linear classifier whose hyperplane is chosen to maximise the gap between the two classes. For this reason it is called a large margin classifier. It can be extended to nonlinear classification using so called kernel trick, where the input space $\mathcal{X}$ is implicitly mapped through a nonlinear kernel function to a higher dimensional $\mathcal{Z}$. A linear hyperplane learned in $\mathcal{Z}$ then corresponds to a nonlinear boundary in $\mathcal{X}$. Support vector machine provides state-of-the-art classification performance in many applications.

Random forest

A decision tree is a model that represents the class prediction process as a tree where the inner nodes specify conditions on the input features – for instance, the value of a specific feature being in a certain range – and a subtree is chosen depending on whether the condition is met; the leaf nodes specify which class is predicted. Decision tree learning can lead to highly irregular decision boundaries, and as a result is prone to overfitting. To address this, an ensemble of such trees can be constructed, where each tree is trained on a different subset of the training set $T$ – an approach known as bagging. The class prediction of a random forest classifier is then based on an average vote of all trees in the ensemble.

Neural network

A single in a neural network can be understood as a function $h(X) = g(\theta^TX)$, where $X$ is the input vector, $\theta$ is a weight vector and $g$ is a nonlinear activation function $\mathbb{R}\to\mathbb{R}$; popular activation functions include the logistic function, hyperbolic tangent and $\mathrm{ReLU}(x)=\max(0,x)$. A neural network layer is then a sequence of $m$ such neurons, each with its own weight vector $\theta_i$. If we denote by $W$ an $m\times n$ matrix whose rows are the weight vectors $\theta_i^T$, then the output of such layer is a vector in $\mathbb{R}^m$ and can be represented as $g(W X)$. A (feedforward) neural network is a sequence of consecutive layers, where the output of each layer is used as the input to the subsequent one. Such network can be used for binary classification by having the final layer contain a single neuron with logistic activation function; the output value of this neuron can then be interpreted in the same way as in the case of logistic regression. As a matter of fact, a single layer, single neuron network with logistic activation function is equivalent to logistic regression. Stacking of layers allows neural networks to approximate arbitrarily complex functions.

Recurrent neural network

In certain machine learning problems the input $X$ is most conveniently represented not as a single vector in $\mathbb{R}^n$, but as an unbounded sequence of such vectors. For instance, when processing audio, each vector can represent the frequency spectrum histogram in a unit of time; when classifying text, each vector can encode a single token. Recurrent neural network is formed by extending a feedforward neural network with the concept of discrete time, and with each neuron’s output not only providing an input to the subsequent layer, but also to the same layer in the next time step. The output of a single layer of size $m$ in time step $t$ can be described as $H_t = g(W X_t + V H_{t-1})$, where $X_t$ is the layer input at time step $t$ and $V$ is an $m\times m$ weight matrix corresponding to the connections through time; the initial state $H_0$ can be given or learned. A recurrent network can process sequential input by taking the vectors in the sequence as the inputs to the network in consecutive time steps. The network produces a single output in each time step. If the problem calls for a single class prediction at the end of the sequence then all of those outputs except for the final one can be discarded; since prior network states feed into future time steps, this last output is a function of the entire sequence of inputs.

  1. The notation $\theta^T$ represents a transposition from a column to row vector; therefore the product $\theta^TX$ is a scalar. ↩︎

  2. Such rounding corresponds to the choice of the value $\frac{1}{2}$ as the decision boundary, or classification threshold. In practice other thresholds can be used if appropriate for a given task. ↩︎