# Support Vector Machines: What they do and their best ‘trick’

In the past years machine learning has been used to address many different problems in neuroscience. When we talk about classification, several algorithms have been proposed for specific tasks. For instance, in Starlab we have used discriminative learning algorithms for EEG/ECG classification (e.g. neurodegenerative diseases) or time-series classification. This type of algorithms model the dependence of an unobserved variable (label) on an observed variable (features). Among these learning algorithms, probably, the one most extensively used is the Support Vector Machine (SVM). This algorithm around two decades ago started outperforming the state-of-the-art and since then it has been used in many different research fields. However, it is worth mentioning that deep learning algorithms have recently outperformed SVM-based ones in certain tasks.

Despite the fact that researchers are familiar with SVMs, in many cases this algorithm is usually used as a black box, specially when we talk about how the learning process is precisely carried out. The idea behind this algorithm consists in finding a linear basis form, called hyperplane, that optimally divides the training data you are providing splitting the feature space into two parts.

Regarding the training data used to fed a SVM classifier, each training example is represented by a set of features (e.g. height, age, country) together with a label that specifies its class. Then, an optimisation process is carried out to find out the aforementioned hyperplane. Finally, depending on the place a new element lies regarding this hyperplane, we will classify it as one class or the other (binary classification).

When training a discriminative learning algorithm such as the SVM we can always apply a non-linear transformation of the training data. This approach permits us performing the training in a different space but using the same learning algorithm. Usually, this transformation is carried out when a linear algorithm is not able to accurately separate the training data in the original space. Thus, if we map the training data in a space, normally in a higher dimensional one, we may have more chances of linearly separate the data, and therefore having a better model. However, the main drawback of using this technique is that, computationally, it tends to be very expensive. But, what if I told you that for certain algorithms can overcome this problem?

Kernel methods have the capability of working in a high-dimensional space without performing any explicit transformation of the data. Basically, given a higher dimensional space, where we will have more chances of linearly splitting the data, the objective consists in finding the expression of an inner product that only operates using the original training data. This ‘modified’ inner product is then used to define the kernel or similarity function used by the kernel method. Any kind of linear approach can be converted into a non-linear version by applying the kernel trick.

One of the most interesting things regarding SVM is their capability of working with kernels. This way, if we identify that the kernel (e.g. linear) we are currently using is not able to properly divide the training data, we can always apply the kernel trick. In this link https://www.youtube.com/watch?v=3liCbRZPrZA you will find a nice visual example. But of course, I remind the reader that before using it, we will need to figure out if the current features we are using are discriminative enough or if the optimisation parameters selected in the training stage are the appropriate ones. This task known as feature/parameter selection has been also addressed for many years in machine learning. For instance, there are some learning algorithms that internally select the best features. In the next post, we will talk about these algorithms.

If you have any question regarding this post, or you want to learn more about learning algorithms, please do not hesitate to send us a message.