Today I would like to deal with the exploitation of EEG analysis once the Computational Intelligence / Machine Learning algorithm for that purpose has been designed. The crucial question to answer is how well the algorithm does when confronted to data that has not been seen before. This is known in data analysis as its generalization capability. We usually receive some datasets when told to develop an automate EEG analysis application. For instance we might receive a set of EEG streams from 2 groups of subjects. One of the groups has developed Alzheimer’s disease, whereas the other one has remained healthy. Is it possible to find a pattern in the EEG that allows us to discriminate one group from the other one? The final goal of such a solution should be to have a test that could be used in Alzheimer’s diagnosis.

Obviously what you first need is a performance measure like the classification rate. I won’t focus today in such measures, but on the procedure you need to conduct in order to evaluate the performance. As mentioned in the former paragraph the most difficult thing to do here is to estimate what the performance of the system would be in case a new subject arrives, EEG is acquired, and this data goes through the implemented system.

**Training Machine Learning and its pitfalls**

If you want to simulate the situation of a new subject arriving you first have to keep some data of the initial set apart from the rest. Hence you split the data in the so called training and test sets. Here the first problem appears in form of the so-called sample size. How many feature vectors do I need to include in my training set? There is no systematic answer to this question. A rule of thumb in the field indicates the use of between 15 and 25 data samples per feature vector component. For instance, if your features are given as vectors of 3 components, you should have at least between 45 and 75 samples in your training set.

The idea is that you use the training set to tune all system parameters (yes, I mean ALL, and that isn’t always easy to fulfill). So when implementing machine learning and computational intelligence systems what you normally do with this training set is to train your classifiers. Once trained, you compute the selected performance measure, e.g. classification rate, true positive rate, over the results obtained when classifying the test set. The most important idea to understand in this first step, is that if you really want to measure the generalization capability, you do not have to tune any parameter with the test set. Not doing so will allow you to write some papers in the field (if reviewers do not discover the trick), but will keep your EEG automate classification from being used in a real-world application.

One usual problem is that you obtain a very good performance over the training set, e.g. 98% true positive rate (TPR), but a bad one over the test set, e.g. 60% TPR on the test data set. Your algorithm suffers from so-called overfitting. This means it is capable to adapt very well to the examples contained in the training data set, but is not able to classify well the non seen examples contained in the test data set. Your algorithm is not generalizing. You have a problem. You have to check for another classifier, or for other features.

**Systematic Procedures for testing generalization in Computational Intelligence**

There are systematic ways of looking at the generalization capability. The most used in machine learning and computational intelligence is denoted as cross-fold validation, usually K-cross-fold validation, where K is a parameter of the procedure. Here you have to randomly split your dataset in K groups. These subsets have to be disjoint, i.e. each member of the data set appears in just 1 group. Once split, you use K-1 groups for training, and the remaining one for testing. So you repeat the same operation K times, and therefore obtain K different performance measures corresponding to the K test sets. You can then compute the average and the variance of these K measures. The average is an indicator of how well the classification does. The variance relates to the generalization capability. The lower the variance, the better your algorithm generalizes.

The idea of the procedure is that you test your algorithm with different training sets. You see then how the performance varies with respect to the samples included in this training set. This gives you an statistical estimation of what would happen when the algorithm is confronted with novel data. It really answers the problem, how well does my algorithm do in a real-world application.

Some variants of this basic procedure have been proposed in the literature: bootstrap, jackknife, leave-one-out. But I think I have no more space (or time) to discuss all them today. So I will leave them for the next post.