Training, testing and validation datasets
The division of the input data into training, testing and validation sets is crucial in the creation of robust machine learning algorithms. Firstly, machine learning algorithms require a training set to be trained on. Each iteration, it calculates the difference between the predicted and actual outcomes and refines the algorithms weightings accordingly to reduce this difference. The algorithm produced is tailored specifically to the training data set. To assess the generalisability of the final algorithm and its learnt parameters, it is tested on a separate testing data set.
The main issue that can arise is the overfitting of the algorithm to a specific training set. When this occurs, the algorithm will be accurate (low error rate between predicted and actual results) on the training data set, but highly inaccurate on the testing data set. To overcome this, another separate validation set is used. The purpose of this is to teach the algorithm with the training set and optimize the hyperparameters (i.e. the architecture, number of iterations and allowable error) based on the accuracy of the validation data set. Once a final model is created based on the training and validation test sets, it is applied to the testing set for a final unbiased evaluation.
The three data sets should be randomly divided at the commencement of the project with the ratio dependent on the specific project and total data size. A not uncommon starting point is a ratio of 60:20:20 for training, validation and testing sets, respectively.