splitting-dataset
Parent: generalisation
Source: google-ml-course
Splitting the dataset
Strategy 1
- A good performance on the training set doesn’t necessarily correspond to good performance on another test set
- The training set must be large enough (to exclude possibility of underfitting)
- The training set isn’t reused for validation (to exclude possibility of overfitting )
- Thus, the idea is, if we only have one large data set, to split it into subsets:
- The training set is used to learn the model weights.
- The trained model is applied onto the validation set
- The model is generalised if the validation losses are the same as that of the training losses
- How to split?
- The larger the training set, the better the learning potential
- But the larger the validation set, the more confident we can be in the model evaluation metrics
- If the original dataset is large, a 90% to 10% training:validation split can be sufficient
- If the original dataset is small, cross-validation might be helpful
Notes:
- shuffle the data before splitting to prevent only certain clusters of data being used in the training/validation sets
- doing too many evaluations on the validation set (and then tuning the model hyperparameters) can increase the risk of overfitting to that validation set
Strategy 2
- Training data: for learning model parameters
- Validation data: for evaluating model and tuning the hyperparameters
- Test data: for final testing
If the results on the independent test data are similar to the results on the validation data, this is good (over-fitting probably did not occur)
Notes
- Test and validation sets ‘wear out’ the more they are used.
- The more they are used for tuning, the higher the risk of overfitting.
- To avoid this, new data can be injected into each set.