# How to split/partition a dataset into training and test datasets for, e.g., cross validation?

### Question

What is a good way to split a NumPy array randomly into training and testing/validation dataset? Something similar to the `cvpartition` or `crossvalind` functions in Matlab.

1
81
10/17/2017 10:33:49 AM

If you want to split the data set once in two halves, you can use `numpy.random.shuffle`, or `numpy.random.permutation` if you need to keep track of the indices:

``````import numpy
x = numpy.random.rand(100, 5)
numpy.random.shuffle(x)
training, test = x[:80,:], x[80:,:]
``````

or

``````import numpy
x = numpy.random.rand(100, 5)
indices = numpy.random.permutation(x.shape)
training_idx, test_idx = indices[:80], indices[80:]
training, test = x[training_idx,:], x[test_idx,:]
``````

There are many ways to repeatedly partition the same data set for cross validation. One strategy is to resample from the dataset, with repetition:

``````import numpy
x = numpy.random.rand(100, 5)
training_idx = numpy.random.randint(x.shape, size=80)
test_idx = numpy.random.randint(x.shape, size=20)
training, test = x[training_idx,:], x[test_idx,:]
``````

Finally, sklearn contains several cross validation methods (k-fold, leave-n-out, ...). It also includes more advanced "stratified sampling" methods that create a partition of the data that is balanced with respect to some features, for example to make sure that there is the same proportion of positive and negative examples in the training and test set.

99
12/7/2017 3:36:50 PM

There is another option that just entails using scikit-learn. As scikit's wiki describes, you can just use the following instructions:

``````from sklearn.model_selection import train_test_split

data, labels = np.arange(10).reshape((5, 2)), range(5)

data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size=0.20, random_state=42)
``````

This way you can keep in sync the labels for the data you're trying to split into training and test.