Split a Dataset into Train and Test Datasets in Python

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Scikit-learn Compatible Packages¶

sklearn.model_selection.train_test_split is the best way to split a dataset into train and test subset for scikit-learn compatible packages (scikit-learn, XGBoost, LightGBM, etc.). It supports splitting both iterable objects (numpy array, list, pandas Series) and pandas DataFrames. When splitting an iterable object, it returns (train, test) where train and test are lists. When splitting a pandas DataFrame, it returns (train, test) where train and test are pandas DataFrames.

import pandas as pd

df = pd.read_csv("/media/data/iris.csv")
df.head()

df.shape

(150, 6)

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2, random_state=119)

Notice that an integer value 119 is passed to the parameter random_state. This is STRONGLY suggested as it enables to reproduce your work later. It is generally a good idea to set a seed for the random number generator when you build a model.

train.head()

train.shape

(120, 6)

test.head()

test.shape

(30, 6)

More Flexible Splitting of Arrays and DataFrames¶

If you are not building a model and want to split a pandas DataFrame into many pieces, numpy.array_split comes very convenient. For example, the code below splits a pandas DataFrame into 4 parts. Numpy arrays are also supported of course.

import numpy as np

dfs = np.split(df, 4)

PyTorch¶

The best way to split a PyTorch Dataset is to use the function torch.utils.data.random_split which returns (train, test) where train and test are of the type torch.utils.data.dataset.Subset.

train, test = torch.utils.data.random_split(dataset, [6000, 2055])

References¶

https://stackoverflow.com/questions/24147278/how-do-i-create-test-and-train-samples-from-one-dataframe-with-pandas