Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Split a Dataset into Train and Test Datasets in Python

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Scikit-learn Compatible Packages

sklearn.model_selection.train_test_split is the best way to split a dataset into train and test subset for scikit-learn compatible packages (scikit-learn, XGBoost, LightGBM, etc.). It supports splitting both iterable objects (numpy array, list, pandas Series) and pandas DataFrames. When splitting an iterable object, it returns (train, test) where train and test are lists. When splitting a pandas DataFrame, it returns (train, test) where train and test are pandas DataFrames.

import pandas as pd
df = pd.read_csv("http://www.legendu.net/media/data/iris.csv")
df.head()
Loading...
df.shape
(150, 6)
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2, random_state=119)

Notice that an integer value 119 is passed to the parameter random_state. This is STRONGLY suggested as it enables to reproduce your work later. It is generally a good idea to set a seed for the random number generator when you build a model.

train.head()
Loading...
train.shape
(120, 6)
test.head()
Loading...
test.shape
(30, 6)

More Flexible Splitting of Arrays and DataFrames

If you are not building a model and want to split a pandas DataFrame into many pieces, numpy.array_split comes very convenient. For example, the code below splits a pandas DataFrame into 4 parts. Numpy arrays are also supported of course.

import numpy as np

dfs = np.split(df, 4)

PyTorch

The best way to split a PyTorch Dataset is to use the function torch.utils.data.random_split which returns (train, test) where train and test are of the type torch.utils.data.dataset.Subset.

train, test = torch.utils.data.random_split(dataset, [6000, 2055])

References