Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Scikit-learn Compatible Packages¶
sklearn(train, test) where train and test are lists.
When splitting a pandas DataFrame,
it returns (train, test) where train and test are pandas DataFrames.
import pandas as pddf = pd.read_csv("http://www.legendu.net/media/data/iris.csv")
df.head()df.shape(150, 6)from sklearn.model_selection import train_test_splittrain, test = train_test_split(df, test_size=0.2, random_state=119)Notice that an integer value 119 is passed to the parameter random_state.
This is STRONGLY suggested as it enables to reproduce your work later.
It is generally a good idea to set a seed for the random number generator
when you build a model.
train.head()train.shape(120, 6)test.head()test.shape(30, 6)More Flexible Splitting of Arrays and DataFrames¶
If you are not building a model and want to split a pandas DataFrame into many pieces, numpy.array_split comes very convenient. For example, the code below splits a pandas DataFrame into 4 parts. Numpy arrays are also supported of course.
import numpy as np
dfs = np.split(df, 4)PyTorch¶
The best way to split a PyTorch Dataset is to use the function torch.utils.data.random_split
which returns (train, test)
where train and test are of the type torch.utils.data.dataset.Subset.
train, test = torch.utils.data.random_split(dataset, [6000, 2055])