Scikit-learn Compatible Packages¶
sklearn.model_selection.train_test_split
is the best way to split a dataset into train and test subset
for scikit-learn compatible packages (scikit-learn, XGBoost, LightGBM, etc.).
It supports splitting both iterable objects (numpy array, list, pandas Series) and pandas DataFrames.
When splitting an iterable object,
it returns (train, test)
where train
and test
are lists.
When splitting a pandas DataFrame,
it returns (train, test)
where train
and test
are pandas DataFrames.
import pandas as pd
df = pd.read_csv("http://www.legendu.net/media/data/iris.csv")
df.head()
df.shape
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2, random_state=119)
Notice that an integer value 119 is passed to the parameter random_state
.
This is STRONGLY suggested as it enables to reproduce your work later.
It is generally a good idea to set a seed for the random number generator
when you build a model.
train.head()
train.shape
test.head()
test.shape
More Flexible Splitting of Arrays and DataFrames¶
If you are not building a model and want to split a pandas DataFrame into many pieces, numpy.array_split comes very convenient. For example, the code below splits a pandas DataFrame into 4 parts. Numpy arrays are also supported of course.
import numpy as np
dfs = np.split(df, 4)
PyTorch¶
The best way to split a PyTorch Dataset is to use the function torch.utils.data.random_split
which returns (train, test)
where train
and test
are of the type torch.utils.data.dataset.Subset
.
train, test = torch.utils.data.random_split(dataset, [6000, 2055])