If your data can be fit into the CPU memory, it is a good practice to save your data into one pickle file (or other format that you know how to deserialize). This comes with several advantages. First, it is easier and faster to read from a single big file rather than many small files. Second, it avoids the possible system error of openning too many files (even though avoiding lazying data loading is another way to fix the issue). Some example datasets (e.g., MNIST) have separate training and testing files (i.e., 2 pickle files), so that research work based on it can be easily reproduced. I personally suggest that you keep only 1 file containing all data when implementing your own Dataset class. You can always use the function
torch.utils.data.random_split
to split your dataset into training and testing datasets later. For more details, please refer to http://www.legendu.net/misc/blog/python-ai-split-dataset/.If one single file is too big (to load into memory), you can split the data into several parts and use the class torchvision.datasets.DatasetFolder to help you load them. If you do want to keep the raw images as separate files, you can place them into different subfolders whose names represent the class names and then use the class torchvision.datasets.ImageFolder to help you load the data. torchvision.datasets.ImageFolder supports image extensions:
.jpg
,.JPG
,.jpeg
,.JPEG
,.png
,.PNG
,.ppm
,.PPM
,.bmp
and.BMP
.It is a good practice to always shuffle the dataset for training as it helps on the model convergence. However, never shuffle the dataset for testing or prediction as it helps avoid surprises if you have to rely on the order of data points for evaluation.
When you implement your own Dataset class, you need to inherit from torch.utils.data.Dataset (or one of its subclasses). You must overwrite the 2 methods
__len__
and__getitem__
.When you implement your own Dataset class for image classification, it is best to inherit from torchvision.datasets.vision.VisionDataset . For example, torchvision.datasets.MNIST subclasses torchvision.datasets.vision.VisionDataset . You can use it as a template. Notice you still only have to overwrite the 2 methods
__len__
and__getitem__
(even though the implementation of torchvision.datasets.MNIST is much more complicated than that). torchvision.datasets.MNIST downloads data into the directoryMNIST/raw
and make a copy of ready-to-use data into the directoryMNIST/processed
. It doesn't matter whether you follow this convention or not as long as you overwrite the 2 methods__len__
and__getitem__
. What's more, the parameterroot
for the constructor of torchvision.datasets.vision.VisionDataset is not critical as long as your Dataset subclass knows where and how to load the data (e.g., you can pass the full path of the data file as parameter for your Dataset subclass). You can set it toNone
if you like.When you implement a Dataset class for image classification, it is best to have the method
__getitem__
return(PIL.Image, target)
and then usetorchvision.transforms.ToTensor
to convertPIL.Image
to tensor in the DataLoader. The reason is that transforming modules introchvision.transforms
behave differently onPIL.Image
and their equivalent numpy array. You might get surprises if you have__getitem__
return(torch.Tensor, target)
. If you do have__getitem__
return(torch.Tensor, target)
, make sure to double check that they tensors are as expected before feeding them into your model for training/prediction.torchvision.transforms.ToTensor
(refered to asToTensor
in the following) converts aPIL.Image
to a numerical tensor with each value between [0, 1].ToTensor
on a boolean numpy array (representing a black/white image) returns a boolean tensor (instead of converting it to a numeric tensor). This is one reason that you should return(PIL.Image, target)
and avoid returning(numpy.array, target)
when implement your own Dataset class for image classification.There is no need to return the target as a
torch.Tensor
(even though you can) when you implement the method__getitem__
of your own Dataset class. The DataLoader will convert the batch of target values totorch.Tensor
automatically.If you already have your training/test data in tensor format, the simplest way to define a dataset is to use torch.utils.data.Dataset . However, one drawback of torch.utils.data.Dataset is that it does not provide a parameter for transforming tensors current (even though discussions and requests have been made on this). In the case when a transformation is needed, a simple alternative is to just deriver your own dataset class.
import numpy as np
import torch
import torchvision
trans = torchvision.transforms.ToTensor()
arr = np.array([[True, True, False], [True, False, True]])
arr
x = trans(arr)
x
x = torch.tensor([1, 2, 3, 4])
y = torch.tensor([1, 0, 1, 0])
dset = torch.utils.data.TensorDataset(x, y)
dset
for d in dset:
print(d)
ImagePaths - a More Generalized Dataset Class for Images¶
If you have a trained model and want to run it on unlabled data, you need a dataset for unlabled data. PyTorch does not have such a class but it is very easy to implement one by yourself. The class ImagePaths implemented below is able to handle the situations of both with and without labels. Actually, it can be seen as a more generalized version of the torchvision.datasets.ImageFolder class.
import torch
class ImagePaths(torch.utils.data.Dataset):
"""Dataset class for Image paths."""
def __init__(
self, paths, transform=None, transform_target=None, cache: bool = False
):
"""Initialize an Image Path object.
:param paths: An iterable of paths to images.
For example, you can get image paths using pathlib.Path.glob.
:param transform: The transform function for the image (or input tensor).
:param transform_target: The transform function for the target/label.
"""
self.paths = list(paths)
labels = set(path.parent.name for path in self.paths)
if all(label.isdigit() for label in labels):
self.class_to_idx = {label: int(label) for label in labels}
else:
self.class_to_idx = {label: i for i, label in enumerate(labels)}
self.transform = transform
self.transform_target = transform_target
self.cache = cache
self._data = None
if self.cache:
self._data = [None] * len(self.paths)
def __getitem__(self, index):
if self.cache and self._data[index]:
return self._data[index]
path = self.paths[index]
img = Image.open(path).convert("RGB")
if self.transform:
img = self.transform(img)
target = self.class_to_idx[path.parent.name]
if self.transform_target:
target = self.transform_target(target)
pair = img, target
if self.cache:
self._data[index] = pair
return pair
def __len__(self):
return len(self.paths)
torch.utils.data.DataLoader¶
Each batch in a torch.utils.data.DataLoader
is a list of tensors.
The length of of the list matches the length of the tuple in the underlying Dataset.
Each tensor in the list/batch has a first dimenion matching the batch size.
If you have specified the option shuffle=False
(default),
the order of the DataLoader is fixed.
You get the same sequence each time you iterate the DataLoader.
However,
if you have specified the option shuffle=True
(which should be used for training),
the order of the DataLoader is random.
Each time you iterate the DataLoader,
the underlying dataset is shuffled
and thus you get a different sequence each time you iterate the DataLoader.
x = torch.rand(10)
y = torch.tensor([0, 1]).repeat(5)
dataset = torch.utils.data.TensorDataset(x, y)
data_loader = torch.utils.data.DataLoader(dataset, batch_size=3)
print("x:", x)
print("y:", y)
print(data_loader)
for elem in data_loader:
print(type(elem))
print(elem)
dir(data_loader)
?torch.utils.data.DataLoader
Other Useful Dataset Classes¶
References¶
https://github.com/pytorch/vision/blob/master/torchvision/datasets/mnist.py
https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset
https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader
https://pytorch.org/docs/stable/data.html