Ben Chuanlong Du's Blog

It is never too late to learn.

Tips on Dataset in PyTorch

  1. If your data can be fit into the CPU memory, it is a good practice to save your data into one pickle file (or other format that you know how to deserialize). This comes with several advantages. First, it is easier and faster to read from a single big file rather than many small files. Second, it avoids the possible system error of openning too many files (even though avoiding lazying data loading is another way to fix the issue). Some example datasets (e.g., MNIST) have separate training and testing files (i.e., 2 pickle files), so that research work based on it can be easily reproduced. I personally suggest that you keep only 1 file containing all data when implementing your own Dataset class. You can always use the function torch.utils.data.random_split to split your dataset into training and testing datasets later. For more details, please refer to http://www.legendu.net/misc/blog/python-ai-split-dataset/.

    If one single file is too big (to load into memory), you can split the data into several parts and use the class torchvision.datasets.DatasetFolder to help you load them. If you do want to keep the raw images as separate files, you can place them into different subfolders whose names represent the class names and then use the class torchvision.datasets.ImageFolder to help you load the data. torchvision.datasets.ImageFolder supports image extensions: .jpg, .JPG, .jpeg, .JPEG, .png, .PNG, .ppm, .PPM, .bmp and .BMP.

  2. It is a good practice to always shuffle the dataset for training as it helps on the model convergence. However, never shuffle the dataset for testing or prediction as it helps avoid surprises if you have to rely on the order of data points for evaluation.

  3. When you implement your own Dataset class, you need to inherit from torch.utils.data.Dataset (or one of its subclasses). You must overwrite the 2 methods __len__ and __getitem__.

  4. When you implement your own Dataset class for image classification, it is best to inherit from torchvision.datasets.vision.VisionDataset . For example, torchvision.datasets.MNIST subclasses torchvision.datasets.vision.VisionDataset . You can use it as a template. Notice you still only have to overwrite the 2 methods __len__ and __getitem__ (even though the implementation of torchvision.datasets.MNIST is much more complicated than that). torchvision.datasets.MNIST downloads data into the directory MNIST/raw and make a copy of ready-to-use data into the directory MNIST/processed. It doesn't matter whether you follow this convention or not as long as you overwrite the 2 methods __len__ and __getitem__. What's more, the parameter root for the constructor of torchvision.datasets.vision.VisionDataset is not critical as long as your Dataset subclass knows where and how to load the data (e.g., you can pass the full path of the data file as parameter for your Dataset subclass). You can set it to None if you like.

  5. When you implement a Dataset class for image classification, it is best to have the method __getitem__ return (PIL.Image, target) and then use torchvision.transforms.ToTensor to convert PIL.Image to tensor in the DataLoader. The reason is that transforming modules in trochvision.transforms behave differently on PIL.Image and their equivalent numpy array. You might get surprises if you have __getitem__ return (torch.Tensor, target). If you do have __getitem__ return (torch.Tensor, target), make sure to double check that they tensors are as expected before feeding them into your model for training/prediction.

  6. torchvision.transforms.ToTensor (refered to as ToTensor in the following) converts a PIL.Image to a numerical tensor with each value between [0, 1]. ToTensor on a boolean numpy array (representing a black/white image) returns a boolean tensor (instead of converting it to a numeric tensor). This is one reason that you should return (PIL.Image, target) and avoid returning (numpy.array, target) when implement your own Dataset class for image classification.

  7. There is no need to return the target as a torch.Tensor (even though you can) when you implement the method __getitem__ of your own Dataset class. The DataLoader will convert the batch of target values to torch.Tensor automatically.

  8. If you already have your training/test data in tensor format, the simplest way to define a dataset is to use torch.utils.data.Dataset . However, one drawback of torch.utils.data.Dataset is that it does not provide a parameter for transforming tensors current (even though discussions and requests have been made on this). In the case when a transformation is needed, a simple alternative is to just deriver your own dataset class.

In [1]:
import numpy as np
import torch
import torchvision
In [98]:
trans = torchvision.transforms.ToTensor()
In [99]:
arr = np.array([[True, True, False], [True, False, True]])
arr
Out[99]:
array([[ True,  True, False],
       [ True, False,  True]])
In [100]:
x = trans(arr)
x
Out[100]:
tensor([[[ True,  True, False],
         [ True, False,  True]]])
In [3]:
x = torch.tensor([1, 2, 3, 4])
y = torch.tensor([1, 0, 1, 0])
dset = torch.utils.data.TensorDataset(x, y)
dset
Out[3]:
<torch.utils.data.dataset.TensorDataset at 0x7f2c440b5da0>
In [4]:
for d in dset:
    print(d)
(tensor(1), tensor(1))
(tensor(2), tensor(0))
(tensor(3), tensor(1))
(tensor(4), tensor(0))

ImagePaths - a More Generalized Dataset Class for Images

If you have a trained model and want to run it on unlabled data, you need a dataset for unlabled data. PyTorch does not have such a class but it is very easy to implement one by yourself. The class ImagePaths implemented below is able to handle the situations of both with and without labels. Actually, it can be seen as a more generalized version of the torchvision.datasets.ImageFolder class.

In [1]:
import torch


class ImagePaths(torch.utils.data.Dataset):
    """Dataset class for Image paths."""

    def __init__(
        self, paths, transform=None, transform_target=None, cache: bool = False
    ):
        """Initialize an Image Path object.
        :param paths: An iterable of paths to images.
            For example, you can get image paths using pathlib.Path.glob.
        :param transform: The transform function for the image (or input tensor).
        :param transform_target: The transform function for the target/label.
        """
        self.paths = list(paths)
        labels = set(path.parent.name for path in self.paths)
        if all(label.isdigit() for label in labels):
            self.class_to_idx = {label: int(label) for label in labels}
        else:
            self.class_to_idx = {label: i for i, label in enumerate(labels)}
        self.transform = transform
        self.transform_target = transform_target
        self.cache = cache
        self._data = None
        if self.cache:
            self._data = [None] * len(self.paths)

    def __getitem__(self, index):
        if self.cache and self._data[index]:
            return self._data[index]
        path = self.paths[index]
        img = Image.open(path).convert("RGB")
        if self.transform:
            img = self.transform(img)
        target = self.class_to_idx[path.parent.name]
        if self.transform_target:
            target = self.transform_target(target)
        pair = img, target
        if self.cache:
            self._data[index] = pair
        return pair

    def __len__(self):
        return len(self.paths)

torch.utils.data.DataLoader

Each batch in a torch.utils.data.DataLoader is a list of tensors. The length of of the list matches the length of the tuple in the underlying Dataset. Each tensor in the list/batch has a first dimenion matching the batch size.

If you have specified the option shuffle=False (default), the order of the DataLoader is fixed. You get the same sequence each time you iterate the DataLoader. However, if you have specified the option shuffle=True (which should be used for training), the order of the DataLoader is random. Each time you iterate the DataLoader, the underlying dataset is shuffled and thus you get a different sequence each time you iterate the DataLoader.

In [16]:
x = torch.rand(10)
y = torch.tensor([0, 1]).repeat(5)
dataset = torch.utils.data.TensorDataset(x, y)
data_loader = torch.utils.data.DataLoader(dataset, batch_size=3)
print("x:", x)
print("y:", y)
print(data_loader)
x: tensor([0.0588, 0.2284, 0.0248, 0.3235, 0.4076, 0.7178, 0.5656, 0.5177, 0.9233,
        0.8219])
y: tensor([0, 1, 0, 1, 0, 1, 0, 1, 0, 1])
<torch.utils.data.dataloader.DataLoader object at 0x7f2c42373518>
In [12]:
for elem in data_loader:
    print(type(elem))
    print(elem)
<class 'list'>
[tensor([0.4187, 0.3091, 0.6890]), tensor([0, 1, 0])]
<class 'list'>
[tensor([0.1153, 0.6208, 0.3379]), tensor([1, 0, 1])]
<class 'list'>
[tensor([0.4417, 0.0541, 0.3020]), tensor([0, 1, 0])]
<class 'list'>
[tensor([0.2231]), tensor([1])]
In [11]:
dir(data_loader)
Out[11]:
['_DataLoader__initialized',
 '_DataLoader__multiprocessing_context',
 '_IterableDataset_len_called',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_auto_collation',
 '_dataset_kind',
 '_index_sampler',
 'batch_sampler',
 'batch_size',
 'collate_fn',
 'dataset',
 'drop_last',
 'multiprocessing_context',
 'num_workers',
 'pin_memory',
 'sampler',
 'timeout',
 'worker_init_fn']
In [6]:
?torch.utils.data.DataLoader
Init signature:
torch.utils.data.DataLoader(
    dataset,
    batch_size=1,
    shuffle=False,
    sampler=None,
    batch_sampler=None,
    num_workers=0,
    collate_fn=None,
    pin_memory=False,
    drop_last=False,
    timeout=0,
    worker_init_fn=None,
    multiprocessing_context=None,
)
Docstring:     
Data loader. Combines a dataset and a sampler, and provides an iterable over
the given dataset.

The :class:`~torch.utils.data.DataLoader` supports both map-style and
iterable-style datasets with single- or multi-process loading, customizing
loading order and optional automatic batching (collation) and memory pinning.

See :py:mod:`torch.utils.data` documentation page for more details.

Arguments:
    dataset (Dataset): dataset from which to load the data.
    batch_size (int, optional): how many samples per batch to load
        (default: ``1``).
    shuffle (bool, optional): set to ``True`` to have the data reshuffled
        at every epoch (default: ``False``).
    sampler (Sampler, optional): defines the strategy to draw samples from
        the dataset. If specified, :attr:`shuffle` must be ``False``.
    batch_sampler (Sampler, optional): like :attr:`sampler`, but returns a batch of
        indices at a time. Mutually exclusive with :attr:`batch_size`,
        :attr:`shuffle`, :attr:`sampler`, and :attr:`drop_last`.
    num_workers (int, optional): how many subprocesses to use for data
        loading. ``0`` means that the data will be loaded in the main process.
        (default: ``0``)
    collate_fn (callable, optional): merges a list of samples to form a
        mini-batch of Tensor(s).  Used when using batched loading from a
        map-style dataset.
    pin_memory (bool, optional): If ``True``, the data loader will copy Tensors
        into CUDA pinned memory before returning them.  If your data elements
        are a custom type, or your :attr:`collate_fn` returns a batch that is a custom type,
        see the example below.
    drop_last (bool, optional): set to ``True`` to drop the last incomplete batch,
        if the dataset size is not divisible by the batch size. If ``False`` and
        the size of dataset is not divisible by the batch size, then the last batch
        will be smaller. (default: ``False``)
    timeout (numeric, optional): if positive, the timeout value for collecting a batch
        from workers. Should always be non-negative. (default: ``0``)
    worker_init_fn (callable, optional): If not ``None``, this will be called on each
        worker subprocess with the worker id (an int in ``[0, num_workers - 1]``) as
        input, after seeding and before data loading. (default: ``None``)


.. warning:: If the ``spawn`` start method is used, :attr:`worker_init_fn`
             cannot be an unpicklable object, e.g., a lambda function. See
             :ref:`multiprocessing-best-practices` on more details related
             to multiprocessing in PyTorch.

.. note:: ``len(dataloader)`` heuristic is based on the length of the sampler used.
          When :attr:`dataset` is an :class:`~torch.utils.data.IterableDataset`,
          ``len(dataset)`` (if implemented) is returned instead, regardless
          of multi-process loading configurations, because PyTorch trust
          user :attr:`dataset` code in correctly handling multi-process
          loading to avoid duplicate data. See `Dataset Types`_ for more
          details on these two types of datasets and how
          :class:`~torch.utils.data.IterableDataset` interacts with `Multi-process data loading`_.
File:           ~/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py
Type:           type
Subclasses:     
In [ ]:
 

Comments