Ben Chuanlong Du's Blog

It is never too late to learn.

Pad a Sequence in Python

In [36]:
import numpy as np
import torch
import tensorflow as tf
In [23]:
x = torch.tensor(
    [
        [1.0, 2, 3, 4, 5],
        [6.0, 7, 8, 9, 10],
    ]
)
x
Out[23]:
tensor([[ 1.,  2.,  3.,  4.,  5.],
        [ 6.,  7.,  8.,  9., 10.]])

Tips

  1. numpy.pad and torch.nn.utils.rnn.pad_sequence can only increase the length of sequence (nump array, list or tensor) while tf.keras.preprocessing.sequence.pad_sequence can both increase and decrease the length of a sequence.

  2. numpy.pad implements many different ways (constant, edge, linear_ramp, maximum, mean, median, minimum, reflect, symmetric, wrap, empty and abitrary padding function) to pad a sequence while torch.nn.utils.rnn.pad_sequence and tf.keras.preprocessing.sequence.pad_sequence only support padding a constant value (as this is only use case in NLP).

  3. You can easily control the final length (after padding) with numpy.pad and tf.keras.preprocessing.sequence.pad_sequence. torch.nn.utils.rnn.pad_sequence pad each tesor to be have the max length of all tensors. You cannot easily use torch.nn.utils.rnn.pad_sequeence to pad sequence to an arbitrary length.

  4. Both numpy.pad pads a single iterable object (numpy array, list or Tensor), torch.nn.utils.rnn.pad_sequence pads a sequence of Tensors, and tf.keras.preprocessing.sequence.pad_sequence pads a sequence of iterable objects (numpy arrays, lists or Tensors).

Overall, tf.keras.preprocessing.sequence.pad_sequence is the most useful for NLP. torch.nn.utisl.rnn.pad_sequence seems to be quite limited. numpy.pad can be used to easily implement customized padding strategy.

In [24]:
a = [1, 2, 3, 4, 5]
np.pad(a, (2, 3), "constant", constant_values=(4, 6))
Out[24]:
array([4, 4, 1, 2, 3, 4, 5, 6, 6, 6])
In [27]:
t = torch.nn.utils.rnn.pad_sequence(
    [
        torch.tensor([1, 2, 3]),
        torch.tensor([1, 2, 3, 4]),
    ]
)
t
Out[27]:
tensor([[1, 1],
        [2, 2],
        [3, 3],
        [0, 4]])
In [28]:
t[0]
Out[28]:
tensor([1, 1])
In [25]:
torch.nn.utils.rnn.pad_sequence(
    [
        [1, 2, 3],
        [1, 2, 3, 4],
    ]
)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-25-242698656610> in <module>
      2     [
      3         [1, 2, 3],
----> 4         [1, 2, 3, 4],
      5     ]
      6 )

~/.local/lib/python3.7/site-packages/torch/nn/utils/rnn.py in pad_sequence(sequences, batch_first, padding_value)
    325     # assuming trailing dimensions and type of all the Tensors
    326     # in sequences are same and fetching those from sequences[0]
--> 327     max_size = sequences[0].size()
    328     trailing_dims = max_size[1:]
    329     max_len = max([s.size(0) for s in sequences])

AttributeError: 'list' object has no attribute 'size'
In [34]:
tf.keras.preprocessing.sequence.pad_sequences(
    [[1, 2, 3, 4, 5]],
    maxlen=3,
    dtype="long",
    value=0,
    truncating="post",
    padding="post",
)
Out[34]:
array([[1, 2, 3]])
In [35]:
tf.keras.preprocessing.sequence.pad_sequences(
    [[1, 2, 3, 4, 5]],
    maxlen=9,
    dtype="long",
    value=0,
    truncating="post",
    padding="post",
)
Out[35]:
array([[1, 2, 3, 4, 5, 0, 0, 0, 0]])
In [ ]:
 

Comments