Pad a Sequence in Python

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

import numpy as np
import torch
import tensorflow as tf

x = torch.tensor(
    [
        [1.0, 2, 3, 4, 5],
        [6.0, 7, 8, 9, 10],
    ]
)
x

tensor([[ 1.,  2.,  3.,  4.,  5.],
        [ 6.,  7.,  8.,  9., 10.]])

Tips¶

numpy.pad and torch.nn.utils.rnn.pad_sequence can only increase the length of sequence (nump array, list or tensor) while tf.keras.preprocessing.sequence.pad_sequence can both increase and decrease the length of a sequence.
numpy.pad implements many different ways (constant, edge, linear_ramp, maximum, mean, median, minimum, reflect, symmetric, wrap, empty and abitrary padding function) to pad a sequence while torch.nn.utils.rnn.pad_sequence and tf.keras.preprocessing.sequence.pad_sequence only support padding a constant value (as this is only use case in NLP).
You can easily control the final length (after padding) with numpy.pad and tf.keras.preprocessing.sequence.pad_sequence. torch.nn.utils.rnn.pad_sequence pad each tesor to be have the max length of all tensors. You cannot easily use torch.nn.utils.rnn.pad_sequeence to pad sequence to an arbitrary length.
Both numpy.pad pads a single iterable object (numpy array, list or Tensor), torch.nn.utils.rnn.pad_sequence pads a sequence of Tensors, and tf.keras.preprocessing.sequence.pad_sequence pads a sequence of iterable objects (numpy arrays, lists or Tensors).

Overall, tf.keras.preprocessing.sequence.pad_sequence is the most useful for NLP. torch.nn.utisl.rnn.pad_sequence seems to be quite limited. numpy.pad can be used to easily implement customized padding strategy.

numpy.pad¶

a = [1, 2, 3, 4, 5]
np.pad(a, (2, 3), "constant", constant_values=(4, 6))

array([4, 4, 1, 2, 3, 4, 5, 6, 6, 6])

torch.nn.utils.rnn.pad_sequence¶

t = torch.nn.utils.rnn.pad_sequence(
    [
        torch.tensor([1, 2, 3]),
        torch.tensor([1, 2, 3, 4]),
    ]
)
t

tensor([[1, 1],
        [2, 2],
        [3, 3],
        [0, 4]])

t[0]

tensor([1, 1])

torch.nn.utils.rnn.pad_sequence(
    [
        [1, 2, 3],
        [1, 2, 3, 4],
    ]
)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-25-242698656610> in <module>
      2     [
      3         [1, 2, 3],
----> 4         [1, 2, 3, 4],
      5     ]
      6 )

~/.local/lib/python3.7/site-packages/torch/nn/utils/rnn.py in pad_sequence(sequences, batch_first, padding_value)
    325     # assuming trailing dimensions and type of all the Tensors
    326     # in sequences are same and fetching those from sequences[0]
--> 327     max_size = sequences[0].size()
    328     trailing_dims = max_size[1:]
    329     max_len = max([s.size(0) for s in sequences])

AttributeError: 'list' object has no attribute 'size'

tf.keras.preprocessing.sequence.pad_sequences¶

tf.keras.preprocessing.sequence.pad_sequences(
    [[1, 2, 3, 4, 5]],
    maxlen=3,
    dtype="long",
    value=0,
    truncating="post",
    padding="post",
)

array([[1, 2, 3]])

tf.keras.preprocessing.sequence.pad_sequences(
    [[1, 2, 3, 4, 5]],
    maxlen=9,
    dtype="long",
    value=0,
    truncating="post",
    padding="post",
)

array([[1, 2, 3, 4, 5, 0, 0, 0, 0]])

Reference¶

https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences

https://docs.scipy.org/doc/numpy/reference/generated/numpy.pad.html

https://pytorch.org/docs/stable/nn.html#torch.nn.utils.rnn.pad_sequence