Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

import numpy as np
import torch
import tensorflow as tf
x = torch.tensor(
    [
        [1.0, 2, 3, 4, 5],
        [6.0, 7, 8, 9, 10],
    ]
)
x
tensor([[ 1., 2., 3., 4., 5.], [ 6., 7., 8., 9., 10.]])

Tips

  1. numpy.pad and torch.nn.utils.rnn.pad_sequence can only increase the length of sequence (nump array, list or tensor) while tf.keras.preprocessing.sequence.pad_sequence can both increase and decrease the length of a sequence.

  2. numpy.pad implements many different ways (constant, edge, linear_ramp, maximum, mean, median, minimum, reflect, symmetric, wrap, empty and abitrary padding function) to pad a sequence while torch.nn.utils.rnn.pad_sequence and tf.keras.preprocessing.sequence.pad_sequence only support padding a constant value (as this is only use case in NLP).

  3. You can easily control the final length (after padding) with numpy.pad and tf.keras.preprocessing.sequence.pad_sequence. torch.nn.utils.rnn.pad_sequence pad each tesor to be have the max length of all tensors. You cannot easily use torch.nn.utils.rnn.pad_sequeence to pad sequence to an arbitrary length.

  4. Both numpy.pad pads a single iterable object (numpy array, list or Tensor), torch.nn.utils.rnn.pad_sequence pads a sequence of Tensors, and tf.keras.preprocessing.sequence.pad_sequence pads a sequence of iterable objects (numpy arrays, lists or Tensors).

Overall, tf.keras.preprocessing.sequence.pad_sequence is the most useful for NLP. torch.nn.utisl.rnn.pad_sequence seems to be quite limited. numpy.pad can be used to easily implement customized padding strategy.

a = [1, 2, 3, 4, 5]
np.pad(a, (2, 3), "constant", constant_values=(4, 6))
array([4, 4, 1, 2, 3, 4, 5, 6, 6, 6])
t = torch.nn.utils.rnn.pad_sequence(
    [
        torch.tensor([1, 2, 3]),
        torch.tensor([1, 2, 3, 4]),
    ]
)
t
tensor([[1, 1], [2, 2], [3, 3], [0, 4]])
t[0]
tensor([1, 1])
torch.nn.utils.rnn.pad_sequence(
    [
        [1, 2, 3],
        [1, 2, 3, 4],
    ]
)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-25-242698656610> in <module>
      2     [
      3         [1, 2, 3],
----> 4         [1, 2, 3, 4],
      5     ]
      6 )

~/.local/lib/python3.7/site-packages/torch/nn/utils/rnn.py in pad_sequence(sequences, batch_first, padding_value)
    325     # assuming trailing dimensions and type of all the Tensors
    326     # in sequences are same and fetching those from sequences[0]
--> 327     max_size = sequences[0].size()
    328     trailing_dims = max_size[1:]
    329     max_len = max([s.size(0) for s in sequences])

AttributeError: 'list' object has no attribute 'size'
tf.keras.preprocessing.sequence.pad_sequences(
    [[1, 2, 3, 4, 5]],
    maxlen=3,
    dtype="long",
    value=0,
    truncating="post",
    padding="post",
)
array([[1, 2, 3]])
tf.keras.preprocessing.sequence.pad_sequences(
    [[1, 2, 3, 4, 5]],
    maxlen=9,
    dtype="long",
    value=0,
    truncating="post",
    padding="post",
)
array([[1, 2, 3, 4, 5, 0, 0, 0, 0]])