The Python package PyPDF can be used to extract pages from a PDF file. The function aiutil.pdf.extract_pages is a wrapper over PyPDF which makes it even easier to extract pages from a PDF file.
In [2]:
!pip3 install --user -U aiutil[all]
Collecting dsutil[pdf]@ git+https://github.com/dclong/dsutil@main Cloning https://github.com/dclong/dsutil (to revision main) to /tmp/pip-install-x38z_hh1/dsutil Running command git clone -q https://github.com/dclong/dsutil /tmp/pip-install-x38z_hh1/dsutil Running command git checkout -b main --track origin/main Switched to a new branch 'main' Branch 'main' set up to track remote branch 'main' from 'origin'. Installing build dependencies ... done Getting requirements to build wheel ... done Preparing wheel metadata ... done Requirement already satisfied, skipping upgrade: python-magic>=0.4.0 in /usr/local/lib/python3.8/dist-packages (from dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (0.4.24) Requirement already satisfied, skipping upgrade: loguru>=0.3.2 in /usr/local/lib/python3.8/dist-packages (from dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (0.5.3) Requirement already satisfied, skipping upgrade: pytest>=3.0 in /usr/local/lib/python3.8/dist-packages (from dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (6.2.4) Requirement already satisfied, skipping upgrade: dateparser>=0.7.1 in /usr/local/lib/python3.8/dist-packages (from dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (1.0.0) Requirement already satisfied, skipping upgrade: numba>=0.53.0rc1 in /usr/local/lib/python3.8/dist-packages (from dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (0.54.0rc3) Requirement already satisfied, skipping upgrade: notifiers>=1.2.1 in /usr/local/lib/python3.8/dist-packages (from dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (1.2.1) Requirement already satisfied, skipping upgrade: tqdm>=4.59.0 in /usr/local/lib/python3.8/dist-packages (from dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (4.62.0) Requirement already satisfied, skipping upgrade: toml>=0.10.0 in /usr/local/lib/python3.8/dist-packages (from dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (0.10.2) Requirement already satisfied, skipping upgrade: pandas>=1.2.0 in /usr/local/lib/python3.8/dist-packages (from dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (1.3.1) Requirement already satisfied, skipping upgrade: pandas-profiling>=2.9.0 in /usr/local/lib/python3.8/dist-packages (from dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (3.0.0) Requirement already satisfied, skipping upgrade: pathspec<0.9.0,>=0.8.1 in /usr/local/lib/python3.8/dist-packages (from dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (0.8.1) Requirement already satisfied, skipping upgrade: GitPython>=3.0.0 in /usr/local/lib/python3.8/dist-packages (from dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (3.1.18) Requirement already satisfied, skipping upgrade: PyYAML>=5.3.1 in /usr/local/lib/python3.8/dist-packages (from dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (5.4.1) Requirement already satisfied, skipping upgrade: sqlparse>=0.4.1 in /usr/local/lib/python3.8/dist-packages (from dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (0.4.1) Collecting PyPDF2>=1.26.0; extra == "pdf" or extra == "all" Downloading PyPDF2-1.26.0.tar.gz (77 kB) |████████████████████████████████| 77 kB 3.3 MB/s eta 0:00:01 Requirement already satisfied, skipping upgrade: attrs>=19.2.0 in /usr/local/lib/python3.8/dist-packages (from pytest>=3.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (21.2.0) Requirement already satisfied, skipping upgrade: packaging in /usr/local/lib/python3.8/dist-packages (from pytest>=3.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (21.0) Requirement already satisfied, skipping upgrade: py>=1.8.2 in /usr/local/lib/python3.8/dist-packages (from pytest>=3.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (1.10.0) Requirement already satisfied, skipping upgrade: iniconfig in /usr/local/lib/python3.8/dist-packages (from pytest>=3.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (1.1.1) Requirement already satisfied, skipping upgrade: pluggy<1.0.0a1,>=0.12 in /usr/local/lib/python3.8/dist-packages (from pytest>=3.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (0.13.1) Requirement already satisfied, skipping upgrade: pytz in /usr/local/lib/python3.8/dist-packages (from dateparser>=0.7.1->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (2021.1) Requirement already satisfied, skipping upgrade: python-dateutil in /usr/local/lib/python3.8/dist-packages (from dateparser>=0.7.1->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (2.8.2) Requirement already satisfied, skipping upgrade: tzlocal in /usr/local/lib/python3.8/dist-packages (from dateparser>=0.7.1->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (2.1) Requirement already satisfied, skipping upgrade: regex!=2019.02.19 in /usr/local/lib/python3.8/dist-packages (from dateparser>=0.7.1->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (2021.8.3) Collecting numpy<1.21,>=1.17 Downloading numpy-1.20.3-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.4 MB) |████████████████████████████████| 15.4 MB 19.3 MB/s eta 0:00:01 Requirement already satisfied, skipping upgrade: llvmlite<0.38,>=0.37.0rc1 in /usr/local/lib/python3.8/dist-packages (from numba>=0.53.0rc1->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (0.37.0rc2) Requirement already satisfied, skipping upgrade: setuptools in /usr/lib/python3/dist-packages (from numba>=0.53.0rc1->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (45.2.0) Requirement already satisfied, skipping upgrade: click>=7.0 in /usr/local/lib/python3.8/dist-packages (from notifiers>=1.2.1->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (8.0.1) Requirement already satisfied, skipping upgrade: rfc3987>=1.3.8 in /usr/local/lib/python3.8/dist-packages (from notifiers>=1.2.1->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (1.3.8) Requirement already satisfied, skipping upgrade: requests>=2.21.0 in /usr/local/lib/python3.8/dist-packages (from notifiers>=1.2.1->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (2.26.0) Requirement already satisfied, skipping upgrade: jsonschema>=3.0.0 in /usr/local/lib/python3.8/dist-packages (from notifiers>=1.2.1->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (3.2.0) Requirement already satisfied, skipping upgrade: seaborn>=0.10.1 in /usr/local/lib/python3.8/dist-packages (from pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (0.11.1) Requirement already satisfied, skipping upgrade: matplotlib>=3.2.0 in /usr/local/lib/python3.8/dist-packages (from pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (3.4.3) Requirement already satisfied, skipping upgrade: joblib in /usr/local/lib/python3.8/dist-packages (from pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (1.0.1) Requirement already satisfied, skipping upgrade: tangled-up-in-unicode==0.1.0 in /usr/local/lib/python3.8/dist-packages (from pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (0.1.0) Requirement already satisfied, skipping upgrade: jinja2>=2.11.1 in /usr/local/lib/python3.8/dist-packages (from pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (3.0.1) Requirement already satisfied, skipping upgrade: visions[type_image_path]==0.7.1 in /usr/local/lib/python3.8/dist-packages (from pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (0.7.1) Requirement already satisfied, skipping upgrade: htmlmin>=0.1.12 in /usr/local/lib/python3.8/dist-packages (from pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (0.1.12) Requirement already satisfied, skipping upgrade: pydantic>=1.8.1 in /usr/local/lib/python3.8/dist-packages (from pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (1.8.2) Requirement already satisfied, skipping upgrade: missingno>=0.4.2 in /usr/local/lib/python3.8/dist-packages (from pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (0.5.0) Requirement already satisfied, skipping upgrade: phik>=0.11.1 in /usr/local/lib/python3.8/dist-packages (from pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (0.12.0) Requirement already satisfied, skipping upgrade: scipy>=1.4.1 in /usr/local/lib/python3.8/dist-packages (from pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (1.7.1) Requirement already satisfied, skipping upgrade: gitdb<5,>=4.0.1 in /usr/local/lib/python3.8/dist-packages (from GitPython>=3.0.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (4.0.7) Requirement already satisfied, skipping upgrade: pyparsing>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from packaging->pytest>=3.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (2.4.7) Requirement already satisfied, skipping upgrade: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil->dateparser>=0.7.1->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (1.14.0) Requirement already satisfied, skipping upgrade: charset-normalizer~=2.0.0; python_version >= "3" in /usr/local/lib/python3.8/dist-packages (from requests>=2.21.0->notifiers>=1.2.1->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (2.0.4) Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /usr/lib/python3/dist-packages (from requests>=2.21.0->notifiers>=1.2.1->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (2019.11.28) Requirement already satisfied, skipping upgrade: urllib3<1.27,>=1.21.1 in /usr/lib/python3/dist-packages (from requests>=2.21.0->notifiers>=1.2.1->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (1.25.8) Requirement already satisfied, skipping upgrade: idna<4,>=2.5; python_version >= "3" in /usr/lib/python3/dist-packages (from requests>=2.21.0->notifiers>=1.2.1->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (2.8) Requirement already satisfied, skipping upgrade: pyrsistent>=0.14.0 in /usr/local/lib/python3.8/dist-packages (from jsonschema>=3.0.0->notifiers>=1.2.1->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (0.18.0) Requirement already satisfied, skipping upgrade: cycler>=0.10 in /usr/local/lib/python3.8/dist-packages (from matplotlib>=3.2.0->pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (0.10.0) Requirement already satisfied, skipping upgrade: pillow>=6.2.0 in /usr/local/lib/python3.8/dist-packages (from matplotlib>=3.2.0->pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (8.3.1) Requirement already satisfied, skipping upgrade: kiwisolver>=1.0.1 in /usr/local/lib/python3.8/dist-packages (from matplotlib>=3.2.0->pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (1.3.1) Requirement already satisfied, skipping upgrade: MarkupSafe>=2.0 in /usr/local/lib/python3.8/dist-packages (from jinja2>=2.11.1->pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (2.0.1) Requirement already satisfied, skipping upgrade: networkx>=2.4 in /usr/local/lib/python3.8/dist-packages (from visions[type_image_path]==0.7.1->pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (2.6.2) Requirement already satisfied, skipping upgrade: multimethod==1.4 in /usr/local/lib/python3.8/dist-packages (from visions[type_image_path]==0.7.1->pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (1.4) Requirement already satisfied, skipping upgrade: bottleneck in /usr/local/lib/python3.8/dist-packages (from visions[type_image_path]==0.7.1->pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (1.3.2) Requirement already satisfied, skipping upgrade: imagehash; extra == "type_image_path" in /usr/local/lib/python3.8/dist-packages (from visions[type_image_path]==0.7.1->pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (4.2.1) Requirement already satisfied, skipping upgrade: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.8/dist-packages (from pydantic>=1.8.1->pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (3.10.0.0) Requirement already satisfied, skipping upgrade: smmap<5,>=3.0.1 in /usr/local/lib/python3.8/dist-packages (from gitdb<5,>=4.0.1->GitPython>=3.0.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (4.0.0) Requirement already satisfied, skipping upgrade: PyWavelets in /usr/local/lib/python3.8/dist-packages (from imagehash; extra == "type_image_path"->visions[type_image_path]==0.7.1->pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (1.1.1) Building wheels for collected packages: dsutil, PyPDF2 Building wheel for dsutil (PEP 517) ... done Created wheel for dsutil: filename=dsutil-0.62.0-py3-none-any.whl size=51572 sha256=bd5393dc9a3a3d3152dfe8b6909ce7240eaadd22e97a0c730f3d455ebe35dfe5 Stored in directory: /tmp/pip-ephem-wheel-cache-icwzqggu/wheels/c0/bb/d6/6a180653188e6ea047bca6cb1c00c355228378e8b00a93d2e7 Building wheel for PyPDF2 (setup.py) ... done Created wheel for PyPDF2: filename=PyPDF2-1.26.0-py3-none-any.whl size=61084 sha256=9d94bbbdf8ff73e902eb3e6a6c9679b89a01d840a5650c931f08e3cbfad13b45 Stored in directory: /home/dclong/.cache/pip/wheels/b1/1a/8f/a4c34be976825a2f7948d0fa40907598d69834f8ab5889de11 Successfully built dsutil PyPDF2 Installing collected packages: PyPDF2, dsutil, numpy Successfully installed PyPDF2-1.26.0 dsutil-0.62.0 numpy-1.20.3
In [5]:
!wget www.legendu.net/media/wolfram/sum_and_product.pdf
--2023-09-17 11:52:33-- http://www.legendu.net/media/wolfram/sum_and_product.pdf Resolving www.legendu.net (www.legendu.net)... 185.199.109.153, 185.199.110.153, 185.199.111.153, ... Connecting to www.legendu.net (www.legendu.net)|185.199.109.153|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 140210 (137K) [application/pdf] Saving to: ‘sum_and_product.pdf’ sum_and_product.pdf 100%[===================>] 136.92K --.-KB/s in 0.03s 2023-09-17 11:52:34 (4.65 MB/s) - ‘sum_and_product.pdf’ saved [140210/140210]
Below is a concrete example of using the function dsutil.pdf.extract_pages
to extract pages from the PDF filesum_and_product.pdf
into 3 sub PDF files
0-3.pdf
(pages 0-3), 4.pdf
(page 4) and 5-15.pdf
(pages 5-15).
In [3]:
from aiutil.pdf import extract_pages
In [12]:
extract_pages(
"sum_and_product.pdf", {"0-3.pdf": range(4), "4.pdf": [4], "5-15.pdf": range(5, 16)}
)