- The Python package PyPDF can be used to extract pages from a PDF file. The function aiutil.pdf.extract_pages is a wrapper over PyPDF which makes it even easier to extract pages from a PDF file.
- Stirling-PDF is is a robust, locally hosted web-based PDF manipulation tool using Docker.
In [2]:
!pip3 install --user -U aiutil[all]
Collecting dsutil[pdf]@ git+https://github.com/dclong/dsutil@main
Cloning https://github.com/dclong/dsutil (to revision main) to /tmp/pip-install-x38z_hh1/dsutil
Running command git clone -q https://github.com/dclong/dsutil /tmp/pip-install-x38z_hh1/dsutil
Running command git checkout -b main --track origin/main
Switched to a new branch 'main'
Branch 'main' set up to track remote branch 'main' from 'origin'.
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing wheel metadata ... done
Requirement already satisfied, skipping upgrade: python-magic>=0.4.0 in /usr/local/lib/python3.8/dist-packages (from dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (0.4.24)
Requirement already satisfied, skipping upgrade: loguru>=0.3.2 in /usr/local/lib/python3.8/dist-packages (from dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (0.5.3)
Requirement already satisfied, skipping upgrade: pytest>=3.0 in /usr/local/lib/python3.8/dist-packages (from dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (6.2.4)
Requirement already satisfied, skipping upgrade: dateparser>=0.7.1 in /usr/local/lib/python3.8/dist-packages (from dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (1.0.0)
Requirement already satisfied, skipping upgrade: numba>=0.53.0rc1 in /usr/local/lib/python3.8/dist-packages (from dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (0.54.0rc3)
Requirement already satisfied, skipping upgrade: notifiers>=1.2.1 in /usr/local/lib/python3.8/dist-packages (from dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (1.2.1)
Requirement already satisfied, skipping upgrade: tqdm>=4.59.0 in /usr/local/lib/python3.8/dist-packages (from dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (4.62.0)
Requirement already satisfied, skipping upgrade: toml>=0.10.0 in /usr/local/lib/python3.8/dist-packages (from dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (0.10.2)
Requirement already satisfied, skipping upgrade: pandas>=1.2.0 in /usr/local/lib/python3.8/dist-packages (from dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (1.3.1)
Requirement already satisfied, skipping upgrade: pandas-profiling>=2.9.0 in /usr/local/lib/python3.8/dist-packages (from dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (3.0.0)
Requirement already satisfied, skipping upgrade: pathspec<0.9.0,>=0.8.1 in /usr/local/lib/python3.8/dist-packages (from dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (0.8.1)
Requirement already satisfied, skipping upgrade: GitPython>=3.0.0 in /usr/local/lib/python3.8/dist-packages (from dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (3.1.18)
Requirement already satisfied, skipping upgrade: PyYAML>=5.3.1 in /usr/local/lib/python3.8/dist-packages (from dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (5.4.1)
Requirement already satisfied, skipping upgrade: sqlparse>=0.4.1 in /usr/local/lib/python3.8/dist-packages (from dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (0.4.1)
Collecting PyPDF2>=1.26.0; extra == "pdf" or extra == "all"
Downloading PyPDF2-1.26.0.tar.gz (77 kB)
|████████████████████████████████| 77 kB 3.3 MB/s eta 0:00:01
Requirement already satisfied, skipping upgrade: attrs>=19.2.0 in /usr/local/lib/python3.8/dist-packages (from pytest>=3.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (21.2.0)
Requirement already satisfied, skipping upgrade: packaging in /usr/local/lib/python3.8/dist-packages (from pytest>=3.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (21.0)
Requirement already satisfied, skipping upgrade: py>=1.8.2 in /usr/local/lib/python3.8/dist-packages (from pytest>=3.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (1.10.0)
Requirement already satisfied, skipping upgrade: iniconfig in /usr/local/lib/python3.8/dist-packages (from pytest>=3.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (1.1.1)
Requirement already satisfied, skipping upgrade: pluggy<1.0.0a1,>=0.12 in /usr/local/lib/python3.8/dist-packages (from pytest>=3.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (0.13.1)
Requirement already satisfied, skipping upgrade: pytz in /usr/local/lib/python3.8/dist-packages (from dateparser>=0.7.1->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (2021.1)
Requirement already satisfied, skipping upgrade: python-dateutil in /usr/local/lib/python3.8/dist-packages (from dateparser>=0.7.1->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (2.8.2)
Requirement already satisfied, skipping upgrade: tzlocal in /usr/local/lib/python3.8/dist-packages (from dateparser>=0.7.1->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (2.1)
Requirement already satisfied, skipping upgrade: regex!=2019.02.19 in /usr/local/lib/python3.8/dist-packages (from dateparser>=0.7.1->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (2021.8.3)
Collecting numpy<1.21,>=1.17
Downloading numpy-1.20.3-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.4 MB)
|████████████████████████████████| 15.4 MB 19.3 MB/s eta 0:00:01
Requirement already satisfied, skipping upgrade: llvmlite<0.38,>=0.37.0rc1 in /usr/local/lib/python3.8/dist-packages (from numba>=0.53.0rc1->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (0.37.0rc2)
Requirement already satisfied, skipping upgrade: setuptools in /usr/lib/python3/dist-packages (from numba>=0.53.0rc1->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (45.2.0)
Requirement already satisfied, skipping upgrade: click>=7.0 in /usr/local/lib/python3.8/dist-packages (from notifiers>=1.2.1->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (8.0.1)
Requirement already satisfied, skipping upgrade: rfc3987>=1.3.8 in /usr/local/lib/python3.8/dist-packages (from notifiers>=1.2.1->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (1.3.8)
Requirement already satisfied, skipping upgrade: requests>=2.21.0 in /usr/local/lib/python3.8/dist-packages (from notifiers>=1.2.1->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (2.26.0)
Requirement already satisfied, skipping upgrade: jsonschema>=3.0.0 in /usr/local/lib/python3.8/dist-packages (from notifiers>=1.2.1->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (3.2.0)
Requirement already satisfied, skipping upgrade: seaborn>=0.10.1 in /usr/local/lib/python3.8/dist-packages (from pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (0.11.1)
Requirement already satisfied, skipping upgrade: matplotlib>=3.2.0 in /usr/local/lib/python3.8/dist-packages (from pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (3.4.3)
Requirement already satisfied, skipping upgrade: joblib in /usr/local/lib/python3.8/dist-packages (from pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (1.0.1)
Requirement already satisfied, skipping upgrade: tangled-up-in-unicode==0.1.0 in /usr/local/lib/python3.8/dist-packages (from pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (0.1.0)
Requirement already satisfied, skipping upgrade: jinja2>=2.11.1 in /usr/local/lib/python3.8/dist-packages (from pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (3.0.1)
Requirement already satisfied, skipping upgrade: visions[type_image_path]==0.7.1 in /usr/local/lib/python3.8/dist-packages (from pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (0.7.1)
Requirement already satisfied, skipping upgrade: htmlmin>=0.1.12 in /usr/local/lib/python3.8/dist-packages (from pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (0.1.12)
Requirement already satisfied, skipping upgrade: pydantic>=1.8.1 in /usr/local/lib/python3.8/dist-packages (from pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (1.8.2)
Requirement already satisfied, skipping upgrade: missingno>=0.4.2 in /usr/local/lib/python3.8/dist-packages (from pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (0.5.0)
Requirement already satisfied, skipping upgrade: phik>=0.11.1 in /usr/local/lib/python3.8/dist-packages (from pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (0.12.0)
Requirement already satisfied, skipping upgrade: scipy>=1.4.1 in /usr/local/lib/python3.8/dist-packages (from pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (1.7.1)
Requirement already satisfied, skipping upgrade: gitdb<5,>=4.0.1 in /usr/local/lib/python3.8/dist-packages (from GitPython>=3.0.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (4.0.7)
Requirement already satisfied, skipping upgrade: pyparsing>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from packaging->pytest>=3.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (2.4.7)
Requirement already satisfied, skipping upgrade: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil->dateparser>=0.7.1->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (1.14.0)
Requirement already satisfied, skipping upgrade: charset-normalizer~=2.0.0; python_version >= "3" in /usr/local/lib/python3.8/dist-packages (from requests>=2.21.0->notifiers>=1.2.1->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (2.0.4)
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /usr/lib/python3/dist-packages (from requests>=2.21.0->notifiers>=1.2.1->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (2019.11.28)
Requirement already satisfied, skipping upgrade: urllib3<1.27,>=1.21.1 in /usr/lib/python3/dist-packages (from requests>=2.21.0->notifiers>=1.2.1->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (1.25.8)
Requirement already satisfied, skipping upgrade: idna<4,>=2.5; python_version >= "3" in /usr/lib/python3/dist-packages (from requests>=2.21.0->notifiers>=1.2.1->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (2.8)
Requirement already satisfied, skipping upgrade: pyrsistent>=0.14.0 in /usr/local/lib/python3.8/dist-packages (from jsonschema>=3.0.0->notifiers>=1.2.1->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (0.18.0)
Requirement already satisfied, skipping upgrade: cycler>=0.10 in /usr/local/lib/python3.8/dist-packages (from matplotlib>=3.2.0->pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (0.10.0)
Requirement already satisfied, skipping upgrade: pillow>=6.2.0 in /usr/local/lib/python3.8/dist-packages (from matplotlib>=3.2.0->pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (8.3.1)
Requirement already satisfied, skipping upgrade: kiwisolver>=1.0.1 in /usr/local/lib/python3.8/dist-packages (from matplotlib>=3.2.0->pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (1.3.1)
Requirement already satisfied, skipping upgrade: MarkupSafe>=2.0 in /usr/local/lib/python3.8/dist-packages (from jinja2>=2.11.1->pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (2.0.1)
Requirement already satisfied, skipping upgrade: networkx>=2.4 in /usr/local/lib/python3.8/dist-packages (from visions[type_image_path]==0.7.1->pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (2.6.2)
Requirement already satisfied, skipping upgrade: multimethod==1.4 in /usr/local/lib/python3.8/dist-packages (from visions[type_image_path]==0.7.1->pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (1.4)
Requirement already satisfied, skipping upgrade: bottleneck in /usr/local/lib/python3.8/dist-packages (from visions[type_image_path]==0.7.1->pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (1.3.2)
Requirement already satisfied, skipping upgrade: imagehash; extra == "type_image_path" in /usr/local/lib/python3.8/dist-packages (from visions[type_image_path]==0.7.1->pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (4.2.1)
Requirement already satisfied, skipping upgrade: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.8/dist-packages (from pydantic>=1.8.1->pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (3.10.0.0)
Requirement already satisfied, skipping upgrade: smmap<5,>=3.0.1 in /usr/local/lib/python3.8/dist-packages (from gitdb<5,>=4.0.1->GitPython>=3.0.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (4.0.0)
Requirement already satisfied, skipping upgrade: PyWavelets in /usr/local/lib/python3.8/dist-packages (from imagehash; extra == "type_image_path"->visions[type_image_path]==0.7.1->pandas-profiling>=2.9.0->dsutil[pdf]@ git+https://github.com/dclong/dsutil@main) (1.1.1)
Building wheels for collected packages: dsutil, PyPDF2
Building wheel for dsutil (PEP 517) ... done
Created wheel for dsutil: filename=dsutil-0.62.0-py3-none-any.whl size=51572 sha256=bd5393dc9a3a3d3152dfe8b6909ce7240eaadd22e97a0c730f3d455ebe35dfe5
Stored in directory: /tmp/pip-ephem-wheel-cache-icwzqggu/wheels/c0/bb/d6/6a180653188e6ea047bca6cb1c00c355228378e8b00a93d2e7
Building wheel for PyPDF2 (setup.py) ... done
Created wheel for PyPDF2: filename=PyPDF2-1.26.0-py3-none-any.whl size=61084 sha256=9d94bbbdf8ff73e902eb3e6a6c9679b89a01d840a5650c931f08e3cbfad13b45
Stored in directory: /home/dclong/.cache/pip/wheels/b1/1a/8f/a4c34be976825a2f7948d0fa40907598d69834f8ab5889de11
Successfully built dsutil PyPDF2
Installing collected packages: PyPDF2, dsutil, numpy
Successfully installed PyPDF2-1.26.0 dsutil-0.62.0 numpy-1.20.3
In [5]:
!wget www.legendu.net/media/wolfram/sum_and_product.pdf
--2023-09-17 11:52:33-- http://www.legendu.net/media/wolfram/sum_and_product.pdf Resolving www.legendu.net (www.legendu.net)... 185.199.109.153, 185.199.110.153, 185.199.111.153, ... Connecting to www.legendu.net (www.legendu.net)|185.199.109.153|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 140210 (137K) [application/pdf] Saving to: ‘sum_and_product.pdf’ sum_and_product.pdf 100%[===================>] 136.92K --.-KB/s in 0.03s 2023-09-17 11:52:34 (4.65 MB/s) - ‘sum_and_product.pdf’ saved [140210/140210]
Below is a concrete example of using the function dsutil.pdf.extract_pages
to extract pages from the PDF filesum_and_product.pdf into 3 sub PDF files
0-3.pdf (pages 0-3), 4.pdf (page 4) and 5-15.pdf (pages 5-15).
In [3]:
from aiutil.pdf import extract_pages
In [12]:
extract_pages(
"sum_and_product.pdf", {"0-3.pdf": range(4), "4.pdf": [4], "5-15.pdf": range(5, 16)}
)