Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Hands on the Python Library pdfplumber

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Stirling-PDF is is a robust, locally hosted web-based PDF manipulation tool using Docker.

!pip3 install pdfplumber
Collecting pdfplumber
  Downloading pdfplumber-0.5.28.tar.gz (45 kB)
     |████████████████████████████████| 45 kB 1.6 MB/s eta 0:00:01
Requirement already satisfied: Pillow>=7.0.0 in /usr/local/lib/python3.8/dist-packages (from pdfplumber) (8.3.1)
Collecting Wand
  Downloading Wand-0.6.6-py2.py3-none-any.whl (138 kB)
     |████████████████████████████████| 138 kB 8.5 MB/s eta 0:00:01
Collecting pdfminer.six==20200517
  Downloading pdfminer.six-20200517-py3-none-any.whl (5.6 MB)
     |████████████████████████████████| 5.6 MB 22.0 MB/s eta 0:00:01
Collecting pycryptodome
  Downloading pycryptodome-3.10.1-cp35-abi3-manylinux2010_x86_64.whl (1.9 MB)
     |████████████████████████████████| 1.9 MB 46.7 MB/s eta 0:00:01
Requirement already satisfied: chardet; python_version > "3.0" in /usr/lib/python3/dist-packages (from pdfminer.six==20200517->pdfplumber) (3.0.4)
Collecting sortedcontainers
  Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
Building wheels for collected packages: pdfplumber
  Building wheel for pdfplumber (setup.py) ... done
  Created wheel for pdfplumber: filename=pdfplumber-0.5.28-py3-none-any.whl size=32220 sha256=8df60e70751b3087fda49d8b20bb47d0e82931b60a2df7ea913391f68716facc
  Stored in directory: /home/dclong/.cache/pip/wheels/36/61/6d/5fdf7f85a9598d42f094b4099be9a3dd9a887b25ca9b5a1bf4
Successfully built pdfplumber
Installing collected packages: Wand, pycryptodome, sortedcontainers, pdfminer.six, pdfplumber
Successfully installed Wand-0.6.6 pdfminer.six-20200517 pdfplumber-0.5.28 pycryptodome-3.10.1 sortedcontainers-2.4.0
!wget http://www.edd.ca.gov/jobs_and_training/warn/eddwarncn12.pdf
--2021-07-15 15:18:14--  http://www.edd.ca.gov/jobs_and_training/warn/eddwarncn12.pdf
Resolving www.edd.ca.gov (www.edd.ca.gov)... 134.186.117.17
Connecting to www.edd.ca.gov (www.edd.ca.gov)|134.186.117.17|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.edd.ca.gov/jobs_and_training/warn/eddwarncn12.pdf [following]
--2021-07-15 15:18:14--  https://www.edd.ca.gov/jobs_and_training/warn/eddwarncn12.pdf
Connecting to www.edd.ca.gov (www.edd.ca.gov)|134.186.117.17|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 307728 (301K) [application/pdf]
Saving to: ‘eddwarncn12.pdf’

eddwarncn12.pdf     100%[===================>] 300.52K   760KB/s    in 0.4s    

2021-07-15 15:18:15 (760 KB/s) - ‘eddwarncn12.pdf’ saved [307728/307728]

import pdfplumber
pdf = pdfplumber.open("eddwarncn12.pdf")
type(pdf)
pdfplumber.pdf.PDF
page = pdf.pages[0]
type(page)
pdfplumber.page.Page
dir(page)
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'annots', 'bbox', 'cached_properties', 'chars', 'close', 'close_file', 'crop', 'cropbox', 'curves', 'debug_tablefinder', 'decimalize', 'dedupe_chars', 'edges', 'extract_table', 'extract_tables', 'extract_text', 'extract_words', 'filter', 'find_tables', 'flush_cache', 'height', 'horizontal_edges', 'hyperlinks', 'images', 'initial_doctop', 'is_original', 'iter_layout_objects', 'layout', 'lines', 'mediabox', 'objects', 'page_number', 'page_obj', 'parse_objects', 'pdf', 'process_object', 'rect_edges', 'rects', 'rotation', 'textboxhorizontals', 'textboxverticals', 'textlinehorizontals', 'textlineverticals', 'to_csv', 'to_image', 'to_json', 'vertical_edges', 'width', 'within_bbox']

Extract Tables

  1. It often helps to crop a PDF page (Page.crop(bounding_box)) before extracting tables.

  2. Below are default settings when extracting tables.

     {
         "vertical_strategy": "lines", 
         "horizontal_strategy": "lines",
         "explicit_vertical_lines": [],
         "explicit_horizontal_lines": [],
         "snap_tolerance": 3,
         "snap_x_tolerance": 3,
         "snap_y_tolerance": 3,
         "join_tolerance": 3,
         "join_x_tolerance": 3,
         "join_y_tolerance": 3,
         "edge_min_length": 3,
         "min_words_vertical": 3,
         "min_words_horizontal": 1,
         "keep_blank_chars": False,
         "text_tolerance": 3,
         "text_x_tolerance": 3,
         "text_y_tolerance": 3,
         "intersection_tolerance": 3,
         "intersection_x_tolerance": 3,
         "intersection_y_tolerance": 3,
     }
    • Setting “vertical_strategy” and/or “horizontal_strategy” to text can be help when there are no horizontal and/or vertical lines in the table.

table = page.extract_table()
type(table)
list
table
[['Company Name', 'Location', 'Employees\nAffected', 'Layoff\nDate'], ['AAR MOBILITY SYSTEMS', 'MCCLELLAN AFB', '48', '6/15/12'], ['ABBOTT VASCULAR', 'MURRIETA', '45', '1/25/12'], ['ABBOTT VASCULAR', 'MURRIETA', '38', '10/17/12'], ['ABBOTT VASCULAR', 'TEMECULA', '247', '1/25/12'], ['ABBOTT VASCULAR', 'TEMECULA', '7', '1/25/12'], ['ABBOTT VASCULAR', 'TEMECULA', '139', '10/17/12'], ['ABBOTT VASCULAR', 'TEMECULA', '16', '10/17/12'], ['ABEO MANAGEMENT CORPORATION', 'LOS ANGELES', '42', '11/28/12'], ['ABERCROMBIE & FITCH', 'ANAHEIM', '51', '1/14/12'], ['ABERCROMBIE & FITCH', 'CAPITOLA', '51', '1/21/12'], ['ABERCROMBIE & FITCH', 'RIVERSIDE', '64', '1/14/12'], ['ABERCROMBIE & FITCH', 'SAN DIEGO', '66', '12/29/12'], ['ABERCROMBIE & FITCH', 'SIMI VALLEY', '70', '3/24/12'], ['ABERCROMBIE & FITCH', 'SIMI VALLEY', '47', '3/24/12'], ['ADAMS RITE MANUFACTURING \nCOMPANY', 'PONOMA', '110', '5/25/12'], ['ADOBE SYSTEMS INCORPORATED', 'SAN FRANCISCO', '121', '1/31/12'], ['ADOBE SYSTEMS INCORPORATED', 'SAN JOSE', '103', '1/31/12'], ['ADVANCED MICRO DEVICES, INC', 'SUNNYVALE', '107', '10/25/12']]

Convert a PDF Page to Image

page.to_image()