Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Stirling-PDF is is a robust, locally hosted web-based PDF manipulation tool using Docker.
!pip3 install pdfplumberCollecting pdfplumber
Downloading pdfplumber-0.5.28.tar.gz (45 kB)
|████████████████████████████████| 45 kB 1.6 MB/s eta 0:00:01
Requirement already satisfied: Pillow>=7.0.0 in /usr/local/lib/python3.8/dist-packages (from pdfplumber) (8.3.1)
Collecting Wand
Downloading Wand-0.6.6-py2.py3-none-any.whl (138 kB)
|████████████████████████████████| 138 kB 8.5 MB/s eta 0:00:01
Collecting pdfminer.six==20200517
Downloading pdfminer.six-20200517-py3-none-any.whl (5.6 MB)
|████████████████████████████████| 5.6 MB 22.0 MB/s eta 0:00:01
Collecting pycryptodome
Downloading pycryptodome-3.10.1-cp35-abi3-manylinux2010_x86_64.whl (1.9 MB)
|████████████████████████████████| 1.9 MB 46.7 MB/s eta 0:00:01
Requirement already satisfied: chardet; python_version > "3.0" in /usr/lib/python3/dist-packages (from pdfminer.six==20200517->pdfplumber) (3.0.4)
Collecting sortedcontainers
Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
Building wheels for collected packages: pdfplumber
Building wheel for pdfplumber (setup.py) ... done
Created wheel for pdfplumber: filename=pdfplumber-0.5.28-py3-none-any.whl size=32220 sha256=8df60e70751b3087fda49d8b20bb47d0e82931b60a2df7ea913391f68716facc
Stored in directory: /home/dclong/.cache/pip/wheels/36/61/6d/5fdf7f85a9598d42f094b4099be9a3dd9a887b25ca9b5a1bf4
Successfully built pdfplumber
Installing collected packages: Wand, pycryptodome, sortedcontainers, pdfminer.six, pdfplumber
Successfully installed Wand-0.6.6 pdfminer.six-20200517 pdfplumber-0.5.28 pycryptodome-3.10.1 sortedcontainers-2.4.0
!wget http://www.edd.ca.gov/jobs_and_training/warn/eddwarncn12.pdf--2021-07-15 15:18:14-- http://www.edd.ca.gov/jobs_and_training/warn/eddwarncn12.pdf
Resolving www.edd.ca.gov (www.edd.ca.gov)... 134.186.117.17
Connecting to www.edd.ca.gov (www.edd.ca.gov)|134.186.117.17|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.edd.ca.gov/jobs_and_training/warn/eddwarncn12.pdf [following]
--2021-07-15 15:18:14-- https://www.edd.ca.gov/jobs_and_training/warn/eddwarncn12.pdf
Connecting to www.edd.ca.gov (www.edd.ca.gov)|134.186.117.17|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 307728 (301K) [application/pdf]
Saving to: ‘eddwarncn12.pdf’
eddwarncn12.pdf 100%[===================>] 300.52K 760KB/s in 0.4s
2021-07-15 15:18:15 (760 KB/s) - ‘eddwarncn12.pdf’ saved [307728/307728]
import pdfplumberpdf = pdfplumber.open("eddwarncn12.pdf")type(pdf)pdfplumber.pdf.PDFpage = pdf.pages[0]
type(page)pdfplumber.page.Pagedir(page)['__class__',
'__delattr__',
'__dict__',
'__dir__',
'__doc__',
'__enter__',
'__eq__',
'__exit__',
'__format__',
'__ge__',
'__getattribute__',
'__gt__',
'__hash__',
'__init__',
'__init_subclass__',
'__le__',
'__lt__',
'__module__',
'__ne__',
'__new__',
'__reduce__',
'__reduce_ex__',
'__repr__',
'__setattr__',
'__sizeof__',
'__str__',
'__subclasshook__',
'__weakref__',
'annots',
'bbox',
'cached_properties',
'chars',
'close',
'close_file',
'crop',
'cropbox',
'curves',
'debug_tablefinder',
'decimalize',
'dedupe_chars',
'edges',
'extract_table',
'extract_tables',
'extract_text',
'extract_words',
'filter',
'find_tables',
'flush_cache',
'height',
'horizontal_edges',
'hyperlinks',
'images',
'initial_doctop',
'is_original',
'iter_layout_objects',
'layout',
'lines',
'mediabox',
'objects',
'page_number',
'page_obj',
'parse_objects',
'pdf',
'process_object',
'rect_edges',
'rects',
'rotation',
'textboxhorizontals',
'textboxverticals',
'textlinehorizontals',
'textlineverticals',
'to_csv',
'to_image',
'to_json',
'vertical_edges',
'width',
'within_bbox']Extract Tables¶
It often helps to crop a PDF page (
Page.crop(bounding_box)) before extracting tables.Below are default settings when extracting tables.
{ "vertical_strategy": "lines", "horizontal_strategy": "lines", "explicit_vertical_lines": [], "explicit_horizontal_lines": [], "snap_tolerance": 3, "snap_x_tolerance": 3, "snap_y_tolerance": 3, "join_tolerance": 3, "join_x_tolerance": 3, "join_y_tolerance": 3, "edge_min_length": 3, "min_words_vertical": 3, "min_words_horizontal": 1, "keep_blank_chars": False, "text_tolerance": 3, "text_x_tolerance": 3, "text_y_tolerance": 3, "intersection_tolerance": 3, "intersection_x_tolerance": 3, "intersection_y_tolerance": 3, }Setting “vertical_strategy” and/or “horizontal_strategy” to
textcan be help when there are no horizontal and/or vertical lines in the table.
table = page.extract_table()
type(table)listtable[['Company Name', 'Location', 'Employees\nAffected', 'Layoff\nDate'],
['AAR MOBILITY SYSTEMS', 'MCCLELLAN AFB', '48', '6/15/12'],
['ABBOTT VASCULAR', 'MURRIETA', '45', '1/25/12'],
['ABBOTT VASCULAR', 'MURRIETA', '38', '10/17/12'],
['ABBOTT VASCULAR', 'TEMECULA', '247', '1/25/12'],
['ABBOTT VASCULAR', 'TEMECULA', '7', '1/25/12'],
['ABBOTT VASCULAR', 'TEMECULA', '139', '10/17/12'],
['ABBOTT VASCULAR', 'TEMECULA', '16', '10/17/12'],
['ABEO MANAGEMENT CORPORATION', 'LOS ANGELES', '42', '11/28/12'],
['ABERCROMBIE & FITCH', 'ANAHEIM', '51', '1/14/12'],
['ABERCROMBIE & FITCH', 'CAPITOLA', '51', '1/21/12'],
['ABERCROMBIE & FITCH', 'RIVERSIDE', '64', '1/14/12'],
['ABERCROMBIE & FITCH', 'SAN DIEGO', '66', '12/29/12'],
['ABERCROMBIE & FITCH', 'SIMI VALLEY', '70', '3/24/12'],
['ABERCROMBIE & FITCH', 'SIMI VALLEY', '47', '3/24/12'],
['ADAMS RITE MANUFACTURING \nCOMPANY', 'PONOMA', '110', '5/25/12'],
['ADOBE SYSTEMS INCORPORATED', 'SAN FRANCISCO', '121', '1/31/12'],
['ADOBE SYSTEMS INCORPORATED', 'SAN JOSE', '103', '1/31/12'],
['ADVANCED MICRO DEVICES, INC', 'SUNNYVALE', '107', '10/25/12']]Convert a PDF Page to Image¶
page.to_image()