Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Stirling-PDF is is a robust, locally hosted web-based PDF manipulation tool using Docker.

In [1]:

!pip3 install pdfplumber

Collecting pdfplumber
  Downloading pdfplumber-0.5.28.tar.gz (45 kB)
     |████████████████████████████████| 45 kB 1.6 MB/s eta 0:00:01
Requirement already satisfied: Pillow>=7.0.0 in /usr/local/lib/python3.8/dist-packages (from pdfplumber) (8.3.1)
Collecting Wand
  Downloading Wand-0.6.6-py2.py3-none-any.whl (138 kB)
     |████████████████████████████████| 138 kB 8.5 MB/s eta 0:00:01
Collecting pdfminer.six==20200517
  Downloading pdfminer.six-20200517-py3-none-any.whl (5.6 MB)
     |████████████████████████████████| 5.6 MB 22.0 MB/s eta 0:00:01
Collecting pycryptodome
  Downloading pycryptodome-3.10.1-cp35-abi3-manylinux2010_x86_64.whl (1.9 MB)
     |████████████████████████████████| 1.9 MB 46.7 MB/s eta 0:00:01
Requirement already satisfied: chardet; python_version > "3.0" in /usr/lib/python3/dist-packages (from pdfminer.six==20200517->pdfplumber) (3.0.4)
Collecting sortedcontainers
  Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
Building wheels for collected packages: pdfplumber
  Building wheel for pdfplumber (setup.py) ... done
  Created wheel for pdfplumber: filename=pdfplumber-0.5.28-py3-none-any.whl size=32220 sha256=8df60e70751b3087fda49d8b20bb47d0e82931b60a2df7ea913391f68716facc
  Stored in directory: /home/dclong/.cache/pip/wheels/36/61/6d/5fdf7f85a9598d42f094b4099be9a3dd9a887b25ca9b5a1bf4
Successfully built pdfplumber
Installing collected packages: Wand, pycryptodome, sortedcontainers, pdfminer.six, pdfplumber
Successfully installed Wand-0.6.6 pdfminer.six-20200517 pdfplumber-0.5.28 pycryptodome-3.10.1 sortedcontainers-2.4.0

In [4]:

!wget http://www.edd.ca.gov/jobs_and_training/warn/eddwarncn12.pdf

--2021-07-15 15:18:14--  http://www.edd.ca.gov/jobs_and_training/warn/eddwarncn12.pdf
Resolving www.edd.ca.gov (www.edd.ca.gov)... 134.186.117.17
Connecting to www.edd.ca.gov (www.edd.ca.gov)|134.186.117.17|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.edd.ca.gov/jobs_and_training/warn/eddwarncn12.pdf [following]
--2021-07-15 15:18:14--  https://www.edd.ca.gov/jobs_and_training/warn/eddwarncn12.pdf
Connecting to www.edd.ca.gov (www.edd.ca.gov)|134.186.117.17|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 307728 (301K) [application/pdf]
Saving to: ‘eddwarncn12.pdf’

eddwarncn12.pdf     100%[===================>] 300.52K   760KB/s    in 0.4s    

2021-07-15 15:18:15 (760 KB/s) - ‘eddwarncn12.pdf’ saved [307728/307728]

In [1]:

import pdfplumber

In [2]:

pdf = pdfplumber.open("eddwarncn12.pdf")

In [3]:

type(pdf)

Out[3]:

pdfplumber.pdf.PDF

In [4]:

page = pdf.pages[0]
type(page)

Out[4]:

pdfplumber.page.Page

In [12]:

dir(page)

Out[12]:

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'annots',
 'bbox',
 'cached_properties',
 'chars',
 'close',
 'close_file',
 'crop',
 'cropbox',
 'curves',
 'debug_tablefinder',
 'decimalize',
 'dedupe_chars',
 'edges',
 'extract_table',
 'extract_tables',
 'extract_text',
 'extract_words',
 'filter',
 'find_tables',
 'flush_cache',
 'height',
 'horizontal_edges',
 'hyperlinks',
 'images',
 'initial_doctop',
 'is_original',
 'iter_layout_objects',
 'layout',
 'lines',
 'mediabox',
 'objects',
 'page_number',
 'page_obj',
 'parse_objects',
 'pdf',
 'process_object',
 'rect_edges',
 'rects',
 'rotation',
 'textboxhorizontals',
 'textboxverticals',
 'textlinehorizontals',
 'textlineverticals',
 'to_csv',
 'to_image',
 'to_json',
 'vertical_edges',
 'width',
 'within_bbox']

Extract Tables¶

It often helps to crop a PDF page (Page.crop(bounding_box)) before extracting tables.

Below are default settings when extracting tables.

 {
     "vertical_strategy": "lines", 
     "horizontal_strategy": "lines",
     "explicit_vertical_lines": [],
     "explicit_horizontal_lines": [],
     "snap_tolerance": 3,
     "snap_x_tolerance": 3,
     "snap_y_tolerance": 3,
     "join_tolerance": 3,
     "join_x_tolerance": 3,
     "join_y_tolerance": 3,
     "edge_min_length": 3,
     "min_words_vertical": 3,
     "min_words_horizontal": 1,
     "keep_blank_chars": False,
     "text_tolerance": 3,
     "text_x_tolerance": 3,
     "text_y_tolerance": 3,
     "intersection_tolerance": 3,
     "intersection_x_tolerance": 3,
     "intersection_y_tolerance": 3,
 }

Setting "vertical_strategy" and/or "horizontal_strategy" to text can be help when there are no horizontal and/or vertical lines in the table.

In [13]:

table = page.extract_table()
type(table)

Out[13]:

list

In [14]:

table

Out[14]:

[['Company Name', 'Location', 'Employees\nAffected', 'Layoff\nDate'],
 ['AAR MOBILITY SYSTEMS', 'MCCLELLAN AFB', '48', '6/15/12'],
 ['ABBOTT VASCULAR', 'MURRIETA', '45', '1/25/12'],
 ['ABBOTT VASCULAR', 'MURRIETA', '38', '10/17/12'],
 ['ABBOTT VASCULAR', 'TEMECULA', '247', '1/25/12'],
 ['ABBOTT VASCULAR', 'TEMECULA', '7', '1/25/12'],
 ['ABBOTT VASCULAR', 'TEMECULA', '139', '10/17/12'],
 ['ABBOTT VASCULAR', 'TEMECULA', '16', '10/17/12'],
 ['ABEO MANAGEMENT CORPORATION', 'LOS ANGELES', '42', '11/28/12'],
 ['ABERCROMBIE & FITCH', 'ANAHEIM', '51', '1/14/12'],
 ['ABERCROMBIE & FITCH', 'CAPITOLA', '51', '1/21/12'],
 ['ABERCROMBIE & FITCH', 'RIVERSIDE', '64', '1/14/12'],
 ['ABERCROMBIE & FITCH', 'SAN DIEGO', '66', '12/29/12'],
 ['ABERCROMBIE & FITCH', 'SIMI VALLEY', '70', '3/24/12'],
 ['ABERCROMBIE & FITCH', 'SIMI VALLEY', '47', '3/24/12'],
 ['ADAMS RITE MANUFACTURING \nCOMPANY', 'PONOMA', '110', '5/25/12'],
 ['ADOBE SYSTEMS INCORPORATED', 'SAN FRANCISCO', '121', '1/31/12'],
 ['ADOBE SYSTEMS INCORPORATED', 'SAN JOSE', '103', '1/31/12'],
 ['ADVANCED MICRO DEVICES, INC', 'SUNNYVALE', '107', '10/25/12']]

Convert a PDF Page to Image¶

In [ ]:

page.to_image()

References¶

In [ ]:

Ben Chuanlong Du's Blog

It is never too late to learn.

Hands on the Python Library pdfplumber

Extract Tables¶

Convert a PDF Page to Image¶

References¶

Comments