Ben Chuanlong Du's Blog

It is never too late to learn.

Hands on the Python Library pdfplumber

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

In [1]:
!pip3 install pdfplumber
Collecting pdfplumber
  Downloading pdfplumber-0.5.28.tar.gz (45 kB)
     |████████████████████████████████| 45 kB 1.6 MB/s eta 0:00:01
Requirement already satisfied: Pillow>=7.0.0 in /usr/local/lib/python3.8/dist-packages (from pdfplumber) (8.3.1)
Collecting Wand
  Downloading Wand-0.6.6-py2.py3-none-any.whl (138 kB)
     |████████████████████████████████| 138 kB 8.5 MB/s eta 0:00:01
Collecting pdfminer.six==20200517
  Downloading pdfminer.six-20200517-py3-none-any.whl (5.6 MB)
     |████████████████████████████████| 5.6 MB 22.0 MB/s eta 0:00:01
Collecting pycryptodome
  Downloading pycryptodome-3.10.1-cp35-abi3-manylinux2010_x86_64.whl (1.9 MB)
     |████████████████████████████████| 1.9 MB 46.7 MB/s eta 0:00:01
Requirement already satisfied: chardet; python_version > "3.0" in /usr/lib/python3/dist-packages (from pdfminer.six==20200517->pdfplumber) (3.0.4)
Collecting sortedcontainers
  Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
Building wheels for collected packages: pdfplumber
  Building wheel for pdfplumber (setup.py) ... done
  Created wheel for pdfplumber: filename=pdfplumber-0.5.28-py3-none-any.whl size=32220 sha256=8df60e70751b3087fda49d8b20bb47d0e82931b60a2df7ea913391f68716facc
  Stored in directory: /home/dclong/.cache/pip/wheels/36/61/6d/5fdf7f85a9598d42f094b4099be9a3dd9a887b25ca9b5a1bf4
Successfully built pdfplumber
Installing collected packages: Wand, pycryptodome, sortedcontainers, pdfminer.six, pdfplumber
Successfully installed Wand-0.6.6 pdfminer.six-20200517 pdfplumber-0.5.28 pycryptodome-3.10.1 sortedcontainers-2.4.0
In [4]:
!wget http://www.edd.ca.gov/jobs_and_training/warn/eddwarncn12.pdf
--2021-07-15 15:18:14--  http://www.edd.ca.gov/jobs_and_training/warn/eddwarncn12.pdf
Resolving www.edd.ca.gov (www.edd.ca.gov)... 134.186.117.17
Connecting to www.edd.ca.gov (www.edd.ca.gov)|134.186.117.17|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.edd.ca.gov/jobs_and_training/warn/eddwarncn12.pdf [following]
--2021-07-15 15:18:14--  https://www.edd.ca.gov/jobs_and_training/warn/eddwarncn12.pdf
Connecting to www.edd.ca.gov (www.edd.ca.gov)|134.186.117.17|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 307728 (301K) [application/pdf]
Saving to: ‘eddwarncn12.pdf’

eddwarncn12.pdf     100%[===================>] 300.52K   760KB/s    in 0.4s    

2021-07-15 15:18:15 (760 KB/s) - ‘eddwarncn12.pdf’ saved [307728/307728]

In [1]:
import pdfplumber
In [2]:
pdf = pdfplumber.open("eddwarncn12.pdf")
In [3]:
type(pdf)
Out[3]:
pdfplumber.pdf.PDF
In [4]:
page = pdf.pages[0]
type(page)
Out[4]:
pdfplumber.page.Page
In [12]:
dir(page)
Out[12]:
['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'annots',
 'bbox',
 'cached_properties',
 'chars',
 'close',
 'close_file',
 'crop',
 'cropbox',
 'curves',
 'debug_tablefinder',
 'decimalize',
 'dedupe_chars',
 'edges',
 'extract_table',
 'extract_tables',
 'extract_text',
 'extract_words',
 'filter',
 'find_tables',
 'flush_cache',
 'height',
 'horizontal_edges',
 'hyperlinks',
 'images',
 'initial_doctop',
 'is_original',
 'iter_layout_objects',
 'layout',
 'lines',
 'mediabox',
 'objects',
 'page_number',
 'page_obj',
 'parse_objects',
 'pdf',
 'process_object',
 'rect_edges',
 'rects',
 'rotation',
 'textboxhorizontals',
 'textboxverticals',
 'textlinehorizontals',
 'textlineverticals',
 'to_csv',
 'to_image',
 'to_json',
 'vertical_edges',
 'width',
 'within_bbox']

Extract Tables

  1. It often helps to crop a PDF page (Page.crop(bounding_box)) before extracting tables.

  2. Below are default settings when extracting tables.

     {
         "vertical_strategy": "lines", 
         "horizontal_strategy": "lines",
         "explicit_vertical_lines": [],
         "explicit_horizontal_lines": [],
         "snap_tolerance": 3,
         "snap_x_tolerance": 3,
         "snap_y_tolerance": 3,
         "join_tolerance": 3,
         "join_x_tolerance": 3,
         "join_y_tolerance": 3,
         "edge_min_length": 3,
         "min_words_vertical": 3,
         "min_words_horizontal": 1,
         "keep_blank_chars": False,
         "text_tolerance": 3,
         "text_x_tolerance": 3,
         "text_y_tolerance": 3,
         "intersection_tolerance": 3,
         "intersection_x_tolerance": 3,
         "intersection_y_tolerance": 3,
     }
    
    
    • Setting "vertical_strategy" and/or "horizontal_strategy" to text can be help when there are no horizontal and/or vertical lines in the table.
In [13]:
table = page.extract_table()
type(table)
Out[13]:
list
In [14]:
table
Out[14]:
[['Company Name', 'Location', 'Employees\nAffected', 'Layoff\nDate'],
 ['AAR MOBILITY SYSTEMS', 'MCCLELLAN AFB', '48', '6/15/12'],
 ['ABBOTT VASCULAR', 'MURRIETA', '45', '1/25/12'],
 ['ABBOTT VASCULAR', 'MURRIETA', '38', '10/17/12'],
 ['ABBOTT VASCULAR', 'TEMECULA', '247', '1/25/12'],
 ['ABBOTT VASCULAR', 'TEMECULA', '7', '1/25/12'],
 ['ABBOTT VASCULAR', 'TEMECULA', '139', '10/17/12'],
 ['ABBOTT VASCULAR', 'TEMECULA', '16', '10/17/12'],
 ['ABEO MANAGEMENT CORPORATION', 'LOS ANGELES', '42', '11/28/12'],
 ['ABERCROMBIE & FITCH', 'ANAHEIM', '51', '1/14/12'],
 ['ABERCROMBIE & FITCH', 'CAPITOLA', '51', '1/21/12'],
 ['ABERCROMBIE & FITCH', 'RIVERSIDE', '64', '1/14/12'],
 ['ABERCROMBIE & FITCH', 'SAN DIEGO', '66', '12/29/12'],
 ['ABERCROMBIE & FITCH', 'SIMI VALLEY', '70', '3/24/12'],
 ['ABERCROMBIE & FITCH', 'SIMI VALLEY', '47', '3/24/12'],
 ['ADAMS RITE MANUFACTURING \nCOMPANY', 'PONOMA', '110', '5/25/12'],
 ['ADOBE SYSTEMS INCORPORATED', 'SAN FRANCISCO', '121', '1/31/12'],
 ['ADOBE SYSTEMS INCORPORATED', 'SAN JOSE', '103', '1/31/12'],
 ['ADVANCED MICRO DEVICES, INC', 'SUNNYVALE', '107', '10/25/12']]

Convert a PDF Page to Image

In [ ]:
page.to_image()
In [ ]:
 

Comments