Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
In [1]:
!pip3 install pdfplumber
In [4]:
!wget http://www.edd.ca.gov/jobs_and_training/warn/eddwarncn12.pdf
In [1]:
import pdfplumber
In [2]:
pdf = pdfplumber.open("eddwarncn12.pdf")
In [3]:
type(pdf)
Out[3]:
In [4]:
page = pdf.pages[0]
type(page)
Out[4]:
In [12]:
dir(page)
Out[12]:
Extract Tables¶
It often helps to crop a PDF page (
Page.crop(bounding_box)
) before extracting tables.Below are default settings when extracting tables.
{ "vertical_strategy": "lines", "horizontal_strategy": "lines", "explicit_vertical_lines": [], "explicit_horizontal_lines": [], "snap_tolerance": 3, "snap_x_tolerance": 3, "snap_y_tolerance": 3, "join_tolerance": 3, "join_x_tolerance": 3, "join_y_tolerance": 3, "edge_min_length": 3, "min_words_vertical": 3, "min_words_horizontal": 1, "keep_blank_chars": False, "text_tolerance": 3, "text_x_tolerance": 3, "text_y_tolerance": 3, "intersection_tolerance": 3, "intersection_x_tolerance": 3, "intersection_y_tolerance": 3, }
- Setting "vertical_strategy" and/or "horizontal_strategy" to
text
can be help when there are no horizontal and/or vertical lines in the table.
- Setting "vertical_strategy" and/or "horizontal_strategy" to
In [13]:
table = page.extract_table()
type(table)
Out[13]:
In [14]:
table
Out[14]:
Convert a PDF Page to Image¶
In [ ]:
page.to_image()
References¶
In [ ]: