Ben Chuanlong Du's Blog

It is never too late to learn.

Extracting Data from PDF Files

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Sometimes, a PDF file is corrupted or encrypted making it hard to extract data from it directly. In this case, you can convert a PDF page to an image first and then use AI tools (e.g., Table Image to CSV Converter) to extract data from it.

AI-powered Tools

Web-based Tools

Stirling-PDF is is a robust, locally hosted web-based PDF manipulation tool using Docker.

Python Libraries

  • pdf2text

  • pdfplumber

References

Comments