Converting PDFs to Text
Introduction
As I discussed in this tutorial, PDFs are notoriously difficult to scrape. Converting them to text files can make extracting their data significantly easier. There are several tools out there to help you do this, but I will focus on the one that I think is the best and easiest to use: pdfminer.
The files containing all of the code that I use in this tutorial can be found here.
Installing and Importing pdfminer
Unfortunately, pdfminer is not available for Python versions 3.x. If you already have a 3.x version installed on your computer, you can install a 2.x version and route your pdfminer programs through that launcher using the instructions here. Once you have a 2.x version installed, Install pdfminer with pip.
In the following two sections, you’ll learn how to convert your PDFs to .txt by running pdfminer from the command line in Windows. If you have a Mac/Linux OS, or want to use pdfminer as a module in Python, skip to section 4.
Converting One PDF to .txt
You can run programs from the command line by typing the commands directly into your terminal window, or by writing them in a .bat file and double-clicking it. I suggest the latter method, just because it makes it possible for you to rerun your program without retyping everything. To convert one PDF to a text file,
- Create a new folder. In this example, mine is titled “pdfToText.”
- Put your PDF and all of the pdfminer files/folders that pip installed into your new folder.
- Create a .bat file in your favorite text editor.
- In your .bat file, type the cd command to change directories to your new folder.
- Use the following syntax to type the command to convert your PDF. Code from here.
Replace “filename.pdf” with the filename of your PDF. In my example below, I only use the “-o” option, to specify a filename for my .txt file. The filename of my PDF is “example.pdf.”
The lines starting with “@rem” are comments. “cmd /k” just keeps the terminal window open after the program has finished executing.
That’s it! You should now have your .txt file!
Converting Multiple PDFs to .txt
If you have multiple PDFs that you need to convert, you just have to iterate through them and call the same commands as above. Do the following steps.
- Create a new folder, and put all of your PDFs in there. In this example, my folder is titled “pdfs.”
- Create a new folder to store your .txt files. My folder is titled “txt.”
- In your .bat file, type the cd command to change directories to your PDF folder.
- Use the command line for-loop syntax in the following example to loop through your PDFs and convert them all to .txt. .
Explanation of the above code:
- “%%i” stands for the current PDF file.
- I put “%%” in front of every “i” because in batch files you have to preface every variable reference with a “%%”.
- “(*)” stands for the current directory.
- "c:\pdftotext\pdf2txt.py" tells the computer to run “pdf2txt.py” from the “c:\pdftotext” directory.
- The modifier “
n” returns the filename only of the current file -- not the directory or extension. I used this modifier to make the filenames of my .txt files be the same as those of their corresponding PDFs. See here for more information about modifiers.
Your PDFs should now be converted to .txt.
Converting PDFs to .txt in Python
Using pdfminer as a module to convert PDFs can be done with the following steps.
Copy and paste the following code, found on this website, into your Python script. The convert() function returns the text content of a PDF as a string.
Now that we have a way to get the text content of a PDF, all we have to do is
- Iterate through all of our PDFs.
- For each pdf, get the text content,
- open/create a .txt file,
- write the text content to the .txt file.
You can do this using the following function and calling it like so:
Of course, if you just want to convert one pdf, use the code inside of the for loop. I’ve also created a module (download from here) containing the previous two functions, that you can import or call from the command line like so:
“pdfToT.py -p <pdfdirectory> -t <textdirectory>”
“pdfdirectory” and “textdirectory” default to your current working directory.