How to Work With PDF Documents Using Python

I really admire Portable Document Format (PDF) files. I remember the days when such files solved any formatting issues while exchanging files due to some differences in Word versions, or for other reasons.

We are mainly talking about Python here, aren't we? And we are interested in tying that to working with PDF documents. Well, you may say that's so simple, especially if you have used Python with text files (txt) before. But, it is a bit different here. PDF documents are binary files and more complex than just plaintext files, especially since they contain different font types, colors, etc.

That doesn't mean that it is hard to work with PDF documents using Python, it is rather simple, and using an external module solves the issue.

PyPDF2

As we mentioned above, using an external module would be the key. The module we will be using in this tutorial is PyPDF2. As it is an external module, the first normal step we have to take is to install that module. For that, we will be using pip, which is (based on Wikipedia):

A package management system used to install and manage software packages written in Python. Many packages can be found in the Python Package Index (PyPI).

You can follow the steps mentioned in the Python Packaging User Guide for installing pip, but if you have Python 2.7.9 and higher, or Python 3.4 and higher, you already have pip!

PyPDF2 now can be simply installed by typing the following command (in Mac OS X's Terminal):

pip install pypdf2

Great! You now have PyPDF2 installed, and you're ready to start playing with PDF documents.

Reading a PDF Document

The sample file we will be working with in this tutorial is sample.pdf. Go ahead and download the file to follow the tutorial, or you can simply use any PDF file you like.

Let's now go ahead and read the PDF document. Since we will be using PyPDF2, we need to import the module, as follows:

import pypdf2

After importing the module, we will be using the PdfFileReader class. So, the script for reading the PDF document looks as follows:

import PyPDF2
pdf_file = open('sample.pdf')
read_pdf = PyPDF2.PdfFileReader(pdf_file)

More Operations on PDF Documents

After reading the PDF document, we can now carry out different operations on the document, as we will see in this section.

Number of Pages

Let's check the number of pages in sample.pdf. For this, we can use the getNumPages() method:

number_of_pages = read_pdf.getNumPages()
print number_of_pages

In this case, the returned value will be 1.

Page Number

Let's now check the number of some page in the PDF document. We can use the method getPageNumber(page), Notice that we have to pass an object of type page to the method. To retrieve a page, we will use the getPage(number) method, where number represents the page number in the PDF document. The argument number starts with the value 0.

Well, I know when you use getPage(number) you already know the page number, but this is just to illustrate how to use those methods together. This can be demonstrated in the following script:

page = read_pdf.getPage(0)
page_number = read_pdf.getPageNumber(page)
print page_number

Go ahead, try the script. What output did you get?

We know that in sample.pdf (the file we are experimenting with), we only have one page (number 0). What if we passed the number 1 as the page number to getPage(number)? In this case, you will get the following error:

Traceback (most recent call last):
  File "test.py", line 6, in <module>
    page = read_pdf.getPage(1)
  File "/usr/local/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1158, in getPage
    return self.flattenedPages[pageNumber]
IndexError: list index out of range

This is because the page is not available, and we are using a page number out of range (does not exist).

Page Mode

The PDF page comes with different modes, which are as follows:

/UseNone	Do not show outlines or thumbnails panels
/UseOutlines	Show outlines (aka bookmarks) panel
/UseThumbs	Show page thumbnails panel
/FullScreen	Fullscreen view
/UseOC	Show Optional Content Group (OCG) panel
/UseAttachments	Show attachments panel

In order to check our page mode, we can use the following script:

page = read_pdf.getPage(0)
page_mode = read_pdf.getPageMode()
print page_mode

In the case of our PDF document (sample.pdf), the returned value is none, which means that the page mode is not specified. If you want to specify a page mode, you can use the method setPageMode(mode), where mode is one of the modes listed in the table above.

Extract Text

We have been wandering around the file so far, so let's see what's inside. The method extractText() will be our friend in this task.

Let me show you the full script to do that, as opposed to what I was doing above in showing you only the required script to perform an operation. The script to extract a text from the PDF document is as follows:

import PyPDF2
pdf_file = open('sample.pdf')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
print page_content

I was surprised when I got the following output rather than that in sample.pdf:

!"#$%#$%&%$&'()*%+,-%./01'*23%4
5'%1$#26%3/%7/))/8%&)/26%8#3"%3"*%313/9#&)
%

This is most likely due to a font issue, such that the character codes map to other values. So it is sometimes an issue with the PDF document itself, as the PDF document might not contain the data required to restore the content.

I thus tried another file, which is a paper of mine: paper.pdf. Go ahead and replace sample.pdf in the code with paper.pdf. The output in this case was:

Medical Imaging 2012: Image Perception, Observer Performance, and Technology Assessment, edited by Craig K. Abbey, Claudia R. Mello-Thoms, Proc. of SPIE Vol. 8318, 83181I © 2012 SPIE · CCC code: 1605-7422/12/$18 · doi: 10.1117/12.912389Proc. of SPIE Vol. 8318  83181I-1Downloaded from SPIE Digital Library on 13 Aug 2012 to 134.130.12.208. Terms of Use:  http://spiedl.org/terms

But, where is the rest of the text in the page? Well, actually the extractText() method seems not to be perfect, and some improvements need to be made. But, the goal here is to show you how to work with PDF files using Python, and it seems some improvements need to be made in the domain.

Conclusion

As we can see, Python makes it simple to work with PDF documents. This tutorial just scratched the surface on this topic, and you can find more details on different operations you can perform on PDF documents on the PyPDF2 documentation page.

HIGHLIGHTS OF THE DAY