I really admire Portable Document Format (PDF) files. I remember the days when such files solved any formatting issues while exchanging files due to some differences in Word versions, or for other reasons.
We are mainly talking about Python here, aren't we? And we are interested in tying that to working with PDF documents. Well, you may say that's so simple, especially if you have used Python with text files (txt) before. But, it is a bit different here. PDF documents are binary files and more complex than just plaintext files, especially since they contain different font types, colors, etc.
That doesn't mean that it is hard to work with PDF documents using Python, it is rather simple, and using an external module solves the issue.
PyPDF2
As we mentioned above, using an external module would be the key. The module we will be using in this tutorial is PyPDF2
. As it is an external module, the first normal step we have to take is to install that module. For that, we will be using pip, which is (based on Wikipedia):
A package management system used to install and manage software packages written in Python. Many packages can be found in the Python Package Index (PyPI).
You can follow the steps mentioned in the Python Packaging User Guide for installing pip
, but if you have Python 2.7.9
and higher, or Python 3.4
and higher, you already have pip
!
PyPDF2
now can be simply installed by typing the following command (in Mac OS X's Terminal):
pip install pypdf2
Great! You now have PyPDF2
installed, and you're ready to start playing with PDF documents.
Reading a PDF Document
The sample file we will be working with in this tutorial is sample.pdf. Go ahead and download the file to follow the tutorial, or you can simply use any PDF file you like.
Let's now go ahead and read the PDF document. Since we will be using PyPDF2
, we need to import the module, as follows:
import pypdf2
After importing the module, we will be using the PdfFileReader class. So, the script for reading the PDF document looks as follows:
import PyPDF2 pdf_file = open('sample.pdf') read_pdf = PyPDF2.PdfFileReader(pdf_file)
More Operations on PDF Documents
After reading the PDF document, we can now carry out different operations on the document, as we will see in this section.
Number of Pages
Let's check the number of pages in sample.pdf. For this, we can use the getNumPages() method:
number_of_pages = read_pdf.getNumPages() print number_of_pages
In this case, the returned value will be 1
.
Page Number
Let's now check the number of some page in the PDF document. We can use the method getPageNumber(page)
, Notice that we have to pass an object of type page
to the method. To retrieve a page
, we will use the getPage(number)
method, where number
represents the page number in the PDF document. The argument number
starts with the value 0
.
Well, I know when you use getPage(number)
you already know the page number, but this is just to illustrate how to use those methods together. This can be demonstrated in the following script:
page = read_pdf.getPage(0) page_number = read_pdf.getPageNumber(page) print page_number
Go ahead, try the script. What output did you get?
We know that in sample.pdf
(the file we are experimenting with), we only have one page (number 0
). What if we passed the number 1
as the page number to getPage(number)
? In this case, you will get the following error:
Traceback (most recent call last): File "test.py", line 6, in <module> page = read_pdf.getPage(1) File "/usr/local/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1158, in getPage return self.flattenedPages[pageNumber] IndexError: list index out of range
This is because the page is not available, and we are using a page number out of range (does not exist).
Page Mode
The PDF page comes with different modes, which are as follows:
/UseNone | Do not show outlines or thumbnails panels |
/UseOutlines |
Show outlines (aka bookmarks) panel |
/UseThumbs |
Show page thumbnails panel |
/FullScreen |
Fullscreen view |
/UseOC |
Show Optional Content Group (OCG) panel |
/UseAttachments |
Show attachments panel |
In order to check our page mode, we can use the following script:
page = read_pdf.getPage(0) page_mode = read_pdf.getPageMode() print page_mode
In the case of our PDF document (sample.pdf
), the returned value is none
, which means that the page mode is not specified. If you want to specify a page mode, you can use the method setPageMode(mode)
, where mode
is one of the modes listed in the table above.
Extract Text
We have been wandering around the file so far, so let's see what's inside. The method extractText()
will be our friend in this task.
Let me show you the full script to do that, as opposed to what I was doing above in showing you only the required script to perform an operation. The script to extract a text from the PDF document is as follows:
import PyPDF2 pdf_file = open('sample.pdf') read_pdf = PyPDF2.PdfFileReader(pdf_file) number_of_pages = read_pdf.getNumPages() page = read_pdf.getPage(0) page_content = page.extractText() print page_content
I was surprised when I got the following output rather than that in sample.pdf
:
!"#$%#$%&%$&'()*%+,-%./01'*23%4 5'%1$#26%3/%7/))/8%&)/26%8#3"%3"*%313/9#&) %
This is most likely due to a font issue, such that the character codes map to other values. So it is sometimes an issue with the PDF document itself, as the PDF document might not contain the data required to restore the content.
I thus tried another file, which is a paper of mine: paper.pdf
. Go ahead and replace sample.pdf
in the code with paper.pdf
. The output in this case was:
Medical Imaging 2012: Image Perception, Observer Performance, and Technology Assessment, edited by Craig K. Abbey, Claudia R. Mello-Thoms, Proc. of SPIE Vol. 8318, 83181I © 2012 SPIE · CCC code: 1605-7422/12/$18 · doi: 10.1117/12.912389Proc. of SPIE Vol. 8318 83181I-1Downloaded from SPIE Digital Library on 13 Aug 2012 to 134.130.12.208. Terms of Use: http://spiedl.org/terms
But, where is the rest of the text in the page? Well, actually the extractText()
method seems not to be perfect, and some improvements need to be made. But, the goal here is to show you how to work with PDF files using Python, and it seems some improvements need to be made in the domain.
Conclusion
As we can see, Python makes it simple to work with PDF documents. This tutorial just scratched the surface on this topic, and you can find more details on different operations you can perform on PDF documents on the PyPDF2 documentation page.
Comments