Working with PDF Files in Python

by Pyrastra Team
Working with PDF Files in Python

PDF is the abbreviation for Portable Document Format, and these files typically use .pdf as their extension. In daily development work, the two most common tasks are reading text content from PDFs and generating PDF documents with existing content.

Extracting Text from PDFs

In Python, you can use a third-party library called PyPDF2 to read PDF files. You can install it with the following command.

pip install PyPDF2

PyPDF2 cannot extract images, charts, or other media from PDF documents, but it can extract text and return it as a Python string.

import PyPDF2

reader = PyPDF2.PdfReader('test.pdf')
for page in reader.pages:
    print(page.extract_text())

Tip: The PDF files used in the code in this chapter can be obtained through the following Baidu Cloud link: https://pan.baidu.com/s/1rQujl5RQn9R7PadB2Z5g_g, extraction code: e7b4.

Of course, PyPDF2 cannot extract text from all PDF documents. As far as I know, there is no particularly good solution to this problem, especially when extracting Chinese text. There are also many articles online explaining how to extract text from PDFs. I recommend reading the article Three Great Tools to Help Python Extract PDF Document Information for more information.

To extract text from PDF files, you can also directly use third-party command-line tools. The specific method is as follows.

pip install pdfminer.six
pdf2text.py test.pdf

Rotating and Overlaying Pages

In the above code, we read PDF documents by creating a PdfFileReader object. The getPage method of this object can get the specified page of the PDF document and obtain a PageObject object. Through the rotateClockwise and rotateCounterClockwise methods of the PageObject object, you can rotate the page clockwise and counterclockwise. Through the addBlankPage method of the PageObject object, you can add a new blank page. The code is as follows.

reader = PyPDF2.PdfReader('XGBoost.pdf')
writer = PyPDF2.PdfWriter()

for no, page in enumerate(reader.pages):
    if no % 2 == 0:
        new_page = page.rotate(-90)
    else:
        new_page = page.rotate(90)
    writer.add_page(new_page)

with open('temp.pdf', 'wb') as file_obj:
    writer.write(file_obj)

Encrypting PDF Files

Using the PdfFileWrite object in PyPDF2, you can encrypt PDF documents. If you need to set a unified access password for a series of PDF documents, using Python programs to handle this will be very convenient.

import PyPDF2

reader = PyPDF2.PdfReader('XGBoost.pdf')
writer = PyPDF2.PdfWriter()

for page in reader.pages:
    writer.add_page(page)

writer.encrypt('foobared')

with open('temp.pdf', 'wb') as file_obj:
    writer.write(file_obj)

Batch Adding Watermarks

The PageObject object mentioned above also has a method called mergePage, which can overlay two PDF pages. Through this operation, we can easily implement the function of adding watermarks to PDF files. For example, to add a watermark to the “XGBoost.pdf” file above, we can first prepare a PDF file that provides the watermark page, then read the PageObject containing the watermark, and then loop through each page of the “XGBoost.pdf” file to get the PageObject object, and then merge the watermark page and the original page through the mergePage method. The code is as follows.

reader1 = PyPDF2.PdfReader('XGBoost.pdf')
reader2 = PyPDF2.PdfReader('watermark.pdf')
writer = PyPDF2.PdfWriter()
watermark_page = reader2.pages[0]

for page in reader1.pages:
    page.merge_page(watermark_page)
    writer.add_page(page)

with open('temp.pdf', 'wb') as file_obj:
    writer.write(file_obj)

If you want, you can also use different watermarks for odd and even pages. You can think about how to do this yourself.

Creating PDF Files

Creating PDF documents requires the support of the third-party library reportlab. The installation method is as follows.

pip install reportlab

The following example demonstrates the usage of reportlab.

from reportlab.lib.pagesizes import A4
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
from reportlab.pdfgen import canvas

pdf_canvas = canvas.Canvas('resources/demo.pdf', pagesize=A4)
width, height = A4

# Draw image
image = canvas.ImageReader('resources/guido.jpg')
pdf_canvas.drawImage(image, 20, height - 395, 250, 375)

# Show current page
pdf_canvas.showPage()

# Register font files
pdfmetrics.registerFont(TTFont('Font1', 'resources/fonts/Vera.ttf'))
pdfmetrics.registerFont(TTFont('Font2', 'resources/fonts/QingGuaShiTouTi.ttf'))

# Write text
pdf_canvas.setFont('Font2', 40)
pdf_canvas.setFillColorRGB(0.9, 0.5, 0.3, 1)
pdf_canvas.drawString(width // 2 - 120, height // 2, 'Hello, World!')
pdf_canvas.setFont('Font1', 40)
pdf_canvas.setFillColorRGB(0, 1, 0, 0.5)
pdf_canvas.rotate(18)
pdf_canvas.drawString(250, 250, 'hello, world!')

# Save
pdf_canvas.save()

If you don’t quite understand the above code, it doesn’t matter. When you really need to use Python to create PDF documents, just read the official documentation of reportlab carefully.

Tip: The images and fonts used in the above code can be obtained through the following Baidu Cloud link: https://pan.baidu.com/s/1rQujl5RQn9R7PadB2Z5g_g, extraction code: e7b4.

Summary

After learning the above content, I believe everyone already knows how to use Python code to handle tasks such as merging multiple PDF files. Go ahead and try it yourself!