Extract PDF text with Python

As part of our SemBib project to analyze the scientific production of Telecom ParisTech, I recover a lot of PDF files. To analyze the content, I need to get the raw text. In addition, as indicated in the blog Services for bibliographic analysis, I made the choice to implement our developments by web services.

I will show here how I develop a REST service of raw text extraction from PDF with Python and how I deploy it on Heroku (note: I develop under Windows 10).

Note: I am freshly converted to Python; my code is not exemplary and deserves to be cleaned / improved, but it is enough for me as proof of feasibility

Create a folder for my service

Create a virtual environment for local development with the command

virtualenv venv

Create a requirements.txt file with the following content

pdfminer==20140328
Flask==0.11
Flask-Login==0.2.11
Flask-RESTful==0.3.2
aniso8601==0.82
Jinja2==2.7.3
MarkupSafe==0.23
Werkzeug==0.9.6
gunicorn==19.4.5
itsdangerous==0.24
six==1.7.2

The file indicates the pdfminer package for processing pdf files.

(see http://spapas.github.io/2014/06/30/rest-flask-mongodb-heroku/ for generic explanations on the dependencies of the project)

In the folder that contains the requirements.txt file, create a pdf_service folder. In this folder, create two files: __init__.py and resources.py.

__init__.py initializes the Flask application that will be created

import os
from flask import Flask, request, jsonify
from flask_restful import Resource, Api
from flask import make_response

app = Flask(__name__)

import pdf_service.resources

And file resources.py

import json
from flask import request, abort
from flask.ext import restful
from flask.ext.restful import reqparse
from pdf_service import app
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import urllib2
from urllib2 import Request
from StringIO import StringIO


def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)


    open = urllib2.urlopen(Request(path)).read()
    memoryFile = StringIO(open)

    parser = PDFParser(memoryFile)

    doc = PDFDocument(parser)
    parser.set_document(doc)

    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(memoryFile, pagenos, maxpages=maxpages,        password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    device.close()
    retstr.close()
    print(text)
    return text


@app.route('/', methods=['GET'])
def get_root():
    return "index"


@app.route('/pdftotxt/<path:url>', methods=['GET'])
def get_url(url):
    return convert_pdf_to_txt(url)

To start a local server

For tests, I created the file runserver.py

from pdf_service import app
app.run(debug=True)

launched with the command

python runserver.py

To create an empty git repository

git init

To add local files to the git

git add .

Be careful to create a .gitignore by specifying the files and directories to ignore so as not to overload the communication

To feed the local deposit

ho will serve as a reference for exchanges with Heroku

git commit -m "mon message texte"

To connect to Heroku

heroku login

then your account information

To destroy a service on Heroku

For example, to make room, remove tests that have become useless …

heroku apps:destroy --app <nom du service à détruire>

To create the service on heroku

heroku create

To push local development towards heroku

git push heroku master

(by default, causes a Python 2.7 deployment and dependencies described in requirements.txt)

To start an instance of the service

heroku ps:scale web=1

To see what runs

heroku ps

Rename the service

Using the command line to rename the service (see https://devcenter.heroku.com/articles/renaming-apps)

Command to change the name of the service (from the one generated automatically by the heroku tools) and the associated git repository (to be done in the folder where the service was created)

heroku apps: rename <new name>

 

This entry was posted in NLP, SemBib, Tutorial. Bookmark the permalink.