Using NLTK on Heroku with Python

On the principle of the “Extract PDF text with Python” ticket, I will create a service that uses the NLTK package. NLTK is a set of tools for building language processing programs in Python. It therefore requires using Python.

Basic phases

I summarize the steps detailed in the ticket mentioned above:

create a folder for this service
create a virtual environment for local development
create a requirements.txt file with the list of dependencies (including nltk, see below)
create a folder nltk_service
in this folder, create two files: __init__.py and resources.py (empty for now)
start a local server (replacing pdf_service with nltk_service in runserver.py)
create a git repository
add project files to the local git repository
update local repository
connect to Heroku
create a new service on Heroku
push local development towards heroku
launch an instance of the service
test online

There, an empty service was created. We will complement it with an implementation of one of the possibilities offered by NLTK.

A simple service with NLTK

I want to create a service that finds the words and other elements that make up a text, and then eliminates the “stopwords” – those words that do not contribute significantly to certain analyzes of the content of a text, such as ‘the’, ‘a’, ‘to’ in English- and finally, counts the number of words in the text and the number of occurrences of each word retained.

To eliminate stopwords – as well as for many other treatments – NLTK uses data files defined under the name ‘NLTK Data‘. Files listing the stopwords of a set of languages are available. But Heroku is not very suitable for the use of large volumes of static files; Heroku is made to host and run programs.

Hosting static files for Heroku

In the article “Using AWS S3 to Store Static Assets and File Uploads“, it is suggested that good scalability can be achieved by deploying the static files used by an application on Amazon S3. Indeed, files hosted on Heroku can be downloaded every time an application is put to sleep and then reloaded for a new use. If it is a large volume of static files, this can penalize the response times of an application. The file system proposed by Heroku is said to be ‘ephemeral’.

Normally the NLTK-related data are downloaded with an interactive program that allows you to select what you’re going to download and make sure that the loaded data will be found by the NLTK tools. In the case of Heroku, we can not do this, but rather try to use a command line load.

This is the first method I have explored. The first idea is to load the data locally and then push them on Heroku, but this would load the GIT repository that we use in our exchanges with Heroku with all static data from nltk-data. A solution is available here: http://stackoverflow.com/questions/13965823/resource-corpora-wordnet-not-found-on-heroku/37558445#37558445. This is the solution that I adopted in the first approach. A test with all nltk_data data fails (all). With just the stopwords (python -m nltk.downloader stopwords) corpus and wordnet (python -m nltk.downloader wordnet) corpus and punkt tokenizer (python -m nltk.downloader punkt), the deployment runs smoothly.

Another idea is to use AWS Simple Storage Service, Amazon’s cloud storage service. I will explore this possibility in a future post.

The service

from flask import request, abort
from flask.ext import restful
from flask.ext.restful import reqparse
from nk_service import app
import urllib2
import nltk.data
import io
import os
import collections
import json
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from nltk.corpus import stopwords

english_stops = set(stopwords.words('english'))
french_stops = set(stopwords.words('french'))
stops = english_stops.union(french_stops)

def words_counting(wordslist):
    wordscounter = {}
    wordscounter["cnt"] = collections.Counter()
    wordscounter["wordscount"] = 0
    for w in wordslist:
        word = w.lower()
        if word not in stops:
            syn = wordnet.synsets(word)
            if syn:
                lemmas = syn[0].lemmas()
                res = lemmas[0].name()
                wordscounter["cnt"][res] += 1
                wordscounter["wordscount"] += 1
    wordscounter["cnt"] = wordscounter["cnt"].most_common()
    return wordscounter

def filewords(path):
    text = urllib2.urlopen(path).read().decode('utf-8')
    wordslist = word_tokenize(text)
    jsonwords = words_counting(wordslist)
    return json.dumps(jsonwords)

@app.route('/words/<path:url>', methods=['GET'])
def get_words(url):
    return filewords(url)

The call of <mondomaine> / words / <url of a text file> returns a json structure with an array of words associated with their number of occurrences (in the cnt array) and the total number of words taken into account (in wordscount), this will allow to easily aggregate the results of several files and to calculate frequencies of occurrences of words.

Conclusion

The use of NLTK with Heroku is validated. It is now necessary to define the structure(s) (a priori json) which will allow us to progressively transport and enrich the data associated with a source or a set of sources.

Basic phases

A simple service with NLTK

Hosting static files for Heroku

The service

Conclusion

Select the language

Other research blogs of our team

Recent Posts

Archives

Categories

Meta