On the principle of the “Extract PDF text with Python” ticket, I will create a service that uses the NLTK package. NLTK is a set of tools for building language processing programs in Python. It therefore requires using Python.
Basic phases
I summarize the steps detailed in the ticket mentioned above:
- create a folder for this service
- create a virtual environment for local development
- create a requirements.txt file with the list of dependencies (including nltk, see below)
- create a folder nltk_service
- in this folder, create two files: __init__.py and resources.py (empty for now)
- start a local server (replacing pdf_service with nltk_service in runserver.py)
- create a git repository
- add project files to the local git repository
- update local repository
- connect to Heroku
- create a new service on Heroku
- push local development towards heroku
- launch an instance of the service
- test online
There, an empty service was created. We will complement it with an implementation of one of the possibilities offered by NLTK.
A simple service with NLTK
I want to create a service that finds the words and other elements that make up a text, and then eliminates the “stopwords” – those words that do not contribute significantly to certain analyzes of the content of a text, such as ‘the’, ‘a’, ‘to’ in English- and finally, counts the number of words in the text and the number of occurrences of each word retained.
To eliminate stopwords – as well as for many other treatments – NLTK uses data files defined under the name ‘NLTK Data‘. Files listing the stopwords of a set of languages are available. But Heroku is not very suitable for the use of large volumes of static files; Heroku is made to host and run programs.
Hosting static files for Heroku
In the article “Using AWS S3 to Store Static Assets and File Uploads“, it is suggested that good scalability can be achieved by deploying the static files used by an application on Amazon S3. Indeed, files hosted on Heroku can be downloaded every time an application is put to sleep and then reloaded for a new use. If it is a large volume of static files, this can penalize the response times of an application. The file system proposed by Heroku is said to be ‘ephemeral’.
Normally the NLTK-related data are downloaded with an interactive program that allows you to select what you’re going to download and make sure that the loaded data will be found by the NLTK tools. In the case of Heroku, we can not do this, but rather try to use a command line load.
This is the first method I have explored. The first idea is to load the data locally and then push them on Heroku, but this would load the GIT repository that we use in our exchanges with Heroku with all static data from nltk-data. A solution is available here: http://stackoverflow.com/questions/13965823/resource-corpora-wordnet-not-found-on-heroku/37558445#37558445. This is the solution that I adopted in the first approach. A test with all nltk_data data fails (all). With just the stopwords (python -m nltk.downloader stopwords) corpus and wordnet (python -m nltk.downloader wordnet) corpus and punkt tokenizer (python -m nltk.downloader punkt), the deployment runs smoothly.
Another idea is to use AWS Simple Storage Service, Amazon’s cloud storage service. I will explore this possibility in a future post.
The service
from flask import request, abort from flask.ext import restful from flask.ext.restful import reqparse from nk_service import app import urllib2 import nltk.data import io import os import collections import json from nltk.tokenize import word_tokenize from nltk.corpus import wordnet from nltk.corpus import stopwords english_stops = set(stopwords.words('english')) french_stops = set(stopwords.words('french')) stops = english_stops.union(french_stops) def words_counting(wordslist): wordscounter = {} wordscounter["cnt"] = collections.Counter() wordscounter["wordscount"] = 0 for w in wordslist: word = w.lower() if word not in stops: syn = wordnet.synsets(word) if syn: lemmas = syn[0].lemmas() res = lemmas[0].name() wordscounter["cnt"][res] += 1 wordscounter["wordscount"] += 1 wordscounter["cnt"] = wordscounter["cnt"].most_common() return wordscounter def filewords(path): text = urllib2.urlopen(path).read().decode('utf-8') wordslist = word_tokenize(text) jsonwords = words_counting(wordslist) return json.dumps(jsonwords) @app.route('/words/<path:url>', methods=['GET']) def get_words(url): return filewords(url)
The call of <mondomaine> / words / <url of a text file> returns a json structure with an array of words associated with their number of occurrences (in the cnt array) and the total number of words taken into account (in wordscount), this will allow to easily aggregate the results of several files and to calculate frequencies of occurrences of words.
Conclusion
The use of NLTK with Heroku is validated. It is now necessary to define the structure(s) (a priori json) which will allow us to progressively transport and enrich the data associated with a source or a set of sources.