I present here the needs related to our approach of analysis of the production and publication of scientific documents – essentially articles – by Telecom ParisTech. It is the goal of the SemBib project.
Télécom ParisTech has a bibliographical database that contains the bulk of our publications. For each publication, we have the title, the authors’ names, the date of publication and some other metadata. Sometimes there is a link to the publication. This link is not always informed and, when it is, it can present different forms:
- it is sometimes a link to a direct online access to a digital version of the article, usually in PDF format
- it is sometimes a link to a web page that contains a link to access, often costly, a digital version of the article
- sometimes it is a dead link
Sometimes links to digital documents can not be exploited by a robot to constitute a stock of all documents. We will see in a future post the solutions implemented to recover the highest possible proportion of documents.
A five-year evaluation gives about 4000 publications, of which one quarter has a link, which can be automatically and easily exploited in a little more than half the cases. On some documents we have a limitation on publication rights which prevents us from putting them online. This has a significant impact on the solutions we will be able to exploit.
As of the date of publication of this post, we have 420 documents requiring 180 MB for source documents. If we can recover the 4000 documents, we will need about 2 GB for the source documents.
We plan to store intermediate treatment results; for example, we will store
- the plain text extracts from PDF documents
- dictionary of words associated with a document
- metadata associated with a document, resulting from different stages of processing
Several of us are working on this project and we plan to involve students for whom this type of work can lead to very formative projects. We must therefore make our sources and results accessible via the network.
I have access to unlimited online storage via a hosting solution which, on the other hand, only supports developments in PHP. It is this accommodation that I will use.
A very good tool for applying treatments to texts for analysis is NLTK. NLTK is written in Python. So treatments are not going to be able to be done in terms of hosting our sources.
For the processing of semantic representations, we use on the one hand tools written in Java based on Apache Jena / Fuseki, and on the other hand a Virtuoso server accessible only on the internal network of Telecom ParisTech.
The need to host services written in different languages, which none of our direct hosting solutions provides, lead us to adopt a distributed solution, based on web services. We will see in a next post the steps taken to create a service in Python, operating NLTK and hosted on Heroku.
Algorithmic starting point
We used the article “Using Linked Data Traversal to Label Academic Communities” by Tiddi, Aquin and Motta (SAVE-SD 2015). However, we have completed or modified the proposed procedure. For example, we need to consider publications in several languages (at least French and English); we also decided to rely on Wordnet, when it made sense. We have undertaken the implementation of the following steps
- development of a list of researchers in the school with links to departments and research teams
- retrieval of the list of publications for the last 5 years; the bibliographic database gives us a result in the BIBTEX format that we translated into a JSON structure; a reflection should be carried out to make the solution more modular, for example, by recovering the base year by year, taking into account the need for updates, with some publications being declared late in the database; moreover, the data obtained present a series of defects handled in our first working phase,
- for each reference, attempt to retrieve a digital version of the document
- for each retrieved reference, extract the raw text from the document (see https://onsem.wp.imt.fr/2017/09/04/extract-pdf-text-with-python/)
- for each reference recovered, to gather metadata (authors, cited references …) either from the bibtex above, or by analysis of the document,
- for each retrieved reference, identify the language of the document
- pass each text in lowercase
- eliminate the empty words in each text as well as the numbers and the punctuations
- stemmatize and / or lemmatize the words of each text
- replace each root resulting from the stemmatization by a word that shares this root (eg the shortest word)
- filter the list of retained words (blacklist, minimum length …)
- search for a mapping of each word to a concept of the semantic web / LOD for example by referring to DBPedia, schema.org, Wordnet, reference vocabulary of IEEE and ACM …; we will need to evaluate the number of unmapped terms and look for a solution for these terms
- evaluate the number of words per article, by corpus (ex: Telecom ParisTech corpus for a given interval of years, ditto for a department, for an author)
- construct the TfIdf matrix of the corpus; methods are implemented which facilitate the updating of this matrix, for example when adding a new article
- reduce this matrix by eliminating words present in a high proportion of documents (application of a threshold, for example 25%); these words being considered as weakly discriminating; they can be kept as potentially representative of the overall production of Telecom ParisTech
- reduce this matrix by eliminating the words present too few times on the corpus; these words being considered as structuring for the corpus
- application of a latent semantic search (LSA)
- clustering of the remaining words either by clustering methods applied to the matrix or by grouping the words on structural criteria of Telecom ParisTech (departments, teams, projects, etc.)
These steps should provide us with the baseline data for our analyzes. In future blog posts, we will study some of these steps in more detail. We will also propose approaches based on semantic web technologies.