This component is the outcome of the work described in the Master's thesis (also as 2-up) of Alexander Schutz.
It takes a textual documents and applies a number of statistical algorithms over the linguistically annotated document content. The linguistic information consists of part-of-speech tags, morphological base-forms, noun chunks.
Due to the nature of the underlying processing steps it is very likely that short documents (<500 words) and not very coherent documents (no real sentence structure, many bullet points) will produce suboptimal results.
On these pages, you will find information about
In digital libraries, keyphrases are an important instrument for cataloguing and information retrieval purposes.
In literature research, they provide a high-level and concise summary of the content of textual documents or the topics covered therein, allowing humans to quickly decide whether a given text is relevant.
As the number of textual content on desktops grows fast, keyphrases can contribute to manage large amounts of textual information, for instance by marking up important sections in documents, i.e. to provide increased user experience in document exploration.
However, only a very few documents, whether authored by the user or retrieved from digital libraries on the internet, have keyphrases assigned as embedded metadata description, although vendors have enabled the relevant file formats with structures capable of storing such information.
As keyphrases are a description of textual data, the consideration of Natural Language Processing (NLP) tools in order to automate the extraction process is obvious.
While shallow NLP techniques are a long way from language understanding, in combination with statistical processing they can be helpful in many ways, providing a first stop in automatic content-metadata extraction, which then can be used as input for more sophisticated technologies.
The goal of this thesis is to demonstrate how linguistic and statistical methods can be combined to perform automatic keyphrase extraction from textual data, putting the emphasis on single documents.
As contribution, the underlying assumptions and implementation of such a modular software component capable of the keyword extraction process, for a number of different languages, will be described.
It is evaluated quantitatively on a medium-sized corpus with a priori assigned keyphrases, whereas a user study gives insight into the acceptance of the algorithms' results in a practical setting.
Evaluation results show that the approach is comparable with the current state-of-the-art, while potential for performance improvement still exists.