It is inevitable to start the process with the identification of the language of any given document, in order to select subsequent processing resources appropriately. Following, the input text needs to be tokenised and split into sentences. This procedure is very much the standard in information retrieval. A stopword analyser marks up the tokens in the document structure with information whether they represent a stopword or not, rather than eliminating tokens found to be stopwords. Arriving at a first junction now, it needs to be decided which part-of-speech tagger is used, depending on the information the language identifier has provided. In case of an English document, the Brill tagger shipped with GATE is utilised, in case of French and German the Tree Tagger is employed. This decision will also influence which component is used for morphological analysis / lemmatisation, as this functionality is also provided by the Tree Tagger in case of German and French texts. For English, the morphological analyser shipped with GATE as an additional plugin is used. Either way, tokens are now enriched with part-of-speech and lemma information. As the part-of-speech tags are heterogeneous and differ from language to language, a mapping plugin enriches the tokens with their coarse morphosyntactic category, in order to be able to treat tokens in a unified way in the later stages of frequency and keyphrase analysis. Now, larger syntactic units are identified by the noun chunker, which again is dependent on the information provided by the language identifier, as different rule sets are loaded for English, German or French. Here, for German and French texts the MuNPEx chunker is employed, whereas for English documents, a noun chunker implemented for this thesis is used. A frequency analysis step producing frequency lists of overall wordform and lemma occurrence is carried out.
Besides the overall observations, frequency lists for all coarse grained morphosyntactic categories representing content words (nouns, verbs, adjectives) are also created, which is useful for convenient lookup during the lexical/statistical part of the keyphrase analysis.
The workflow as a whole is also depicted in the figure below, where the white boxes embody components or plugins that have (partly) been implemented as part of this thesis, whereas the grey boxes represent components that were available off the shelf as part of the GATE framework.
To achieve reasonable results for a wide variety of possible input types (different document formats, very short/very long texts, use of different languages), the KeywordAnalyser plug-in for GATE is consuming most of the aggregated information, represented as annotations in the GATE document structure, which were added by the previously described preprocessors. The component is divided into a number of subpackages, each of which are responsible for different aspects of the keyword analysis process. The distributed responsibilities include
The following figure offers a high-level overview of the first two parts of the keyword analysis (statistical computation over lexical items & extraction of complex terms), whereas the process of grouping (or clustering of ) the complex terms and determining a candidate for each cluster is sketched in the second figure.