The evaluation of the keyphrase extraction algorithm is divided into a quantitative assessment performed on a dataset obtained from PubMed Central, and a qualitative assessment in form of a user study.
An evaluation of the keyphrase extraction algorithm based solely on objective measures is meaningful only to some extent, partly because the question arises what actually constitutes a gold standard, and partly because it is debatable whether such an experiment can be set up as a selection experiment for human judges.
The task may be too subjective for subjects to end up with a sufficient amount of inter-annotator agreement over selected keywords/keyphrases ( Barker & Cornacchia report on spectacularly low kappa values), an issue also encountered in gold standard annotations for the GENETAG dataset.
In order to account for this, the evaluation procedure has been carefully set up into a quantitative assessment and a user study, as reported next.
The dataset obtained from PubMed Central comprises of 77,496 peer-reviewed articles. Its XML schema offers a fine grained distinction of various aspects of publication-related metadata (journal, authors, affliation, index-terms/keywords), full text and references.
From these articles, only those containing assigned keywords were considered, reducing the number of articles used as evaluation dataset to 1,323, consisting of 4,921,583 words in total. The 1,323 articles were distributed across 254 different journals published by PubMed Central, ranging from Abdominal Imaging to World Journal of Urology. The text entered into the keyphrase extraction algorithm did not contain the keyphrases assigned to the articles. The workflow of the dataset construction and quantitative evaluation is given in the figure below.
Average document length was at 3,720 words (median: 3,473 words), with the longest document at 15,782 words, and the shortest document at 6(!) words. The total number of assigned keywords in this dataset was 6,931, such that the average mean of assigned keywords/keyphrases per document was just above 5, with the median at 5 as well. The maximal number of keyphrases assigned a priori to a document was 29, the minimum was 1. Document size and amount of assigned keyphrases were not found to be corresponding to each other.
Before settling upon evaluation metrics, it is important to consider the scenarios that may be encountered when assessing the predictions of the algorithm against the gold standard. A naive approach would consider the gold standard as non-forgiving ground truth, that only permits predictions that fully match to be counted as positives, whereas predictions matching partly would be regarded as negatives (or errors).
Often it is the case that partial matches are too close to the gold standard to be treated as error, as very commonly encountered in Biomedical NLP scenarios.
The following introduces such types of partial matches by typical examples found in the dataset and discusses their proper treatment.
| Source | Example |
|---|---|
| Gold Standard | Cerebral cortex |
| Prediction | cerebral cortex |
| Source | Example 1 | Example 2 |
|---|---|---|
| Gold Standard | Traumatic brain injury | Head and neck cancer |
| Prediction | brain injury | neck cancer |
| Source | Example 1 | Example 2 |
|---|---|---|
| Gold Standard | Epilepsy | Lymphoma |
| Prediction | temporal lobe epilepsy | Bcell Lymphoma |
| Source | Example 1 | Example 2 |
|---|---|---|
| Gold Standard | Distal radial fractures | Cell cycle control |
| Prediction | distal radial fracture | cell cycle |
| Source | Example 1 | Example 2 |
|---|---|---|
| Gold Standard | Biological agent | Haloarchaea |
| Prediction | biological agents | haloarchaeal genomes |
The introduced differences in partial matching are graphically displayed in figures below, and can be characterised in the contingency matrix of the following table.
| Prefix (Left Boundary) | Suffix (Right Boundary) | |
|---|---|---|
| Class1 | Semantic Transfer (medium penalty) | Widening Scope (very low penalty) |
| Class2 | Syntactic Variation (low penalty) | Narrowing Scope (low penalty) |
As discussed, Class1 suffix matches typically widen the scope of the meaning conveyed by the target term,
thereby deviating very little (and only in a more general way), which should be reflected in a very low penalty when deciding the grade of correctness, as in the context of the whole set of predictions, the more generic phrases are grounded again. Also, typical Class2 errors are still very close to the target and should only be slightly penalised, whereas Class1 prefix matches often imply a slight semantic transfer in case of a lacking
head noun compared to the gold standard. While such a deviation is not desireable, the proximity of the predicted term to the gold standard term is still close, and thus it should be reflected in a medium penalty. This scoring philosophy is closely related to the one used in GENETAG evaluation efforts and to the evaluation scheme for Named Entity Recognition used by the Message Understanding Conference (MUC-4) in 1995.


Now, with these considerations in mind, it is possible to define a partial order over the previously introduced matching classes as depicted in the figures above, which also reflects the degree of correctness when evaluating for recall, where 1.0 means a full match, and 0.0 means no match at all.
The previously defined order offers a reasonable instrument for applying the standard information retrieval metrics recall and precision in a fair evaluation, despite the problems pointed out at the beginning of this section.
Recall is given as the correlation of the amount of correctly predicted keyphrases and the amount of keyphrases specified by the gold standard:
recall = (# of correctly predicted keyphrases) / (# of gold standard keyphrases)
Precision is defined by the correlation between the amount of correctly predicted keyphrases and the number of total predictions:
precision = (# of correctly predicted keyphrases) / (# of total predictions)
To put the fair evaluation into a context, two additional assessments were considered, (i) a strict evaluation, treating all partial matches as negatives (or errors), and (ii) a relaxed evaluation, analysing all partial matches as positives .
It is important to note that the introduction of partial matching also opens the door for ambiguity : In a document, multiple candidates could be mapped to the same single target keyphrase with the same matching type. For instance candidate ac and candidate bc could both be assigned a Class1 SuffixMatch for gold standard phrase cc.
In such a case, in order to maintain correct counts for positives and negatives, only the candidate with the highest confidence value is considered a true bearer of such a matching type, rendering the other equally matching candidates to NoMatch.
At the same time, a candidate cc could also be found for target cc. In such a case, preference is given to the match type with higher priority, which is derived from the defined partial order, such that FullMatch overrides Class1 SuffixMatch, and as a result, only the highest match type retains its status, whereas the disfavoured candidate will be assigned a NoMatch in turn.
This ensures that at maximum one true positive per gold standard item is being accounted for when it comes to calculating precision and recall.
For the 1,323 documents, the total amount of generated keyphrase predictions was 52,825. The maximum number of keyphrase predictions for a document was 99, and the minimum number was 0 (on 13 occasions, for very short documents of average length 64 words / median 29 words ). On average, just under 40 keyphrases were extracted per document, with median at 40.
The tables below display the matching distribution for the fair evaluation, which are the main results obtained for this quantitative evaluation, and referred to unless otherwise stated. Prediction ratio gives a view on the matching distribution relative to the total number of predictions, whereas target ratio displays the matching distribution relative to the gold standard.
| Distribution | Prediction Ratio | Target Ratio | |
|---|---|---|---|
| Prediction Total | 52825 | Target Total: 6931 | |
| FullMatch | 2065 | 0.039 | 0.298 |
| C1Suffix | 959 | 0.018 | 0.138 |
| C1Prefix | 517 | 0.010 | 0.075 |
| C2Prefix | 322 | 0.006 | 0.046 |
| C1Suffix | 204 | 0.004 | 0.029 |
| NoMatch | 47243 | n/a | 0.413 |
The actual precision and recall values for strict, fair and relaxed assessment can be derived directly therefrom, and are displayed in the following table.
| Precision | Recall | |
|---|---|---|
| Fair | 0.077 | 0.518 |
| Strict | 0.039 | 0.298 |
| Relaxed | 0.089 | 0.587 |
Given these values, the first thing to note is that precision is very low. The problem faced here is the large amount of keyphrase predictions constituting the NoMatch set.
As mentioned above, on average 40 predictions are competing for only assigned 5 target keyphrases, and looking at the next figure, it reveals that the amount of predictions increases dynamically with document size, whereas the number of a priori assigned keyphrases constituting the gold standard remains more or less stable around 5 regardless of document length.
It shall clearly be noted that instead of restricting the result set to a number of n-best items as commonly observed, here all identified keyphrase candidates are presented to the judges, offering a larger degree of vulnerability for a decline in precision.
Moreover, Turney and Jones et al observe that not necessarily all a priori assigned keyphrases are actually contained in the document, as i.e. frequently recurring phrases in the text body may be rephrased for keyword assignment, empirically determining the average proportion of containment at roughly 75%. The implication of this observation is that a recall expectation of 100% is an unrealistic one. A good example from this evaluation is found in the article "Biomechanics of Traumatic Brain Injury: Influences of the Morphologic Heterogeneities of the Cerebral Cortex" (PubMedID 2413127) , where the index term "Inhomogeneities" is given, but never actually mentioned in the full text. Instead, "heterogeneities" is used quite often in the article, and extracted as a keyphrase candidate, unfortunately with a NoMatch result.
Further focusing on the results obtained from the fair evaluation, the next figure displays that on average, nearly 2 a priori keyphrases could be matched after the top-40% of the candidate list. Taking into consideration that the average number of assigned keyphrases is 5, and that possibly not all are containted in the document, the average recall with 51.8% does quite well when compared to the KEA evaluation, where recall on author assigned keyphrases is reported on average at 17% .
Eventually, an investigation into the distribution of matching types, and their contribution relative to the position in the candidate list (which is ordered by confidence ) revealed that candidates of type FullMatch are commonly found at the top segments of the list, as the following figure reveals. This is a positive sign, as it shows that the ranking function for score assignment does its job reasonably well, accumulating the most contributing (and therefore, important) candidates at the beginning of the list. It is also notable that this finding is confirmed in the qualitative evaluation results discussed in the next section.
The previous figure also exposes that the impact of the partial matching types is low throughout the prediction space, except for the C1 SuffixMatch, which also is a major contributor and stands out when considering partial matches only, a fact that will be interesting to look back to when examining the results of the user study.
The dataset including PubMedIDs, document statistics, gold-standard annotations and predictions is available for download here.
A user study was conducted, where the self-selected judges are preferably authors of the documents used as evaluation dataset. As the judges were free to choose documents of their liking , inter-annotator agreement such as Kappa - which seemed to be problematic for this kind of task in previous experiments - became irrelevant for this qualitative evaluation. Furthermore, as human interaction takes place in most use-cases where the tool has been deployed, this form of user study resembles a real-world scenario more closely. Next, the considerations underlying the experiment are outlined, and the path taken to conduct the user study are described.
Each judge taking part in the experiment was asked to provide the system with a freely chosen number of documents he knows very well, preferably documents that have been (co-)authored by him. It is the judges task to accept or reject any predicted keyphrase for the documents presented to the system, where in case of a rejection, the details for the reason of rejection were sought, from a set of 3 options:
Besides giving a more realistic view on precision and user-acceptance, the experiment is also regarded as insightful for possible adjustments of the system in a future development cycle, particularly when considering the distribution of reasons for rejection.
The experiment was conducted via a web-application specifically developed for this undertaking, presenting the self-selected judges with an interface to the keyphrase extractor and an upload mechanism for the documents. The predictions were also presented via a web-interface in such a way that judges were able to conveniently fill out the generated forms and submit their assessment.
Overall, 47 users signed up for the experiment, which was running for 10 days. In total 94 documents were used as input, with the largest document at 81,668 words, whereas the smallest document consisted of only 4 words (the instructions suggested to use reasonably sized documents, if possible consisting of at least 500
words). The average document length was 7,671 words, the median was determined at 5,128 words per document, and it took an average of just over 3 minutes to determine the 'good' and 'bad' candidates per user, per document.
No restrictions were imposed on the document content, and judges were encouraged to use documents from all sorts of domains, ranging from scientific articles, technical records (i.e. RFCs), contemporary writing and news, to personal communication. Documents however were required to be written in English, and of type PDF, Microsoft Word, plain text or HTML. The judges came from a multitude of backgrounds, ranging from PhD students and researchers (mostly in computer science) to IT professionals, engineers, as well as persons employed in the financial sector.
An overview of the evaluation runs per judge is given in the following table.
| Judges | Documents | |
|---|---|---|
| Total | 47 | 94 |
| 1 | 10 | |
| 1 | 9 | |
| 1 | 8 | |
| 2 | 4 | |
| 3 | 3 | |
| 13 | 2 | |
| 27 | 1 |
The next table shows a breakdown of the average acceptance ratio per document length, which has
been partitioned into segments of individual size. Most evaluation runs were performed on documents between 2,001 and 10,000 words, which corresponds to the size of average scientific conference articles. In this segment, accept ratio was with 49.4% almost even up with reject ratio. Instructions suggested not to use documents with less than 500 words, nevertheless this was the case on 8 occasions.
| Size in Words | Documents | Accept % |
|---|---|---|
| Total: 721,157 | 94 | |
| 0 - 500 | 8 | 41.8 |
| 501 - 1,000 | 7 | 49.3 |
| 1,001 - 2,000 | 10 | 42.6 |
| 2,001 - 10,000 | 55 | 49.4 |
| 10,001 - 25,000 | 8 | 46.8 |
| 25,001 - 50,000 | 3 | 32.4 |
| 50,001 - 100,000 | 3 | 70.6 |
In the subsequent table, the overall scores for accept and reject assessment are given, stating that acceptance with 49% almost matches the amount of rejection. This finding is about 13% below from what has reported by Turney, however in his experiment only top-7 keyphrases are presented to the judges, whereas in the experiment described here all identified keyphrase candidates are considered, without postprocessing them and excluding phrases below a certain threshold or rank, which is also partly responsible for the increase of the candidate set at increased document length.
| % | Absolute | |
|---|---|---|
| Total | 100.0 | 2600 |
| Accept | 49.0 | 1,273 |
| Reject | 51.0 | 1,327 |
Looking more closely at the overall distribution of rejections relative to the position in the candidate list, as depicted in the next figure, the top-2 segments clearly contain more acceptable keyphrases than the other segments, while in the last segment of the candidate list (90 - 100%) a sharp decline of acceptable items is observable. This is in line with what has been found in the quantitative evaluation, where clearly the top of the candidate list accumulates most matches, including partial ones.
The following table shows the overall distribution of reject reasons. Here, too general accumulates more than half of all reject reasons, followed by nonsense.
| Reason | Reject % | Absolute |
|---|---|---|
| too general | 52.2 | 693 |
| too specific | 13.6 | 181 |
| nonsense | 34.1 | 453 |
Investigating the data of the rejections directly, supported by the distribution over the rejection reasons displayed in the following figure 5.11, as an overall trend, the proportion of rejected items where the reason was given as too general is outstanding in all segments of the candidate list. Additionally, rejections classified as too specific remain more or less stable and with a rather low impact over the whole list, but a relatively large proportion of nonsense rejections is observable in the segment of 0 - 10% . Therefore, this segment shall be isolated and examined in more detail, with the hope that such an investigation will uncover the reasons for the strong presence of mentioned nonsense rejections.
From 61 occurrences of nonsense annotations in the first 10% of the candidate list, 26 were due to the text-conversion mistakes and sentence boundary mismatches as pointed out above. This is with 42% a considerably larger proportion than the 28% overall ratio of nonsense-based errors introduced by such types, as reported above. A number of examples from this segment for sentence boundary mismatches are "Face
Recognition Face recognition", "Semantic Data Semantic data", "Scheduling Events Scheduling events" and "speci cation" (specification), "bu ers" (buffers) and "work ows" (work flows) for text conversion errors. A reduction of these errors would mean that the "nonsense" proportion in this segment drops to a value that
more smoothly resembles the behaviour observed in the following three segments for "nonsense" rejections.
The dataset including AnnotatorIDs, document statistics, accepted and rejected keyphrases is available for download here.