Byron Georgantopoulos
Department of Linguistics
University of Edinburgh
The proposed, corpus-based and statistical approach exploits several heuristics to determine the summary-worthiness of sentences. It actually uses statistical appearances of words, word-pairs and noun phrases to calculate sentence weights and then extract the highest scoring sentences. The statistical model used in the scoring function is a slight variation of the Term-Frequency Inverse-Document-Frequency (TFIDF) term weighting formula.
The results obtained by application of the system to 5 test texts, separately and cumulatively for the three afore-mentioned heuristics were subjectively judged for their quality. This evaluation showed that the noun phrases, as a separate method, produce the best extracts and that they also are parts of the best performing combinations.
I would like to thank my supervisors, Andrei Mikheev and Jim Hurford. Andrei was very helpful in the technical-programming issues as well as the general statistical NLP techniques. Jim provided the ``linguistic'' viewpoint and clever suggestions.
I would also like to thank the Greek State Scholarships Foundation, for providing the funding, without which I would not have been able to do this MSc.
The number of papers that are published today is ever increasing, while the Internet burst has created vast libraries of machine-readable texts. For one to keep informed and updated about the recent advances in his field has become now both vital and difficult. Since it is impossible to read through all the papers which are published nowadays, it is of great help to present them in a condensed way, i.e. using abstracts which summarise the content of an article. In this way the reader can get quickly a general idea about the article and decide if it is interesting enough to read it all.
However, the work of abstracting a document is far from being easy. It requires skilled and specialised abstractors and of course it takes plenty of time. Only after an abstract is prepared, can the paper be published, which results to an unavoidable and irritating delay. In the same way as machine translation, linguists and computer scientists have tried in recent years to substitute and/or aid human with machine abstractors. Automatic (or computer-based) abstracting, which started back in the late fifties, has shown considerable progress, has created several techniques and theories and has even produced some commercial software.
This thesis focuses on the production of a special kind of abstracts, called extracts: sets of sentences taken from the original text. These sentences are selected on the basis of the amount of information they carry about the subject content. Using statistical tools, the system ``learns'' from a corpus of papers which elements of a sentence make it important enough to function as a highly representative sentence. There are several such elements explored in the existing bibliography; the system utilises three of them: words, word pairs and noun phrases, both as separate units and cumulatively, (by combining two or all of them). It finally attempts to find the best combination, i.e. the one which produces the most satisfying set of extracted sentences.
There is no generally agreed definition for abstracts although several efforts have been made to grasp every possible interpretation. Without trying to be perfectly exact, we can say that an abstract is a condensed representation of the content of a text. An abstract serves as a representative of a text, but much reduced in size. However, this reduction should entail minimal loss in content.
A successful abstract must have certain attributes. It must be brief in style and size, accurate (not under- or over- informative) and have clarity in expression. These virtues have been well documented and require educated professors to apply them. A similarity can be drawn with the translators: both have to transform a given text, trying to keep faithful to it at the same time.
Abstracts support the transfer of knowledge and communication within the academic and scientific community. They ``inherit'' all the purposes of paper publishing, but it is for their brevity that they:
There are basically two types of abstracts: the Indicative abstract which gives a brief outline of the basic concepts and conclusions of the paper and the Informative abstract which is usually longer and includes results, figures etc.
Depending on the author, abstracts can be author prepared, subject-expert prepared or professionals prepared. Author prepared abstracts suffer a criticism for bias and subjectivity; experts are probably the best choice, while professionals are competent but will not produce high quality work if they have little knowledge of the subject matter.
A final distinction is made between human-produced and computer-produced summaries. In the rest of the dissertation by summary I will refer to summaries (or abstracts or any other form of compression) made by humans.
As Machine Translation started to show very promising results in the fifties, scientists felt confident to put their hands to every other NLP problem, including of course automatic abstracting. With the cumulating of vast amounts of texts it was of great help to condense published papers via computer programs. (Luhn 1959) created the first abstracting system based on word-frequency statistics. Nowadays, depending on the processing tools, on the available corpus and the research motives, one can choose between pure NLP, statistical NLP or hybrid methods. In order to select a method we need to know:
This method is very close to the way humans do abstracting: the system has first to understand the text and then to create the abstract. The first step produces a canonical-logical form of the text that feeds the sentence generator to write the abstract. Both of these issues, text analysis and understanding, and text generation are huge NLP problems by themselves, and computationally unachievable so far. The complexities of language (anaphora, context, polysemy, global common knowledge required, etc.) make it too hard to create a system which will be effective in terms of producing good summaries, as well as being fast and maintainable.
Methodologies for NLU-based summarising are borrowed mainly from information extraction systems. These systems have been tried to be tailored to automatic summarising for specific domains. They try to match their stored knowledge entities against the input text so as to locate its dominant concepts. The major drawback is that all these entities have to be hand-coded by humans. Usually the knowledge base has a huge size because it has to cater for every possible situation. Humans also undertake the maintenance and updating of the system.
Automatic summarising via sentence extraction has the target of locating the best content-bearing sentences in a text. Extracting sentences can be a much simpler and hence a fast and feasible approach. On the other hand the result passage might not be much comprehensible and refined: we sacrifice coherence for speed and feasibility.
The assumption behind extracting is that there should be a set of sentences which present all the key ideas of the text, or at least a great number of these ideas. The goal is first to identify what really influences the significance of a sentence, what makes it important. The next step is to train and program a system to automatically locate these elements in a sentence and compute its summary-value.
It is evident that this method avoids all the above-mentioned conventional NLP problems: no analysis, representation an understanding of the text is required. No generation has to take place, and in addition the extracted sentences will be perfectly grammatical.
Although it is logical to suppose that the original text will include sentences that could be as good as the abstract sentences, there are several issues to be dealt with:
This method uses certain rules to select the sentences for the extract. The sentences which satisfy these rules are the ones to be extracted. Examples of rules are:
The format of rules is usually an if ... then template:
IF (S(I) IS paragraph-initial) OR (S(I) CONTAINS ``This paper presents'') THEN extract(S(I))
The two problems for NLP-based strategies referred above hold for this approach as well. The system must be provided with the rules by a human and these rules are specifically tailored to the domain they are written for. A change of domain should mean a major rewriting of the rules.
In contrast to the human-provided rules, the statistical approach basically tries to learn automatically the rules which predict a summary-worthy sentence.
Statistics-based systems are empirical, retrainable systems which minimise human effort. Their goal is to identify the units in a sentence which influence its importance and try to learn the dependency between the significance of units and the significance of a sentence. In other words, what unit triggers a good sentence and in what quantities it is needed to do so; the occurrence of this unit will help to determine the weight of the sentence. In this framework each sentence is assigned a score that represents the degree of appropriateness for inclusion in a summary.
Statistical techniques for automatic extracting are very similar to the ones used in information retrieval. In the latter, each document is viewed as a collection of indices (usually words) and every index has a weight which corresponds to its statistical appearance in the document. The document then is represented by a vector with index weights as elements. In a typical document retrieval query only the top frequency indices will be returned. In the same way, for extracting, each document is a collection of weighted sentences and the highest scoring ones comprise the final extract.
The statistical methodology has two major advantages over NLP-based abstracting and rule-based extracting: For a crude system (only words) there is no need for language processing since all the scoring is based on superficial factors, i.e. the strings in a sentence. Furthermore, as it happens with statistical approaches, the same system can be used in different domains, as long as it can learn the statistics and the dependencies (be trained) on it before.
Statistics do not contradict the rule-based extracting. It is possible that they will come up with the same rules and factors for predicting the significance of sentences. However, statistically based systems learn these rules automatically from a corpus, and thus avoid the subjectivity and retrainability problems imposed by human coding. They also produce more refined rules because they are based on (combinations of) several predicting (diagnostic) units.
On the other hand statistical systems usually presuppose the independence of instances of the same diagnostic unit. This oversimplified model does not apply to natural language: words, noun phrases, etc. in a sentence do not appear completely independently. In addition, the selection of a statistical model which is responsible for utilising the statistical regularities of the existing corpus has a major effect on the performance of the system. Although a factor can be really important for summarising, an inadequate or irrelevant model may not exploit that. Hence, researchers usually attempt to find the best combination of factors as well as the statistical modelling which makes the best out of them.
In this chapter are presented the tools, laws and techniques which are generally used in statistical NLP and in the implemented system.
In out approach, the score of a sentence depends on its diagnostic units. A diagnostic unit (DU) is anything in a sentence which gives a clue to its significance. It can be a word, a sequence of words, a syntactic structure, its position within the text, etc. The presence and the importance of the DUs of a sentence virtually determines its score. The scoring formula can encapsulate more than one DU, for example, the goodness of a sentence can rely both on its terms and its position. The form of the formula follows the term-weighting model:
where is the weight of the i-th heuristic, is the specific score of the sentence for the i-th heuristic and n is of course the number of heuristics used.
The optimum weight of each contributing DU is found empirically, after several attempts with different weights. Ideally, the best combination is the one which maximises the score of only the extract-worthy sentences. In order to train to system for this, we need to know these sentences beforehand.
The main measure of a diagnostic unit ( term in the following) weight is its statistical appearance in the specific document and its statistical appearance in the whole corpus. Statistical appearance can either mean how many times it is found in a collection (frequency) or if it appears at all (binary feature).
The assignment of weights must discriminate between
relevant and non-relevant diagnostic units. There are
several ways to calculate these weights (Robertson Jones 1994):
collection frequency: if n is the number of documents term appears in and N is the total number of documents then:
.
This method selects terms which occur often in a certain document but rarely in the rest by using the inverse of the term appearance.
term frequency: It takes into account only the frequency of a term inside the document:
= number of occurrences of term in document j.
document length: It is logical to assume that
terms appear more frequently in bigger files, so if a term
is relatively more frequent in a short than in a big file,
then it is more important. To incorporate document length in
the weighting formula we define:
= total number of term occurrences in .
This can be generalised to the average length of a document:
combined evidence: Richard Jones combined together the three weighting formulas:
where K is a tuning constant which is determined after successive trials and adjusts the effect of . K=0 eliminates the effect. TREC (Text REtrieval Conference) Program trials showed that K=2 is actually an effective value.
Term-Frequency Inverse-Document-Frequency (TFIDF): Another standard weight computation method (e.g. (Salton et al. 1993)) combines term frequency, the number of documents (N) and the the number of documents () that the term appears in:
The TFIDF scoring formula also favours terms which are highly frequent in a document but rare in the rest of the corpus.
Making reliable inferences on occurrences counted in texts is not an easy task. Assuming that occurrences have a normal distribution, we will find many entries with very low frequencies. Are these entries good predictors? The answer is that it depends on the nature of the diagnostic unit. For single words, it is important to keep even the ones with very low occurrences because rare words are valuable. But for word pairs, a different approach is needed. It is evident that words with low frequency will construct word pairs with low frequency as well. Since these rare words are already encapsulated and evaluated in the single-word method, they have no value in word pairs as predictors. Good word pair predictors rely on the mutual information between words. Common statistical practice says that if the frequency of the term in question is below 5 then the statistical inference based on this frequency is unreliable.
Evaluation in NLP systems is a very difficult task; researchers have just recently began to define and systematise it. For the purposes of this project evaluation techniques for automatic information retrieval along with some specific techniques for evaluating abstracts and extracts are presented.
Again measures of success are borrowed from automatic text
indexing and classification.
Given that we have (perfect) extracts to compare against the
computer-produced ones, the classical precision and
recall ratios can be used. Their formal definition
is:
precision =
recall =
For example, suppose there is a paper for summarising, and a
set of sentences T (represented by their indices for
convenience) have been identified before by a human. Then
the system is tried on the same paper and produces the set C.
If T comprises of , S2, S3, S4, S5, S6 and C comprises of
, S7, S4, S8, S9 then:
Using the formulas, recall is two correct sentences out of five (40), and precision is two correct sentences out of six (33).
Precision indicates the ratio of correct hits over the total selected sentences. Recall indicates the ratio of similarity between the correct sentences extracted and the sentences in the human, ``correct'' extract (usually called the target extract). It is obvious that both quantities cannot exceed 100.
Precision and recall are inversely related. If more sentences are extracted, recall will grow up as a result of more correct sentences being retrieved but precision will decrease because not all of the extra sentences will be correct. A combined measure, F-score (see Hobbs et al. (1992)) can be used:
where P is precision, R is recall and is an parameter indicating the relative weight of precision and recall. If = 1 there is no preference, if 1 recall is more important, if 1 it is precision that is more favoured.
There are certain assumptions behind these measures; (1) all sentences carry the same weight, so whichever omission has the same cost, and (2) that there is a unique, ideal target extract. In fact, (1) some sentences are more important than others in the construction of an extract, and (2) if a sentence (or more generally a collection of sentences) does not belong to the target extract, can nonetheless be significant for the subject content and act as surrogate (without great loss) of the target sentence.
In addition to the afore-mentioned evaluations, there are some qualitative criteria specifically for abstracts:
For the present system, the interest lies in (1) how informative is the extract compared to the human produced abstract/extract and (2) which heuristic or combination of heuristics performs best.
Usually the success of a method is measured against some baseline techniques which produce an extract without any real processing. One such method is the random selection of sentences (also used in (Edmundson 1969)). Other methods, slightly more advanced, include: (1) selecting sentences only from the beginning and the end and (2) just consisting of whole sections (e.g. the summary section).
Comparing against a preselected set of sentences means of evaluating how good the computer ``simulates'' humans in pinpointing the important sentences in a text. The advantage of this comparison is that it evaluates computer extracts as extracts and not as abstracts. Virtues beyond pure extracting, such as normal flow of ideas and resolved anaphora, are not compulsory.
The creation of target extracts (also called manual extracts) is made by human evaluators. They are usually required to read through the text and then mark the most important sentences. If the summarising system has no way of dealing with text coherence, there is no need for the target extract to have that attribute. Otherwise the human extractor has to pick up sentences which produce a connected and coherent passage.
Besides precision and recall measures we can also compute
the mathematic similarity between computer and target
extracts. This can help to demonstrate how close are the two
passages. Using the vector space model (see (Salton 1989))
we represent the extract of a document of N sentences as a
vector:
where has a value of 1 if the sentence belongs
to the extract or 0 otherwise. If is the human
produced target extract and is the automatic
extract their similarity is calculated by multiplying
the vectors:
where and are the elements of the vectors that represent the target and computer extracts respectively.
Measures based on the binary notion of belonging/not belonging to the target extract certainly ignore the possibility that there may be a surrogate subset of sentences which is satisfactorily close to the target extract. For example consider these two sentences taken from the same text (from the introduction and the summary respectively):
It is obvious that these sentences have about the same content, but if one sentence makes it to the extract and then the program chooses the other, the cost for evaluation is the same as if any other sentence was selected. If we could find a way to express that two sentences are ``similar'' in content then the evaluation of an extract could be more precise. There is even a chance of judging the extract against a real abstract if a target extract is not available.
Sentence similarity can be influenced by morphological, syntactic and semantic criteria. Sentences with a great number of identical word stems or noun/verb phrases or similar logical forms can be said to be similar and thus interchangeable for inclusion in an extract.
For a more integrated study of evaluation, it is convenient to judge the extract, as a whole and its sentences, as a stand-alone text excerpt and not against other targets. This method is based on the qualitative and quantitative evaluations of a human reader, who assigns scores to each computer-produced sentence on a point scale (usually consulting the original text and probably the abstract). It is the most unbiased method because the evaluator can value every sentence without being preoccupied by a target extract, and so an alternative extract can be equally appreciated if it is informative enough.
Five systems on automatic extracting are reviewed to give a general idea of the historical background and progress. Edmundson is the basis, Kupiec is the most recent one, NetSumm is a commercial product and Salton explores a somewhat different methodology, but with interesting thoughts. Finally, Ono incorporates text understanding while avoiding using vast knowledge bases.
Note: Words in italics are terms used throughout the specific paper.
It is not so risky to say that Edmundson's work (Edmundson 1969) set the trend for automatic extracting; all subsequent researchers refer to his work, and his heuristics have not been seriously expanded since then.
At that time, the only previous work on automatic extracting was Luhn's (Luhn 1959) system, ten years before, which only used high frequency words to calculate the sentence weights. Edmundson described and utilised cue phrases, titles and locational heuristics, tried all of their combinations and evaluated the computer-produced extracts with human-produced target extracts.
Edmundson carefully outlined the principles of human extracting in order to make his evaluation more concise. For a sentence to be eligible for the target extract it was required to carry information of at least one of six types: subject matter, purpose, methods, conclusions or findings, generalisations or implications and recommendations or suggestions. The final set of selected sentences must also be coherent and their number should equal 25 of the original text. Interesting note: headings were also included in these extracts.
The heuristic ( clues) come from two separate sources:
the body and the skeleton of the text. They also
can be linguistic or structural. All the
possible combinations create four different methods:
Rationale of the four Edmundson methods
All these methods were tried singly and cumulatively and the best combination was chosen on the basis of the greater mean percentage of sentences common in the automatic extracts and the target extracts. The combination cue + title + location had the best mean co-selection measure (cue + title + location + frequency had the second best) and was later input to the next evaluation stage. The fact that the frequency heuristic did not improve the performance was a very important finding: (1) automatic extraction systems needed more sophisticated representations than single words, (2) the exclusion of word frequencies speeds up the system and reduces the storage requirements.
Two different evaluations (for the preferred method) were tried out:
Kupiec et al. work (Kupiec et al. 1995) is a highly influenced by Edmundson {(Edmundson 1969)). Document extraction is viewed as a statistical classification problem, i.e. for every sentence, its score means the probability it can be included in a summary.
The heuristics (called features in the paper) utilised were:
The probability measure, using Baye's rule, that a sentence s will be included in the summary S is:
and assuming statistical independence of the features, this reduces to:
is constant so it does not affect the probability, and the two remaining quantities can be calculated straightly by counting occurrences.
The project corpus consisted of 188 document/summary pairs; the summaries were manually generated by humans . 79.4 of the summary sentences were also found in the body of the texts ( direct match). For another 3.3 there were two or more sentences in the texts which effectively had the same content ( direct join). Therefore, the maximum precision rate for the system could reach the sum of these probabilities, 83, since the rest of the sentences in the summaries did not have any direct or indirect correspondent in the original texts. Kupiec calculated recall and precision scores both individually for each feature and with combinations. The conclusion supported paragraph + fixed-phrase + sentence-length, with a 42 recall.
The notion of direct matches and joins is a useful one when target extracts are not available. Furthermore, the results are in trend with Edmundson; both studies show now a clear preference over certain features (location, cue phrases) and reduction of performance for the word-based statistics. However, we have to consider two facts which somehow decrease the significance of the results. First, that manual summaries had on average only three sentences. Second, that the fraction of direct sentence matches found in the corpus was surprisingly high. (for instance, in the Computational Linguistics corpus used to train the present project, rarely sentences in the abstract coincided with text sentences)
Salton et al (1993) system, Smart, does not produce straight abstracts, but tries to identify sets of sentences (even whole sections or paragraphs) which represent the subject content of a paper. In the report, there is a brief introduction to sentence extracting, and it is stated that retrieving passages is a right step towards better response to user queries. The architecture of Smart is based on vector processing model, i.e. every document is represented as a vector of weighted terms. The procedure to obtain the most relevant documents is as follows:
The effectiveness of the system is enhanced by considering ways to retrieve subsets of texts with the same qualifications as the previous texts, i.e. with a similarity metric above a certain threshold. In this way, sentences, sequences of adjacent sentences, sections and paragraphs are assigned a similarity metric (using the above-mentioned stages) and the ones with a high similarity can be considered as surrogates for the original full texts.
The evaluation is based on recall and precision figures, along with statistics that show the ratio paragraph per retrieved item, indicative of the achieved compression. The utilisation of local similarity and afterwards the substitution of texts with text excerpts improves the efficiency of the system by 25. Salton concludes that the best performing combination is sections+paragraphs, with averagely 1.3 paragraphs representing each document.
Although not a strictly sentence extracting system, Smart deals with issues (word ambiguity, similarity between sentences) in the area of automatic extracting. Considering sequences of sentences instead of isolated sentences also decreases the effect of problems such as anaphora resolution and textual coherence.
NetSumm is a product of British Telecommunications and summarises WWW Pages. It ``automatically highlights the sentences that it considers make up the "most important" part of the text''. BT has established a WEB page, (http://www.labs.bt.com/netsumm/pages/about.htm) to spread information about the program and it allows you to submit a paper and get the extract. Unfortunately, the methodology is not explained, so it remains a secret (obviously for commercial reasons) how the extracts ( abridgements) are produced.
NetSumm is program running in Mackintosh computers and through NetScape thus using all the user-friendly facilities, like mouse, menus and buttons. It accepts ASCII and HTML formatted documents and it responses very quickly. It allows for some parametrisation of the proportion of condensation, and it can display the extract sentences alone, or the whole text with the extracted sentences highlighted.
In accordance with (Morris et al. 1992) BT assert that ``an abridgement of typically only 5 of the original article will contain roughly 70 of the information in an author-written abstract, while a 25 abridgement contains essentially all of the information in the abstract''.
Ono et al. (1994) developed a system that generates abstracts of Japanese expository writings based on the notion of rhetorical structures. It is basically a pattern matching technique which tries to find the structure of mostly argumentative texts. It also ensures that the output extract will be coherent, whatever the desired length.
Rhetorical structures represent deep semantic relations between various portions of a text, in two layers: intra-paragraph (between sentences) and inter-paragraph. These structures consist of rhetorical relations. Typical rhetorical relations (and the connectives which reflect the flow of the argument) include serial ( X thus Y), summarisation ( X after all Y), reason ( X because Y) etc.. Rhetorical structures are stored in binary trees (in the form of connective(X,Y), so there is a hierarchy of relations - the higher in the tree the more general and important the relation.
The extraction procedure works as follows: the sentence is morphologically and syntactically parsed and then the basic rhetorical relations are detected (like the ones referred above) between adjacent sentences . The segmentation module detects rhetorical relations between distant sentences and then the candidate generator produces all possible rhetorical structures. The final step is to assign penalty scores to each candidate. Penalty scores are based on preference rules for every neighboring relations (i.e. what is the most preferred structure for a given relation).
The generation of the abstract is accomplished by examining the best performing rhetorical structure of each section. Rhetorical relations are categorised according to which operand is the stronger: for example in summarising relation summarisation(X,Y), Y, the conclusion, is obviously more important than X. This relation is characterised as RightNucleus and there are two other types: LeftNucleus and BothNucleus relations.
The system evaluates the importance of all the sentences by imposing penalties to the weaker parts of the categorised relations. The nodes of the tree with the highest penalties (from bottom to top) are cut out and the remaining terminal nodes construct the extract.
Extracts of these type carry two important virtues: the relations and the connectives represent true relations and arguments of the original text. Secondly, they will not contain fragmentary sentences; all the sentences are there as parts of rhetorical relations. In addition it is easy to vary the length of an extract: just cut out more terminal nodes, the ones with the highest penalties.
The system was evaluated against 30 editorial articles and 42 technical papers. Recall figures were 51 and 74 respectively, and this difference was explained on the basis that technical papers have a more typical and expository style and contain more explicit clues of rhetorical relations.
In the following sections all the diagnostic units are reviewed in more detail.
The first and simplest method tries to utilise the basic units of a sentence: words. It assumes that the most frequent words found in the body of a text might be indicative of high content in the sentences they appear in. However, such a word must have another attribute: it must be infrequent in the rest of the corpus, because otherwise it is just a common word for the domain without any discriminating power. For example, the word linguistics will possibly appear very often in a computational linguistics corpus. In a sentence of the text we want to summarise, this word does not enhance its significance; anyway we expect a lot of sentences in such a corpus to contain the word linguistics. The same holds for the functional words in English, like the, and etc. They have very high frequencies and even distributions throughout the corpus, and their contribution to the content of the sentence is negligible.
It seems fairly logical that some (though certainly not all) the key ideas for a paper will repeat throughout the sentences, but it cannot be easily concluded that:
and what's more:
Word-based statistics are the easiest to be obtained because they do not require any advanced linguistic processing. This methodology was one of the first to be tried out ((Luhn 1959), (Rath et al. 1961), (Edmundson 1969)). One of its main advantages derives from a statistical law: the greater the collected statistics, the more reliable the method. Indeed, the highest frequencies will appear for this heuristic because all the other frequency based diagnostic units consist of more than one word and their chances of repeating will be much less than of those of the single-word ones.
In the next level, these terms are taken to be pairs of words. The idea behind this concept is that the co-occurrence of two certain (not necessarily adjacent) words in the same sentences in high frequency could serve as a diagnostic unit. First, terms which consist of two words can be grasped. Second, if a document contains relatively many sentences with two certain words, this must relate to its importance (either as a specific term or as typical of the writer's style) for this document. Third we are able to evaluate context information; if a word tends to appear with another one frequently their dependence will produce high frequency word pairs and thus be valued accordingly.
A further elaboration of the method could be to select which part of speeches are allowed to form a word pair candidate. A problem with this approach derives from word ambiguity when considering adjacent words. For instance, consider the famous linguistic phrase: ``Time flies like an arrow''. The first three words here can belong to more than one grammatical categories so it is not clear which grammatical combination this construct represents. If it appears in a later sentence with different categories this time, then it will unfortunately group with the previous one.
It is typical in writing a paper for publication to use certain words or phrases to highlight sentences. It was Edmundson {(Edmundson 1969)) who first observed and used these heuristics for his project. Such cues as This paper ..., we propose ..., important are frequently indicative of a worthy sentence for summarisation. These diagnostic units are called cue phrases and bonus words. On the contrary, there are some negative words (or phrases), called stigma words which decrease the significance of a sentence, like hardly, impossible, belittling expressions, etc.
Undoubtedly the title and the headers of a text provide a good amount of information about its content. By extracting the content words from titles and headers we can create a scoring formula which increases the score of sentences that contain these words. We can also put headers and titles directly into the extract thus creating a more coherent and structured output. Finally, the system can be guided to select at least one sentence from every major section, so given that this sentence is good, the extract is more likely to grasp all the basic subjects.
When abstracts (or even better extracts) are available with texts, the system can ``learn'' certain properties from them. We can extract words, noun phrases, syntax style, etc. from abstracts and then try to find similar sentences in the body of the text. In this way the output sentences bear more resemblance (in content and in format) with the human-produced abstracts. Other statistical properties which can be utilised include the average number of sentences in a summary along with the average length of summary sentences.
Linguistic information can also prove useful, on the basis of looking for strings of words that form a syntactic structure. Extending the idea of high frequency words, we can assume that noun phrases form more meaningful concepts, thus getting closer to the idea of terms. This overcomes several problems of the first single-word method because it can utilise:
There is a possibility though that one term can be ``implemented'' with more than one noun phrases, compare for example information overload with overload of information.
The position of a sentence within a paragraph (and the position of a paragraph within a text as well) can sometimes be indicative of its significance. This happens usually with the first sentence of a paragraph which supposedly gives a very good clue of what is going to follow. Another similar assertion is that a good proportion of important information about a paper is to be found in the first and last paragraphs (they indeed often contain introductory and conclusive material), so their sentences should be favoured more than the sentences of the other paragraphs.
The format of an article can further aid the searching for summarising sentences. Writers usually put important phrases in bold or italic font, capitalise or underline them if they want to give a certain emphasis.
Along with the diagnostic units we can further parameterise the system and experiment with varying values in order to find optimum combinations.
Morris et al. (Morris et al. 1992) postulated that about 20 of the sentences in a text could express all the basic ideas about it. Since abstracts are much shorter than this proportion, the length of extracts should lie between the length of an abstract and Morris's percentage.
There are five ways to vary the number of sentences for an extract:
It can be suggested that too short or too long sentences are not generally ideal for an abstract, and therefore for an extract as well. This is usually referred to (see (Kupiec et al. 1995)) as sentence cut-off feature. It penalises short (less than 5-6 words) and long sentences either by reducing their score, or even by excluding them completely.
Language in general contains much redundancy. There are many words that carry no meaning but are necessary to construct grammatical sentences. These words include prepositions, pronouns, articles, connectives etc. and are called function words. It follows that the function words should not be included in any statistics and scoring formulas because they do not contribute to the relevance and importance of a sentence. For the present system I used an enhanced list that contains along with the traditionally functional words, some others as well which intuitively do not affect the summary-worthiness of a sentence, e.g. following, first, auxiliary verbs, etc.
Can words of the same morphological root be considered identical terms? Especially in automatic indexing, it is a classical technique to use word stems as indices. It speeds up the procedure, reduces table sizes and supposedly enhances the performance of the system. However, there is no clear evidence that words with the same stem but with different suffixes (and possibly different grammatical categories) must be grouped. A difference in morphology sometimes entails a semantic difference, even a subtle one. Moreover, the word ambiguity problem is considerably enlarged as less different words exist. Finally, the quality of the stemming program has a strong impact on the usefulness of the technique. (Strzalkowski 1992) refers to an 6-8 improvement of precision by incorporating morphological information. Similar assertions can be made about capitalised and non-capitalised words. Others distinguish between for example Grammar and grammar, others group them into one word. No approach is definitely better than the other, it is rather a matter of preference.
The nature and characteristics of the corpus have a strong impact on the later stages of work. It is of great importance to design beforehand the overall structure and appearance of the corpus, the ratio of automatic/manual editing required and adapt it to the annotation and processing tools which will be used afterwards.
The corpus processing was carried along with Simone Teuffel, PhD student in the Centre for Cognitive Science, who also did a summer project on automatic summarising.
Statistical NLP requires a large amount of texts to guarantee the reliability of the gathered statistics. These texts must also be in electronic form and belong to the same domain, since we want to locate high content diagnostic units specifically for this domain. For the task in hand, at least the files that will later comprise the test set should contain an abstract (or even better an extract) to serve as a target for evaluation. Finally, they must be files of the same format, so the same processing programs can be applied to all the texts and output a uniform and coherent corpus.
After an extensive search in the World Wide Web (WWW), the only group of texts satisfying all these criteria was a collection of papers on Computational Linguistics, all of them LaTeX files. Although enormous amounts of texts, on several fields, are available online, they are only in the format of postscript files. Postscript files are by nature files to be displayed on a graphic screen, it is extremely difficult, if not impossible, to reconstruct them as text.
The corpus consists of papers downloaded from the Computation and Language E-Print Archive ( http://xxx.lanl.gov/cmp-lg) which contains papers submitted from June 1994 until May 1996. All the papers downloaded were LaTeX files and contained small abstracts. The topics included: computational linguistics, natural-language processing, speech processing, and related fields.
After the LaTeX files were downloaded they had to pass several processing stages in order to be usable by the Language Technology Group (LTG, University of Edinburgh) tools. LTG programs require the files to be in SGML format (see below).
SGML (Standard Generalized Markup Language, (Goldfard 1990)) is an international standard language for describing marked-up electronic documents.
In this language we list and describe all the allowable entities in a text (e.g. titles, paragraphs, sentences, words etc.). The structuring of the entities is hierarchical and is represented in a tree; there is a root entity and all the others have to belong to a parental node.
The major advantage of SGML for our approach is that it allows for different representations of a text to be constructed in a uniform way. The software that runs on the corpus can easily access whatever part of the text, the only required information is the name of the entity which corresponds to a certain diagnostic unit. It does not need to know how it was made or where it is placed.
Every entity (also called structure, or
tag) has a name and a description, stored in a description
type document (DTD). This file is consulted when marking up
documents. For a
tag T its start and end are denoted by <T> and
</T> respectively. For example if we have a
sentence:
This is a test sentence
and if only sentences (tag: S) and words (tag: W) are used for markup, the SGML equivalent is:
<S><W>This</W> <W>is</W> <W>a</W> <W>test</W> <W>sentence</W></S>
The structure of the sentence is (assuming TEXT is the
root element):
If we would also like to annotate noun phrases, the sentence would look like this:
<S><W>This</W> <W>is</W> <NG><W>a</W> <W>test</W> <W>sentence</W></NG></S>
It is obvious that every new representation can be easily incorporated in the same text. This SGML sentence contains now two different kinds of diagnostic units.
The specifications set by the DTD file must be strictly satisfied. Except the imposed hierarchy (for example a <W> must appear only enclosed in an <S>, tags of the same category cannot be nested (e.g. sentences within sentences are not permitted).
LT NSL is a development environment for SGML-based corpus and document processing, with support for multiple versions and multiple levels of annotation. It consists of a C-based API for accessing and manipulating SGML documents and an integrated set of SGML tools.
LT NSL belongs to a greater group of tools, the LTG corpus workbench, which also includes tools for tagging, tokenizing, frequency counting, etc.
The LTG tools work with SGML queries, which are requests for specific tags inside the documents. After the tag is retrieved it can further be processed.
The complexity of LaTeX files and the diversity of environments that different writers use had to be reduced before feeding the next module. We undertook some automatic deletions and alterations so as the files could be more easily and quickly converted to HTML.
Nikos Drakos' (Computer Based Learning Unit, University of
Leeds) LaTeX2HTML program was used to convert the LaTeX
files to HTML format
LaTeX2HTML is a powerful program, that basically tries to
translate as many Latex commands as possible, including
figures, equations and even user defined
environments. Since the summarisation program does not need this
type of information, the converter is configured only for
text. The command is :
latex2html -info 0 -nolatex -no <LaTeX>
The options before the filename tell LaTeX2HTML to avoid certain conversions of LaTeX commands.
Several LaTeX-oriented tags (like images and equations) whose presence and content do not have a summarising value were replaced by simple indicator tags. In this way the system did not count them as diagnostic units but the reader of the text was able to know where an equation or an image was originally placed, although the actual content of the equation is not of course reproduced. These tags were:
Table of Indicator TAGS
The diversity of papers necessitated some further editing of files. Among the most frequent corrections were:
The HTML output file contains a lot of sophisticated tags which are not necessary for the subsequent processing of the texts. It also does not include end-of-paragraph markers, which are a requirement for SGML (every tag must end). At the end of processing, the only tags that should be present in the texts are:
and the hierarchical organisation of the text is:
The conversion is accomplished with several Perl scripts.
Even after the running of the previous script there were some files which needed extra editing because of:
The corpus so far is paragraph-marked, but we also need sentence, word and noun-phrase marking in order to count their frequencies. For assigning grammatical categories to the texts, I used the LTG tagger, a Hidden Markov Model based tagger with a correctness rate up to 96. The inserted tags were <S>, <W> and <NG>, referring to sentences, words and noun groups (LTG term for noun phrase) respectively.
Here is a sentence before being tagged:
The formalism of synchronous tree-adjoining grammars [[#REF]], a variant of standard tree-adjoining grammars (TAG), was intended to allow the use of TAGs for language transduction in addition to language specification.
and after tagging:
<S ID="1"><NG> <W C="AT">The</W> <W C="NN">formalism</W></NG> <W C="IN">of</W><NG> <W C="JJ">synchronous</W> <W C="VBG">tree-adjoining</W> <W C="NNS">grammars</W></NG> <D>[</D><D>[</D> <W C="CD">#REF</W><D>]</D><D>]</D><CM>,</CM> <NG> <W C="AT">a</W> <W C="NN">variant</W></NG> <W C="IN">of</W><NG> <W C="JJ">standard</W> <W C="VBG">tree-adjoining</W> <W C="NNS">grammars</W></NG> <D>(</D><NG> <W C="NN">TAG</W></NG> <D>)</D> <CM>,</CM> <W C="BEDZ">was</W> <W C="VBN">intended</W> <W C="TO">to</W> <W C="VB">allow</W><NG> <W C="AT">the</W> <W C="NN">use</W></NG> <W C="IN">of</W> <NG><W C="NNS">TAGs</W></NG> <W C="IN">for</W> <NG><W C="NN">language</W> <W C="NN">transduction</W></NG> <W C="IN">in</W> <NG><W C="NN">addition</W></NG> <W C="IN">to</W>< NG><W C="NN">language</W> <W C="NN">specification</W></NG> <EOS>.</EOS> </S>
Explanations
When this stage is finished, the corpus is ready for frequency counting.
Here are some indicative statistics concerning the corpus:
and the abstracts (for all the papers):
The basic schema under which the system tries to refine itself is that the same statistical model (a slight variation of TFIDF) will be tested on various representations of the text. We begin with single word representation and use TFIDF weighting to score the sentences and produce extracts.
In the next level, we attempt to improve the performance by adding more sophisticated representations (i.e. diagnostic units). First word pairs inside a sentence are used, a diagnostic unit which takes into account the context of words and therefore reduces the ambiguity problem present in the previous method. Second we take the notion of 'term' to the level of 'noun phrase' so as to minimise word ambiguity and capture terms beyond one single word.
Finally we try all the methods in all possible combinations. This produces 3 single methods, 3 paired methods (words + pairs, words + NPs, NPs + pairs) and 1 triple (words + pairs + NPs).
It is of high importance that the model is independent of the representations because it allows for more representations to be tried out without altering the model and generally the philosophy of the system.
The performance results will show (1) which combination produces the most satisfactory extract and (2) which representations the TDIDF statistical model can exploit.
The statistical model used to determine the diagnostic units weights is a TFIDF-like formula. Instead of the inverse document frequency, the system uses the inverse total frequency in a corpus (hence the name TFICF: Term-Frequency Inverse-Corpus-Frequency):
The ratio of frequencies in a document over the whole corpus has the effect of supporting terms with high frequency in the document but rare in the rest of the corpus while it is moderate with words that have the same frequency over both the document and the corpus. The scoring function is implemented in such a way that no term gets a weight more than 1, even if the term frequency is bigger than the corpus frequency. This adjustment was made to avoid extremely large scores for terms appearing only in the input text. Suppose for instance that a word appeared 5 times in a text and twice in the training corpus. Then it would have a weight of log(5)/log(2)=2.32, much above the normal weights. If this very high score should be kept is however a matter of preference; maybe such a term could be so valuable it deserves such a score. I personally favoured the normalised weights for two reasons: First, I found many similar cases not only because these terms were too specific and important for the text but also because of the small size of the corpus that allowed for many zero-frequency terms. Second, the problem was magnified for pairs and noun phrases since these diagnostic units had far less appearances.
There are basically two steps involved in the application of each one of the methods: (1) preparation of the corpus statistics and (2) the sentence scoring and sorting. The statistics (i.e. frequencies of diagnostic units) are collected from all the training documents beforehand. After this, the only statistics needed to be collected are the input text frequencies.
The system needs to be trained by gathering statistics from the training documents. The procedure is as follows:
After the frequency table is constructed, the sentences of the test document are scored in the following way:
where is the frequency of the j-th word of the sentence over the document, is the frequency of the j-th word of the sentence over the corpus and N is the number of content words in the sentence (i.e. its length minus the stoplist words).
In the same way as the stoplist words for the first method, there can be certain pairs with one or even two insignificant words. These pairs are excluded from the further process, which also speeds up the system.
For every sentence, of length n, there are possible word pairs. It is an extremely large number which eventually results in a huge frequency table. In order to be able to store and access quickly the table, a decision was made to cut off pairs with very low frequency, assuming that their absence cannot affect the performance of the system.
Again two steps, preparation and computation are required.
Basically the same formula is used to compute the sentence scores:
The third method explores the possibilities of syntactic structures improving the system performance. Within the notion of looking for terms it is a logical sequence to go for noun phrases after words and pairs.
To utilise syntactic information, it is obvious that the text needs to be parsed, so the output noun phrases rely on the quality of the parser and the noun phrase grammar specification. It is very useful to be able to configure the NP grammar because complicated structures can be simplified or rejected. For instance, if a prepositional phrase modifies the head noun phrase, the former can be deleted so the real diagnostic unit is only the main noun phrase. In addition, we could only select the last two or three words of a noun phrase as the diagnostic unit, thus eliminating adverbs and adjectives which are far away from the noun, etc.
The only difference in the algorithm for collecting statistics has to do with preparing the statistics:
The extracting algorithm is essentially the same. Noun phrases with frequency of 1 are not used in scoring and sentences with less than 2 noun phrases are also excluded.
After the heuristics are tried separately, they are tried cumulatively in search for the optimum combination. The scoring formula is:
where K is the number of heuristics combined, is the number of diagnostic units of sentence i, is the contributing weight of the j-th diagnostic unit and is the weight for the occurrence of the diagnostic unit in the i-th sentence.
Evaluation for the present project consisted of running the extraction program on the 5 test texts for all the methods (including all their combinations) and then subjectively judging the quality of all the output extracts.
All the extracts had 10 sentences, a practical value given the average of pages of the test papers (6 pages) and the average number of sentences in the abstracts of the training corpus (4.75 sentences) average.
This judgement was performed as follows:
Table 1: Sentence ratings
Generally, if the same sentence appeared in extracts produced by different methods, it got the same score. However, if it generated a better relevance within the context of the extract its score was increased.
The points of the sentences for every extract are added together; the sum represents the total quality.
For every method the scores of the texts are also summed up to produce a quality measure for a specific method.
The following tables show the total scores for papers and
methods (W stands for words, P for pairs, N for noun
phrases):
Table 2: Subjective quality results for extracts produced for the 5 test papers.
A first look on the scores shows a consistency of methods over papers, i.e. scores for a certain paper had about the same boundaries for all the methods. Total scores on the horizontal axis seem to belong to three categories. Papers 1 and 5 got very high scores, many of them five-point sentences (i.e. perfect for extracting). Paper 4 had a moderate scoring and papers 2 and 3 scored very badly. This was mainly due to the nature of the texts. Papers 2 and 3 had a lot of example sentences with a lot of proper names (one paper had in fact Japanese examples). These names, which did not appear at all in the training corpus, were considered high value terms and achieved impressive scores without the sentence having other content words. As a result we can say that texts with many example sentences, fractions of programming code, etc. were not effectively summarised by our frequency-count model.
Another observation made by looking at what sentences were extracted is the co-selection of certain sentences by different methods. The following table shows for every paper how many sentences were co-selected by more than one method (starting from 7, i.e. selected by all methods):
Table 3: Number of co-selected sentences by all methods.
This table demonstrates some kind of regularity. We expect the number of sentences co-selected to be lower in the first column and then increase gradually. This actually happens, except a gap in sentences selected by 3 methods. If we calculate the number of individual sentences appearing in the extracts by removing all the sentences that were repeated (i.e. selected by more than one method) we can get a measure of overlap between the methods. The total number of sentences in all the extracts is: 5 papers * 10 sentences * 7 methods = 350. 205 of them were repetitions, so 350-205=145 were distinct sentences. Of these 145, 80 are repeated more than once (80 is the sum of the last row of table 3) and the remaining 65 appeared exactly once. Hence, the overlap is 80/145=55,1 which shows that more than half of the sentences was selected at least from one other method.
In combinations, methods were assumed to be equally weighted (see chapter 6, Description of the system) but actually words were slightly less weighted because of the following fact: The highest scoring sentences (with TFIDF, not in my evaluation) with single-words did not usually reach the maximum score, i.e. 1. However, sentence scores for pairs and noun phrases achieved more frequently the maximum score. Actually, for the sentences extracted the average sentence score, computed with TFICF, for words, pairs and noun phrases was 0.66, 0.99, and 0.92 respectively. Therefore, when adding the scores of individual methods, the contribution of words was on average 0.66/0.99=66 the contribution of pairs and 0.66/0.92=71 the contribution of noun phrases. The reason why the average scores for pairs and NPs were that high comes from the scoring formula. Units for these methods had much less occurrences in the corpus; it was easier for a sentence to have 2-3 units with a weight of unity, so the total sentence score was again 1 (also these methods had a smaller threshold of diagnostic units, see again chapter 6).
The fact that the word-based method did not contribute equally in compound heuristics is welcome (though not planned originally); since words have the poorest performance, it is desired to decrease their contribution.
The scores over methods now show that the NP method had by far the best performance (82 points in total, 7 points more than the runner-up, pairs+words). Noun phrases also had consistently high scores in every paper, being the first scoring method in 3 papers (no.2, 3 4). Another observation which supports NP heuristics, is that 2 runner-up methods, ranked 3th and 4th, are combinations with NP as a factor.
On the other hand, words and pairs seem to perform
similarly. However, I would say that pairs are in fact a
better predictor because they outperform significantly the
single-word method in 3 out of 5 papers; the 2 remaining
papers which show a preference for words are in fact the two
very high scoring ones. In these 2 papers (the ``easier''
for summarisation), high scores result in
differences between methods. If we normalise the scores of
every method by dividing by the total score for every paper
we get:
Table 4: Normalised performance of all 7 methods.
This viewpoint again proves the dominance of the NP method, but this time word pairs are clearly the second best method, while the single-word is the poorest one. Combinations are placed between pairs and words. If a combination includes the word method, it always achieves lower scores than the same combination without words.
A comparison of the program extracts with the extracts produced by the BT NetSumm program did not reveal any reliable results. The proportion of condensation for NetSumm was usually more than 15, while it was less than 10 for the system extracts. The mean co-selection of sentences is only 16, but this is not a comparable measure given that we do not know how NetSumm extracts are produced. However, NetSumm passages are more coherent and do not contain example sentences.
The following table shows how many sentences were co-selected (1) for the NP method and (2) for the best performing method regarding the specific paper (see Table 1. A star (*) indicates that for the specific paper, NP was also the best performing method):
Table 5: Co-selection of sentences between the program and BT's NetSumm.
The final conclusions drawn from the evaluation are:
The system can be extended in many ways, from heuristics to statistical models. Here are some suggestions:
We have described and developed a system for automatic summarisation of texts based on sentence extraction. The system finds the most significant sentences of a text by computing a relevance-score for every sentence. This score is based on the statistical regularities of words, word pairs and noun phrases in (1) the input text and (2) a collection of training texts (800,000 words) all in the general domain of computational linguistics. Output extracts were produced for 5 test papers, for all the three methods and their possible combinations. Evaluation scores were subjectively assigned to the extracts in order to find the method which produces the most satisfactory extract. Results from the evaluation showed a clear preference for noun phrases, good performance for pairs and poor performance for words.
The system was developed in C++ under the Unix operating system. There are two basic programs, create to create the necessary representations and docsum to do the extracting.
Syntax:
cat <sgml_file> | create_dus [-p] [-n]
where <sgml> is the input file, marked up at least for words and sentences. Option -p creates all the word pairs in a sentence and -n all the noun phrases. These tags are stored under the <REP> tag which is separate from <BODY>. All the entries have a <W> tag for uniformity, but there is an attribute T (for type) which has a value of "p" for pairs and "n" for noun phrases. Here is an example for the phrase An automatic extraction program with both pairs and noun phrases (word pairs consisting of at least one stoplist word are of course excluded):
<W T="n">automatic_extraction_program</W> <W T="p"automatic_extraction></W> <W T="p">automatic_program</W> <W T="p">extraction_program</W>
Notes: There is one noun phrase here and three pairs. For the NP the article an was not included because maximally three words can be in one NP. The same article does not also appear in any pair because it belongs to the stoplist.
After the creation of the representation of the text, and given
the frequency tables docsum produces the extracted sentences.
cat <sgml_file> | docsum [-m <method(s)>] [-p <condensation>] [-q <query>] [-u <unit_query>] [-b <body_query>] [-h] [-f <frequency_table>]
All the options have default values, especially set for the single
word method. Here are further explanations about these options:
sgml:
It is the input file, marked up at least for words and sentences.
methods:
Indicates the diagnostic unit(s) to be used. w is for
words, p is for pairs, and n is for noun
phrases. For combinations the corresponding letters are
simply concatenated, for example wn tries words + noun
phrases (default: w).
condensation:
Corresponds to the length of the extract. If it is less than
1 then it indicates the proportion of the sentences that
will be extracted (for example 0.1 means extract 10). If
the value is more than 1 then it indicates the fixed number
of sentences (default: 0.1).
query:
This value represents the SGML tag where the diagnostic units
are stored (default: <BODY>, for the single-word method).
unit:
Units are the SGML entities which are scored and extracted
(default: sentences <S>).
body:
The SGML entity which contains text marked-up only for words
and not other representations. It is the tag used to access
the sentences for displaying them (default: <BODY>).
frequency
It provides the filenames for the look-up frequency tables. For simplification reasons it is assumed that files for different diagnostic units have the same root filename but different extensions. The default values are total.frq, total.ng.frq and tt total.wp.frq for words, NPs and pairs respectively. The filename given with this option corresponds to the single-word table. It is then split into root and extension and .ng.frq and .wp.frq are added in order to construct the remaining filenames.
The output of the program is an SGML document with all the non-qualifying sentences removed. It still contains paragraph markers, headers, etc. and a perl script was created to make the output look more natural. See Appendix 1 for ``clear'' results.