MSc in Speech and Language Processing Dissertation : Automatic summarising based on sentence extraction: A statistical approach

Byron Georgantopoulos
Department of Linguistics
University of Edinburgh

Abstract:

The present dissertation and project describes a system for automatic summarising of texts. Instead of generating abstracts, a hard NLP task of questionable effectiveness, the system tries to identify the most important sentences of the original text, thus producing an extract.
The proposed, corpus-based and statistical approach exploits several heuristics to determine the summary-worthiness of sentences. It actually uses statistical appearances of words, word-pairs and noun phrases to calculate sentence weights and then extract the highest scoring sentences. The statistical model used in the scoring function is a slight variation of the Term-Frequency Inverse-Document-Frequency (TFIDF) term weighting formula.
The results obtained by application of the system to 5 test texts, separately and cumulatively for the three afore-mentioned heuristics were subjectively judged for their quality. This evaluation showed that the noun phrases, as a separate method, produce the best extracts and that they also are parts of the best performing combinations.

Acknowledgements

I would like to thank my supervisors, Andrei Mikheev and Jim Hurford. Andrei was very helpful in the technical-programming issues as well as the general statistical NLP techniques. Jim provided the ``linguistic'' viewpoint and clever suggestions.

I would also like to thank the Greek State Scholarships Foundation, for providing the funding, without which I would not have been able to do this MSc.

Acknowledgements
Introduction
References
Automatic abstracting
Outline of standard statistical NLP techniques
- Scoring the sentences
  - Determining weights
- Evaluation
Methods and Heuristics in automatic extracting
Description of the Corpus
Description of the system
Results and Evaluation
User manual
- create
- docsum
  - Syntax:
  - Output

Introduction

The number of papers that are published today is ever increasing, while the Internet burst has created vast libraries of machine-readable texts. For one to keep informed and updated about the recent advances in his field has become now both vital and difficult. Since it is impossible to read through all the papers which are published nowadays, it is of great help to present them in a condensed way, i.e. using abstracts which summarise the content of an article. In this way the reader can get quickly a general idea about the article and decide if it is interesting enough to read it all.

However, the work of abstracting a document is far from being easy. It requires skilled and specialised abstractors and of course it takes plenty of time. Only after an abstract is prepared, can the paper be published, which results to an unavoidable and irritating delay. In the same way as machine translation, linguists and computer scientists have tried in recent years to substitute and/or aid human with machine abstractors. Automatic (or computer-based) abstracting, which started back in the late fifties, has shown considerable progress, has created several techniques and theories and has even produced some commercial software.

This thesis focuses on the production of a special kind of abstracts, called extracts: sets of sentences taken from the original text. These sentences are selected on the basis of the amount of information they carry about the subject content. Using statistical tools, the system ``learns'' from a corpus of papers which elements of a sentence make it important enough to function as a highly representative sentence. There are several such elements explored in the existing bibliography; the system utilises three of them: words, word pairs and noun phrases, both as separate units and cumulatively, (by combining two or all of them). It finally attempts to find the best combination, i.e. the one which produces the most satisfying set of extracted sentences.

Abstracts: Nature and characteristics

There is no generally agreed definition for abstracts although several efforts have been made to grasp every possible interpretation. Without trying to be perfectly exact, we can say that an abstract is a condensed representation of the content of a text. An abstract serves as a representative of a text, but much reduced in size. However, this reduction should entail minimal loss in content.

A successful abstract must have certain attributes. It must be brief in style and size, accurate (not under- or over- informative) and have clarity in expression. These virtues have been well documented and require educated professors to apply them. A similarity can be drawn with the translators: both have to transform a given text, trying to keep faithful to it at the same time.

Uses of abstracts

Abstracts support the transfer of knowledge and communication within the academic and scientific community. They ``inherit'' all the purposes of paper publishing, but it is for their brevity that they:

Can keep the reader aware of virtually all the new developments and findings in his area
Reduce the time spent on an article and thus increase the number of papers of which one can have an idea of their basic conclusions
Help select the articles which are worth full reading
Require of the reader to know only the language used in the abstract journal, although the original paper could be written in any language
Facilitate literature reviews because it is easier to browse through abstracts than whole texts
Speed up document classifying by using abstracts as the input to the classification procedure

Types of abstracts

There are basically two types of abstracts: the Indicative abstract which gives a brief outline of the basic concepts and conclusions of the paper and the Informative abstract which is usually longer and includes results, figures etc.

Depending on the author, abstracts can be author prepared, subject-expert prepared or professionals prepared. Author prepared abstracts suffer a criticism for bias and subjectivity; experts are probably the best choice, while professionals are competent but will not produce high quality work if they have little knowledge of the subject matter.

A final distinction is made between human-produced and computer-produced summaries. In the rest of the dissertation by summary I will refer to summaries (or abstracts or any other form of compression) made by humans.

References

-: Borko, H. Bernier, C. L. (1975), Abstracting concepts and methods, New York : Academic Press.
-: Dunning, T. (1993), `Accurate methods for the statistics of surprise and coincidence', Association for Computational Linguistics.
-: Edmundson, H. P. (1969), `New methods in automatic extracting', Journal of the Association for Computing Machinery 16(2), 264--285.
-: Finch, S. Moens, M. (1995), Sista final techical report, Technical report, Human Communications Research Centre, Language Technology Group, University of Edinburgh.
-: Goldfard, C. (1990), The SGML handbook, Oxford University Press. New York.
-: Jones, K. S. (1993), `What might be in a summary?', Information Retrieval pp. 9--26.
-: Jones, P. A. Paice, C. D. (1993), `A 'select and generate' approach to automatic abstracting'.
-: Kupiec, J., Pedersen, J. Chen, F. (1995), A trainable document summarizer, SIGIR '95, pp. 68--73.
-: Luhn, H. P. (1959), `The automatic creation of literature abstracts', IBM J. Res. Develop. 2(2), 159--165.
-: McKeown, K. Radev, D. R. (1995), `Generating summaries of multiple news articles', SIGIR '95 pp. 74--82.
-: Mikheev, A. Finch, S. (1995), Towards a workbench for acquisition of domain knowledge from natural language, in `Proceedings of the 7th Conference of the European Chapter of the Association for Computational Linguistics.'.
-: Morris, A. H., Kasper, G. M. Adams, D. A. (1992), `The effects and limitations of automated text condensing on reading comprehension performance', Information Systems Research 26, 17--35.
-: Paice, C. D. (1990), `Constructing literature abstracts by computer: Techniques and prospects', Information Processing Management 26(1), 171--186.
-: Rath, G. J., Resnick, A. Savage, R. (1961), `The formation of abstracts by the selection of sentences: Part 1: sentence selection by men and machines', American Documentation 12(2), 139--141.
-: Robertson, S. E. Jones, K. S. (1994), Simple, proven approaches to text retrieval.
-: Salton, G. (1989), Automatic text processing : the transformation, analysis, and retrieval of information by computer, Reading, Mass. Wokingham : Addison-Wesley.
-: Salton, G., Allan, J. Buckley, C. (1993), Approaches to passage retrieval in full text information systems, Technical report, Department of Computer Science, Cornell University.
-: Strzalkowski, T. (1992), Information retrieval using robust natural language processing, in `Speech and Natural Language. Proceedings of a workshop held at Harrimon, NY', pp. 206--211.
-: Thompson, H. S. McKelvie, D. (1996), `A software architecture for simple, efficient sgml applications. the lt nsl software library.', LTG Presentations at SGML-Europe'96. Language Technology Group, Human Communication Research Centre, Edinburgh University.
-: Zechner, K. (1995), Automatic text abstracting by selecting relevant passages, Master's thesis, Centre for Cognitive Science, University of Edinburgh.

#1#

Automatic abstracting

Introduction

As Machine Translation started to show very promising results in the fifties, scientists felt confident to put their hands to every other NLP problem, including of course automatic abstracting. With the cumulating of vast amounts of texts it was of great help to condense published papers via computer programs. (Luhn 1959) created the first abstracting system based on word-frequency statistics. Nowadays, depending on the processing tools, on the available corpus and the research motives, one can choose between pure NLP, statistical NLP or hybrid methods. In order to select a method we need to know:

What kind of texts we want to summarise. Domain, size and complexity of the texts affect the character of the system. For example, texts of a restricted domain are easier to summarise than multi-topical texts, because it might be feasible for the former to find and codify the rules-templates which produce the abstracts. Additionally, short papers call for different techniques in abstracting (and even in structuring and writing them in the first place) from dissertations and bigger articles.
What tools we have and how self-confident we are about their performance. If we want to generate a coherent and comprehensible abstract we need an efficient text generation module. If we want to use statistics, we need a big and homogeneously annotated corpus along with corpus analysis tools.

Automatic abstracting through text understanding

This method is very close to the way humans do abstracting: the system has first to understand the text and then to create the abstract. The first step produces a canonical-logical form of the text that feeds the sentence generator to write the abstract. Both of these issues, text analysis and understanding, and text generation are huge NLP problems by themselves, and computationally unachievable so far. The complexities of language (anaphora, context, polysemy, global common knowledge required, etc.) make it too hard to create a system which will be effective in terms of producing good summaries, as well as being fast and maintainable.

Techniques

Methodologies for NLU-based summarising are borrowed mainly from information extraction systems. These systems have been tried to be tailored to automatic summarising for specific domains. They try to match their stored knowledge entities against the input text so as to locate its dominant concepts. The major drawback is that all these entities have to be hand-coded by humans. Usually the knowledge base has a huge size because it has to cater for every possible situation. Humans also undertake the maintenance and updating of the system.

Automatic extraction of sentences

Automatic summarising via sentence extraction has the target of locating the best content-bearing sentences in a text. Extracting sentences can be a much simpler and hence a fast and feasible approach. On the other hand the result passage might not be much comprehensible and refined: we sacrifice coherence for speed and feasibility.

The assumption behind extracting is that there should be a set of sentences which present all the key ideas of the text, or at least a great number of these ideas. The goal is first to identify what really influences the significance of a sentence, what makes it important. The next step is to train and program a system to automatically locate these elements in a sentence and compute its summary-value.

It is evident that this method avoids all the above-mentioned conventional NLP problems: no analysis, representation an understanding of the text is required. No generation has to take place, and in addition the extracted sentences will be perfectly grammatical.

Extracts vs. Abstracts

Although it is logical to suppose that the original text will include sentences that could be as good as the abstract sentences, there are several issues to be dealt with:

The sentences of an abstract are more dense and contain implications, generalisations and conclusions which might not be ``expressed'' intact in sentences of the main text.
The language style (especially in syntax) of an abstract is generally different from the original text. Although an extract preserves the style of the writer, an abstract has a usually a more dense and conventional style.
The extracted sentences might not be textually coherent and might not flow naturally. It is much possible that there will be fragmentary sentences, which will not make sense in the context of the extract, in spite of being important sentences. Furthermore, the extract will probably contain unresolved anaphora.
There is a chance of redundancy and repetition in an extract, because sentences with similar content will both achieve high scores and will be extracted.

Techniques for automatic extraction

Rule-based approach

This method uses certain rules to select the sentences for the extract. The sentences which satisfy these rules are the ones to be extracted. Examples of rules are:

Extract every first sentence in a paragraph
Extract every sentence that has a title word(s) and a cue phrase (e.g. This paper)
Extract every sentence with a specified number of words in a list. This word list should contain domain oriented words which are supposed to trigger worthy sentences.

The format of rules is usually an if ... then template:

       IF (S(I) IS paragraph-initial) OR
          (S(I) CONTAINS ``This paper presents'')
       THEN
          extract(S(I))

The two problems for NLP-based strategies referred above hold for this approach as well. The system must be provided with the rules by a human and these rules are specifically tailored to the domain they are written for. A change of domain should mean a major rewriting of the rules.

Statistical approach

In contrast to the human-provided rules, the statistical approach basically tries to learn automatically the rules which predict a summary-worthy sentence.

Statistics-based systems are empirical, retrainable systems which minimise human effort. Their goal is to identify the units in a sentence which influence its importance and try to learn the dependency between the significance of units and the significance of a sentence. In other words, what unit triggers a good sentence and in what quantities it is needed to do so; the occurrence of this unit will help to determine the weight of the sentence. In this framework each sentence is assigned a score that represents the degree of appropriateness for inclusion in a summary.

Statistical techniques for automatic extracting are very similar to the ones used in information retrieval. In the latter, each document is viewed as a collection of indices (usually words) and every index has a weight which corresponds to its statistical appearance in the document. The document then is represented by a vector with index weights as elements. In a typical document retrieval query only the top frequency indices will be returned. In the same way, for extracting, each document is a collection of weighted sentences and the highest scoring ones comprise the final extract.

Pros and Cons

The statistical methodology has two major advantages over NLP-based abstracting and rule-based extracting: For a crude system (only words) there is no need for language processing since all the scoring is based on superficial factors, i.e. the strings in a sentence. Furthermore, as it happens with statistical approaches, the same system can be used in different domains, as long as it can learn the statistics and the dependencies (be trained) on it before.

Statistics do not contradict the rule-based extracting. It is possible that they will come up with the same rules and factors for predicting the significance of sentences. However, statistically based systems learn these rules automatically from a corpus, and thus avoid the subjectivity and retrainability problems imposed by human coding. They also produce more refined rules because they are based on (combinations of) several predicting (diagnostic) units.

On the other hand statistical systems usually presuppose the independence of instances of the same diagnostic unit. This oversimplified model does not apply to natural language: words, noun phrases, etc. in a sentence do not appear completely independently. In addition, the selection of a statistical model which is responsible for utilising the statistical regularities of the existing corpus has a major effect on the performance of the system. Although a factor can be really important for summarising, an inadequate or irrelevant model may not exploit that. Hence, researchers usually attempt to find the best combination of factors as well as the statistical modelling which makes the best out of them.

Outline of standard statistical NLP techniques

In this chapter are presented the tools, laws and techniques which are generally used in statistical NLP and in the implemented system.

Scoring the sentences

In out approach, the score of a sentence depends on its diagnostic units. A diagnostic unit (DU) is anything in a sentence which gives a clue to its significance. It can be a word, a sequence of words, a syntactic structure, its position within the text, etc. The presence and the importance of the DUs of a sentence virtually determines its score. The scoring formula can encapsulate more than one DU, for example, the goodness of a sentence can rely both on its terms and its position. The form of the formula follows the term-weighting model:

where is the weight of the i-th heuristic, is the specific score of the sentence for the i-th heuristic and n is of course the number of heuristics used.

The optimum weight of each contributing DU is found empirically, after several attempts with different weights. Ideally, the best combination is the one which maximises the score of only the extract-worthy sentences. In order to train to system for this, we need to know these sentences beforehand.

Determining weights

The main measure of a diagnostic unit ( term in the following) weight is its statistical appearance in the specific document and its statistical appearance in the whole corpus. Statistical appearance can either mean how many times it is found in a collection (frequency) or if it appears at all (binary feature).

The assignment of weights must discriminate between relevant and non-relevant diagnostic units. There are several ways to calculate these weights (Robertson Jones 1994):

collection frequency: if n is the number of documents term appears in and N is the total number of documents then:

This method selects terms which occur often in a certain document but rarely in the rest by using the inverse of the term appearance.

term frequency: It takes into account only the frequency of a term inside the document:

= number of occurrences of term in document j.

document length: It is logical to assume that terms appear more frequently in bigger files, so if a term is relatively more frequent in a short than in a big file, then it is more important. To incorporate document length in the weighting formula we define:

= total number of term occurrences in .

This can be generalised to the average length of a document:

combined evidence: Richard Jones combined together the three weighting formulas:

where K is a tuning constant which is determined after successive trials and adjusts the effect of . K=0 eliminates the effect. TREC (Text REtrieval Conference) Program trials showed that K=2 is actually an effective value.

Term-Frequency Inverse-Document-Frequency (TFIDF): Another standard weight computation method (e.g. (Salton et al. 1993)) combines term frequency, the number of documents (N) and the the number of documents () that the term appears in:

The TFIDF scoring formula also favours terms which are highly frequent in a document but rare in the rest of the corpus.

Frequency threshold

Making reliable inferences on occurrences counted in texts is not an easy task. Assuming that occurrences have a normal distribution, we will find many entries with very low frequencies. Are these entries good predictors? The answer is that it depends on the nature of the diagnostic unit. For single words, it is important to keep even the ones with very low occurrences because rare words are valuable. But for word pairs, a different approach is needed. It is evident that words with low frequency will construct word pairs with low frequency as well. Since these rare words are already encapsulated and evaluated in the single-word method, they have no value in word pairs as predictors. Good word pair predictors rely on the mutual information between words. Common statistical practice says that if the frequency of the term in question is below 5 then the statistical inference based on this frequency is unreliable.

Evaluation

Evaluation in NLP systems is a very difficult task; researchers have just recently began to define and systematise it. For the purposes of this project evaluation techniques for automatic information retrieval along with some specific techniques for evaluating abstracts and extracts are presented.

General Statistical NLP evaluation

Again measures of success are borrowed from automatic text indexing and classification. Given that we have (perfect) extracts to compare against the computer-produced ones, the classical precision and recall ratios can be used. Their formal definition is:

precision =

recall =

For example, suppose there is a paper for summarising, and a set of sentences T (represented by their indices for convenience) have been identified before by a human. Then the system is tried on the same paper and produces the set C.

If T comprises of , S2, S3, S4, S5, S6 and C comprises of , S7, S4, S8, S9 then:

Using the formulas, recall is two correct sentences out of five (40), and precision is two correct sentences out of six (33).

Precision indicates the ratio of correct hits over the total selected sentences. Recall indicates the ratio of similarity between the correct sentences extracted and the sentences in the human, ``correct'' extract (usually called the target extract). It is obvious that both quantities cannot exceed 100.

Precision and recall are inversely related. If more sentences are extracted, recall will grow up as a result of more correct sentences being retrieved but precision will decrease because not all of the extra sentences will be correct. A combined measure, F-score (see Hobbs et al. (1992)) can be used:

where P is precision, R is recall and is an parameter indicating the relative weight of precision and recall. If = 1 there is no preference, if 1 recall is more important, if 1 it is precision that is more favoured.

There are certain assumptions behind these measures; (1) all sentences carry the same weight, so whichever omission has the same cost, and (2) that there is a unique, ideal target extract. In fact, (1) some sentences are more important than others in the construction of an extract, and (2) if a sentence (or more generally a collection of sentences) does not belong to the target extract, can nonetheless be significant for the subject content and act as surrogate (without great loss) of the target sentence.

Specific factors for automatic summarising

In addition to the afore-mentioned evaluations, there are some qualitative criteria specifically for abstracts:

Readability: The text is understandable and reads easily.
Textual coherence: There are no discourse gaps between the sentences and the text flows naturally.
Accuracy: All the important issues of the text are covered and there are no errors.

Evaluating extracts

For the present system, the interest lies in (1) how informative is the extract compared to the human produced abstract/extract and (2) which heuristic or combination of heuristics performs best.

Usually the success of a method is measured against some baseline techniques which produce an extract without any real processing. One such method is the random selection of sentences (also used in (Edmundson 1969)). Other methods, slightly more advanced, include: (1) selecting sentences only from the beginning and the end and (2) just consisting of whole sections (e.g. the summary section).

Target extracts

Comparing against a preselected set of sentences means of evaluating how good the computer ``simulates'' humans in pinpointing the important sentences in a text. The advantage of this comparison is that it evaluates computer extracts as extracts and not as abstracts. Virtues beyond pure extracting, such as normal flow of ideas and resolved anaphora, are not compulsory.

The creation of target extracts (also called manual extracts) is made by human evaluators. They are usually required to read through the text and then mark the most important sentences. If the summarising system has no way of dealing with text coherence, there is no need for the target extract to have that attribute. Otherwise the human extractor has to pick up sentences which produce a connected and coherent passage.

Text similarity

Besides precision and recall measures we can also compute the mathematic similarity between computer and target extracts. This can help to demonstrate how close are the two passages. Using the vector space model (see (Salton 1989)) we represent the extract of a document of N sentences as a vector:

where has a value of 1 if the sentence belongs to the extract or 0 otherwise. If is the human produced target extract and is the automatic extract their similarity is calculated by multiplying the vectors:

where and are the elements of the vectors that represent the target and computer extracts respectively.

Sentence similarity

Measures based on the binary notion of belonging/not belonging to the target extract certainly ignore the possibility that there may be a surrogate subset of sentences which is satisfactorily close to the target extract. For example consider these two sentences taken from the same text (from the introduction and the summary respectively):

This paper describes the adaptation work carried out to retarget the Xerox Tagger to Spanish.
This paper presents results obtained with the port to Spanish of a public domain tagger -- the Xerox Tagger.

It is obvious that these sentences have about the same content, but if one sentence makes it to the extract and then the program chooses the other, the cost for evaluation is the same as if any other sentence was selected. If we could find a way to express that two sentences are ``similar'' in content then the evaluation of an extract could be more precise. There is even a chance of judging the extract against a real abstract if a target extract is not available.

Sentence similarity can be influenced by morphological, syntactic and semantic criteria. Sentences with a great number of identical word stems or noun/verb phrases or similar logical forms can be said to be similar and thus interchangeable for inclusion in an extract.

Subjective evaluation

For a more integrated study of evaluation, it is convenient to judge the extract, as a whole and its sentences, as a stand-alone text excerpt and not against other targets. This method is based on the qualitative and quantitative evaluations of a human reader, who assigns scores to each computer-produced sentence on a point scale (usually consulting the original text and probably the abstract). It is the most unbiased method because the evaluator can value every sentence without being preoccupied by a target extract, and so an alternative extract can be equally appreciated if it is informative enough.

Methods and Heuristics in automatic extracting

Review of extracting systems

Five systems on automatic extracting are reviewed to give a general idea of the historical background and progress. Edmundson is the basis, Kupiec is the most recent one, NetSumm is a commercial product and Salton explores a somewhat different methodology, but with interesting thoughts. Finally, Ono incorporates text understanding while avoiding using vast knowledge bases.

Note: Words in italics are terms used throughout the specific paper.

H.P. Edmundson's Abstract Generation System

It is not so risky to say that Edmundson's work (Edmundson 1969) set the trend for automatic extracting; all subsequent researchers refer to his work, and his heuristics have not been seriously expanded since then.

At that time, the only previous work on automatic extracting was Luhn's (Luhn 1959) system, ten years before, which only used high frequency words to calculate the sentence weights. Edmundson described and utilised cue phrases, titles and locational heuristics, tried all of their combinations and evaluated the computer-produced extracts with human-produced target extracts.

Preparation of target extracts

Edmundson carefully outlined the principles of human extracting in order to make his evaluation more concise. For a sentence to be eligible for the target extract it was required to carry information of at least one of six types: subject matter, purpose, methods, conclusions or findings, generalisations or implications and recommendations or suggestions. The final set of selected sentences must also be coherent and their number should equal 25 of the original text. Interesting note: headings were also included in these extracts.

Utilised heuristics

The heuristic ( clues) come from two separate sources: the body and the skeleton of the text. They also can be linguistic or structural. All the possible combinations create four different methods:

Rationale of the four Edmundson methods

Cue method: Basically cue phrases and words (see section 4.2.3) A Cue dictionary was constructed, including three different kinds of entries: bonus words (positive effect), stigma words (negative effect) and null words (the equivalent of stoplist words). Linguistic analysis showed which grammatical categories belong to these classes.
Key method: Again, a Key glossary is created, which contains words with a corpus frequency above a certain percentage of the total number of words.
Title method: It assigns positive weights to words appearing in titles, headings and formatted text.
Location method: It assigns positive weights to headings, sentences in first and last paragraphs and paragraph-initial and paragraph-final sentences as well .

All these methods were tried singly and cumulatively and the best combination was chosen on the basis of the greater mean percentage of sentences common in the automatic extracts and the target extracts. The combination cue + title + location had the best mean co-selection measure (cue + title + location + frequency had the second best) and was later input to the next evaluation stage. The fact that the frequency heuristic did not improve the performance was a very important finding: (1) automatic extraction systems needed more sophisticated representations than single words, (2) the exclusion of word frequencies speeds up the system and reduces the storage requirements.

Evaluation studies

Two different evaluations (for the preferred method) were tried out:

Subjective similarity ratings measure the similarity between automatic-target extracts, based on subjective evaluation of similarity between their sentences (on a five-point scale). The result was a mean similarity of 66.
Statistical error analysis measures the portion of the automatically extracted sentences worthy of belonging to a successful extract. Such sentences were specified as (1) identical with sentences in target extracts, (2) co-intensional and (3) extract-worthy judged. A promising 84 was classified in these categories.

Kupiec's Trainable Document Summariser

Kupiec et al. work (Kupiec et al. 1995) is a highly influenced by Edmundson {(Edmundson 1969)). Document extraction is viewed as a statistical classification problem, i.e. for every sentence, its score means the probability it can be included in a summary.

Heuristics

The heuristics (called features in the paper) utilised were:

Sentence length cut-off feature: Sentences with fewer than 5 words are excluded.
Fixed-phrase feature: Favours sentences containing any of 26 indicator (cue) phrases, e.g. ``this letter...''.
Paragraph feature: Classifies the sentences into five categories: first 10, last 5, paragraph-initial, paragraph-final and paragraph-medial.
Thematic word feature: All the high frequency words comprise a term vocabulary; sentences with such terms get extra points.
Uppercase word feature: Favours sentences with proper names (in capitals).

The probability measure, using Baye's rule, that a sentence s will be included in the summary S is:

and assuming statistical independence of the features, this reduces to:

is constant so it does not affect the probability, and the two remaining quantities can be calculated straightly by counting occurrences.

Evaluation

The project corpus consisted of 188 document/summary pairs; the summaries were manually generated by humans . 79.4 of the summary sentences were also found in the body of the texts ( direct match). For another 3.3 there were two or more sentences in the texts which effectively had the same content ( direct join). Therefore, the maximum precision rate for the system could reach the sum of these probabilities, 83, since the rest of the sentences in the summaries did not have any direct or indirect correspondent in the original texts. Kupiec calculated recall and precision scores both individually for each feature and with combinations. The conclusion supported paragraph + fixed-phrase + sentence-length, with a 42 recall.

The notion of direct matches and joins is a useful one when target extracts are not available. Furthermore, the results are in trend with Edmundson; both studies show now a clear preference over certain features (location, cue phrases) and reduction of performance for the word-based statistics. However, we have to consider two facts which somehow decrease the significance of the results. First, that manual summaries had on average only three sentences. Second, that the fraction of direct sentence matches found in the corpus was surprisingly high. (for instance, in the Computational Linguistics corpus used to train the present project, rarely sentences in the abstract coincided with text sentences)

Salton's Passage Retrieval System

Salton et al (1993) system, Smart, does not produce straight abstracts, but tries to identify sets of sentences (even whole sections or paragraphs) which represent the subject content of a paper. In the report, there is a brief introduction to sentence extracting, and it is stated that retrieving passages is a right step towards better response to user queries. The architecture of Smart is based on vector processing model, i.e. every document is represented as a vector of weighted terms. The procedure to obtain the most relevant documents is as follows:

The user supplies a query: it may consist of a word, a stem or a phrase.
For every document a similarity metric is computed between the query and the document, based on the number and the weight of co-occuring (in the query and in the document) terms. This is called the global similarity.
All the documents having a global similarity below a predefined threshold (normally it is 0.2) are rejected.
To overcome the problem of ambiguous words (with multiple meanings) and phrases the local text environments are considered; only texts with a sufficient number of locally similar substructures finally qualify for the response to the query.

Results

The effectiveness of the system is enhanced by considering ways to retrieve subsets of texts with the same qualifications as the previous texts, i.e. with a similarity metric above a certain threshold. In this way, sentences, sequences of adjacent sentences, sections and paragraphs are assigned a similarity metric (using the above-mentioned stages) and the ones with a high similarity can be considered as surrogates for the original full texts.

The evaluation is based on recall and precision figures, along with statistics that show the ratio paragraph per retrieved item, indicative of the achieved compression. The utilisation of local similarity and afterwards the substitution of texts with text excerpts improves the efficiency of the system by 25. Salton concludes that the best performing combination is sections+paragraphs, with averagely 1.3 paragraphs representing each document.

Although not a strictly sentence extracting system, Smart deals with issues (word ambiguity, similarity between sentences) in the area of automatic extracting. Considering sequences of sentences instead of isolated sentences also decreases the effect of problems such as anaphora resolution and textual coherence.

NetSumm

NetSumm is a product of British Telecommunications and summarises WWW Pages. It ``automatically highlights the sentences that it considers make up the "most important" part of the text''. BT has established a WEB page, (http://www.labs.bt.com/netsumm/pages/about.htm) to spread information about the program and it allows you to submit a paper and get the extract. Unfortunately, the methodology is not explained, so it remains a secret (obviously for commercial reasons) how the extracts ( abridgements) are produced.

NetSumm is program running in Mackintosh computers and through NetScape thus using all the user-friendly facilities, like mouse, menus and buttons. It accepts ASCII and HTML formatted documents and it responses very quickly. It allows for some parametrisation of the proportion of condensation, and it can display the extract sentences alone, or the whole text with the extracted sentences highlighted.

Evaluation

In accordance with (Morris et al. 1992) BT assert that ``an abridgement of typically only 5 of the original article will contain roughly 70 of the information in an author-written abstract, while a 25 abridgement contains essentially all of the information in the abstract''.

Abstract Generation based on rhetorical structure extraction

Ono et al. (1994) developed a system that generates abstracts of Japanese expository writings based on the notion of rhetorical structures. It is basically a pattern matching technique which tries to find the structure of mostly argumentative texts. It also ensures that the output extract will be coherent, whatever the desired length.

Rhetorical structures represent deep semantic relations between various portions of a text, in two layers: intra-paragraph (between sentences) and inter-paragraph. These structures consist of rhetorical relations. Typical rhetorical relations (and the connectives which reflect the flow of the argument) include serial ( X thus Y), summarisation ( X after all Y), reason ( X because Y) etc.. Rhetorical structures are stored in binary trees (in the form of connective(X,Y), so there is a hierarchy of relations - the higher in the tree the more general and important the relation.

Rhetorical structure extraction

The extraction procedure works as follows: the sentence is morphologically and syntactically parsed and then the basic rhetorical relations are detected (like the ones referred above) between adjacent sentences . The segmentation module detects rhetorical relations between distant sentences and then the candidate generator produces all possible rhetorical structures. The final step is to assign penalty scores to each candidate. Penalty scores are based on preference rules for every neighboring relations (i.e. what is the most preferred structure for a given relation).

Abstract Generation

The generation of the abstract is accomplished by examining the best performing rhetorical structure of each section. Rhetorical relations are categorised according to which operand is the stronger: for example in summarising relation summarisation(X,Y), Y, the conclusion, is obviously more important than X. This relation is characterised as RightNucleus and there are two other types: LeftNucleus and BothNucleus relations.

The system evaluates the importance of all the sentences by imposing penalties to the weaker parts of the categorised relations. The nodes of the tree with the highest penalties (from bottom to top) are cut out and the remaining terminal nodes construct the extract.

Extracts of these type carry two important virtues: the relations and the connectives represent true relations and arguments of the original text. Secondly, they will not contain fragmentary sentences; all the sentences are there as parts of rhetorical relations. In addition it is easy to vary the length of an extract: just cut out more terminal nodes, the ones with the highest penalties.

Evaluation

The system was evaluated against 30 editorial articles and 42 technical papers. Recall figures were 51 and 74 respectively, and this difference was explained on the basis that technical papers have a more typical and expository style and contain more explicit clues of rhetorical relations.

Review of diagnostic units

In the following sections all the diagnostic units are reviewed in more detail.

High frequency keywords

The first and simplest method tries to utilise the basic units of a sentence: words. It assumes that the most frequent words found in the body of a text might be indicative of high content in the sentences they appear in. However, such a word must have another attribute: it must be infrequent in the rest of the corpus, because otherwise it is just a common word for the domain without any discriminating power. For example, the word linguistics will possibly appear very often in a computational linguistics corpus. In a sentence of the text we want to summarise, this word does not enhance its significance; anyway we expect a lot of sentences in such a corpus to contain the word linguistics. The same holds for the functional words in English, like the, and etc. They have very high frequencies and even distributions throughout the corpus, and their contribution to the content of the sentence is negligible.

It seems fairly logical that some (though certainly not all) the key ideas for a paper will repeat throughout the sentences, but it cannot be easily concluded that:

every major concept will have a reasonably high frequency in the paper and a poor frequency in the rest of the corpus.
these key concepts are just single words; they may extend over one word, e.g. noun phrases.
and what's more:
high frequency words which are not concepts but happen to appear a lot of times in a text will boost undesired sentences.
words with multiple senses (e.g. bank, book or parts of bigger structures like car oil and olive oil ) will be grouped into a single one and thus blur the terminology bank.

Word-based statistics are the easiest to be obtained because they do not require any advanced linguistic processing. This methodology was one of the first to be tried out ((Luhn 1959), (Rath et al. 1961), (Edmundson 1969)). One of its main advantages derives from a statistical law: the greater the collected statistics, the more reliable the method. Indeed, the highest frequencies will appear for this heuristic because all the other frequency based diagnostic units consist of more than one word and their chances of repeating will be much less than of those of the single-word ones.

Word pairs

In the next level, these terms are taken to be pairs of words. The idea behind this concept is that the co-occurrence of two certain (not necessarily adjacent) words in the same sentences in high frequency could serve as a diagnostic unit. First, terms which consist of two words can be grasped. Second, if a document contains relatively many sentences with two certain words, this must relate to its importance (either as a specific term or as typical of the writer's style) for this document. Third we are able to evaluate context information; if a word tends to appear with another one frequently their dependence will produce high frequency word pairs and thus be valued accordingly.

A further elaboration of the method could be to select which part of speeches are allowed to form a word pair candidate. A problem with this approach derives from word ambiguity when considering adjacent words. For instance, consider the famous linguistic phrase: ``Time flies like an arrow''. The first three words here can belong to more than one grammatical categories so it is not clear which grammatical combination this construct represents. If it appears in a later sentence with different categories this time, then it will unfortunately group with the previous one.

Cue words and phrases

It is typical in writing a paper for publication to use certain words or phrases to highlight sentences. It was Edmundson {(Edmundson 1969)) who first observed and used these heuristics for his project. Such cues as This paper ..., we propose ..., important are frequently indicative of a worthy sentence for summarisation. These diagnostic units are called cue phrases and bonus words. On the contrary, there are some negative words (or phrases), called stigma words which decrease the significance of a sentence, like hardly, impossible, belittling expressions, etc.

Title and Headings words

Undoubtedly the title and the headers of a text provide a good amount of information about its content. By extracting the content words from titles and headers we can create a scoring formula which increases the score of sentences that contain these words. We can also put headers and titles directly into the extract thus creating a more coherent and structured output. Finally, the system can be guided to select at least one sentence from every major section, so given that this sentence is good, the extract is more likely to grasp all the basic subjects.

Utilisation of already existing abstracts

When abstracts (or even better extracts) are available with texts, the system can ``learn'' certain properties from them. We can extract words, noun phrases, syntax style, etc. from abstracts and then try to find similar sentences in the body of the text. In this way the output sentences bear more resemblance (in content and in format) with the human-produced abstracts. Other statistical properties which can be utilised include the average number of sentences in a summary along with the average length of summary sentences.

Linguistic processing

Linguistic information can also prove useful, on the basis of looking for strings of words that form a syntactic structure. Extending the idea of high frequency words, we can assume that noun phrases form more meaningful concepts, thus getting closer to the idea of terms. This overcomes several problems of the first single-word method because it can utilise:

compound nouns
terms which consist of adjective+noun, e.g. computational linguistics

There is a possibility though that one term can be ``implemented'' with more than one noun phrases, compare for example information overload with overload of information.

Locational heuristics

The position of a sentence within a paragraph (and the position of a paragraph within a text as well) can sometimes be indicative of its significance. This happens usually with the first sentence of a paragraph which supposedly gives a very good clue of what is going to follow. Another similar assertion is that a good proportion of important information about a paper is to be found in the first and last paragraphs (they indeed often contain introductory and conclusive material), so their sentences should be favoured more than the sentences of the other paragraphs.

Exploiting text formatting

The format of an article can further aid the searching for summarising sentences. Writers usually put important phrases in bold or italic font, capitalise or underline them if they want to give a certain emphasis.

Additional factors affecting a system for automatic extraction

Along with the diagnostic units we can further parameterise the system and experiment with varying values in order to find optimum combinations.

Length of an extract

Morris et al. (Morris et al. 1992) postulated that about 20 of the sentences in a text could express all the basic ideas about it. Since abstracts are much shorter than this proportion, the length of extracts should lie between the length of an abstract and Morris's percentage.

There are five ways to vary the number of sentences for an extract:

Proportion: Select a predefined percentage (usually 10) of the number of sentences of the paper. This technique is good for normally sized papers but will produce long extracts for long documents. Since long documents may contain many ideas and probably more than one topic it might be helpful to try for a lot of sentences. However, long extracts magnify the problems of the method (incoherence, repetition) and so many sentences can be confusing to read and follow. Finally, from an economical point of view, big extracts contradict the brevity of a summary: nobody would like to read a summary with 100 or more sentences.
Oracle method: If a target extract is available, select the same number of sentences. Precision and recall figures are always the same because of the same denominator. However, target extracts are not supposed to be accessible (otherwise why making a computer program?). In addition, it is intuitive that a computer extract will need more sentences than the perfect extract to reach a good point of coverage and coherence. But an advantage of the oracle method is that the system can be ``trained'' from the target extracts so it can predict for the test documents the optimum number of sentences.
Fixed number of sentences: Here the length of an extract is always the same (typical number: 10-15 sentences) regardless of the size of the paper. This technique is closer to human-produced abstracts. It favours shortness but the problems of the previous method are replicated: who knows how good the runner-up sentences would be?
Sentences above a certain threshold: For a sentence to be included in the extract it suffices to have a reasonable enough score. This is a way of compromising the extremes of the previous methods, but it requires to determine this threshold.
Mathematic formula: The number of extracted sentences is an increasing function of the number of sentences in the text, but it does not grow linearly. So, relatively few sentences are added for a big text and even fewer for a much bigger. This is probably one of the best methods, it prevents a size explosion but it caters for huge papers as well.

Length of a sentence

It can be suggested that too short or too long sentences are not generally ideal for an abstract, and therefore for an extract as well. This is usually referred to (see (Kupiec et al. 1995)) as sentence cut-off feature. It penalises short (less than 5-6 words) and long sentences either by reducing their score, or even by excluding them completely.

Stop words

Language in general contains much redundancy. There are many words that carry no meaning but are necessary to construct grammatical sentences. These words include prepositions, pronouns, articles, connectives etc. and are called function words. It follows that the function words should not be included in any statistics and scoring formulas because they do not contribute to the relevance and importance of a sentence. For the present system I used an enhanced list that contains along with the traditionally functional words, some others as well which intuitively do not affect the summary-worthiness of a sentence, e.g. following, first, auxiliary verbs, etc.

Word stemming and Capitalisation

Can words of the same morphological root be considered identical terms? Especially in automatic indexing, it is a classical technique to use word stems as indices. It speeds up the procedure, reduces table sizes and supposedly enhances the performance of the system. However, there is no clear evidence that words with the same stem but with different suffixes (and possibly different grammatical categories) must be grouped. A difference in morphology sometimes entails a semantic difference, even a subtle one. Moreover, the word ambiguity problem is considerably enlarged as less different words exist. Finally, the quality of the stemming program has a strong impact on the usefulness of the technique. (Strzalkowski 1992) refers to an 6-8 improvement of precision by incorporating morphological information. Similar assertions can be made about capitalised and non-capitalised words. Others distinguish between for example Grammar and grammar, others group them into one word. No approach is definitely better than the other, it is rather a matter of preference.

Description of the Corpus

The nature and characteristics of the corpus have a strong impact on the later stages of work. It is of great importance to design beforehand the overall structure and appearance of the corpus, the ratio of automatic/manual editing required and adapt it to the annotation and processing tools which will be used afterwards.

The corpus processing was carried along with Simone Teuffel, PhD student in the Centre for Cognitive Science, who also did a summer project on automatic summarising.

Specifications

Statistical NLP requires a large amount of texts to guarantee the reliability of the gathered statistics. These texts must also be in electronic form and belong to the same domain, since we want to locate high content diagnostic units specifically for this domain. For the task in hand, at least the files that will later comprise the test set should contain an abstract (or even better an extract) to serve as a target for evaluation. Finally, they must be files of the same format, so the same processing programs can be applied to all the texts and output a uniform and coherent corpus.

After an extensive search in the World Wide Web (WWW), the only group of texts satisfying all these criteria was a collection of papers on Computational Linguistics, all of them LaTeX files. Although enormous amounts of texts, on several fields, are available online, they are only in the format of postscript files. Postscript files are by nature files to be displayed on a graphic screen, it is extremely difficult, if not impossible, to reconstruct them as text.

Origins and domain

The corpus consists of papers downloaded from the Computation and Language E-Print Archive (http://xxx.lanl.gov/cmp-lg) which contains papers submitted from June 1994 until May 1996. All the papers downloaded were LaTeX files and contained small abstracts. The topics included: computational linguistics, natural-language processing, speech processing, and related fields.

Corpus Processing

After the LaTeX files were downloaded they had to pass several processing stages in order to be usable by the Language Technology Group (LTG, University of Edinburgh) tools. LTG programs require the files to be in SGML format (see below).

The SGML environment

SGML (Standard Generalized Markup Language, (Goldfard 1990)) is an international standard language for describing marked-up electronic documents.

In this language we list and describe all the allowable entities in a text (e.g. titles, paragraphs, sentences, words etc.). The structuring of the entities is hierarchical and is represented in a tree; there is a root entity and all the others have to belong to a parental node.

The major advantage of SGML for our approach is that it allows for different representations of a text to be constructed in a uniform way. The software that runs on the corpus can easily access whatever part of the text, the only required information is the name of the entity which corresponds to a certain diagnostic unit. It does not need to know how it was made or where it is placed.

Every entity (also called structure, or tag) has a name and a description, stored in a description type document (DTD). This file is consulted when marking up documents. For a tag T its start and end are denoted by <T> and </T> respectively. For example if we have a sentence:

This is a test sentence

and if only sentences (tag: S) and words (tag: W) are used for markup, the SGML equivalent is:

<S><W>This</W> <W>is</W> <W>a</W> <W>test</W> <W>sentence</W></S>

The structure of the sentence is (assuming TEXT is the root element):

If we would also like to annotate noun phrases, the sentence would look like this:

<S><W>This</W> <W>is</W> <NG><W>a</W> <W>test</W> 
<W>sentence</W></NG></S>

It is obvious that every new representation can be easily incorporated in the same text. This SGML sentence contains now two different kinds of diagnostic units.

The specifications set by the DTD file must be strictly satisfied. Except the imposed hierarchy (for example a <W> must appear only enclosed in an <S>, tags of the same category cannot be nested (e.g. sentences within sentences are not permitted).

The Language Technology (LT) library of Normalised SGML tools (NSL)

LT NSL is a development environment for SGML-based corpus and document processing, with support for multiple versions and multiple levels of annotation. It consists of a C-based API for accessing and manipulating SGML documents and an integrated set of SGML tools.

LT NSL belongs to a greater group of tools, the LTG corpus workbench, which also includes tools for tagging, tokenizing, frequency counting, etc.

The LTG tools work with SGML queries, which are requests for specific tags inside the documents. After the tag is retrieved it can further be processed.

Pre-editing the LaTeX files

The complexity of LaTeX files and the diversity of environments that different writers use had to be reduced before feeding the next module. We undertook some automatic deletions and alterations so as the files could be more easily and quickly converted to HTML.

From LaTeX to HTML

Nikos Drakos' (Computer Based Learning Unit, University of Leeds) LaTeX2HTML program was used to convert the LaTeX files to HTML format LaTeX2HTML is a powerful program, that basically tries to translate as many Latex commands as possible, including figures, equations and even user defined environments. Since the summarisation program does not need this type of information, the converter is configured only for text. The command is :

latex2html -info 0 -nolatex -no <LaTeX>

The options before the filename tell LaTeX2HTML to avoid certain conversions of LaTeX commands.

Necessary replacements

Several LaTeX-oriented tags (like images and equations) whose presence and content do not have a summarising value were replaced by simple indicator tags. In this way the system did not count them as diagnostic units but the reader of the text was able to know where an equation or an image was originally placed, although the actual content of the equation is not of course reproduced. These tags were:

Images, figures: It is obvious that they are not needed, and of course they cannot be represented in text mode.
Equations: A certain assumption can be made that an equation cannot be included in ``high content'' words, let alone that equations from a LaTeX file are not possible to represent in text mode. However one can argue that the presence of an equation can affect the significance of a sentence, especially for informative abstracts.
References: The names of references (authors and years) as well as references to other sections of the text are not indicative of sentence content.

Table of Indicator TAGS

Manual post-editing of HTML files

The diversity of papers necessitated some further editing of files. Among the most frequent corrections were:

Abstract and Title: Keywords and Acknowledgements inside the abstract section, and ``It appears in Computational Linguistics vol. ...'' in the title section were removed.
Abstract boundaries: Some files had Introduction or Summary instead of Abstract.
Authors, dates, university names: All of these were deleted if they appeared inside the abstract or outside the BODY section.

From HTML to (simple) SGML

The HTML output file contains a lot of sophisticated tags which are not necessary for the subsequent processing of the texts. It also does not include end-of-paragraph markers, which are a requirement for SGML (every tag must end). At the end of processing, the only tags that should be present in the texts are:

and the hierarchical organisation of the text is:

The conversion is accomplished with several Perl scripts.

Post-Editing the SGML files

Even after the running of the previous script there were some files which needed extra editing because of:

Very strange and rare TAGS: Usually LaTeX2HTML-produced tags that appeared infrequently and the Perl scripts did not take care of them in the first place.
Empty paragraphs: Sometimes paragraphs without any text are produced by LaTeX2HTML. Besides their apparent usefulness, they tend to confuse the system in further stages.
Headers inside paragraphs: Of course headers should not enclosed in paragraphs, they stand on their own.
Items lists: Sometimes the items in a list do not end with full points, so the system considers them as a whole big sentence. We separated lists from the previous and next sentences as well as tagged every item as a sentence.

Tagging the corpus

The corpus so far is paragraph-marked, but we also need sentence, word and noun-phrase marking in order to count their frequencies. For assigning grammatical categories to the texts, I used the LTG tagger, a Hidden Markov Model based tagger with a correctness rate up to 96. The inserted tags were <S>, <W> and <NG>, referring to sentences, words and noun groups (LTG term for noun phrase) respectively.

Here is a sentence before being tagged:

The formalism of synchronous tree-adjoining grammars
[[#REF]], a variant of standard tree-adjoining grammars
(TAG), was intended to allow the use of TAGs for language
transduction in addition to language specification.

and after tagging:

<S ID="1"><NG> <W C="AT">The</W> <W C="NN">formalism</W></NG>
<W C="IN">of</W><NG> <W C="JJ">synchronous</W>
<W C="VBG">tree-adjoining</W> <W C="NNS">grammars</W></NG>
<D>[</D><D>[</D> <W C="CD">#REF</W><D>]</D><D>]</D><CM>,</CM>
<NG> <W C="AT">a</W> <W C="NN">variant</W></NG> 
<W C="IN">of</W><NG> <W C="JJ">standard</W> 
<W C="VBG">tree-adjoining</W> <W C="NNS">grammars</W></NG>
<D>(</D><NG> <W C="NN">TAG</W></NG> <D>)</D> <CM>,</CM> 
<W C="BEDZ">was</W> <W C="VBN">intended</W> <W C="TO">to</W> 
<W C="VB">allow</W><NG> <W C="AT">the</W> 
<W C="NN">use</W></NG> <W C="IN">of</W>
<NG><W C="NNS">TAGs</W></NG> <W C="IN">for</W>
<NG><W C="NN">language</W> <W C="NN">transduction</W></NG> 
<W C="IN">in</W> <NG><W C="NN">addition</W></NG> 
<W C="IN">to</W>< NG><W C="NN">language</W> 
<W C="NN">specification</W></NG> <EOS>.</EOS> </S>

Explanations

The ID inside S is an attribute which refers to the index (position) of the sentence in the text.
The characters in quotes inside <W C=...> refer to grammatical categories.
The tags CM and D are used for non-word strings such as parenthesis, commas, etc.
EOS denotes the end-of-sentence tag.

When this stage is finished, the corpus is ready for frequency counting.

Final statistics

Here are some indicative statistics concerning the corpus:

and the abstracts (for all the papers):

Description of the system

Introduction

The basic schema under which the system tries to refine itself is that the same statistical model (a slight variation of TFIDF) will be tested on various representations of the text. We begin with single word representation and use TFIDF weighting to score the sentences and produce extracts.

In the next level, we attempt to improve the performance by adding more sophisticated representations (i.e. diagnostic units). First word pairs inside a sentence are used, a diagnostic unit which takes into account the context of words and therefore reduces the ambiguity problem present in the previous method. Second we take the notion of 'term' to the level of 'noun phrase' so as to minimise word ambiguity and capture terms beyond one single word.

Finally we try all the methods in all possible combinations. This produces 3 single methods, 3 paired methods (words + pairs, words + NPs, NPs + pairs) and 1 triple (words + pairs + NPs).

It is of high importance that the model is independent of the representations because it allows for more representations to be tried out without altering the model and generally the philosophy of the system.

The performance results will show (1) which combination produces the most satisfactory extract and (2) which representations the TDIDF statistical model can exploit.

Statistical model

The statistical model used to determine the diagnostic units weights is a TFIDF-like formula. Instead of the inverse document frequency, the system uses the inverse total frequency in a corpus (hence the name TFICF: Term-Frequency Inverse-Corpus-Frequency):

The ratio of frequencies in a document over the whole corpus has the effect of supporting terms with high frequency in the document but rare in the rest of the corpus while it is moderate with words that have the same frequency over both the document and the corpus. The scoring function is implemented in such a way that no term gets a weight more than 1, even if the term frequency is bigger than the corpus frequency. This adjustment was made to avoid extremely large scores for terms appearing only in the input text. Suppose for instance that a word appeared 5 times in a text and twice in the training corpus. Then it would have a weight of log(5)/log(2)=2.32, much above the normal weights. If this very high score should be kept is however a matter of preference; maybe such a term could be so valuable it deserves such a score. I personally favoured the normalised weights for two reasons: First, I found many similar cases not only because these terms were too specific and important for the text but also because of the small size of the corpus that allowed for many zero-frequency terms. Second, the problem was magnified for pairs and noun phrases since these diagnostic units had far less appearances.

Algorithm steps

There are basically two steps involved in the application of each one of the methods: (1) preparation of the corpus statistics and (2) the sentence scoring and sorting. The statistics (i.e. frequencies of diagnostic units) are collected from all the training documents beforehand. After this, the only statistics needed to be collected are the input text frequencies.

Method 1: Key words

Processing Steps

Collecting statistics over the whole corpus

The system needs to be trained by gathering statistics from the training documents. The procedure is as follows:

For all the documents in the training set compute the frequencies of all their words.
Remove all the words (functional+others) which do not contribute to the content of the sentence, by consulting a stoplist.
Remove all the words with frequency 1, assuming that they are too specific or might be misspellings.
Sort the list by frequency in decreasing order.

Producing the extract

After the frequency table is constructed, the sentences of the test document are scored in the following way:

Calculate word frequencies for the words in the document.
For each sentence i compute its score using the formula:

where is the frequency of the j-th word of the sentence over the document, is the frequency of the j-th word of the sentence over the corpus and N is the number of content words in the sentence (i.e. its length minus the stoplist words).
Exclude sentences with less than 3 content words.
Sort the sentences by score in decreasing order.
Create the extract by selecting a certain proportion of the sorted sentences, e.g. 10 or only ones that are above a predefined number.
Display extracted sentences in the order of their appearance in the document.

Method 2: Word pairs

Special issues

Stoplist words

In the same way as the stoplist words for the first method, there can be certain pairs with one or even two insignificant words. These pairs are excluded from the further process, which also speeds up the system.

Frequency threshold for word pairs

For every sentence, of length n, there are possible word pairs. It is an extremely large number which eventually results in a huge frequency table. In order to be able to store and access quickly the table, a decision was made to cut off pairs with very low frequency, assuming that their absence cannot affect the performance of the system.

Processing Steps

Again two steps, preparation and computation are required.

Collecting statistics over the whole corpus

Compute the frequency of potential word pairs, i.e. all the possible pairs of every sentence in the collection of documents. Pairs with words from the stoplist, or with word frequency less than 5 are excluded, the latter ensures the reliability of the statistics.
Remove all the words with frequency 1, assuming that they are too specific or that they are spelling errors from the corpus.
Sort the list by frequency in decreasing order.

Producing the extract

Basically the same formula is used to compute the sentence scores:

Calculate word pairs frequencies for the sentences in the document. If a word pair exists in both orders, use the total of both the two frequencies..
For each sentence i compute its score using the formula:
Exclude sentences with less than 2 word pairs.
Sort the sentences by score in decreasing order.
Create the extract by selecting a certain proportion of the sorted sentences.
Display extracted sentences in the order of their appearance in the document.

Method 3: Noun phrases

The third method explores the possibilities of syntactic structures improving the system performance. Within the notion of looking for terms it is a logical sequence to go for noun phrases after words and pairs.

Configurational issues

To utilise syntactic information, it is obvious that the text needs to be parsed, so the output noun phrases rely on the quality of the parser and the noun phrase grammar specification. It is very useful to be able to configure the NP grammar because complicated structures can be simplified or rejected. For instance, if a prepositional phrase modifies the head noun phrase, the former can be deleted so the real diagnostic unit is only the main noun phrase. In addition, we could only select the last two or three words of a noun phrase as the diagnostic unit, thus eliminating adverbs and adjectives which are far away from the noun, etc.

Processing steps

The only difference in the algorithm for collecting statistics has to do with preparing the statistics:

If a noun phrase had more than three words it was truncated to the last three.
If a noun phrase is contained in the stoplist, then it is excluded.

The extracting algorithm is essentially the same. Noun phrases with frequency of 1 are not used in scoring and sentences with less than 2 noun phrases are also excluded.

Combined Heuristics

After the heuristics are tried separately, they are tried cumulatively in search for the optimum combination. The scoring formula is:

where K is the number of heuristics combined, is the number of diagnostic units of sentence i, is the contributing weight of the j-th diagnostic unit and is the weight for the occurrence of the diagnostic unit in the i-th sentence.

Results and Evaluation

Evaluation design

Evaluation for the present project consisted of running the extraction program on the 5 test texts for all the methods (including all their combinations) and then subjectively judging the quality of all the output extracts.

All the extracts had 10 sentences, a practical value given the average of pages of the test papers (6 pages) and the average number of sentences in the abstracts of the training corpus (4.75 sentences) average.

This judgement was performed as follows:

For every sentence there was a score on a five point scale according to the relevance of the sentence:

Table 1: Sentence ratings

Generally, if the same sentence appeared in extracts produced by different methods, it got the same score. However, if it generated a better relevance within the context of the extract its score was increased.

The points of the sentences for every extract are added together; the sum represents the total quality.

For every method the scores of the texts are also summed up to produce a quality measure for a specific method.

Results

The following tables show the total scores for papers and methods (W stands for words, P for pairs, N for noun phrases):

Table 2: Subjective quality results for extracts produced for the 5 test papers.

Findings and Discussions

A first look on the scores shows a consistency of methods over papers, i.e. scores for a certain paper had about the same boundaries for all the methods. Total scores on the horizontal axis seem to belong to three categories. Papers 1 and 5 got very high scores, many of them five-point sentences (i.e. perfect for extracting). Paper 4 had a moderate scoring and papers 2 and 3 scored very badly. This was mainly due to the nature of the texts. Papers 2 and 3 had a lot of example sentences with a lot of proper names (one paper had in fact Japanese examples). These names, which did not appear at all in the training corpus, were considered high value terms and achieved impressive scores without the sentence having other content words. As a result we can say that texts with many example sentences, fractions of programming code, etc. were not effectively summarised by our frequency-count model.

Co-selection

Another observation made by looking at what sentences were extracted is the co-selection of certain sentences by different methods. The following table shows for every paper how many sentences were co-selected by more than one method (starting from 7, i.e. selected by all methods):

Table 3: Number of co-selected sentences by all methods.

This table demonstrates some kind of regularity. We expect the number of sentences co-selected to be lower in the first column and then increase gradually. This actually happens, except a gap in sentences selected by 3 methods. If we calculate the number of individual sentences appearing in the extracts by removing all the sentences that were repeated (i.e. selected by more than one method) we can get a measure of overlap between the methods. The total number of sentences in all the extracts is: 5 papers * 10 sentences * 7 methods = 350. 205 of them were repetitions, so 350-205=145 were distinct sentences. Of these 145, 80 are repeated more than once (80 is the sum of the last row of table 3) and the remaining 65 appeared exactly once. Hence, the overlap is 80/145=55,1 which shows that more than half of the sentences was selected at least from one other method.

Weighted combinations

In combinations, methods were assumed to be equally weighted (see chapter 6, Description of the system) but actually words were slightly less weighted because of the following fact: The highest scoring sentences (with TFIDF, not in my evaluation) with single-words did not usually reach the maximum score, i.e. 1. However, sentence scores for pairs and noun phrases achieved more frequently the maximum score. Actually, for the sentences extracted the average sentence score, computed with TFICF, for words, pairs and noun phrases was 0.66, 0.99, and 0.92 respectively. Therefore, when adding the scores of individual methods, the contribution of words was on average 0.66/0.99=66 the contribution of pairs and 0.66/0.92=71 the contribution of noun phrases. The reason why the average scores for pairs and NPs were that high comes from the scoring formula. Units for these methods had much less occurrences in the corpus; it was easier for a sentence to have 2-3 units with a weight of unity, so the total sentence score was again 1 (also these methods had a smaller threshold of diagnostic units, see again chapter 6).

The fact that the word-based method did not contribute equally in compound heuristics is welcome (though not planned originally); since words have the poorest performance, it is desired to decrease their contribution.

What is the best method?

The scores over methods now show that the NP method had by far the best performance (82 points in total, 7 points more than the runner-up, pairs+words). Noun phrases also had consistently high scores in every paper, being the first scoring method in 3 papers (no.2, 3 4). Another observation which supports NP heuristics, is that 2 runner-up methods, ranked 3th and 4th, are combinations with NP as a factor.

On the other hand, words and pairs seem to perform similarly. However, I would say that pairs are in fact a better predictor because they outperform significantly the single-word method in 3 out of 5 papers; the 2 remaining papers which show a preference for words are in fact the two very high scoring ones. In these 2 papers (the ``easier'' for summarisation), high scores result in differences between methods. If we normalise the scores of every method by dividing by the total score for every paper we get:

Table 4: Normalised performance of all 7 methods.

This viewpoint again proves the dominance of the NP method, but this time word pairs are clearly the second best method, while the single-word is the poorest one. Combinations are placed between pairs and words. If a combination includes the word method, it always achieves lower scores than the same combination without words.

Comparison with NetSumm extracts

A comparison of the program extracts with the extracts produced by the BT NetSumm program did not reveal any reliable results. The proportion of condensation for NetSumm was usually more than 15, while it was less than 10 for the system extracts. The mean co-selection of sentences is only 16, but this is not a comparable measure given that we do not know how NetSumm extracts are produced. However, NetSumm passages are more coherent and do not contain example sentences.

The following table shows how many sentences were co-selected (1) for the NP method and (2) for the best performing method regarding the specific paper (see Table 1. A star (*) indicates that for the specific paper, NP was also the best performing method):

Table 5: Co-selection of sentences between the program and BT's NetSumm.

Conclusion

The final conclusions drawn from the evaluation are:

Noun phrases are the best predictors for summary-worthy sentences.
More sophisticated representations than single words help to improve the system's performance.
Combinations of methods do not produce better results, but still noun phrases are involved in the best of them. Finally, these synthesised methods are far more satisfactory than the single words diagnostic unit.
Some papers are more suitable for extracting, some are not, depending on the nature of the text. Simple counting of term occurrences may extract sentences with too specific words, like proper names from examples, programming code, etc.
55,7 of the sentences were extracted by more than one method, and generally scores of different methods did not have a great diversity.

Future Extensions

The system can be extended in many ways, from heuristics to statistical models. Here are some suggestions:

Use other statistical models, incorporating inverse document frequency for example. Furthermore, we can discriminate between terms whose total global frequency is the same but some appear in small and others in big files.
Experiment with weights of methods by finding what weights increase (and maximise) the score of the important sentences. Supervised training procedures can be used for this.
Explore other heuristics including titles, headers and perhaps the most promising feature: position within a paragraph.
Use structured noun phrases in order to group NPs such as information retrieval and retrieval of information. This can be done by storing NPs in a tree structure: e.g. head of the NP + complements.
Exclude example sentences, items in lists assuming that their significance is negligible. This has also the effect of ``clearing'' the frequency tables, removing too specific units.
Group lower-cased and upper-cased instances of the same word to increase the occurrence statistics.
Utilise abstracts and target extracts: Much information can be learnt from abstracts, such as words, style, cue phrases, length of sentences, number of sentences, etc.
Use different evaluation techniques. We can judge the system extracts with human-prepared target extracts. Klaus Zechner (Zechner 1995) conducted such an experiment, and his results showed a high correlation (mean precision/recall: 47/54) between humans and machines. Text similarity and/or sentence similarity also can be incorporated to refine the evaluation.
Develop techniques to deal with anaphora which usually requires text understanding (Jones 1993), but it can refine substantially the extract.
Try the system in a different corpus to see if it achieves better results, and if certain features (e.g., example sentences, the statistical model) improve/harm its performance.

Summary

We have described and developed a system for automatic summarisation of texts based on sentence extraction. The system finds the most significant sentences of a text by computing a relevance-score for every sentence. This score is based on the statistical regularities of words, word pairs and noun phrases in (1) the input text and (2) a collection of training texts (800,000 words) all in the general domain of computational linguistics. Output extracts were produced for 5 test papers, for all the three methods and their possible combinations. Evaluation scores were subjectively assigned to the extracts in order to find the method which produces the most satisfactory extract. Results from the evaluation showed a clear preference for noun phrases, good performance for pairs and poor performance for words.

User manual

The system was developed in C++ under the Unix operating system. There are two basic programs, create to create the necessary representations and docsum to do the extracting.

create

Syntax:

cat <sgml_file> | create_dus [-p] [-n]

where <sgml> is the input file, marked up at least for words and sentences. Option -p creates all the word pairs in a sentence and -n all the noun phrases. These tags are stored under the <REP> tag which is separate from <BODY>. All the entries have a <W> tag for uniformity, but there is an attribute T (for type) which has a value of "p" for pairs and "n" for noun phrases. Here is an example for the phrase An automatic extraction program with both pairs and noun phrases (word pairs consisting of at least one stoplist word are of course excluded):

<W T="n">automatic_extraction_program</W>
<W T="p"automatic_extraction></W>
<W T="p">automatic_program</W>
<W T="p">extraction_program</W>

Notes: There is one noun phrase here and three pairs. For the NP the article an was not included because maximally three words can be in one NP. The same article does not also appear in any pair because it belongs to the stoplist.

docsum

After the creation of the representation of the text, and given the frequency tables docsum produces the extracted sentences.

Syntax:

cat <sgml_file> | docsum [-m <method(s)>] [-p <condensation>]
[-q <query>] [-u <unit_query>] [-b <body_query>] [-h]
[-f <frequency_table>]

All the options have default values, especially set for the single word method. Here are further explanations about these options:

sgml:

It is the input file, marked up at least for words and sentences.

methods:

Indicates the diagnostic unit(s) to be used. w is for words, p is for pairs, and n is for noun phrases. For combinations the corresponding letters are simply concatenated, for example wn tries words + noun phrases (default: w).

condensation:

Corresponds to the length of the extract. If it is less than 1 then it indicates the proportion of the sentences that will be extracted (for example 0.1 means extract 10). If the value is more than 1 then it indicates the fixed number of sentences (default: 0.1).

query:

This value represents the SGML tag where the diagnostic units are stored (default: <BODY>, for the single-word method).

unit:

Units are the SGML entities which are scored and extracted (default: sentences <S>).

body:

The SGML entity which contains text marked-up only for words and not other representations. It is the tag used to access the sentences for displaying them (default: <BODY>).

frequency

It provides the filenames for the look-up frequency tables. For simplification reasons it is assumed that files for different diagnostic units have the same root filename but different extensions. The default values are total.frq, total.ng.frq and tt total.wp.frq for words, NPs and pairs respectively. The filename given with this option corresponds to the single-word table. It is then split into root and extension and .ng.frq and .wp.frq are added in order to construct the remaining filenames.

Output

The output of the program is an SGML document with all the non-qualifying sentences removed. It still contains paragraph markers, headers, etc. and a perl script was created to make the output look more natural. See Appendix 1 for ``clear'' results.

...text

As a surrogate, an abstract is placed between the title and the table of contents, in a size axis.

...documented

There are numerous guidelines for the creation of abstracts. See (Borko Bernier 1975) references for some Institution's published standards.

...understanding

Which might involve syntactic, semantic and discourse analysis.

...maintainable