Graphs in clusters: a hybrid approach to unsupervised extractive long document summarization using language models

Research output: Contribution to journalArticlepeer-review

13 Downloads (Pure)

Abstract

Effective summarization of long documents is a challenging task. When addressing this challenge, Graph and Cluster-Based methods stand out as effective unsupervised solutions. Graph-Based Unsupervised methods are widely employed for summarization due to their success in identifying relationships within documents. Cluster-Based methods excel in minimizing redundancy by grouping similar content together before generating a concise summary. Therefore, this paper merges Cluster-Based and Graph-Based methods by applying language models for Unsupervised Extractive Summarization of long documents. The approach simultaneously extracts key information while minimizing redundancy. First, we use BERT-based sentence embeddings to create sentence clusters using k-means clustering and select the optimum number of clusters using the elbow method to ensure that sentences are categorized based on their semantic similarities. Then, the TextRank algorithm is employed within each cluster to rank sentences based on their importance and representativeness. Finally, the total similarity score of the graph is used to rank the clusters and eliminate less important sentence groups. Our method achieves comparable or better summary quality and reduced redundancy compared to both individual Cluster-Based and Graph-Based methods, as well as other supervised and Unsupervised baseline models across diverse datasets.
Original languageEnglish
Article number189
Number of pages18
JournalArtificial Intelligence Review
Volume57
Issue number7
Early online date29 Jun 2024
DOIs
Publication statusPublished - Jul 2024

Keywords

  • Language models
  • Ranking
  • Redundancy
  • Sentence centrality
  • SentenceBERT

Fingerprint

Dive into the research topics of 'Graphs in clusters: a hybrid approach to unsupervised extractive long document summarization using language models'. Together they form a unique fingerprint.

Cite this