Summarization using automated AI-based techniques can help counter information overload. But for this to be useful in everyday life, we need AI models that can create concise summaries of conversations from a variety of formats, including meeting notes, email threads, and discussions happening in forums and chats. These differ from well-structured sources like newspaper articles, which often contain large amounts of valuable but noisy, conversational, and verbose content from multiple participants. Meta AI has made a series of research advancements that enable machines to achieve close to human performance for automatic summaries of conversations across diverse domains. We’re sharing details in this blog post on our progress.
Our research covers key aspects of AI modeling to create a comprehensive approach to a solution: new dataset collection, new benchmark definition, a novel model training approach, and methods that cover both short- and long-form text content. Specifically, we used publicly available text in online community discussion sites to generate an additional dataset for multiperspective answer summarization and then defined a comprehensive benchmark for conversational summarization across diverse domains. We achieved state-of-the-art results on those benchmarks and significantly reduced occurrences of factual errors by using a novel linguistically informed contrastive fine-tuning approach. We also deliver state-of-the-art results when handling long document summarization. Finally, since labeling summarization data is a resource-heavy process, we proposed a general method for improving zero and few-shot abstractive summarization.
We are exploring how this work can be applied across a variety of use cases. Summarized information is particularly valuable for augmented and virtual reality devices, due to their limited screen space. We believe summarization can also be a useful capability for smart assistants, by creating intelligent, natural interactions between people and AI – which could have many potential future use cases as we help build the metaverse.
Enabling conversational summarization
While documents, articles, and scientific papers contain specific linguistic structures that make them easier to summarize, conversational text scatters the main points across multiple utterances and participants, covering a vast amount of information in many different formats.
We addressed this research gap by collecting the ConvoSumm benchmark for research purposes, which is the first comprehensive benchmark for conversational summarization across diverse domains. It includes newly collected summaries for news article comments, discussion forums and debate, community question answering, email threads, and existing data for dialog and meeting summarization.
We used the “issues-viewpoints-assertions” graph framework to unify modeling across these domains. We constructed the argument graph using entailment relations. We then linearized the graph and trained a graph-to-text model, and experimented with argument mining as a way to reduce noise in long-text input. Our results showed improved performance over the previous state-of-the-art model on the ConvoSumm benchmark in both automatic and human evaluations.
Another challenge we addressed was how to accurately summarize multiple perspectives in community question answering. Previously there was no way to ensure that different viewpoints were reflected in the summary. We addressed this by collecting the AnswerSumm dataset for research purposes. This follows the pipeline of relevant sentence selection, sentence grouping based on perspectives, summarizing each perspective, and producing an overall fused summary. We then introduced a novel unsupervised approach that automatically creates multi-perspective bullet-point answer summaries for data augmentation, further boosting overall summarization performance. Furthermore, we proposed to use reinforcement learning with two additional rewards based on textual entailment and semantic area to improve factual consistency and answer coverage.
Improving faithfulness of conversational summarization
Abstractive summarization models can hallucinate – generating information that might not be relevant or faithful to the input. For example, a common factual error for dialogue summaries is wrong reference error. To better understand the types of hallucinations generated by state-of-the-art, pre-trained models on dialogue summarization, we devised a new linguistically motivated taxonomy of factual errors and conducted human evaluation on popular dialogue summarization datasets.
We detailed a training strategy that improves the factual consistency and overall quality of summaries via a novel contrastive fine-tuning, called CONFIT. To tackle top factual error types based on our annotation, we introduced additional contrastive loss with carefully designed hard negative samples and self-supervised dialogue-specific loss to capture the key information between speakers. The results show that our model significantly reduces many types of factual errors on both SAMSum dialogue summarization and AMI meeting summarization. On both datasets, we achieve significant improvements over state-of-the-art baselines using automatic metrics, ROUGE and BARTScore, with 2.2 points increase in ROUGE-1 on SAMSum and 2.4 points increase on AMI. We also demonstrated the effectiveness of our approach through human evaluation with 30 percent improvement on SAMSum and 15 percent improvement on AMI on the human faithfulness score.
Scaling with zero or few examples
To cut laborious tasks for manually creating summaries for each new domain, we introduced a generalizable method called WikiTransfer, which fine-tunes pretrained models on pseudo-summaries that are produced from generic Wikipedia data. Each summary contains characteristics of the target dataset, such as the length and level of abstraction of the desired summaries.
Using WikiTransfer , we achieved a new state-of-the-art result for zero-shot abstractive summarization and demonstrated the effectiveness of our approach on four datasets from diverse domains. With this method, we also achieved better results in few-shot summarization compared to transfer from other summarization datasets. We also improved few-shot performance further with data augmentation techniques, and introduced a regularization term for few-shot transfer. Human assessments of the resulting summaries do not show significant differences between the WikiTransfer few-shot summaries and fully supervised summaries, demonstrating the efficiency of our approach.
Finding the key points to summarize in longer conversations
Conversations can be long and varied, adding to the challenge of using AI to accurately and concisely summarize what was discussed. Most state-of-the-art summarization models, such as BART or T5, rely on full-attention transformers to effectively capture global information in the input documents. However, applying these models to long inputs is prohibitive due to efficiency constraints – the self-attention mechanism has a quadratic complexity with respect to the input length.
Through a series of studies on efficient transformer variants (published in NAACL 2022), we identify a simpler yet still effective architecture that achieves a nice trade-off of performance and efficiency on long-text tasks. This architecture augments block-wise attention with pooling operations on the top layers of the transformer encoder. To further improve the performance, we pretrain this model on a large dataset of long text sequences constructed from the C4 corpus, using a masked span prediction objective that includes both long and short target spans. Our final model establishes state of the art on five summarization tasks. See our new preprint for more details.
Building the future of AI-driven summarization
Natural language processing is advancing at an exciting pace, with recent innovations such as Open AI’s ChatGPT. However, ChatGPT is not specifically designed for summarization. Our research helps address some of the areas in AI-generated text that warrant further exploration. For conversation summarization to be successful, responses will need to stick to the given source document as context and be able to cover multiple perspectives.
However, there are still additional challenges that will need to be addressed. We are actively working on zero- and few-shot conversation summarization generalization, using one model for conversation summarization across multiple domains. In addition, we are looking at on-device summarization to preserve user privacy for certain use cases on augmented and virtual reality devices.
We hope our research in this area can help the AI community advance conversation summarization and enable more use cases that bring concise information to people where and when they need it.