SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator

1Huawei Noah's Ark Lab, 2The University of Hong Kong,
3Center of Excellence for Generative AI, KAUST 4Max Planck Institute for Intelligent Systems, Tübingen
Corresponding author

guoxchen@connect.hku.hk, shi.han@huawei.com, li.jiawei@huawei.com

Abstract

Large Language Models (LLMs) have exhibited exceptional performance across a spectrum of natural language processing tasks. However, their substantial sizes pose considerable challenges, particularly in terms of computational demands and inference speed, due to its quadratic complexity. In this work, we have identified a noteworthy pattern: certain meaningless special tokens (i.e., separators) contribute massively to attention scores compared to other semantically meaningful tokens. This insight has led us to hypothesize that the information of the segment between these special tokens can be condensed into these tokens without significant loss of information. Based on this hypothesis, we introduce SepLLM, a plug-and-play framework for inference acceleration by compressing these segments and dropping redundant tokens. Besides, we implement efficient kernels for training acceleration. The experimental results on training-free, training-from-scratch and post-training settings substantiate the effectiveness of SepLLM. Notably, SepLLM achieves a remarkable reduction of over 50% in KV cache on the GSM8K-CoT benchmark, utilizing the Llama-3-8B backbone, with negligible compromise in performance. Additionally, in streaming settings, SepLLM is capable of delivering consistent and effective language modeling across up to 4 million tokens or even more.

Overview

Figure 1: The training and inference paradigm of SepLLM.


Figure 2: Overall framework of the proposed SepLLM tailored for streaming applications.

Training-free

GSM8K-CoT r.KV(%) MMLU r.KV(%)
Vanilla 77.3 100.0 65.7 100.0
StrmLLM (n=380) 71.4 47.5 63.4 52.5
StrmLLM (n=256) 68.6 26.0 62.1 37.7
SepLLM (n=256) 77.2 47.4 64.7 44.6

Table 1: Evaluation results and average running time KV cache usage for training-free experiments on GSM8K-CoT 8-shots and MMLU 5-shots.


Video 1: The comparison of ppl and runing time between SepLLM and others.

Training from Scratch

Figure 3: The loss comparison between vanilla Transformer and SepLLM.

Post-training

Figure 4: Training loss curves for the post-training.

BibTeX

@article{chen2024sepllm,
  title={SepLLM: Accelerate Large Language Models by Compressing One Segment into One Token},
  author={Chen, Guoxuan and Shi, Han and Li, Jiawei and Gao, Yihang and Ren, Xiaozhe and Chen, Yimeng and Jiang, Xin and Li, Zhenguo and Liu, Weiyang and Huang, Chao},
  journal={arXiv preprint arXiv:2412.12094},
  year={2024}
}