Papers
arxiv:2410.00531

TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices

Published on Oct 1, 2024
· Submitted by AK on Oct 2, 2024
#2 Paper of the day
Authors:
,

Abstract

A tensor parallel inference system named TPI-LLM improves large language model (LLM) inference on edge devices by minimizing memory usage and reducing latency.

AI-generated summary

Large model inference is shifting from cloud to edge due to concerns about the privacy of user interaction data. However, edge devices often struggle with limited computing power, memory, and bandwidth, requiring collaboration across multiple devices to run and speed up LLM inference. Pipeline parallelism, the mainstream solution, is inefficient for single-user scenarios, while tensor parallelism struggles with frequent communications. In this paper, we argue that tensor parallelism can be more effective than pipeline on low-resource devices, and present a compute- and memory-efficient tensor parallel inference system, named TPI-LLM, to serve 70B-scale models. TPI-LLM keeps sensitive raw data local in the users' devices and introduces a sliding window memory scheduler to dynamically manage layer weights during inference, with disk I/O latency overlapped with the computation and communication. This allows larger models to run smoothly on memory-limited devices. We analyze the communication bottleneck and find that link latency, not bandwidth, emerges as the main issue, so a star-based allreduce algorithm is implemented. Through extensive experiments on both emulated and real testbeds, TPI-LLM demonstrated over 80% less time-to-first-token and token latency compared to Accelerate, and over 90% compared to Transformers and Galaxy, while cutting the peak memory footprint of Llama 2-70B by 90%, requiring only 3.1 GB of memory for 70B-scale models.

Community

Paper submitter

Nice work!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Hey, this paper was a great read. We wrote a summary blog about this paper and a few more like

  1. TPI LLM
  2. Differential Transformer
  3. ARIA
    You can find it here. Please give it a read :)
·

Thank you for the great summary. It's clear and concise. Actually we are continuing to develop it, and our current results show that token latency can be reduced to less than one second. We will let you know once we open source the new work.

Paper author

Want to run larger LLMs with llama.cpp but hit hardware limits? Have multiple devices lying around but not sure how to use them for collaborative inference? If yes, try our new work prima.cpp!

Prima.cpp is a distributed implementation of llama.cpp. It lets you use multiple everyday home devices to run larger models, even 70B! It inherits great features from llama.cpp like mmap to avoid OOM, and adds more features like piped-ring parallelism, prefetching, and automatic workload distribution to make distributed inference faster.

Give it a try and unlock the full power of your devices! 🖥️ 💻 📱

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2410.00531 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2410.00531 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.00531 in a Space README.md to link it from this page.

Collections including this paper 11