AI时代:本地运行大模型vllm

https://docs.vllm.ai/en/latest/index.html
高吞吐量、高内存效率的 LLMs 推理和服务引擎(快速搭建本地大模型,且openAI API 兼容)

vLLM is a fast and easy-to-use library for LLM inference and serving.

vLLM is fast with:

State-of-the-art serving throughput

Efficient management of attention key and value memory with PagedAttention

Continuous batching of incoming requests

Fast model execution with CUDA/HIP graph

Quantization: GPTQ, AWQ, SqueezeLLM, FP8 KV Cache

Optimized CUDA kernels

vLLM is flexible and easy to use with:

Seamless integration with popular HuggingFace models

High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more

Tensor parallelism support for distributed inference

Streaming outputs

OpenAI-compatible API server

Support NVIDIA GPUs and AMD GPUs

(Experimental) Prefix caching support

(Experimental) Multi-lora support

支持的开源模型:
https://docs.vllm.ai/en/latest/models/supported_models.html

posted @ 2024-03-05 22:41  iTech  阅读(89)  评论(0编辑  收藏  举报