本地部署 通义千问视觉大模型 2.5-VL-3B-Instruct-AWQ量化 (windows)

自 Qwen2-VL 发布以来的五个月里,众多开发者在 Qwen2-VL 视觉语言模型的基础上构建了新的模型,并为我们提供了宝贵的反馈。在此期间,我们专注于构建更有用的视觉语言模型。今天,我们非常激动地向大家介绍 Qwen 家族的最新成员:Qwen2.5-VL

视觉理解:Qwen2.5-VL 不仅擅长识别常见的物体,如花、鸟、鱼和昆虫,而且还能高效分析图像中的文本、图表、图标、图形和布局。

自主代理:Qwen2.5-VL 直接作为视觉代理,能够进行推理并动态指导工具使用,具备计算机和手机操作能力。

理解长视频并捕捉事件:Qwen2.5-VL 可以理解超过1小时的视频,并且这次新增了通过定位相关视频片段来捕捉事件的能力。

支持多种格式的视觉定位:Qwen2.5-VL 可以通过生成边界框或点来准确地定位图像中的对象,并能提供稳定的 JSON 输出,包括坐标和属性。

生成结构化输出:对于发票、表格等扫描数据,Qwen2.5-VL 支持其内容的结构化输出,适用于金融、商业等领域。

参考连接:https://modelscope.cn/models/Qwen/Qwen2.5-VL-3B-Instruct-AWQ

 

 

本次使用的操作系统,以及环境配置

操作系统:windows11

CPU:  i7-11800H

内存:16GB

GPU:RTX3050ti  4G显存

CUDA Version: 12.6

 

使用anaconda 创建用于qwen的虚拟环境

conda create -n qwen python=3.9
activate qwen

 

构建Qwen2.5-VL(需要安装git,通过命令从源码构建)

windows安装git教程,参考链接:git的安装与配置教程--超详细版 - 知乎

pip install git+https://github.com/huggingface/transformers accelerate

 

安装工具包,以帮助您更方便地处理各种类型的视觉输入,包括 base64、URL 以及交错的图像

pip install qwen-vl-utils[decord]==0.0.8

 

安装modelscope,下载模型

pip install modelscope

在你的工作区创建一个download_model.py文件,用于下载文件,这里下载的是Qwen/Qwen2.5-VL-3B-Instruct-AWQ模型

# 通过ModelScope 的 snapshot_download函数下载模型

from modelscope import snapshot_download

# 下载Qwen2.5-VL-3B-Instruct-AWQ模型
model_dir = snapshot_download(model_id="Qwen/Qwen2.5-VL-3B-Instruct-AWQ", local_dir="models/Qwen2___5-VL-3B-Instruct-AWQ")

# 下载Qwen2.5-VL-3B-Instruct模型
# model_dir = snapshot_download("Qwen/Qwen2.5-VL-3B-Instruct")
# 下载Qwen2.5-VL-7B-Instruct7模型
# model_dir = snapshot_download("Qwen/Qwen2.5-VL-7B-Instruct7")
# 下载Qwen2.5-VL-72B-Instruct模型
# model_dir = snapshot_download("Qwen/Qwen2.5-VL-72B-Instruct")

print(f"模型已下载到: {model_dir}")

 

安装pytorch 2.3.1依赖( cuda 12.6)

# 安装依赖(cuda 12.1 或 cuda 12.6)
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu121

# 检测torch、cuda、cudnn版本
import torch
print(torch.__version__)
print(torch.version.cuda)
print(torch.backends.cudnn.version())

#是否可用gpu
flag = torch.cuda.is_available()
print(flag)

 

通过代码实现,调用Qwen/Qwen2.5-VL-3B-Instruct-AWQ模型

# 图像
import torch
from modelscope import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# 刚刚下载后弹出的下载地址,就填入这里
model_dir="models/Qwen2___5-VL-3B-Instruct-AWQ"
# model_dir = "C:/Users/malin/.cache\/modelscope/hub/models/Qwen/Qwen2___5-VL-3B-Instruct-AWQ"

# default: Load the model on the available device(s)
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-VL-3B-Instruct-AWQ", torch_dtype="auto", device_map="auto"
# )
print(111)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_dir, torch_dtype=torch.float16, device_map="cuda:0"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-VL-3B-Instruct-AWQ",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct-AWQ")
print(222)
# default processer
# 这里做了推理时候的图片压缩,平衡精度和速度,默认是全尺寸图片进行推理,很慢也要很大的显存
min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
    model_dir, min_pixels=min_pixels, max_pixels=max_pixels, use_fast=True,
)
# processor = AutoProcessor.from_pretrained(model_dir, use_fast=True,)

# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct-AWQ", min_pixels=min_pixels, max_pixels=max_pixels)

# 本地图像路径
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "4cf739bc-effe-4a64-a29b-156dc12a7f73.jpg"},
            {"type": "text", "text": "识别铁罐上的文字,只输出最高准确度的数字"},
        ],
    }
]

# 图像URL
# messages = [
#     {
#         "role": "user",
#         "content": [
#             {
#                 "type": "image",
#                 "image": "http://192.168.31.164/resource/pic_iron2/20250522101928-af4f7146-8dc8-43a0-8554-2f052c2eabe2.jpg",
#             },
#             {"type": "text", "text": "识别铁罐上的文字,只输出最高准确度的数字"},
#         ],
#     }
# ]

## Base64 encoded image
# messages = [
#     {
#         "role": "user",
#         "content": [
#             {"type": "image", "image": ""},
#             {"type": "text", "text": "识别铁罐上的文字,只输出最高准确度的数字"},
#         ],
#     }
# ]


# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

 

 

安装环境时出现的报错

1.ImportError: cannot import name 'shard_checkpoint' from 'transformers.modeling_utils'

这是因为通过 pip install git+https://github.com/huggingface/transformers accelerate源码构建transformers的方式,transformers是最新的4.53.0的版本,但是shard_checkpoint方法在4.47.0以上的版本就已经删掉了,而且也无法手动降低transformers的版本,会导致 KeyError: 'qwen2_5_vl'报错

# 补充shard_checkpoint方法,transformers4.47.0版本以后删除了
def shard_checkpoint(
    state_dict: Dict[str, torch.Tensor], max_shard_size: Union[int, str] = "10GB", weights_name: str = WEIGHTS_NAME
):
    """
    Splits a model state dictionary in sub-checkpoints so that the final size of each sub-checkpoint does not exceed a
    given size.

    The sub-checkpoints are determined by iterating through the `state_dict` in the order of its keys, so there is no
    optimization made to make each sub-checkpoint as close as possible to the maximum size passed. For example, if the
    limit is 10GB and we have weights of sizes [6GB, 6GB, 2GB, 6GB, 2GB, 2GB] they will get sharded as [6GB], [6+2GB],
    [6+2+2GB] and not [6+2+2GB], [6+2GB], [6GB].

    <Tip warning={true}>

    If one of the model's weight is bigger than `max_shard_size`, it will end up in its own sub-checkpoint which will
    have a size greater than `max_shard_size`.

    </Tip>

    Args:
        state_dict (`Dict[str, torch.Tensor]`): The state dictionary of a model to save.
        max_shard_size (`int` or `str`, *optional*, defaults to `"10GB"`):
            The maximum size of each sub-checkpoint. If expressed as a string, needs to be digits followed by a unit
            (like `"5MB"`).
        weights_name (`str`, *optional*, defaults to `"pytorch_model.bin"`):
            The name of the model save file.
    """
    logger.warning(
        "Note that `shard_checkpoint` is deprecated and will be removed in v4.44. We recommend you using "
        "split_torch_state_dict_into_shards from huggingface_hub library"
    )
    max_shard_size = convert_file_size_to_int(max_shard_size)

    sharded_state_dicts = [{}]
    last_block_size = 0
    total_size = 0
    storage_id_to_block = {}

    for key, weight in state_dict.items():
        # when bnb serialization is used the weights in the state dict can be strings
        # check: https://github.com/huggingface/transformers/pull/24416 for more details
        if isinstance(weight, str):
            continue
        else:
            storage_id = id_tensor_storage(weight)

        # If a `weight` shares the same underlying storage as another tensor, we put `weight` in the same `block`
        if storage_id in storage_id_to_block and weight.device != torch.device("meta"):
            block_id = storage_id_to_block[storage_id]
            sharded_state_dicts[block_id][key] = weight
            continue

        weight_size = weight.numel() * dtype_byte_size(weight.dtype)
        # If this weight is going to tip up over the maximal size, we split, but only if we have put at least one
        # weight in the current shard.
        if last_block_size + weight_size > max_shard_size and len(sharded_state_dicts[-1]) > 0:
            sharded_state_dicts.append({})
            last_block_size = 0

        sharded_state_dicts[-1][key] = weight
        last_block_size += weight_size
        total_size += weight_size
        storage_id_to_block[storage_id] = len(sharded_state_dicts) - 1

    # If we only have one shard, we return it
    if len(sharded_state_dicts) == 1:
        return {weights_name: sharded_state_dicts[0]}, None

    # Otherwise, let's build the index
    weight_map = {}
    shards = {}
    for idx, shard in enumerate(sharded_state_dicts):
        shard_file = weights_name.replace(".bin", f"-{idx+1:05d}-of-{len(sharded_state_dicts):05d}.bin")
        shard_file = shard_file.replace(
            ".safetensors", f"-{idx + 1:05d}-of-{len(sharded_state_dicts):05d}.safetensors"
        )
        shards[shard_file] = shard
        for key in shard.keys():
            weight_map[key] = shard_file

    # Add the metadata
    metadata = {"total_size": total_size}
    index = {"metadata": metadata, "weight_map": weight_map}
    return shards, index

 2.ValueError: You are attempting to load an AWQ model with a device_map that contains a CPU or disk device

指定GPU设备,修改以下代码

# 修改前
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_dir, torch_dtype=torch.float16, device_map="auto"
)

# 修改后
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_dir, torch_dtype=torch.float16, device_map="cuda:0"
)

 

 
posted @ 2025-05-22 16:17  马铃薯1  阅读(1806)  评论(1)    收藏  举报