【学习】CountVid使用VLM识别视频中指定实体的个数

一、项目及环境介绍

CountVid项目，能够识别视频中的指定实体类别个数

countvid-results

原仓库：https://github.com/niki-amini-naieni/CountVid.git
我修改后：https://github.com/freshmakerzhao/CountVid.git
- 这里建议用改后的
论文：https://arxiv.org/abs/2506.15368
环境：ubuntu+conda+cuda12.8

二、环境准备

2.1 系统环境

安装C环境

sudo apt update
sudo apt install build-essential
sudo apt install gcc-11 g++-11

安装cuda

# 访问 https://developer.nvidia.com/cuda-downloads
# 选择适合自己电脑的cuda进行安装

2.2 具体配置

创建虚拟环境

conda create -n countvid python=3.10
conda activate countvid
conda install -c conda-forge gxx_linux-64 compilers libstdcxx-ng

安装sam2
- Segment Anything Model 2
- 用于切分出图像中的各个物体

git clone https://github.com/facebookresearch/sam2.git && cd sam2
pip install -e .
cd ..

clone并配置CountVid

git clone https://github.com/freshmakerzhao/CountVid.git
cd CountVid
pip install -r requirements.txt

# 这里可能会出现依赖冲突，按提示解决即可，可参考：
pip install torch==2.5.1 torchvision==0.20.1

安装并部署GroundingDINO

# 在 CountVid 中
export CC=/usr/bin/gcc-11
cd models/GroundingDINO/ops
python setup.py build install
python test.py # 输出6个True

安装并部署detectron2
- 工业级的目标检测 / 分割 / 姿态估计深度学习框架

# 回到CountVid外
git clone https://github.com/facebookresearch/detectron2.git
cd detectron2
python -m pip install -e . --no-build-isolation
# 这里可能会出现iopath冲突，不用管

2.3 下载预训练权重

下载 CountGD-Box 预训练模型

# 在 CountVid 中
mkdir checkpoints
python download_bert.py

# 下载 https://drive.google.com/file/d/1bw-YIS-Il5efGgUqGVisIZ8ekrhhf_FD/view?usp=sharing
# 得到 countgd_box.pth
# 放到 CountVid/checkpoints
cp countgd_box.pth CountVid/checkpoints

pip install gdown
gdown --id 1bw-YIS-Il5efGgUqGVisIZ8ekrhhf_FD -O checkpoints/

下载sam2权重

# https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_large.pt
# 得到 sam2.1_hiera_large.pt
# 放到 CountVid/checkpoints
cp sam2.1_hiera_large.pt CountVid/checkpoints

下载测试数据

# https://drive.google.com/drive/folders/1v4RNNBHYEQQ82NF8fNiRPhIdQ96-7xCs?usp=sharing
# 得到94张企鹅照片，实际上是一个视频的切片，15s左右一张
# 放到 CountVid/demo/

三、运行

测试数据：
https://drive.google.com/drive/folders/1v4RNNBHYEQQ82NF8fNiRPhIdQ96-7xCs
下载后为一个zip压缩包，我这里放在了CountVid/test_data/demo中
支持的格式：".jpg", ".jpeg", ".JPG", ".JPEG"

3.1 使用提示词prompt

根据prompt，识别并生成视频

python count_in_videos.py \
  --video_dir test_data/demo \
  --input_text "penguin" \
  --sam_checkpoint checkpoints/sam2.1_hiera_large.pt \
  --sam_model_cfg configs/sam2.1/sam2.1_hiera_l.yaml \
  --obj_batch_size 30 \
  --img_batch_size 10 \
  --downsample_factor 1 \
  --pretrain_model_path checkpoints/countgd_box.pth \
  --temp_dir ./demo_temp \
  --output_dir ./test_output \
  --save_final_video \
  --save_countgd_video

3.2 使用示例图结合prompt

根据示例图，识别并生成视频
- exemplar_image_file：示例图片，包含你要是别的实体
- exemplar_file：描述文件，json格式，描述待识别实体在exemplar_image_file中的位置

python count_in_videos.py \
  --video_dir test_data/demo_03 \
  --use_exemplars \
  --input_text "A small X-shaped symbol" \
  --exemplar_image_file test_data/demo_03_exa/1.jpg \
  --exemplar_file test_data/demo_03_exa/bbox.json \
  --sam_checkpoint checkpoints/sam2.1_hiera_large.pt \
  --sam_model_cfg configs/sam2.1/sam2.1_hiera_l.yaml \
  --obj_batch_size 30 \
  --img_batch_size 10 \
  --downsample_factor 1 \
  --pretrain_model_path checkpoints/countgd_box.pth \
  --temp_dir ./demo_temp \
  --output_dir ./test_output/demo_03 \
  --save_final_video \
  --save_countgd_video

exemplar_file生成方式
- pip install labelimg
- labelimg生成xml后，手动拷贝内部的bndbox至json中，如下

{
    "exemplars": [
        [960, 43, 976, 58]
    ]
}

3.3 制作测试集

项目本身需要输入图片格式的数据集进行测试或生成视频，这里我们借用ffmpeg进行切分。

以windows平台为例，其他平台类似：

import argparse
import subprocess
from pathlib import Path
import sys


def extract_frames_with_ffmpeg(
    video_dir: Path,
    interval_seconds: float,
    image_ext: str = "jpg",
    ffmpeg_path: str = "ffmpeg",
):
    """
    使用 ffmpeg 按时间间隔抽帧（单位：秒）

    :param video_dir: 存放视频文件的目录（Path）
    :param interval_seconds: 抽帧时间间隔（秒），可以是 1, 5, 0.5, 0.3 等任意正数
    :param image_ext: 输出图片格式（如 jpg / png）
    :param ffmpeg_path: ffmpeg 可执行文件名或绝对路径（Windows 上一般是 ffmpeg.exe）
    """
    if interval_seconds <= 0:
        raise ValueError("interval_seconds 必须 > 0")

    video_dir = video_dir.resolve()
    if not video_dir.exists():
        print(f"[错误] video_dir 不存在: {video_dir}")
        sys.exit(1)

    # 检查 ffmpeg 是否可用
    try:
        subprocess.run(
            [ffmpeg_path, "-version"],
            stdout=subprocess.DEVNULL,
            stderr=subprocess.DEVNULL,
            check=True,
        )
    except FileNotFoundError:
        print(f"[错误] 找不到 ffmpeg，可执行路径: {ffmpeg_path}")
        print("请确认 ffmpeg 已安装，并且在系统 PATH 中，或者传入正确的 --ffmpeg_path")
        sys.exit(1)
    except subprocess.CalledProcessError:
        print(f"[错误] ffmpeg 运行异常，请检查安装: {ffmpeg_path}")
        sys.exit(1)

    # 支持的视频扩展名（可以按需增减）
    video_exts = {".mp4", ".avi", ".mov", ".mkv", ".flv", ".wmv"}

    videos = [p for p in video_dir.iterdir() if p.suffix.lower() in video_exts]
    if not videos:
        print(f"[提示] 目录下没有找到视频文件: {video_dir}")
        return

    # interval 转成 fps 表达式
    # 1 秒一帧: fps=1
    # 5 秒一帧: fps=1/5
    # 0.3 秒一帧: fps=1/0.3 ≈ 3.3333
    if interval_seconds >= 1:
        fps_expr = f"fps=1/{interval_seconds:.6f}"
    else:
        fps_expr = f"fps={1.0 / interval_seconds:.6f}"

    print(f"[信息] 视频目录: {video_dir}")
    print(f"[信息] 抽帧间隔: {interval_seconds} 秒  →  滤镜: \"{fps_expr}\"")
    print(f"[信息] 输出图片格式: .{image_ext}")
    print()

    for video_path in videos:
        name = video_path.stem  # 不带扩展名的文件名
        out_dir = video_dir / name
        out_dir.mkdir(parents=True, exist_ok=True)

        # 输出文件名模板，例如: videoName_00001.jpg
        output_pattern = str(out_dir / f"{name}_%05d.{image_ext}")

        print(f"    处理视频: {video_path.name}")
        print(f"    输出目录: {out_dir}")

        cmd = [
            ffmpeg_path,
            "-y",                   # 自动覆盖输出文件
            "-i", str(video_path),  # 输入视频
            "-vf", fps_expr,        # 按时间间隔采样
            output_pattern,         # 输出图片模板
        ]

        # 打印命令方便你 debug（可注释掉）
        print("    运行命令:", " ".join(cmd))

        try:
            subprocess.run(cmd, check=True)
        except subprocess.CalledProcessError as e:
            print(f"    [错误] ffmpeg 处理失败: {e}")
        else:
            print(f"    [完成] {video_path.name} → {out_dir}\n")

    print("全部处理完成。")

使用教程

video_dir = Path("./files/") # 视频文件的目录
interval_seconds = 0.5 # 抽帧时间间隔（秒）
image_ext = "jpg"  # 输出图片格式
ffmpeg_path = "D:\\Application_ffmpeg\\bin\\ffmpeg.exe"  # ffmpeg 可执行文件路径
extract_frames_with_ffmpeg(
    video_dir=video_dir,
    interval_seconds=interval_seconds,
    image_ext=image_ext,
    ffmpeg_path=ffmpeg_path
)

posted @ 2025-11-16 22:40 小拳头呀阅读(29) 评论(0) 收藏举报

刷新页面返回顶部

不想重名的小拳头

【学习】CountVid使用VLM识别视频中指定实体的个数

一、项目及环境介绍

二、环境准备

2.1 系统环境

2.2 具体配置

2.3 下载预训练权重

三、运行

3.1 使用提示词prompt

3.2 使用示例图结合prompt

3.3 制作测试集

公告