llama benchmarks

Introduction

Here we re-evaluate llama2 benchmarks to prove its performence.

datasets

In this blog, we'll test the following datasets shown in the images.

The 1st picture is the benchmarks for llma2-70B in llama2 paper.

from here you can find the dataset

Code. We report the average pass@1 scores of our models on HumanEval (Chen et al., 2021) and MBPP (Austin et al., 2021).
Commonsense Reasoning. We report the average of PIQA (Bisk et al., 2020), SIQA (Sap et al., 2019), HellaSwag (Zellers et al., 2019a), WinoGrande (Sakaguchi et al., 2021), ARC easy and challenge (Clark et al., 2018), OpenBookQA (Mihaylov et al., 2018), and CommonsenseQA (Talmor et al., 2018). We report 7-shot results for CommonSenseQA and 0-shot results for all other benchmarks.
World Knowledge. We evaluate the 5-shot performance on NaturalQuestions (Kwiatkowski et al., 2019) and TriviaQA (Joshi et al., 2017) and report the average.
Reading Comprehension. For reading comprehension, we report the 0-shot average on SQuAD (Rajpurkar et al., 2018), QuAC (Choi et al., 2018), and BoolQ (Clark et al., 2019).
MATH. We report the average of the GSM8K (8 shot) (Cobbe et al., 2021) and MATH (4 shot) (Hendrycks et al., 2021) benchmarks at top 1.

indicator

model conversion

here we convert the official llama model into HF style
7B has been converted. here
for 13B:

click to view the command

# /home/ludaze/Docker/Llama/llama
cd llama-13-7b
mkdir 13B
mv ./* ./13B
cp ../tokenizer.model ../tokenizer_checklist.chk .
cd ..
python convert_llama_weights_to_hf.py  --input_dir llama-2-13b  --model_size 13B --output_dir models_hf/13B

for 70B

click to view the command

# /home/ludaze/Docker/Llama/llama
cd llama-2-70b
mkdir 70B
mv ./checklist.chk  consolidated.00.pth  consolidated.01.pth  consolidated.02.pth  consolidated.03.pth  consolidated.04.pth  consolidated.05.pth  consolidated.06.pth  consolidated.07.pth  params.json ./70B
cp ../tokenizer.model ../tokenizer_checklist.chk .
cd ..
python convert_llama_weights_to_hf.py  --input_dir llama-2-70b  --model_size 70B --output_dir models_hf/70B

convert mixtral into HF format

use lm-evaluation-harness

lm-evaluation-harness is a benchmark plateform for multiple style model and datasets.

mmlu(5-shot)

environment lm_evaluation python=3.10
signle gup

click to view command

lm_eval --model hf \
    --model_args pretrained=../llama/models_hf/7B \
    --tasks mmlu \
    --device cuda:1 \
    --batch_size 8

# 5-shot
lm_eval --model hf \
    --model_args pretrained=../llama/models_hf/7B \
    --tasks mmlu \
    --num_fewshot 5 \
    --device cuda:1 \
    --batch_size 8

multiple GPUS

click to view the command

# this command doesn't work
accelerate launch -m lm_eval --model hf \
    --model_args pretrained=../llama/models_hf/7B \
    --tasks mmlu \
    --num_fewshot 5 \
    --batch_size 32

# use llm, but just can work on some tasks

lm_eval --model vllm \
    --model_args pretrained=../llama/models_hf/70B,tensor_parallel_size=4,dtype=auto,gpu_memory_utilization=0.8,data_parallel_size=2 \
    --tasks mmlu \
    --num_fewshot 5 \
    --batch_size auto
# use lm_eval and set parallelize=True
lm_eval --model hf --model_args pretrained=../llama/models_hf/70B,parallelize=True \
    --tasks mmlu \
    --num_fewshot 5 \
    --batch_size 4

0-shot

5-shot

13B(5-shot)

triviaqa(1-shot)

single gup

click to view command

lm_eval --model hf \
    --model_args pretrained=../llama/models_hf/7B \
    --tasks triviaqa \
    --num_fewshot 1 \
    --device cuda:2 \
    --batch_size 8

0-shot(exact-match)

1-shot(f1)

1-shot(perplexity)

0-shot(acc) 7B

1-shot (acc) 7B

1-shot(acc) 13B

1-shot 13B

click to view the code

lm_eval --model vllm \
    --model_args pretrained=../llama/models_hf/13B,tensor_parallel_size=1,dtype=auto,gpu_memory_utilization=0.99,data_parallel_size=1 \
    --tasks triviaqa \
    --num_fewshot 1 \
    --device cuda:0 \
    --batch_size auto

5-shot(13B)

gsm8k(8-shot)

single gup

click to view command

lm_eval --model hf \
    --model_args pretrained=../llama/models_hf/7B \
    --tasks gsm8k \
    --num_fewshot 8 \
    --device cuda:3 \
    --batch_size 8

13B(8-shot)

bigbench

click to view the code

lm_eval --model hf \
    --model_args pretrained=../llama/models_hf/7B \
    --tasks bigbench \
    --num_fewshot 3 \
    --device cuda:0 \
    --batch_size 8

bbh(3-shot)

click to view the code

lm_eval --model hf \
    --model_args pretrained=../llama/models_hf/7B \
    --tasks bbh \
    --num_fewshot 3 \
    --device cuda:0 \
    --batch_size 8

7B exact-match

70B

here may be a problem, paprameter shown 3-shot, but the table shown 0-shot.?

mistral-7B

piqa

70B

siqa

70B

hellaswag

command for 4 tasks

click to view the command

#7B
lm_eval --model hf \
    --model_args pretrained=../llama/models_hf/7B \
    --tasks hellaswag,openbookqa,arc_easy,winogrande \
    --num_fewshot 0 \
    --device cuda:1 \
    --batch_size 8

#13B
lm_eval --model hf \
    --model_args pretrained=../llama/models_hf/13B \
    --tasks hellaswag,openbookqa,arc_easy,winogrande \
    --num_fewshot 0 \
    --device cuda:2 \
    --batch_size 8

#70B
lm_eval --model vllm \    
    --model_args pretrained=../llama/models_hf/70B,tensor_parallel_size=4,dtype=auto,gpu_memory_utilization=0.99,data_parallel_size=1 \
    --tasks hellaswag,openbookqa,arc_easy,winogrande \
    --num_fewshot 0 \
    --batch_size auto

7B ![image](https://img2024.cnblogs.com/blog/1894686/202401/1894686-20240103022028672-1652871546.png)

70B

opencompass 7B

python run.py --models hf_llama_7b --datasets hellaswag_clean_ppl

openbookqa

70B

arc_easy

70B

winogrande

70B

anli

squad2

13B

others

click to view the code

lm_eval --model hf \
    --model_args pretrained=../llama/models_hf/7B \
    --tasks anli,arithmetic,asdiv,babi,belebele,blimp,cmmlu \
    --num_fewshot 5 \
    --device cuda:1 \
    --batch_size 8

Table

posted @ 2023-12-24 15:57 Daze_Lu 阅读(214) 评论(0) 收藏举报

刷新页面返回顶部

llama benchmarks

Introduction

datasets

indicator

model conversion

use lm-evaluation-harness

mmlu(5-shot)

triviaqa(1-shot)

gsm8k(8-shot)

bigbench

bbh(3-shot)

piqa

siqa

hellaswag

openbookqa

arc_easy

winogrande

anli

squad2

others

Table

公告