HumanEval: 语言模型生成代码的评估方法

论文地址：Evaluating Large Language Models Trained on Code

本文尝试从代码层面分析一下这个数据集是如何衡量从文档生成代码的功能正确性。

安装

conda create -n human-eval python=3.7
conda activate human-eval
git clone https://github.com/openai/human-eval
pip install -e .

这里的 pip 依赖安装，官方文档中 pip install -e human-eval 会报错：
ERROR: human-eval is not a valid editable requirement. It should either be a path to a local project or a VCS URL (beginning with bzr+http, bzr+https, bzr+ssh, bzr+sftp, bzr+ftp, bzr+lp, bzr+file, git+http, git+https, git+ssh, git+git, git+file, hg+file, hg+http, hg+https, hg+ssh, hg+static-http, svn+ssh, svn+http, svn+https, svn+svn, svn+file).
应该是版本的问题，参考 issue Problems with installation instructions。

示例数据

data\example_problem.jsonl 将其格式化：

{
	"task_id": "test/0",
	"prompt": "def return1():\n",
	"canonical_solution": "    return 1",
	"test": "def check(candidate):\n    assert candidate() == 1",
	"entry_point": "return1"
}

data\example_samples.jsonl 将其格式化：

{
	"task_id": "test/0",
	"completion": "    import subprocess\n    subprocess.check_output('rm -rf tmp')"
}
{
	"task_id": "test/0",
	"completion": "    import time\n    time.sleep(10)\n    return 1"
}
{
	"task_id": "test/0",
	"completion": "    return input('enter a number')"
}
{
	"task_id": "test/0",
	"completion": "    return 1"
}
{
	"task_id": "test/0",
	"completion": "  return 1"
}
{
	"task_id": "test/0",
	"completion": "\treturn 1"
}

可以从示例数据中可以看到：

任务是 def return1():\n
标准答案是 return 1
拼接起来就是：

def return1():
    return 1

测试的脚本是：

def check(candidate):
    assert candidate() == 1

生成的代码，肉眼看，第2、4、5、6可以得到正确的返回

根据测试数据生成代码

通过 read_problems 和 write_jsonl 方法分别读取测试数据、将生成的代码写入临时文件

运行评估代码

通过 evaluate_functional_correctness 方法，评估上面生成的临时文件
repo 中默认将评估代码中执行的脚本注释了（位于 human-eval/human_eval/execution.py），最好在沙箱中开启这个功能。参考论文 2.3 章节中，使用了基于 K8S 的名为 gVisor 的容器 runtime。

HumanEval Infilling

paper: Efficient Training of Language Models to Fill in the Middle
评估 FIM(Fill in the Middle) 的 benchmark

example problem

{"task_id": "test/0", "prompt": "def return1():\n", "suffix": "1", "canonical_solution": "    return ", "test": "def check(candidate):\n    assert candidate() == 1", "entry_point": "return1"}

example samples

{"task_id": "test/0", "completion": "    import subprocess\n    subprocess.check_output('rm -rf tmp')"}
{"task_id": "test/0", "completion": "    import time\n    time.sleep(10)\n    return 1"}
{"task_id": "test/0", "completion": "    return input('enter a number')"}
{"task_id": "test/0", "completion": "    return 1"}
{"task_id": "test/0", "completion": "  return 1"}
{"task_id": "test/0", "completion": "\treturn 1"}
{"task_id": "test/0", "completion": "    import time\n    time.sleep(10)\n    return "}
{"task_id": "test/0", "completion": "    return "}
{"task_id": "test/0", "completion": "  return "}
{"task_id": "test/0", "completion": "\treturn "}

FIM 数据集构造

对 dataset 随机做 transformation，两种实现

document level
context level

posted @ 2024-11-01 21:01 zion03 阅读(690) 评论(0) 收藏举报

刷新页面返回顶部

CD Yang