C/C++源码扫描系列- codeql 篇

首发于

https://xz.aliyun.com/t/9275

概述

codeql 是一个静态源码扫描工具，支持 c, python, java 等语言，用户可以使用 ql 语言编写自定义规则识别软件中的漏洞，也可以使用ql自带的规则进行扫描。

环境搭建

codeql的工作方式是首先使用codeql来编译源码，从源码中搜集需要的信息，然后将搜集到的信息保存为代码数据库文件，用户通过编写codeql规则从数据库中搜索出匹配的代码，工作示意图如下：

本节涉及的环境为

Windows 平台： vscode + codeql 用于开发codeql规则并查询
Linux 平台： codeql 用于编译代码创建代码数据库

首先下载codeql的二进制安装包

https://github.com/github/codeql-cli-binaries/releases

二进制包的文件名和对应的类型

codeql-linux64.zip   Linux平台
codeql-osx64.zip	 macos平台
codeql-win64.zip	 Windows平台
codeql.zip			 全平台

根据自己的平台下载对应的压缩包，然后解压到一个目录即可。

Windows 平台的就下载 codeql-win64.zip 并解压，然后再根据 vscode-codeql-starter 的 readme 设置 vscode 用于后续编写 codeql 规则和对数据库进行查询.

https://github.com/github/vscode-codeql-starter

下载好vscode-codeql-starter和 vscode 的 codeql插件后，使用 vscode 打开vscode-codeql-starter的工作目录（通过File > Open Workspace），然后进入vscode的设置界面，搜索codeql然后设置 Executable Path 为 codeql.exe 的路径

Linux环境主要是使用 codeql 来编译代码，创建代码数据库，所以只要下载 codeql-linux64.zip 解压到一个目录即可。

下面以一个简单的例子来介绍使用方式，代码路径

https://github.com/hac425xxx/sca-workshop/tree/master/hello

首先使用 codeql 编译代码并创建数据库

$ /home/hac425/sca/codeql/codeql database create --language=cpp -c "gcc hello.c -o hello" ./hello_codedb

Initializing database at /home/hac425/sca-workshop/hello_codedb.
Running command [gcc, hello.c, -o, hello] in /home/hac425/sca-workshop.
Finalizing database at /home/hac425/sca-workshop/hello_codedb.
Successfully created database at /home/hac425/sca-workshop/hello_codedb.

其中的命令行选项解释如下

--language=cpp  指定语言是cpp
-c 指定编译代码需要执行的命令命令，比如 make、 gcc等
./hello_codedb 数据库相关文件保存的路径

-c 这里为了简单直接使用了gcc的编译命令，codeql也支持make、cmake等编译系统来创建数据库，比如可以写个Makefile

hello:
	gcc hello.c -o hello

然后 -c 指定为 make 编译命令也可以创建出数据库

$ /home/hac425/sca/codeql/codeql database create --language=cpp -c "make -f Makefile_hello" ./hello_codedb

Initializing database at /home/hac425/sca-workshop/hello_codedb.
Running command [make, -f, Makefile_hello] in /home/hac425/sca-workshop.
[2021-02-23 05:09:18] [build] gcc hello.c -o hello
Finalizing database at /home/hac425/sca-workshop/hello_codedb.
Successfully created database at /home/hac425/sca-workshop/hello_codedb.

数据库创建好之后可以直接使用 codeql 插件的 From a folder 选项打开数据库所在目录，即可加载数据库。

由于我是在Linux上创建数据库，然后在Windows平台加载数据库并进行查询，这样的话还需要将数据库打包.

$ /home/hac425/sca/codeql/codeql database bundle -o hello_codedb.zip hello_codedb

Creating bundle metadata for /home/hac425/sca-workshop/hello_codedb...
Creating zip file at /home/hac425/sca-workshop/hello_codedb.zip.

命令行选项解释

database bundle 表示这个命令是要打包数据库
-o 打包后的压缩文件
hello_codedb 数据库所在目录

数据库打包之后就可以拷贝到其他机器上进行分析了。

vscode 加载打包的数据库文件可以使用插件的 From an archive 选项

加载完之后我们就可以编写规则了，这里创建一个简单的codeql查询，用途是找到源码中的所有函数调用并显示调用的的目标函数名和函数调用的位置。

ql 代码如下

import cpp
 
from FunctionCall fc
select fc.getTarget().getQualifiedName(), fc

执行后就可以显示所有的函数调用信息

对于图中的fc列，可以点击进入对于的源码行进行查看。

QL语言简介和简单示例

codeql 自己实现了 ql 语言，用户通过ql语言从数据库中查询需要的代码片段。QL语言是一种逻辑语言，QL中的所有语句基本都是逻辑语句，虽然有些情况下ql的使用和普通的编程语言（比如python）类似，但是其中的一些理念是完全不一样的，这个下面会进行一些讲解。本节将基于一些简单的例子介绍ql常用语法的使用，完整的语法建议查看官方文档。

示例代码简介

代码路径

https://github.com/hac425xxx/sca-workshop/blob/master/ql-example/example.c

我们知道漏洞都是由于程序在处理外部不可信数据时产生的，因此这个示例代码的实现思路就是模拟一些获取外部数据的函数，然后预设一些漏洞和不存在漏洞的场景，最后我们使用codeql把其中的漏洞查询出来

其中模拟获取外部数据的函数如下

// fake read byte from taint data
char read_byte()
{
    return 1;
}

// fake read int from taint data
int read_int()
{
}

// fake get user input function
char *get_user_input_str()
{
    return (char *)malloc(12);
}

system命令执行

本节所使用的示例代码路径

https://github.com/hac425xxx/sca-workshop/tree/master/ql-example
https://github.com/hac425xxx/sca-workshop/tree/master/ql-example/system_query

代码漏洞

int call_system_example()
{

    char *user = get_user_input_str();

    char *xx = user;

    system(xx);
    return 1;
}

漏洞在于函数首先使用 get_user_input_str 获取外部输入的字符串，然后会将其传给 system ，可以导致命令执行。

本节通过查询system命令执行漏洞来学习一下ql规则的编写，首先通过一个简单的 ql 查询示例来看看ql查询的组成元素

import cpp

from FunctionCall fc
where fc.getTarget().getName().matches("system")
select fc.getEnclosingFunction(), fc

这个查询的作用是找到所有调用 system 函数的位置，然后显示调用点所在的函数和函数调用的位置，各个语句的作用如下：

import 语句可以导入需要的库，库里面会封装一些函数、类供我们使用
from 语句用于定义查询中需要使用的变量，比如这里就定义了一个 fc ，类型为 FunctionCall 表示一个函数调用
where 语句用于设置变量需要满足的条件，比如这里的条件就是函数调用的目标的名称为 system
select 语句则用于将结果显示，可以选择结果中需要输出的东西.

查询结果如下

查询结果中列的数目和列中的数据由 select 语句指定，每一行代表一个结果，这个结果的呈现和sql语句的类似。

浏览查询的结果可以发现有一个 system 调用的参数是一个固定字符串

int call_system_const_example()
{
    system("cat /etc/xxx");
    return 1;
}

这个不会导致命令注入，我们在查询的where语句中可以增加一个条件过滤掉这个调用。

import cpp

from FunctionCall fc
where fc.getTarget().getName().matches("system") and not fc.getArgument(0).isConstant()
select fc.getEnclosingFunction(), fc, fc.getArgument(0)

where 语句通过 and 增加与条件，通过fc.getArgument(0).isConstant()可以判断fc的第一个参数是不是一个常量，这样就可以过滤掉 system 的参数为常量字符串的函数调用。

通过这两个例子可以大概理解一下codeql的语法规则，首先用户会在 from 里面定义需要的语法元素（比如FunctionCall），然后会在where语句里面定义若干个逻辑表达式，然后在执行查询时codeql会根据from语句搜集所有的语法元素（这里是所有的函数调用），然后使用where里面的逻辑表达式对这些元素进行校验，where的结果为真就会进入select语句进行结果的展示。

或者可以这样理解 from 语句中声明的变量类型只是代表某一类语法元素，取值空间很大，比如 FunctionCall 可以表示任意一个函数调用，然后 fc 经过 where 语句里面的各个逻辑表达式的约束，使得 fc 取值空间缩小，然后 select 语句就将所有的取值以表格的形式展现出来。

最开始学习codeql的时候在这一块困扰了一段时间，大概理解ql语言的工作机理后对规则的编写、调试都有很大的帮助。

继续回调示例，此时我们的结果还剩下两个，其中 call_system_safe_example 中会调用函数 clean_data 对用户的输入进行校验，仅仅是为了教学我们假设 clean_data 可以确保用户输入是干净的，否则就返回0，那么我们需要将 call_system_safe_example 过滤掉。

对于我们这个简单的例子，我们可以加一些表达式，过滤掉在函数中既调用了system 有调用的 clean_data 函数的结果。

import cpp

from FunctionCall fc, FunctionCall clean_fc
where
  fc.getTarget().getName().matches("system") and
  not fc.getArgument(0).isConstant() and
  clean_fc.getTarget().getName().matches("clean_data") and
  not clean_fc.getEnclosingFunction() = fc.getEnclosingFunction()
select fc.getEnclosingFunction(), fc, fc.getArgument(0)

当然这样去过滤会产生漏报和误报，比如clean_data检查的数据和实际传入system的数据不是一个。

	clean_data(data_1)
	................
	................
	system(data_2)

还有就是这样做搜索无法判断system的入参是否为外部可控。

这时候就需要使用 codeql 的污点跟踪功能，示例代码如下

import cpp
import semmle.code.cpp.dataflow.TaintTracking

from FunctionCall system_call, FunctionCall user_input, DataFlow::Node source, DataFlow::Node sink
where
  system_call.getTarget().getName().matches("system") and
  user_input.getTarget().getName().matches("get_user_input_str") and
  sink.asExpr() = system_call.getArgument(0) and
  source.asExpr() = user_input and
  TaintTracking::localTaint(source, sink)
select user_input, user_input.getEnclosingFunction()

污点跟踪由 TaintTracking 模块提供，codeql 支持 local 和 global 两种污点追踪模块，区别在于 local 的污点追踪只能追踪函数内的代码，函数外部的不追踪，global 则会在整个源码工程中对数据进行追踪。

回到上面的 codeql 代码，首先我们要明确我们的目标和已知的信息。

get_user_input_str 函数模拟程序从外部获取数据，其返回值里面的数据是外部数据，即污点源（source）
system 是 sink 点，数据从 get_user_input_str 流向 system 函数的就很大概率是有漏洞

查询的解释如下：

首先定义了两个函数调用 system_call 和 user_input ，分别表示调用 system 和 get_user_input_str 的函数调用表达式
然后定义 source 和 sink 作为污点跟踪的 source 和 sink 点
然后利用 sink.asExpr() = system_call.getArgument(0) 设置 sink 点为 system 函数调用的第一个参数
然后利用 source.asExpr() 设置 sink 点为 system 函数调用的第一个参数
最后使用 TaintTracking::localTaint 查找从 source 到 sink 的查询

这个查询的作用就是查询 system 第一个参数由 get_user_input_str 返回值控制的调用点，比如

但是由于这里采用的是 localTaint 所以下面这种情况会漏报，如果要查询下面这个情况有两种方式

把 our_wrapper_system 函数加到 sink 里面
使用 global taint 进行跟踪

void our_wrapper_system(char* cmd)
{
    system(cmd);
}

int call_our_wrapper_system_example()
{

    char* user = get_user_input_str();

    char* xx = user;

    our_wrapper_system(xx);
    return 1;
}

第一种方案的查询如下，其实就是把 our_wrapper_system 也考虑进 sink 点

import cpp
import semmle.code.cpp.dataflow.TaintTracking

predicate setSystemSink(FunctionCall fc, Expr e) {
  fc.getTarget().getName().matches("system") and
  fc.getArgument(0) = e
}

predicate setWrapperSystemSink(FunctionCall fc, Expr e) {
  fc.getTarget().getName().matches("our_wrapper_system") and
  fc.getArgument(0) = e
}

from FunctionCall fc, FunctionCall user_input, DataFlow::Node source, DataFlow::Node sink
where
  (
    setWrapperSystemSink(fc, sink.asExpr()) or
    setSystemSink(fc, sink.asExpr())
  ) and
  user_input.getTarget().getName().matches("get_user_input_str") and
  sink.asExpr() = fc.getArgument(0) and
  source.asExpr() = user_input and
  TaintTracking::localTaint(source, sink)
select user_input, user_input.getEnclosingFunction()

使用global taint 的代码如下

import cpp
import semmle.code.cpp.dataflow.TaintTracking

class SystemCfg extends TaintTracking::Configuration {
  SystemCfg() { this = "SystemCfg" }

  override predicate isSource(DataFlow::Node source) {
    source.asExpr().(FunctionCall).getTarget().getName() = "get_user_input_str"
  }

  override predicate isSink(DataFlow::Node sink) {
    exists(FunctionCall call |
      sink.asExpr() = call.getArgument(0) and
      call.getTarget().getName() = "system"
    )
  }
}

from DataFlow::PathNode sink, DataFlow::PathNode source, SystemCfg cfg
where cfg.hasFlowPath(source, sink)
select source, sink

ps: exists 的作用类似于局部变量

要使用 global taint 需要定义一个类继承自 TaintTracking::Configuration ，然后重写 isSource 和 isSink

isSource 用于定义 source 点，指定 get_user_input_str 的函数调用为 source 点
isSink 定义 sink 点，指定 system 的一个参数为 sink 点
然后在 where 语句里面使用 cfg.hasFlowPath(source, sink) 查询到从 source 到 sink 的代码

查看查询结果发现 call_system_safe_example 也会出现在结果中，前面提到 clean_data 可以确保数据无法进行命令注入，我们可以通过 isSanitizer 函数来剔除掉污点数据流入 clean_data 函数的结果，关键代码如下：

import cpp
import semmle.code.cpp.dataflow.TaintTracking
import semmle.code.cpp.valuenumbering.GlobalValueNumbering

class SystemCfg extends TaintTracking::Configuration {
  SystemCfg() { this = "SystemCfg" }
  ............

  override predicate isSanitizer(DataFlow::Node nd) {
    exists(FunctionCall fc |
      fc.getTarget().getName() = "clean_data" and
      globalValueNumber(fc.getArgument(0)) = globalValueNumber(nd.asExpr())
    )
  }
  ............
}

ps: 使用 globalValueNumber 才能结果正确，这个应该和编译原理 GVN 理论相关。

数组越界

本节使用涉及的代码

https://github.com/hac425xxx/sca-workshop/tree/master/ql-example/array_oob_query

代码漏洞

int global_array[40] = {0};

void array_oob()
{
    int user = read_byte();
    global_array[user] = 1;
}

首先函数通过 read_byte 获取外部输入的一个字节，然后将其作为数组索引去访问 global_array ，但是 global_array 的大小只有 40 项，所以可能导致数组越界。

这个漏洞模型很清晰，我们使用污点跟踪来查询这个漏洞，首先 source 点就是 read_byte 的函数调用， sink 点就是污点数据被用作数组索引。

查询代码如下

import cpp
import semmle.code.cpp.dataflow.TaintTracking

class ArrayOOBCfg extends TaintTracking::Configuration {
  ArrayOOBCfg() { this = "ArrayOOBCfg" }

  override predicate isSource(DataFlow::Node source) {
    source.asExpr().(FunctionCall).getTarget().getName() = "read_byte"
  }

  override predicate isSink(DataFlow::Node sink) {
    exists(ArrayExpr ae | sink.asExpr() = ae.getArrayOffset())
  }
}

from DataFlow::PathNode sink, DataFlow::PathNode source, ArrayOOBCfg cfg
where cfg.hasFlowPath(source, sink)
select source.getNode().asExpr().(FunctionCall).getEnclosingFunction(), source, sink

首先看定义 source 点的代码

source.asExpr().(FunctionCall).getTarget().getName() = "read_byte"

这里就是让 source 为 read_byte 的 FunctionCall 语句，其中 .(FunctionCall) 类似于类型强制转换。

下面介绍sink点的查询，在 ql 中很多语法结构都有对应的类来表示，比如这里涉及的数组访问就可以通过 ArrayExpr 对象获取

import cpp

from ArrayExpr ae
select ae, ae.getArrayOffset(), ae.getArrayBase()

可以看到 getArrayOffset 获取到的是数组偏移的部分，getArrayBase 获取到的是数组的基地址，所以这个查询的作用就是查询数据从 read_byte 流入数组索引的代码。

查询结果如下

可以看到查询到了所有符合条件的代码，其中有一个误报

void no_array_oob()
{
    int user = read_byte();

    if (user >= sizeof(global_array))
        return;

    global_array[user] = 1;
}

可以看到这里检查了 user 的值，我们可以通过 isSanitizer 来过滤掉这个结果，这里就简单的认为用户输入进入 if 语句的条件判断中就认为用户输入被正确的校验了。

  override predicate isSanitizer(DataFlow::Node nd) {
    exists(IfStmt ifs |
      globalValueNumber(ifs.getControllingExpr().getAChild*()) = globalValueNumber(nd.asExpr())
    )
  }

codeql 使用 IfStmt 来表示一个 if 语句，然后使用 getControllingExpr 可以获取到 if 语句的控制语句部分，然后我们使用 getAChild* 递归的遍历控制语句的所有子节点，只要有 nd 为控制语句中的一部分就返回true。

引用计数相关

本节相关代码

https://github.com/hac425xxx/sca-workshop/tree/master/ql-example/ref_query

漏洞代码一

int ref_leak(int *ref, int a, int b)
{

    ref_get(ref);

    if (a == 2)
    {
        puts("error 2");
        return -1;
    }
    ref_put(ref);
    return 0;
}

漏洞是当 a=2 时会直接返回没有调用 ref_put 对引用计数减一，漏洞模型：在某些存在 return 的条件分支中没有调用 ref_put 释放引用计数。

查询的代码如下

import cpp
import semmle.code.cpp.dataflow.TaintTracking

class RefGetFunctionCall extends FunctionCall {
  RefGetFunctionCall() { this.getTarget().getName() = "ref_get" }
}

class RefPutFunctionCall extends FunctionCall {
  RefPutFunctionCall() { this.getTarget().getName() = "ref_put" }
}

class EvilIfStmt extends IfStmt {
  EvilIfStmt() {
    exists(ReturnStmt rs |
      this.getAChild*() = rs and
      not exists(RefPutFunctionCall rpfc | rpfc.getEnclosingBlock() = rs.getEnclosingBlock())
    )
  }
}

from RefGetFunctionCall rgfc, EvilIfStmt eifs
where eifs.getEnclosingFunction() = rgfc.getEnclosingFunction()
select eifs.getEnclosingFunction(), eifs

代码使用类来定义某个特定的函数调用，比如 RefPutFunctionCall 用于表示调用 ref_put 函数的函数调用语句。

然后使用 EvilIfStmt 来表示存在 return 语句但是没有调用 ref_put 的代码

class EvilIfStmt extends IfStmt {
  EvilIfStmt() {
    exists(ReturnStmt rs |
      this.getAChild*() = rs and
      not exists(RefPutFunctionCall rpfc | rpfc.getEnclosingBlock() = rs.getEnclosingBlock())
    )
  }
}

大概的逻辑如下

首先使用 this.getAChild*() = rs 约束 this 为一个包含 return 语句的 if 结构
然后在加上一个 exists 语句确保和 rs 同一个块的语句里面没有 reutrn 语句。

漏洞代码二

int ref_dec_error(int *ref, int a, int b)
{
    ref_get(ref);

    if (a == 2)
    {
        puts("ref_dec_error 2");
        ref_put(ref);
    }
    ref_put(ref);
    return 0;
}

漏洞是当 a=2 时调用 ref_put 对引用计数减一但是没有 return。

漏洞模型：在某些条件分支中调用 ref_put 释放引用计数，但是没有 reuturn 返回，可能导致 ref_put 多次。

ql 查询代码的关键代码如下

class EvilIfStmt extends IfStmt {
  EvilIfStmt() {
    exists(RefPutFunctionCall rpfc |
      this.getAChild*() = rpfc and
      not exists(ReturnStmt rs | rpfc.getEnclosingBlock() = rs.getEnclosingBlock())
    )
  }
}

外部函数建模

本节涉及代码

https://github.com/hac425xxx/sca-workshop/tree/master/ql-example/model_function

静态污点分析的常见问题当数据流入外部函数（比如没有源码的库函数）中时污点分析引擎就可能会丢失污点传播信息，比如

int custom_memcpy(char *dst, char *src, int sz);

int call_our_wrapper_system_custom_memcpy_example()
{

    char *user = get_user_input_str();

    char *tmp = malloc(strlen(user) + 1);

    custom_memcpy(tmp, user, strlen(user));

    our_wrapper_system(tmp);
    return 1;
}

这个函数首先使用 get_user_input_str 获取外部输入，然后调用 custom_memcpy 把数据拷贝到 tmp 中，然后将 tmp 传入 system 执行命令， custom_memcpy 实际就是对 memcpy 进行了封装，只不过没有提供函数的源码。

直接使用之前的 ql 代码进行查询会发现查询不到这个代码，因为 custom_memcpy 是一个外部函数， codeql 的污点跟踪引擎无法知道污点的传播规则。

import cpp
import semmle.code.cpp.dataflow.TaintTracking

class SystemCfg extends TaintTracking::Configuration {
  SystemCfg() { this = "SystemCfg" }

  override predicate isSource(DataFlow::Node source) {
    source.asExpr().(FunctionCall).getTarget().getName() = "get_user_input_str"
  }

  override predicate isSink(DataFlow::Node sink) {
    exists(FunctionCall call |
      sink.asExpr() = call.getArgument(0) and
      call.getTarget().getName() = "system"
    )
  }
}

from DataFlow::PathNode sink, DataFlow::PathNode source, SystemCfg cfg
where cfg.hasFlowPath(source, sink)
select source.getNode().asExpr().(FunctionCall).getEnclosingFunction(), source, sink

为了解决这个问题，我们可以选择两种方式：重写isAdditionalTaintStep函数或者给ql源码增加模型，下面分别介绍。

重写 `isAdditionalTaintStep` 函数

使用 TaintTracking::Configuration 时可以通过重写 isAdditionalTaintStep 函数来自定义污点传播规则，代码如下

  override predicate isAdditionalTaintStep(DataFlow::Node pred, DataFlow::Node succ) {
    exists(FunctionCall fc |
      pred.asExpr() = fc.getArgument(1) and fc.getTarget().getName() = "custom_memcpy"
      and succ.asDefiningArgument() = fc.getArgument(0)
    )
  }

isAdditionalTaintStep 的逻辑是如果函数返回值为 True 就表示污点数据从 pred 流入了 succ.

因此这里指定的就是污点数据从 custom_memcpy 的第1个参数流入了函数的第0个参数。

给`ql`源码增加模型

在ql的源码里面内置很多标准库函数的模型，比如strcpy，memcpy 等，代码路径为

cpp\ql\src\semmle\code\cpp\models\implementations\Memcpy.qll

我们可以基于这些模型进行改造来快速对需要的函数建模，下面介绍一下步骤

首先在目录下新建一个 .qll 文件，这里就直接拷贝了 Memcpy.qll 然后修改了19行函数名部分，因为本身是对 memcpy 进行的封装。

然后在 Models.qll 里面导入一下即可

这时再去查询就可以了。

hac425

博客迁移自 blog.hac425.top，部分博文由于新浪图床的限制无法显示图片。pdf 版本：https://gitee.com/hac425/data/tree/master/blog_pdf

C/C++源码扫描系列- codeql 篇

概述

环境搭建

QL语言简介和简单示例

示例代码简介

system命令执行

数组越界

引用计数相关

外部函数建模

重写 `isAdditionalTaintStep` 函数

给`ql`源码增加模型

相关链接

公告

hac425

博客迁移自 blog.hac425.top， 部分博文由于新浪图床的限制无法显示图片。pdf 版本：https://gitee.com/hac425/data/tree/master/blog_pdf

C/C++源码扫描系列- codeql 篇

概述

环境搭建

QL语言简介和简单示例

示例代码简介

system命令执行

数组越界

引用计数相关

外部函数建模

重写 isAdditionalTaintStep 函数

给ql源码增加模型

相关链接

公告

博客迁移自 blog.hac425.top，部分博文由于新浪图床的限制无法显示图片。pdf 版本：https://gitee.com/hac425/data/tree/master/blog_pdf

重写 `isAdditionalTaintStep` 函数

给`ql`源码增加模型