Clang Static Analyzer——源码分析——解读ReadMe

要分析一个开源项目,它的ReadMe是必须要阅读的

生肉:

//===----------------------------------------------------------------------===// // Clang Static Analyzer //===----------------------------------------------------------------------===//

= Library Structure =

The analyzer library has two layers: a (low-level) static analysis engine (GRExprEngine.cpp and friends), and some static checkers (*Checker.cpp). The latter are built on top of the former via the Checker and CheckerVisitor interfaces (Checker.h and CheckerVisitor.h). The Checker interface is designed to be minimal and simple for checker writers, and attempts to isolate them from much of the gore of the internal analysis engine.

= How It Works =

The analyzer is inspired by several foundational research papers ([1], [2]). (FIXME: kremenek to add more links)

In a nutshell, the analyzer is basically a source code simulator that traces out possible paths of execution. The state of the program (values of variables and expressions) is encapsulated by the state (ProgramState). A location in the program is called a program point (ProgramPoint), and the combination of state and program point is a node in an exploded graph (ExplodedGraph). The term "exploded" comes from exploding the control-flow edges in the control-flow graph (CFG).

Conceptually the analyzer does a reachability analysis through the ExplodedGraph. We start at a root node, which has the entry program point and initial state, and then simulate transitions by analyzing individual expressions. The analysis of an expression can cause the state to change, resulting in a new node in the ExplodedGraph with an updated program point and an updated state. A bug is found by hitting a node that satisfies some "bug condition" (basically a violation of a checking invariant).

The analyzer traces out multiple paths by reasoning about branches and then bifurcating the state: on the true branch the conditions of the branch are assumed to be true and on the false branch the conditions of the branch are assumed to be false. Such "assumptions" create constraints on the values of the program, and those constraints are recorded in the ProgramState object (and are manipulated by the ConstraintManager). If assuming the conditions of a branch would cause the constraints to be unsatisfiable, the branch is considered infeasible and that path is not taken. This is how we get path-sensitivity. We reduce exponential blow-up by caching nodes. If a new node with the same state and program point as an existing node would get generated, the path "caches out" and we simply reuse the existing node. Thus the ExplodedGraph is not a DAG; it can contain cycles as paths loop back onto each other and cache out.

ProgramState and ExplodedNodes are basically immutable once created. Once one creates a ProgramState, you need to create a new one to get a new ProgramState. This immutability is key since the ExplodedGraph represents the behavior of the analyzed program from the entry point. To represent these efficiently, we use functional data structures (e.g., ImmutableMaps) which share data between instances.

Finally, individual Checkers work by also manipulating the analysis state. The analyzer engine talks to them via a visitor interface. For example, the PreVisitCallExpr() method is called by GRExprEngine to tell the Checker that we are about to analyze a CallExpr, and the checker is asked to check for any preconditions that might not be satisfied. The checker can do nothing, or it can generate a new ProgramState and ExplodedNode which contains updated checker state. If it finds a bug, it can tell the BugReporter object about the bug, providing it an ExplodedNode which is the last node in the path that triggered the problem.

= Notes about C++ =

Since now constructors are seen before the variable that is constructed in the CFG, we create a temporary object as the destination region that is constructed into. See ExprEngine::VisitCXXConstructExpr().

In ExprEngine::processCallExit(), we always bind the object region to the evaluated CXXConstructExpr. Then in VisitDeclStmt(), we compute the corresponding lazy compound value if the variable is not a reference, and bind the variable region to the lazy compound value. If the variable is a reference, just use the object region as the initilizer value.

Before entering a C++ method (or ctor/dtor), the 'this' region is bound to the object region. In ctors, we synthesize 'this' region with CXXRecordDecl, which means we do not use type qualifiers. In methods, we synthesize 'this' region with CXXMethodDecl, which has getThisType() taking type qualifiers into account. It does not matter we use qualified 'this' region in one method and unqualified 'this' region in another method, because we only need to ensure the 'this' region is consistent when we synthesize it and create it directly from CXXThisExpr in a single method call.

= Working on the Analyzer =

If you are interested in bringing up support for C++ expressions, the best place to look is the visitation logic in GRExprEngine, which handles the simulation of individual expressions. There are plenty of examples there of how other expressions are handled.

If you are interested in writing checkers, look at the Checker and CheckerVisitor interfaces (Checker.h and CheckerVisitor.h). Also look at the files named *Checker.cpp for examples on how you can implement these interfaces.

= Debugging the Analyzer =

There are some useful command-line options for debugging. For example:

$ clang -cc1 -help | grep analyze -analyze-function <value> -analyzer-display-progress -analyzer-viz-egraph-graphviz ...

The first allows you to specify only analyzing a specific function. The second prints to the console what function is being analyzed. The third generates a graphviz dot file of the ExplodedGraph. This is extremely useful when debugging the analyzer and viewing the simulation results.

Of course, viewing the CFG (Control-Flow Graph) is also useful:

$ clang -cc1 -help | grep cfg -cfg-add-implicit-dtors Add C++ implicit destructors to CFGs for all analyses -cfg-add-initializers Add C++ initializers to CFGs for all analyses -cfg-dump Display Control-Flow Graphs -cfg-view View Control-Flow Graphs using GraphViz -unoptimized-cfg Generate unoptimized CFGs for all analyses

-cfg-dump dumps a textual representation of the CFG to the console, and -cfg-view creates a GraphViz representation.

= References =

[1] Precise interprocedural dataflow analysis via graph reachability, T Reps, S Horwitz, and M Sagiv, POPL '95, http://portal.acm.org/citation.cfm?id=199462

[2] A memory model for static analysis of C programs, Z Xu, T Kremenek, and J Zhang, http://lcs.ios.ac.cn/~xzx/memmodel.pdf

熟肉:

Clang Static Analyzer(一个静态代码分析器)

结构

分析库有两个类型:一个是比较低级的是静态分析引擎(包含在GRExprEngine.cpp和friends文件里),另一个是比较厉害的是静态Checker(检查器)包含在*Checker.cpp文件中,Checker通过Checker和CheckerVisitor这两个接口函数构建,分别包含在Checker.h和CheckerVisitor.h文件中。

Checker的接口函数对于Checker编写者来说是最小最简单最好用的接口函数了,并且尽力试图将它们和静态分析引擎区别出来。

工作流程

简单地说,CSA分析器就是一个代码模拟器,但是它会追踪可能的执行路径。

程序的状态(State)(也就是变量或者表达式的值),由状态(ProgramState)来封装。

程序中的某个位置被称为程序点(ProgramPoint),状态和程序点的组合是分解图(explodegraph)中的一个节点,也就是说把当前即将执行的代码和之前的状态合集结合就是一个分解图的节点。就好比if语句下面有一条指令,那么下面的指令就是要执行的程序点,状态就是前面的if语句中的条件值。

在该分析器中的"exploded"这个术语来自于分解控制流图(CFG)中的控制流边(也就是说利用分解控制流图中的每一条边来分解程序的执行流程)

 

从概念上来讲,分析器通过分解图(ExplodedGraph)来进行可达性分析;需要从一个具有入口程序和初始状态的根节点开始,然后通过分析单个表达式来进行模拟转换(也就是说从一个最初的节点开始通过每一个条件表达式来达到模拟分析所有的语句);对表达式的分析可能会导致状态更改,从而导致分解图(ExplodedGraph)中的新结点有新的程序点和新的状态;通过符合某种bug产生的条件的节点来发现bug(简单来说就是通过一个节点,自上而下的模拟出来所有的流程,中间可能会因为一些表达式的值产生了不一样的节点,最后如果达到了代码逻辑中的发现bug的逻辑就报出bug,没有就算了)。

 

分析器通过对分支条件语句进行假设和推理,然后对状态(ProgramState)进行分岔来跟踪多条路径:在真的分支上假设分支条件为真,在假的分支上假设分支条件为假;这样的假设会对程序的值创建约束,这些约束会记录在状态(ProgramState)对象中,并且由ConstrainManager来操作;如果假设分支条件会导致约束不满足,那么这个分支就是不可行的,没必要采用,这就是调用path-sensitivity分析的办法;通过将节点进行缓存来减少指数膨胀;如果生成于现有节点相同状态(state)和程序点(point)的新结点,则路径就可以缓存出来了,只需要重新调用现有的节点就好了;所以分解图(ExplodedGraph)不是一个有向无环图,它是可以包含循环的。(总的来说就是分析器通过对一些分支语句进行假设,然后这种假设会变成一种约束来保存,这些约束保存在状态里面,如果后面的分支条件语句不满足约束,那么就不采纳这一条分支)

 

状态(State)和分解图节点(ExplodedNode)在创建之后基本不会改变,一旦创建了一个ProgramState就需要创建一个新的ProgramState变量来存储新的ProgramState;这种不变性很关键,因为分解图(ExplodedNode)表示的是从程序入口点来分析程序的行为;为了更有效的表达这些内容,使用了一些共享的数据结构(例如:ImmutableMaps)。(总的来说就是分解图中的内容一般是不可变的,为了方便表示分解图的内容也会采用一些共享的数据结构体)

 

单个的Checker还可以通过操作分析状态来工作,分析器引擎通过一个访问者界面来进行交互。例如,GRExprEngine 调用PreVisitCallExpr函数来告诉分析器(Checker)需要分析一个CallExpr,并要求检查器(Checker)检查可能不满足的任何前提,检查器不能执行任何操作也可以生成一个新的ProgramState和ExplodedNode,其中包含检查器的状态,如果该检查器发现了一个bug,那么它可以告诉BugReporter对象关于这个bug的信息,并且提供一个分解图的节点(explodeNode),这是这条路径中的最后一个节点。(也就是单个的Checker可以根据状态来分析,如果有bug出现就给BugReporter对象提供该bug并给分解图生成一个终结节点来结束,)

关于C++的使用

这一段我觉得没什么价值

如何对分析器进行开发

如果对为C++表达式提供支持有兴趣,最好的地方是看GRExprEngine中的访问逻辑,以及它是如何模拟处理单个表达式的。

如果对编写Checker感兴趣,请查看Checker和CheckerVisitor接口函数(Checker.h和CheckerVisitor.h),还有查看*Checker.cpp文件,里面都是利用接口实现Checker的实例

调试分析器

有一些比较有用的命令行调试指令,例如:

$ clang -cc1 -help | grep analyze
-analyze-function <value>
-analyzer-display-progress
-analyzer-viz-egraph-graphviz

-analyzer-function <value> 允许只分析特定的函数

-analyzer-display-progress 向控制台打印正在分析的函数

-analyzer-viz-egraph-graphviz 生成ExplodedGraph的graphviz文件

这几条指令在调试分析器和查看模拟结果的时候非常有用

查看流程图

查看流程图也对调试分析器和查看模拟结果非常有用

$ clang -cc1 -help | grep cfg
-cfg-add-implicit-dtors
-cfg-add-initializers  
-cfg-dump              
-cfg-view              
-unoptimized-cfg        
-cfg-dump
-cfg-view

-cfg-add-implicit-dtors将C++隐式析构函数添加到CFG中用于分析

-cfg-add-initializers 添加C++初始化器到CFG中用于分析

-cfg-dump 将CFG的文本表示形式转储到控制台

-cfg-view 使用GraphViz查看流程控制图,创建一个流程控制图

-unoptimized-cfg 生成未优化的CFG用于分析