从lambda的capture语义看gcc名字查找

一、问题的引入

对于lambda表达式的capture内容，比较知名的是capture-default，也就是通过“&”或者“=”引导的缺省捕捉，通常的做法大家应该都是使用"&"以引用的形式引入。但是如果捕捉内容为空时，此时语法如何处理？从文档上看，如果是global/tatic/thread/constexp之类的大家不常见的类型之外，在捕捉列表为空的情况lambda语句中是不能使用的。这个语法/语义从C++规范上看比较明晰，但是编译器是如何实现这套复杂的查找机制的呢？
tsecer@harry: cat -n lambda.none.capture.cpp
1 int ga;
2 int main(int argc, const char *argv[])
3 {
4 return []()->int{ return ga + argc;}();
5 }
tsecer@harry: gcc -c -std=c++11 lambda.none.capture.cpp
lambda.none.capture.cpp: In lambda function:
lambda.none.capture.cpp:4:32: error: 'argc' is not captured
return []()->int{ return ga + argc;}();
^
tsecer@harry:

二、语法检测的关键

这个地方实现了C++标准描述的大部分语法检测功能，例如是否为常量的检测。
gcc-4.8.2\gcc\cp\semantics.c
tree
finish_id_expression (tree id_expression,
tree decl,
tree scope,
cp_id_kind *idk,
bool integral_constant_expression_p,
bool allow_non_integral_constant_expression_p,
bool *non_integral_constant_expression_p,
bool template_p,
bool done,
bool address_p,
bool template_arg_p,
const char **error_msg,
location_t location)
{
……
else
{
tree context = DECL_CONTEXT (decl);
tree containing_function = current_function_decl;
tree lambda_stack = NULL_TREE;
tree lambda_expr = NULL_TREE;
tree initializer = convert_from_reference (decl);

/* Mark it as used now even if the use is ill-formed. */
mark_used (decl);

/* Core issue 696: "[At the July 2009 meeting] the CWG expressed
support for an approach in which a reference to a local
[constant] automatic variable in a nested class or lambda body
would enter the expression as an rvalue, which would reduce
the complexity of the problem"

FIXME update for final resolution of core issue 696. */
if (decl_constant_var_p (decl))
{
if (processing_template_decl)
/* In a template, the constant value may not be in a usable
form, so look it up again at instantiation time. */
return id_expression;
else
return integral_constant_value (decl);
}

/* If we are in a lambda function, we can move out until we hit
1. the context,
2. a non-lambda function, or
3. a non-default capturing lambda function. */
while (context != containing_function
&& LAMBDA_FUNCTION_P (containing_function))
{//这里的context是被引用变量声明的上下文，此处逐层向外访问，看是否能够找到一条合法路径。中间如果lambda的捕捉为空(对应CPLD_NONE)，则路径可能无法到达。
lambda_expr = CLASSTYPE_LAMBDA_EXPR
(DECL_CONTEXT (containing_function));

if (LAMBDA_EXPR_DEFAULT_CAPTURE_MODE (lambda_expr)
== CPLD_NONE)
break;

lambda_stack = tree_cons (NULL_TREE,
lambda_expr,
lambda_stack);

containing_function
= decl_function_context (containing_function);
}

if (context == containing_function)
{
decl = add_default_capture (lambda_stack,
/*id=*/DECL_NAME (decl),
initializer);
}
else if (lambda_expr)
{
error ("%qD is not captured", decl);
return error_mark_node;
}
else
{
error (TREE_CODE (decl) == VAR_DECL
? G_("use of local variable with automatic storage from containing function")
: G_("use of parameter from containing function"));
error (" %q+#D declared here", decl);
return error_mark_node;
}
}
……
}

三、标识符如何查找的呢

在高级语言中，标识符本质上是通过变量名来确定对应关系的。直观上理解，当遇到一个标识符的时候，需要在外层的命名作用域中根据变量名这个字符串进行字符串比较匹配，从而找到这个标识符具体是个什么内容。但是看了下gcc的代码，并没有找到这种标识符字符串查找的逻辑，这就有些让人惊诧了。

四、预处理对标识符的处理

这里有一个重要的内容，在进行标识符查找的过程中，完全按照标识符的字符串返回indentifier节点。这也就是说，对于前面看到的ga，它在第1行和第4行出现的时候返回的是同一个节点。或者再进一步说，整个编译过程中出现的所有的具有相同名字的变量都是一个相同的identifier节点。那这些节点时怎么区别不同作用域的不用意义呢？
gcc-4.8.2\libcpp\lex.c
/* Lex an identifier starting at BUFFER->CUR - 1. */
static cpp_hashnode *
lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn,
struct normalize_state *nst)
{
……
{
len = cur - base;
hash = HT_HASHFINISH (hash, len);

result = CPP_HASHNODE (ht_lookup_with_hash (pfile->hash_table,
base, len, hash, HT_ALLOC));
}
……
}

五、gcc定义的identifier

从这个地方可以看到，当为一个identifier分配内存的时候，对于C++来说分配的是一个定义在cp-tree.h中的lang_identifier结构，其中的c_common结构包含了常规的、字符串形式的标识符名称。除此之外，C++在此基础上扩展了额外的内容，其中比较明显的就有两个cxx_binding结构：namespace_bindings和bindings。
gcc-4.8.2\gcc\tree.c
size_t
tree_code_size (enum tree_code code)
{
……
case tcc_exceptional: /* something random, like an identifier. */
switch (code)
{
case IDENTIFIER_NODE: return lang_hooks.identifier_size;
……
}
而对于C++中的identifier_size，这个则是
#define LANG_HOOKS_IDENTIFIER_SIZE sizeof (struct lang_identifier)
gcc-4.8.2\gcc\cp\cp-tree.h
struct GTY(()) lang_identifier {
struct c_common_identifier c_common;
cxx_binding *namespace_bindings;
cxx_binding *bindings;
tree class_template_info;
tree label_value;
};

六、binding组成的链表

这个cxx_binding结构中有一个非常关键的链表字段cxx_binding *previous，从这个字段的名称就可以看出来，这个字段指向的是一个外部的作用域，从而可以从逻辑概念上组成一个作用域的链表，对应的也就是语言中可见的“name hiding”机制的实现基础。另外，这个标识符具体是什么意义同样在这个结构的tree value;和tree type;字段中，从而一个同名的标识符在不同的cxx_binding中对应不同的类型和数值。
struct GTY(()) cxx_binding {
/* Link to chain together various bindings for this name. */
cxx_binding *previous;
/* The non-type entity this name is bound to. */
tree value;
/* The type entity this name is bound to. */
tree type;
/* The scope at which this binding was made. */
cp_binding_level *scope;
unsigned value_is_inherited : 1;
unsigned is_local : 1;
};

以下面代码为例
tsecer@harry: cat -n override.cpp
1 int tsecer = 11;
2 void foo()
3 {
4 extern int tsecer();
5 }
同样的identifier字符串"tsecer"，它在foo函数内部对应一个cxx_binding，在foo外部还有一个cxx_binding，函数内cxx_binding通过自己的previous字段指向外部的cxx_binding，在两个不同的cxx_binding中，分别有不同的value和type字段，分别对应tsecer作为一个函数和整数类型。
这样在遇到一个identifier的时候，就不需要再进行变量字符串形式匹配了，因为在预处理阶段已经进行了标识符的匹配，此时在遇到新的声明的时候，只需要创建新的cxx_binding，并把新声明信息保存在value/type字段，插入到当前lang_identifier的bindings链表的头部，这样只需要在解析出新的声明的时候才需要修改identifier的bindings列表，而其它情况下直接使用预处理库(libcpp)返回的identifier就可以了，它所有已经声明的内容都在这个链表中了。

七、图片展示下

对于前面提到的例子，其中大致的关系图如下所示

posted on 2020-10-23 20:43 tsecer 阅读(398) 评论(0) 收藏举报

刷新页面返回顶部

tsecer