代码改变世界

Lex & Flex 词法分析器实践(未完,持续更新)

2011-10-10 23:38  Haippy  阅读(11754)  评论(1编辑  收藏  举报

Lex & Flex 简介

Lex是LEXical compiler的缩写,是Unix环境下非常著名的工具, Lex (最早是埃里克·施密特和 Mike Lesk 制作)是许多 UNIX 系统的标准词法分析器(lexical analyzer)产生程式,而且这个工具所作的行为被详列为 POSIX 标准的一部分。Lex 主要功能是生成一个词法分析器(scanner)的 C 源码,描述规则采用正则表达式(regular expression)。描述词法分析器的文件 *.l 经过lex编译后,生成一个lex.yy.c 的文件,然后由 C 编译器编译生成一个词法分析器。词法分析器,简言之,就是将输入的各种符号,转化成相应的标识符(token),转化后的标识符很容易被后续阶段处理,如YACC 或 Bison,过程如图 :

 

图1. Lex 工作原理图

 

Flex (fast lexical analyser generator) 是 Lex 的另一个替代品。它经常和自由软件 Bison 语法分析器生成器 一起使用。Flex 最初由 Vern Paxson 于 1987 年用C语言写成。Flex手册里对 Flex 描述如下: 

Flex是一个生成扫描器的工具,能够识别文本中的词法模式。Flex 读入给定的输入文件,如果没有给定文件名的话,则从标准输入读取,从而获得一个关于需要生成的扫描器的描述。此描述叫做 规则,由正则表达式和 C代码对组成。Flex 的输出是一个 C 代码文件——lex.yy.c——其中定义了yylex() 函数。编译输出文件并且和 -lfl 库链接生成一个可执行文件。当运行可执行文件的时候,它分析输入文件,为每一个正则表达式寻找匹配。当发现一个匹配时,它执行与此正则表达式相关的C代码。Flex 不是GNU工程,但是GNU为Flex 写了手册。

总之,Flex 是词法分析工具,它读取输入源文件,然后生成 C 语言源程序,通常默认的是 "lex.yy.c", 该文件中包含 yylex( ) 例程,并且可以被 C 编译器编译链接为可执行文件,在该词法分析器运行时,它会根据已定义的规则,在遇到一定的匹配模式时执行相应的 C 代码,从而完成词法分析动作。 

Lex & Flex 输入文件格式

Flex 的输入文件包含了三部分,分别是定义区(definitions)、规则区(rules)和用户代码区(user code)并且由单独占一行的两个连续的百分号("%%")分隔开:

definitions
%%
rules
%%
user code

下面对 Flex 输入文件的三个部分做出解释:  

  • 定义区包含一些简单的名字定义(name definitions)来简化词法扫描器(scanner)的规则,并且还有起始条件(start condition)的定义,我们将会稍后详细讲解起始条件。
  • 规则区包含了一系列具有pattern-action形式的规则,并且模式 pattern 位于行首不能缩进,action 也应该起始于同一行,我们也将会在稍后详细阐述。在规则区,所有出现在第一条规则之前且被缩进的或者被 "%{"、"%}"包含的代码对于扫描例程或者扫描例程执行的代码来说都是可见的,规则区其他的缩进代码或者被 "%{"、"%}"包含的代码将被拷贝到输出文件中( "%{"、"%}"被移除,并且其本身不能缩进),但是其意义没有相关的标准来定义,这主要是为了保持了 POSIX 兼容。
  • 用户代码区 的代码将被原封不动地拷贝到输出文件中,并且这些代码通常会被扫描器调用,当然,该区是可选的,如果 Flex 源文件中不存在该区,那么第二个 "%%" 可以被省略。

Flex源文件中的注释

Flex 支持 C 风格的注释,即 /* comments */ 是合法的,在生成词法分析程序时,flex 源文件中的注释也仅仅只是原封不动的拷贝到输出文件中,注释通常可以出现在 flex源文件的任何位置,但是也有一些限制

  • 注释不能出现在规则区中可能出现正规式的位置,这意味着注释不能在规则区中的行首出现,也不能出现在扫描器的声明列表之后。
  • 定义区中注释不能和 %option 出现在同一行。

所以通常注释时应该在 "/*" 前面加上若干空白字符。如以下注释均合法:

%{
/* code block */
%}
/* Definitions Section */
%x STATE_X
%%
/* Rules Section */
ruleA /* after regex */ { /* code block */ } /* after code block */
/* Rules Section (indented) */
<STATE_X>{
ruleC ECHO;
ruleD ECHO;
%{
/* code block */
%}
}
%%
/* User Code Section */

 

匹配模式和执行动作(为了保持完整性和准确性,模式和动作两小节直接使用 Flex 手册的英文原文,未翻译,见谅)


Patterns

The patterns in the input are written using an extended set of regular expressions. These are:

 

x
match the character 'x'
.
any character (byte) except newline

[xyz]
a character class; in this case, the pattern matches either an 'x', a 'y', or a 'z'

[abj-oZ]
a "character class" with a range in it; matches an 'a', a 'b', any letter from 'j' through 'o', or a 'Z'

[^A-Z]
a "negated character class", i.e., any character but those in the class. In this case, any character EXCEPT an uppercase letter.
[^A-Z\n]
any character EXCEPT an uppercase letter or a newline
[a-z]{-}[aeiou]
the lowercase consonants
r*
zero or more r's, where r is any regular expression
r+
one or more r's
r?
zero or one r's (that is, “an optional r”)

r{2,5}
anywhere from two to five r's
r{2,}
two or more r's
r{4}
exactly 4 r's

{name}
the expansion of the ‘name’ definition.

"[xyz]\"foo"
the literal string: ‘[xyz]"foo

\X
if X is ‘a’, ‘b’, ‘f’, ‘n’, ‘r’, ‘t’, or ‘v’, then the ANSI-C interpretation of ‘\x’. Otherwise, a literal ‘X’ (used to escape operators such as ‘*’)

\0
a NUL character (ASCII code 0)

\123
the character with octal value 123
\x2a
the character with hexadecimal value 2a
(r)
match an ‘r’; parentheses are used to override precedence (see below)
(?r-s:pattern)
apply option ‘r’ and omit option ‘s’ while interpreting pattern. Options may be zero or more of the characters ‘i’, ‘s’, or ‘x’.

i’ means case-insensitive. ‘-i’ means case-sensitive.

s’ alters the meaning of the ‘.’ syntax to match any single byte whatsoever. ‘-s’ alters the meaning of ‘.’ to match any byte except ‘\n’.

x’ ignores comments and whitespace in patterns. Whitespace is ignored unless it is backslash-escaped, contained within ‘""’s, or appears inside a character class.

The following are all valid:

     (?:foo)         same as  (foo)
     (?i:ab7)        same as  ([aA][bB]7)
     (?-i:ab)        same as  (ab)
     (?s:.)          same as  [\x00-\xFF]
     (?-s:.)         same as  [^\n]
     (?ix-s: a . b)  same as  ([Aa][^\n][bB])
     (?x:a  b)       same as  ("ab")
     (?x:a\ b)       same as  ("a b")
     (?x:a" "b)      same as  ("a b")
     (?x:a[ ]b)      same as  ("a b")
     (?x:a
         /* comment */
         b
         c)          same as  (abc)

 

(?# comment )
omit everything within ‘()’. The first ‘)’ character encountered ends the pattern. It is not possible to for the comment to contain a ‘)’ character. The comment may span lines.

rs
the regular expression ‘r’ followed by the regular expression ‘s’; called concatenation
r|s
either an ‘r’ or an ‘s

r/s
an ‘r’ but only if it is followed by an ‘s’. The text matched by ‘s’ is included when determining whether this rule is the longest match, but is then returned to the input before the action is executed. So the action only sees the text matched by ‘r’. This type of pattern is called trailing context. (There are some combinations of ‘r/s’ that flex cannot match correctly. See Limitations, regarding dangerous trailing context.)

^r
an ‘r’, but only at the beginning of a line (i.e., when just starting to scan, or right after a newline has been scanned).

r$
an ‘r’, but only at the end of a line (i.e., just before a newline). Equivalent to ‘r/\n’.

Note that flex's notion of “newline” is exactly whatever the C compiler used to compile flexinterprets ‘\n’ as; in particular, on some DOS systems you must either filter out ‘\r’s in the input yourself, or explicitly use ‘r/\r\n’ for ‘r$’.

<s>r
an ‘r’, but only in start condition s .
<s1,s2,s3>r
same, but in any of start conditions s1, s2, or s3.
<*>r
an ‘r’ in any start condition, even an exclusive one.

<<EOF>>
an end-of-file.
<s1,s2><<EOF>>
an end-of-file when in start condition s1 or s2

Note that inside of a character class, all regular expression operators lose their special meaning except escape (‘\’) and the character class operators, ‘-’, ‘]]’, and, at the beginning of the class, ‘^’.

The regular expressions listed above are grouped according to precedence, from highest precedence at the top to lowest at the bottom. Those grouped together have equal precedence (see special note on the precedence of the repeat operator, ‘{}’, under the documentation for the ‘--posix’ POSIX compliance option). For example,

         foo|bar*

is the same as

         (foo)|(ba(r*))

since the ‘*’ operator has higher precedence than concatenation, and concatenation higher than alternation (‘|’). This pattern therefore matches eitherthe string ‘fooorthe string ‘ba’ followed by zero-or-more ‘r’'s. To match ‘foo’ or zero-or-more repetitions of the string ‘bar’, use:

         foo|(bar)*

And to match a sequence of zero or more repetitions of ‘foo’ and ‘bar’:

         (foo|bar)*

In addition to characters and ranges of characters, character classes can also contain character class expressions. These are expressions enclosed inside ‘[’: and ‘:]’ delimiters (which themselves must appear between the ‘[’ and ‘]’ of the character class. Other elements may occur inside the character class, too). The valid expressions are:

         [:alnum:] [:alpha:] [:blank:]
         [:cntrl:] [:digit:] [:graph:]
         [:lower:] [:print:] [:punct:]
         [:space:] [:upper:] [:xdigit:]

These expressions all designate a set of characters equivalent to the corresponding standard C isXXXfunction. For example, ‘[:alnum:]’ designates those characters for which isalnum() returns true - i.e., any alphabetic or numeric character. Some systems don't provide isblank(), so flex defines ‘[:blank:]’ as a blank or a tab.

For example, the following character classes are all equivalent:

         [[:alnum:]]
         [[:alpha:][:digit:]]
         [[:alpha:][0-9]]
         [a-zA-Z0-9]

A word of caution. Character classes are expanded immediately when seen in the flex input. This means the character classes are sensitive to the locale in which flex is executed, and the resulting scanner will not be sensitive to the runtime locale. This may or may not be desirable.

  • If your scanner is case-insensitive (the ‘-i’ flag), then ‘[:upper:]’ and ‘[:lower:]’ are equivalent to ‘[:alpha:]’.

  • Character classes with ranges, such as ‘[a-Z]’, should be used with caution in a case-insensitive scanner if the range spans upper or lowercase characters. Flex does not know if you want to fold all upper and lowercase characters together, or if you want the literal numeric range specified (with no case folding). When in doubt, flex will assume that you meant the literal numeric range, and will issue a warning. The exception to this rule is a character range such as ‘[a-z]’ or ‘[S-W]’ where it is obvious that you want case-folding to occur. Here are some examples with the ‘-i’ flag enabled:
    Range Result Literal Range Alternate Range
    [a-t] ok [a-tA-T]  
    [A-T] ok [a-tA-T]  
    [A-t] ambiguous [A-Z\[\\\]_`a-t] [a-tA-T]
    [_-{] ambiguous [_`a-z{] [_`a-zA-Z{]
    [@-C] ambiguous [@ABC] [@A-Z\[\\\]_`abc]

  • A negated character class such as the example ‘[^A-Z]’ above willmatch a newline unless ‘\n’ (or an equivalent escape sequence) is one of the characters explicitly present in the negated character class (e.g., ‘[^A-Z\n]’). This is unlike how many other regular expression tools treat negated character classes, but unfortunately the inconsistency is historically entrenched. Matching newlines means that a pattern like ‘[^"]*’ can match the entire input unless there's another quote in the input.

    Flex allows negation of character class expressions by prepending ‘^’ to the POSIX character class name.

                  [:^alnum:] [:^alpha:] [:^blank:]
                  [:^cntrl:] [:^digit:] [:^graph:]
                  [:^lower:] [:^print:] [:^punct:]
                  [:^space:] [:^upper:] [:^xdigit:]
         

    Flex will issue a warning if the expressions ‘[:^upper:]’ and ‘[:^lower:]’ appear in a case-insensitive scanner, since their meaning is unclear. The current behavior is to skip them entirely, but this may change without notice in future revisions of flex.

  • The ‘{-}’ operator computes the difference of two character classes. For example, ‘[a-c]{-}[b-z]’ represents all the characters in the class ‘[a-c]’ that are not in the class ‘[b-z]’ (which in this case, is just the single character ‘a’). The ‘{-}’ operator is left associative, so ‘[abc]{-}[b]{-}[c]’ is the same as ‘[a]’. Be careful not to accidentally create an empty set, which will never match.
  • The ‘{+}’ operator computes the union of two character classes. For example, ‘[a-z]{+}[0-9]’ is the same as ‘[a-z0-9]’. This operator is useful when preceded by the result of a difference operation, as in, ‘[[:alpha:]]{-}[[:lower:]]{+}[q]’, which is equivalent to ‘[A-Zq]’ in the "C" locale.

  • A rule can have at most one instance of trailing context (the ‘/’ operator or the ‘$’ operator). The start condition, ‘^’, and ‘<<EOF>>’ patterns can only occur at the beginning of a pattern, and, as well as with ‘/’ and ‘$’, cannot be grouped inside parentheses. A ‘^’ which does not occur at the beginning of a rule or a ‘$’ which does not occur at the end of a rule loses its special properties and is treated as a normal character.
  • The following are invalid:

                  foo/bar$
                  <sc1>foo<sc2>bar
         

    Note that the first of these can be written ‘foo/bar\n’.

  • The following will result in ‘$’ or ‘^’ being treated as a normal character:

                  foo|(bar$)
                  foo|^bar
         

    If the desired meaning is a ‘foo’ or a ‘bar’-followed-by-a-newline, the following could be used (the special | action is explained below, see Actions):

                  foo      |
                  bar$     /* action goes here */
         

    A similar trick will work for matching a ‘foo’ or a ‘bar’-at-the-beginning-of-a-line.

     


Actions

Each pattern in a rule has a corresponding action, which can be any arbitrary C statement. The pattern ends at the first non-escaped whitespace character; the remainder of the line is its action. If the action is empty, then when the pattern is matched the input token is simply discarded. For example, here is the specification for a program which deletes all occurrences of ‘zap me’ from its input:

         %%
         "zap me"

This example will copy all other characters in the input to the output since they will be matched by the default rule.

Here is a program which compresses multiple blanks and tabs down to a single blank, and throws away whitespace found at the end of a line:

         %%
         [ \t]+        putchar( ' ' );
         [ \t]+$       /* ignore this token */

If the action contains a ‘{’, then the action spans till the balancing ‘}’ is found, and the action may cross multiple lines. flexknows about C strings and comments and won't be fooled by braces found within them, but also allows actions to begin with ‘%{’ and will consider the action to be all the text up to the next ‘%}’ (regardless of ordinary braces inside the action).

An action consisting solely of a vertical bar (‘|’) means “same as the action for the next rule”. See below for an illustration.

Actions can include arbitrary C code, including return statements to return a value to whatever routine called yylex(). Each time yylex()is called it continues processing tokens from where it last left off until it either reaches the end of the file or executes a return.

Actions are free to modify yytext except for lengthening it (adding characters to its end–these will overwrite later characters in the input stream). This however does not apply when using %array (see Matching). In that case, yytextmay be freely modified in any way.

Actions are free to modify yyleng except they should not do so if the action also includes use of yymore()(see below).

There are a number of special directives which can be included within an action:

ECHO
copies yytext to the scanner's output.
BEGIN
followed by the name of a start condition places the scanner in the corresponding start condition (see below).
REJECT
directs the scanner to proceed on to the “second best” rule which matched the input (or a prefix of the input). The rule is chosen as described above in Matching, and yytext and yyleng set up appropriately. It may either be one which matched as much text as the originally chosen rule but came later in the flex input file, or one which matched less text. For example, the following will both count the words in the input and call the routine special()whenever ‘frob’ is seen:
                      int word_count = 0;
              %%
          
              frob        special(); REJECT;
              [^ \t\n]+   ++word_count;
     

Without the REJECT, any occurrences of ‘frob’ in the input would not be counted as words, since the scanner normally executes only one action per token. Multiple uses of REJECTare allowed, each one finding the next best choice to the currently active rule. For example, when the following scanner scans the token ‘abcd’, it will write ‘abcdabcaba’ to the output:

              %%
              a        |
              ab       |
              abc      |
              abcd     ECHO; REJECT;
              .|\n     /* eat up any unmatched character */
     

The first three rules share the fourth's action since they use the special ‘|’ action.

REJECT is a particularly expensive feature in terms of scanner performance; if it is used in any of the scanner's actions it will slow down all of the scanner's matching. Furthermore, REJECTcannot be used with the ‘-Cf’ or ‘-CF’ options.

Note also that unlike the other special actions, REJECT is a branch. Code immediately following it in the action will not be executed.

yymore()
tells the scanner that the next time it matches a rule, the corresponding token should be appended onto the current value of yytextrather than replacing it. For example, given the input ‘mega-kludge’ the following will write ‘mega-mega-kludge’ to the output:

              %%
              mega-    ECHO; yymore();
              kludge   ECHO;
     

First ‘mega-’ is matched and echoed to the output. Then ‘kludge’ is matched, but the previous ‘mega-’ is still hanging around at the beginning of yytext so the ECHOfor the ‘kludge’ rule will actually write ‘mega-kludge’.

Two notes regarding use of yymore(). First, yymore() depends on the value of yyleng correctly reflecting the size of the current token, so you must not modify yyleng if you are using yymore(). Second, the presence of yymore()in the scanner's action entails a minor performance penalty in the scanner's matching speed.

yyless(n) returns all but the first n characters of the current token back to the input stream, where they will be rescanned when the scanner looks for the next match. yytext and yyleng are adjusted appropriately (e.g., yyleng will now be equal to n). For example, on the input ‘foobar’ the following will write out ‘foobarbar’:

         %%
         foobar    ECHO; yyless(3);
         [a-z]+    ECHO;

An argument of 0 to yyless() will cause the entire current input string to be scanned again. Unless you've changed how the scanner will subsequently process its input (using BEGIN, for example), this will result in an endless loop.

Note that yyless()is a macro and can only be used in the flex input file, not from other source files.

unput(c) puts the character cback onto the input stream. It will be the next character scanned. The following action will take the current token and cause it to be rescanned enclosed in parentheses.

         {
         int i;
         /* Copy yytext because unput() trashes yytext */
         char *yycopy = strdup( yytext );
         unput( ')' );
         for ( i = yyleng - 1; i >= 0; --i )
             unput( yycopy[i] );
         unput( '(' );
         free( yycopy );
         }

Note that since each unput() puts the given character back at the beginningof the input stream, pushing back strings must be done back-to-front.

An important potential problem when using unput() is that if you are using %pointer (the default), a call to unput() destroys the contents of yytext, starting with its rightmost character and devouring one character to the left with each call. If you need the value of yytext preserved after a call to unput() (as in the above example), you must either first copy it elsewhere, or build your scanner using %array instead.

Finally, note that you cannot put back ‘EOF’ to attempt to mark the input stream with an end-of-file.

input()reads the next character from the input stream. For example, the following is one way to eat up C comments:

         %%
         "/*"        {
                     register int c;
     
                     for ( ; ; )
                         {
                         while ( (c = input()) != '*' &&
                                 c != EOF )
                             ;    /* eat up text of comment */
     
                         if ( c == '*' )
                             {
                             while ( (c = input()) == '*' )
                                 ;
                             if ( c == '/' )
                                 break;    /* found the end */
                             }
     
                         if ( c == EOF )
                             {
                             error( "EOF in comment" );
                             break;
                             }
                         }
                     }

(Note that if the scanner is compiled using C++, then input() is instead referred to as yyinput(), in order to avoid a name clash with the C++ stream by the name of input.)

YY_FLUSH_BUFFER() flushes the scanner's internal buffer so that the next time the scanner attempts to match a token, it will first refill the buffer using YY_INPUT() . This action is a special case of the more general yy_flush_buffer() function, described below

yyterminate() can be used in lieu of a return statement in an action. It terminates the scanner and returns a 0 to the scanner's caller, indicating “all done”. By default, yyterminate()is also called when an end-of-file is encountered. It is a macro and may be redefined.

自动生成的词法分析器

高级话题

开始条件

多输入缓冲区

输入文件的终止规则

Flex 中用户可见变量

简单的例子

下面是一个简单的例子,以说明 Flex 的基本用法:

%{
/* need this for the call to atof() below */
#include <math.h>
%}
     
DIGIT    [0-9]
ID       [a-z][a-z0-9]*

%%
     
{DIGIT}+	{
	printf( "An integer: %s (%d)\n", yytext,
			atoi( yytext ) );
			}
     
{DIGIT}+"."{DIGIT}*     {
	printf( "A float: %s (%g)\n", yytext,
			atof( yytext ) );
						}
     
if|then|begin|end|procedure|function	{
	printf( "A keyword: %s\n", yytext );}

{ID}	printf( "An identifier: %s\n", yytext );
     
"+"|"-"|"*"|"/"	printf( "An operator: %s\n", yytext );
     
"{"[\^{}}\n]*"}"	/* eat up one-line comments */
     
[ \t\n]+	/* eat up whitespace */

.	printf( "Unrecognized character: %s\n", yytext );
     
%%
     
int main(int argc, const char *argv[])
{
	++argv, --argc;  /* skip over program name */
	if ( argc > 0 )
		yyin = fopen( argv[0], "r" );
	else
		yyin = stdin;
	yylex();
	return 0;
}

  

深入主题

FAQs