6.3 Writing a custom lexer

https://lalrpop.github.io/lalrpop/lexer_tutorial/003_writing_custom_lexer.html

Let's say we want to parse the Whitespace language, so we've put together a grammar like the following:

MST -- 假设我们想要解析 Whitespace 语言,因此我们整理了如下语法:

pub Program = <Statement*>;

Statement: ast::Stmt = {
    " " <StackOp>,
    "\t" " " <MathOp>,
    "\t" "\t" <HeapOp>,
    "\n" <FlowCtrl>,
    "\t" "\n" <Io>,
};

StackOp: ast::Stmt = {
    " " <Number> => ast::Stmt::Push(<>),
    "\n" " " => ast::Stmt::Dup,
    "\t" " " <Number> => ast::Stmt::Copy(<>),
    "\n" "\t" => ast::Stmt::Swap,
    "\n" "\n" => ast::Stmt::Discard,
    "\t" "\n" <Number> => ast::Stmt::Slide(<>),
};

MathOp: ast::Stmt = {
    " " " " => ast::Stmt::Add,
    " " "\t" => ast::Stmt::Sub,
    " " "\n" => ast::Stmt::Mul,
    "\t" " " => ast::Stmt::Div,
    "\t" "\t" => ast::Stmt::Mod,
};

// Remainder omitted

Naturally, it doesn't work. By default, LALRPOP generates a tokenizer that skips all whitespace -- including newlines. What we want is to capture whitespace characters and ignore the rest as comments, and LALRPOP does the opposite of that.

MST -- 自然,它不起作用。默认情况下,LALRPOP 会生成一个跳过所有空格(包括换行符)的分词器。我们想要的是捕获空白字符并忽略其余的注释,而 LALRPOP 则相反。

At the moment, LALRPOP doesn't allow you to configure the default tokenizer. In the future it will become quite flexible, but for now we have to write our own.

MST -- 目前,LALRPOP 不允许您配置默认分词器。将来它会变得非常灵活,但现在我们必须编写自己的代码。

Let's start by defining the stream format. The parser will accept an iterator where each item in the stream has the following structure:

MST -- 让我们从定义流格式开始。解析器将接受一个迭代器,其中流中的每个项目都具有以下结构:

pub type Spanned<Tok, Loc, Error> = Result<(Loc, Tok, Loc), Error>;

Loc is typically just a usize, representing a byte offset into the input string. Each token is accompanied by two of them, marking the start and end positions where it was found. Error can be pretty much anything you choose. And of course Tok is the meat of the stream, defining what possible values the tokens themselves can have. Following the conventions of Rust iterators, we'll signal a valid token with Some(Ok(...)), an error with Some(Err(...)), and EOF with None.

MST -- Loc 通常只是一个 usize,表示输入字符串中的字节偏移量。每个令牌都附有其中两个令牌,标记找到它的开始和结束位置。错误几乎可以是您选择的任何内容。当然,Tok 是流中的肉,定义了代币本身可以具有的可能价值。遵循 Rust 迭代器的约定,我们将使用 Some(Ok(...)) 表示有效令牌,使用 Some(Err(...)) 表示错误,并使用 None 表示 EOF。

(Note that the term "tokenizer" normally refers to a piece of code that simply splits up the stream, whereas a "lexer" also tags each token with its lexical category. What we're writing is the latter.)

MST -- (请注意,术语 “tokenizer” 通常是指简单地拆分流的一段代码,而 “lexer” 还使用其词法类别标记每个标记。我们写的是后者。

Whitespace is a simple language from a lexical standpoint, with only three valid tokens:

MST -- 从词法的角度来看,空格是一种简单的语言,只有三个有效的标记:

pub enum Tok {
    Space,
    Tab,
    Linefeed,
}

Everything else is a comment. There are no invalid lexes, so we'll define our own error type, a void enum:

MST -- 其他一切都是评论。没有无效的 lexes,因此我们将定义自己的 error 类型,即 void 枚举:

pub enum LexicalError {
    // Not possible
}

Now for the lexer itself. We'll take a string slice as its input. For each token we process, we'll want to know the character value, and the byte offset in the string where it begins. We can do that by wrapping the CharIndices iterator, which yields tuples of (usize, char) representing exactly that information.

MST -- 现在是词法分析器本身。我们将一个字符串 slice 作为其输入。对于我们处理的每个 token,我们想知道字符值,以及字符串中开始的字节偏移量。我们可以通过包装 CharIndices 迭代器来实现这一点,它会产生 (usize, char) 的元组,恰好表示该信息。

use std::str::CharIndices;

pub struct Lexer<'input> {
    chars: CharIndices<'input>,
}

impl<'input> Lexer<'input> {
    pub fn new(input: &'input str) -> Self {
        Lexer { chars: input.char_indices() }
    }
}

(The lifetime parameter 'input indicates that the Lexer cannot outlive the string it's trying to parse.)

MST -- (lifetime 参数 'input 表示词行分析器的生存期不能超过它尝试解析的字符串。

Let's review our rules:

  • For a space character, we output Tok::Space. 对于空格字符,我们输出 Tok::Space。
  • For a tab character, we output Tok::Tab. 对于 Tab 字符,我们输出 Tok::Tab。
  • For a linefeed (newline) character, we output Tok::Linefeed. 对于换行符(换行符),我们输出 Tok::Linefeed。
  • We skip all other characters. 我们跳过所有其他字符。
  • If we've reached the end of the string, we'll return None to signal EOF. 如果我们已经到达字符串的末尾,我们将返回 None 来发出 EOF 信号。

Writing a lexer for a language with multi-character tokens can get very complicated, but this is so straightforward, we can translate it directly into code without thinking very hard. Here's our Iterator implementation:

MST -- 为具有多字符标记的语言编写词法分析器可能会变得非常复杂,但这非常简单,我们可以将其直接翻译成代码,而无需非常费力地思考。这是我们的 Iterator 实现:

impl<'input> Iterator for Lexer<'input> {
    type Item = Spanned<Tok, usize, LexicalError>;

    fn next(&mut self) -> Option<Self::Item> {
        loop {
            match self.chars.next() {
                Some((i, ' ')) => return Some(Ok((i, Tok::Space, i+1))),
                Some((i, '\t')) => return Some(Ok((i, Tok::Tab, i+1))),
                Some((i, '\n')) => return Some(Ok((i, Tok::Linefeed, i+1))),

                None => return None, // End of file
                _ => continue, // Comment; skip this character
            }
        }
    }
}

That's it. That's all we need.

Updating the parser

To use this with LALRPOP, we need to expose its API to the parser. It's pretty easy to do, but also somewhat magical, so pay close attention. Pick a convenient place in the grammar file (I chose the bottom) and insert an extern block:

MST -- 要将其与 LALRPOP 一起使用,我们需要将其 API 公开给解析器。这很容易做到,但也有些神奇,所以要密切注意。在语法文件中选择一个方便的位置(我选择了底部)并插入一个 extern 块:

extern {
    // ...
}

Now we tell LALRPOP about the Location and Error types, as if we're writing a trait:

MST -- 现在我们告诉 LALRPOP Location 和 Error 类型,就像我们正在编写一个 trait 一样:

extern {
    type Location = usize;
    type Error = lexer::LexicalError;

    // ...
}

We expose the Tok type by kinda sorta redeclaring it:

MST -- 我们通过重新声明 Tok 类型来公开它:

extern {
    type Location = usize;
    type Error = lexer::LexicalError;

    enum lexer::Tok {
        // ...
    }
}

Now we have to declare each of our terminals. For each variant of Tok, we pick what name the parser will see, and write a pattern of the form name => lexer::Tok::Variant, similar to how action code works in grammar rules. The name can be an identifier, or a string literal. We'll use the latter.

MST -- 现在我们必须声明我们的每个终端。对于 Tok 的每个变体,我们选择解析器将看到的名称,并编写 name => lexer::Tok::Variant 形式的模式,类似于操作代码在语法规则中的工作方式。名称可以是标识符或字符串文本。我们将使用后者。

Here's the whole thing:

MST -- 这是整件事:

extern {
    type Location = usize;
    type Error = lexer::LexicalError;

    enum lexer::Tok {
        " " => lexer::Tok::Space,
        "\t" => lexer::Tok::Tab,
        "\n" => lexer::Tok::Linefeed,
    }
}

From now on, the parser will take a Lexer as its input instead of a string slice, like so:

MST -- 从现在开始,解析器将采用 Lexer 而不是字符串 slice 作为其输入,如下所示:

    let lexer = lexer::Lexer::new("\n\n\n");
    match parser::parse_Program(lexer) {
        ...
    }

And any time we write a string literal in the grammar, it'll substitute a variant of our Tok enum. This means we don't have to change any of the rules we already wrote! This will work as-is:

MST -- 每当我们在语法中编写字符串字面量时,它都会替换 Tok 枚举的变体。这意味着我们不必更改我们已经编写的任何规则!这将按原样工作:

FlowCtrl: ast::Stmt = {
    " " " " <Label> => ast::Stmt::Mark(<>),
    " " "\t" <Label> => ast::Stmt::Call(<>),
    " " "\n" <Label> => ast::Stmt::Jump(<>),
    "\t" " " <Label> => ast::Stmt::Jz(<>),
    "\t" "\t" <Label> => ast::Stmt::Js(<>),
    "\t" "\n" => ast::Stmt::Return,
    "\n" "\n" => ast::Stmt::Exit,
};

The complete grammar is available in whitespace/src/parser.lalrpop.

MST -- 完整的语法可在 whitespace/src/parser.lalrpop 中找到。

Where to go from here 从这里去哪里

Things to try that apply to lexers in general:

MST -- 通常适用于词法分析器的尝试:

  • Longer tokens 更长的令牌
  • Tokens that require tracking internal lexer state 需要跟踪内部词法分析器状态的标记

Things to try that are LALRPOP-specific:

MST -- 要尝试的是特定于 LALRPOP 的操作:

  • Persuade a lexer generator to output the Spanned format 说服词法分析器生成器输出跨区格式
  • Make this tutorial better 使本教程更好

posted on 2025-01-05 16:02  及途又八  阅读(26)  评论(0)    收藏  举报

导航