6.5 Using an external library 使用外部库

https://lalrpop.github.io/lalrpop/lexer_tutorial/005_external_lib.html

Writing a lexer yourself can be tricky. Fortunately, you can find many libraries on crates.io to generate a lexer for you.

MST -- 自己编写词法分析器可能很棘手。幸运的是,您可以在 crates.io 上找到许多库来为您生成词法分析器。

GPT -- “编写自己的词法分析器可能会很棘手。幸运的是,你可以在 crates.io 上找到许多库来为你生成词法分析器。”

In this tutorial, we will use Logos to build a simple lexer for a toy programming language. Here is an example of what we will be able to parse:

MST -- 在本教程中,我们将使用 Logos 为玩具编程语言构建一个简单的词法分析器。以下是我们将能够解析的内容的示例:

GPT -- “在本教程中,我们将使用 Logos 构建一个简单的词法分析器,用于解析一个玩具编程语言。以下是我们将能够解析的示例:”

var a = 42;
var b = 23;

# a comment
print (a - b);

Setup

In your Cargo.toml, add the following dependency:

logos = "0.14"

This will provide the logos crate and the Logos trait.

The AST

We will use the following abstract syntax tree (AST) as a representation of our expressions:

MST -- 我们将使用以下抽象语法树 (AST) 作为表达式的表示形式:

GPT -- “我们将使用以下抽象语法树(AST)作为我们表达式的表示:”

#[derive(Clone, Debug, PartialEq)]
pub enum Statement {
  Variable { name: String, value: Box<Expression> },
  Print { value: Box<Expression> },
}

#[derive(Clone, Debug, PartialEq)]
pub enum Expression {
  Integer(i64),
  Variable(String),
  BinaryOperation {
    lhs: Box<Expression>,
    operator: Operator,
    rhs: Box<Expression>,
  },
}

#[derive(Clone, Debug, PartialEq)]
pub enum Operator {
  Add,
  Sub,
  Mul,
  Div,
}

Implement the tokenizer 实现分词器

In a file named tokens.rs (or any other name you want), create an enumeration for your tokens, as well as a type for lexing errors:

MST -- 在名为 tokens.rs (或你想要的任何其他名称) 的文件中,为你的令牌创建一个枚举,并为词法错误创建一个类型:

GPT -- “在一个名为 tokens.rs 的文件中(或者你选择的其他名称),创建一个枚举类型用于表示令牌,并定义一个类型来表示词法错误:”

use std::fmt;  // to implement the Display trait later
use std::num::ParseIntError;
use logos::Logos;

#[derive(Default, Debug, Clone, PartialEq)]
pub enum LexicalError {
    InvalidInteger(ParseIntError),
    #[default]
    InvalidToken,
}

impl From<ParseIntError> for LexicalError {
    fn from(err: ParseIntError) -> Self {
        LexicalError::InvalidInteger(err)
    }
}

#[derive(Logos, Clone, Debug, PartialEq)]
#[logos(skip r"[ \t\n\f]+", skip r"#.*\n?", error = LexicalError)]
pub enum Token {
  #[token("var")]
  KeywordVar,
  #[token("print")]
  KeywordPrint,

  #[regex("[_a-zA-Z][_0-9a-zA-Z]*", |lex| lex.slice().to_string())]
  Identifier(String),
  #[regex("[1-9][0-9]*", |lex| lex.slice().parse())]
  Integer(i64),

  #[token("(")]
  LParen,
  #[token(")")]
  RParen,
  #[token("=")]
  Assign,
  #[token(";")]
  Semicolon,

  #[token("+")]
  OperatorAdd,
  #[token("-")]
  OperatorSub,
  #[token("*")]
  OperatorMul,
  #[token("/")]
  OperatorDiv,
}

An exact match is specified using the #[token(...)] attribute.

MST -- 使用 #[token(...)] 属性指定完全匹配。

GPT -- “使用 #[token(...)] 属性可以指定精确匹配。”

For example, #[token("+")] makes the OperatorAdd token to be emitted only when a literal "+" appears in the input (unless it is part of another match, see below).

MST -- 例如,#[token(“+”)] 使 OperatorAdd 令牌仅在输入中出现文本 “+” 时发出(除非它是另一个匹配项的一部分,请参阅下文)。

GPT -- “例如,#[token("+")] 使得只有当输入中出现字面量 + 时,才会发出 OperatorAdd 令牌(除非它是另一个匹配的一部分,见下文)。”

On the other hand, #[regex(...)] will match a regular expression.

MST -- 另一方面,#[regex(...)] 将匹配正则表达式。

GPT -- “另一方面,#[regex(...)] 将匹配一个正则表达式。”

The #[logos(...)] attribute around the enum defines regexes to skip when parsing the input. We've chosen to skip common whitespace characters, and single-line comments of the form # .... It also allows us to specify our custom error type, LexicalError if an unexpected token was encountered or if parsing an integer fails.

MST -- 枚举周围的 #[logos(...)] 属性定义了解析输入时要跳过的正则表达式。我们选择跳过常见的空格字符和格式为 # .... 的单行注释它还允许我们指定自定义错误类型,如果遇到意外标记或解析整数失败,则指定 LexicalError。

GPT -- “围绕枚举的 #[logos(...)] 属性定义了在解析输入时跳过的正则表达式。我们选择跳过常见的空白字符,以及以 # 开头的单行注释。它还允许我们指定自定义的错误类型 LexicalError,用于处理遇到意外令牌或解析整数失败的情况。”

A few things to note about how Logos works:

MST -- 关于 Logos 的工作原理,需要注意的几点:

GPT -- “关于 Logos 的工作原理,有几点需要注意:”

When several sequences of tokens can match the same input, Logos uses precise rules to make a choice. Rule of thumb is:

MST -- 当多个标记序列可以匹配相同的输入时,Logos 使用精确的规则来做出选择。经验法则是:

GPT -- “当多个令牌序列可以匹配相同的输入时,Logos 会使用精确的规则来做出选择。经验法则是:”

  • Longer beats shorter. 长胜短。
  • Specific beats generic. 具体胜过通用。

This means the "printa" input string will generate the following token:

MST -- 这意味着 “printa” 输入字符串将生成以下令牌:

GPT -- “这意味着输入字符串 printa 将生成以下令牌:”

Token::Identifier(String::new("printa"))

And not:

  • Token::KeywordPrint
  • Token::Identifier(String::new("a"))

This is because printa is longer than print, therefore the Identifier rule has priority.

MST -- 这是因为 printa 比 print 长,因此 Identifier 规则具有优先级。

GPT -- “这是因为 printaprint 更长,因此 Identifier 规则具有优先权。”

Finally, we implement the Display trait:

MST -- 最后,我们实现 Display trait:

GPT -- “最后,我们实现 Display trait:”

impl fmt::Display for Token {
  fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
    write!(f, "{:?}", self)
  }
}

This is required because the token is included in the error message that LALRPOP generates if it fails at parsing.

MST -- 这是必需的,因为令牌包含在 LALRPOP 解析失败时生成的错误消息中。

GPT -- “这是必需的,因为如果 LALRPOP 在解析时失败,令牌会包含在其生成的错误消息中。”

Implement the lexer 实现词法分析器

This part is very similar to the previous tutorials. In a file lexer.rs (or any other name), we will implement the Lexer as required by LALRPOP.

MST -- 这部分与前面的教程非常相似。在文件 lexer.rs(或任何其他名称)中,我们将按照 LALRPOP 的要求实现词法分析器。

GPT -- “这一部分与之前的教程非常相似。在一个名为 lexer.rs 的文件中(或任何其他名称),我们将实现 LALRPOP 所要求的 Lexer。”

First, we define our types and structures:

use logos::{Logos, SpannedIter};

use crate::tokens::{Token, LexicalError}; // your Token enum, as above

pub type Spanned<Tok, Loc, Error> = Result<(Loc, Tok, Loc), Error>;

pub struct Lexer<'input> {
  // instead of an iterator over characters, we have a token iterator
  token_stream: SpannedIter<'input, Token>,
}

Then, we create the constructor for our Lexer:

impl<'input> Lexer<'input> {
  pub fn new(input: &'input str) -> Self {
    // the Token::lexer() method is provided by the Logos trait
    Self { token_stream: Token::lexer(input).spanned() }
  }
}

Finally, we implement the Iterator trait:

impl<'input> Iterator for Lexer<'input> {
  type Item = Spanned<Token, usize, LexicalError>;

  fn next(&mut self) -> Option<Self::Item> {
    self.token_stream
      .next()
      .map(|(token, span)| Ok((span.start, token?, span.end)))
  }
}

Update the grammar

Next, in our grammar.lalrpop file (or any other name), we can integrate our lexer as follows:

use crate::tokens::{Token, LexicalError};
use crate::ast;

grammar;

// ...

extern {
  type Location = usize;
  type Error = LexicalError;

  enum Token {
    "var" => Token::KeywordVar,
    "print" => Token::KeywordPrint,
    "identifier" => Token::Identifier(<String>),
    "int" => Token::Integer(<i64>),
    "(" => Token::LParen,
    ")" => Token::RParen,
    "=" => Token::Assign,
    ";" => Token::Semicolon,
    "+" => Token::OperatorAdd,
    "-" => Token::OperatorSub,
    "*" => Token::OperatorMul,
    "/" => Token::OperatorDiv,
  }
}

NB: This part allows us to give a precise name to the tokens emitted by our Lexer. We can then use those names ("identifier", "var", ...) in our grammar rules to reference the desired token.

MST -- 注意:这部分允许我们为 Lexer 发出的标记提供一个精确的名称。然后,我们可以在语法规则中使用这些名称 (“identifier”, “var”, ...) 来引用所需的标记。

GPT -- “注意:这一部分允许我们为词法分析器发出的令牌指定精确的名称。然后,我们可以在语法规则中使用这些名称(如 identifiervar 等)来引用所需的令牌。”

Finally, we can build our rules:

pub Script: Vec<ast::Statement> = {
  <stmts:Statement*> => stmts
}

pub Statement: ast::Statement = {
  "var" <name:"identifier"> "=" <value:Expression> ";" => {
    ast::Statement::Variable { name, value }
  },
  "print" <value:Expression> ";" => {
    ast::Statement::Print { value }
  },
}

pub Expression: Box<ast::Expression> = {
  #[precedence(level="1")]
  Term,

  #[precedence(level="2")] #[assoc(side="left")]
  <lhs:Expression> "*" <rhs:Expression> => {
    Box::new(ast::Expression::BinaryOperation {
      lhs,
      operator: ast::Operator::Mul,
      rhs
    })
  },
  <lhs:Expression> "/" <rhs:Expression> => {
    Box::new(ast::Expression::BinaryOperation {
      lhs,
      operator: ast::Operator::Div,
      rhs
    })
  },

  #[precedence(level="3")] #[assoc(side="left")]
  <lhs:Expression> "+" <rhs:Expression> => {
    Box::new(ast::Expression::BinaryOperation {
      lhs,
      operator: ast::Operator::Add,
      rhs
    })
  },
  <lhs:Expression> "-" <rhs:Expression> => {
    Box::new(ast::Expression::BinaryOperation {
      lhs,
      operator: ast::Operator::Sub,
      rhs
    })
  },
}

pub Term: Box<ast::Expression> = {
  <val:"int"> => {
    Box::new(ast::Expression::Integer(val))
  },
  <name:"identifier"> => {
    Box::new(ast::Expression::Variable(name))
  },
  "(" <Expression> ")",
}

Our grammar is now complete.

Running your parser

The last step is to run our parser:

let source_code = std::fs::read_to_string("myscript.toy")?;
let lexer = Lexer::new(&source_code);
let parser = ScriptParser::new();
let ast = parser.parse(lexer)?;

println!("{:?}", ast);

posted on 2025-01-05 16:33  及途又八  阅读(31)  评论(0)    收藏  举报

导航