6.5 Using an external library 使用外部库
https://lalrpop.github.io/lalrpop/lexer_tutorial/005_external_lib.html
Writing a lexer yourself can be tricky. Fortunately, you can find many libraries on crates.io to generate a lexer for you.
MST -- 自己编写词法分析器可能很棘手。幸运的是,您可以在 crates.io 上找到许多库来为您生成词法分析器。
GPT -- “编写自己的词法分析器可能会很棘手。幸运的是,你可以在 crates.io 上找到许多库来为你生成词法分析器。”
In this tutorial, we will use Logos to build a simple lexer for a toy programming language. Here is an example of what we will be able to parse:
MST -- 在本教程中,我们将使用 Logos 为玩具编程语言构建一个简单的词法分析器。以下是我们将能够解析的内容的示例:
GPT -- “在本教程中,我们将使用 Logos 构建一个简单的词法分析器,用于解析一个玩具编程语言。以下是我们将能够解析的示例:”
var a = 42;
var b = 23;
# a comment
print (a - b);
Setup
In your Cargo.toml, add the following dependency:
logos = "0.14"
This will provide the logos crate and the Logos trait.
The AST
We will use the following abstract syntax tree (AST) as a representation of our expressions:
MST -- 我们将使用以下抽象语法树 (AST) 作为表达式的表示形式:
GPT -- “我们将使用以下抽象语法树(AST)作为我们表达式的表示:”
#[derive(Clone, Debug, PartialEq)]
pub enum Statement {
Variable { name: String, value: Box<Expression> },
Print { value: Box<Expression> },
}
#[derive(Clone, Debug, PartialEq)]
pub enum Expression {
Integer(i64),
Variable(String),
BinaryOperation {
lhs: Box<Expression>,
operator: Operator,
rhs: Box<Expression>,
},
}
#[derive(Clone, Debug, PartialEq)]
pub enum Operator {
Add,
Sub,
Mul,
Div,
}
Implement the tokenizer 实现分词器
In a file named tokens.rs (or any other name you want), create an enumeration for your tokens, as well as a type for lexing errors:
MST -- 在名为 tokens.rs (或你想要的任何其他名称) 的文件中,为你的令牌创建一个枚举,并为词法错误创建一个类型:
GPT -- “在一个名为
tokens.rs的文件中(或者你选择的其他名称),创建一个枚举类型用于表示令牌,并定义一个类型来表示词法错误:”
use std::fmt; // to implement the Display trait later
use std::num::ParseIntError;
use logos::Logos;
#[derive(Default, Debug, Clone, PartialEq)]
pub enum LexicalError {
InvalidInteger(ParseIntError),
#[default]
InvalidToken,
}
impl From<ParseIntError> for LexicalError {
fn from(err: ParseIntError) -> Self {
LexicalError::InvalidInteger(err)
}
}
#[derive(Logos, Clone, Debug, PartialEq)]
#[logos(skip r"[ \t\n\f]+", skip r"#.*\n?", error = LexicalError)]
pub enum Token {
#[token("var")]
KeywordVar,
#[token("print")]
KeywordPrint,
#[regex("[_a-zA-Z][_0-9a-zA-Z]*", |lex| lex.slice().to_string())]
Identifier(String),
#[regex("[1-9][0-9]*", |lex| lex.slice().parse())]
Integer(i64),
#[token("(")]
LParen,
#[token(")")]
RParen,
#[token("=")]
Assign,
#[token(";")]
Semicolon,
#[token("+")]
OperatorAdd,
#[token("-")]
OperatorSub,
#[token("*")]
OperatorMul,
#[token("/")]
OperatorDiv,
}
An exact match is specified using the #[token(...)] attribute.
MST -- 使用 #[token(...)] 属性指定完全匹配。
GPT -- “使用 #[token(...)] 属性可以指定精确匹配。”
For example, #[token("+")] makes the OperatorAdd token to be emitted only when a literal "+" appears in the input (unless it is part of another match, see below).
MST -- 例如,#[token(“+”)] 使 OperatorAdd 令牌仅在输入中出现文本 “+” 时发出(除非它是另一个匹配项的一部分,请参阅下文)。
GPT -- “例如,
#[token("+")]使得只有当输入中出现字面量+时,才会发出OperatorAdd令牌(除非它是另一个匹配的一部分,见下文)。”
On the other hand, #[regex(...)] will match a regular expression.
MST -- 另一方面,#[regex(...)] 将匹配正则表达式。
GPT -- “另一方面,
#[regex(...)]将匹配一个正则表达式。”
The #[logos(...)] attribute around the enum defines regexes to skip when parsing the input. We've chosen to skip common whitespace characters, and single-line comments of the form # .... It also allows us to specify our custom error type, LexicalError if an unexpected token was encountered or if parsing an integer fails.
MST -- 枚举周围的 #[logos(...)] 属性定义了解析输入时要跳过的正则表达式。我们选择跳过常见的空格字符和格式为 # .... 的单行注释它还允许我们指定自定义错误类型,如果遇到意外标记或解析整数失败,则指定 LexicalError。
GPT -- “围绕枚举的
#[logos(...)]属性定义了在解析输入时跳过的正则表达式。我们选择跳过常见的空白字符,以及以#开头的单行注释。它还允许我们指定自定义的错误类型LexicalError,用于处理遇到意外令牌或解析整数失败的情况。”
A few things to note about how Logos works:
MST -- 关于 Logos 的工作原理,需要注意的几点:
GPT -- “关于 Logos 的工作原理,有几点需要注意:”
When several sequences of tokens can match the same input, Logos uses precise rules to make a choice. Rule of thumb is:
MST -- 当多个标记序列可以匹配相同的输入时,Logos 使用精确的规则来做出选择。经验法则是:
GPT -- “当多个令牌序列可以匹配相同的输入时,Logos 会使用精确的规则来做出选择。经验法则是:”
- Longer beats shorter. 长胜短。
- Specific beats generic. 具体胜过通用。
This means the "printa" input string will generate the following token:
MST -- 这意味着 “printa” 输入字符串将生成以下令牌:
GPT -- “这意味着输入字符串
printa将生成以下令牌:”
Token::Identifier(String::new("printa"))
And not:
- Token::KeywordPrint
- Token::Identifier(String::new("a"))
This is because printa is longer than print, therefore the Identifier rule has priority.
MST -- 这是因为 printa 比 print 长,因此 Identifier 规则具有优先级。
GPT -- “这是因为
printa比Identifier规则具有优先权。”
Finally, we implement the Display trait:
MST -- 最后,我们实现 Display trait:
GPT -- “最后,我们实现
Displaytrait:”
impl fmt::Display for Token {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
write!(f, "{:?}", self)
}
}
This is required because the token is included in the error message that LALRPOP generates if it fails at parsing.
MST -- 这是必需的,因为令牌包含在 LALRPOP 解析失败时生成的错误消息中。
GPT -- “这是必需的,因为如果 LALRPOP 在解析时失败,令牌会包含在其生成的错误消息中。”
Implement the lexer 实现词法分析器
This part is very similar to the previous tutorials. In a file lexer.rs (or any other name), we will implement the Lexer as required by LALRPOP.
MST -- 这部分与前面的教程非常相似。在文件 lexer.rs(或任何其他名称)中,我们将按照 LALRPOP 的要求实现词法分析器。
GPT -- “这一部分与之前的教程非常相似。在一个名为
lexer.rs的文件中(或任何其他名称),我们将实现 LALRPOP 所要求的 Lexer。”
First, we define our types and structures:
use logos::{Logos, SpannedIter};
use crate::tokens::{Token, LexicalError}; // your Token enum, as above
pub type Spanned<Tok, Loc, Error> = Result<(Loc, Tok, Loc), Error>;
pub struct Lexer<'input> {
// instead of an iterator over characters, we have a token iterator
token_stream: SpannedIter<'input, Token>,
}
Then, we create the constructor for our Lexer:
impl<'input> Lexer<'input> {
pub fn new(input: &'input str) -> Self {
// the Token::lexer() method is provided by the Logos trait
Self { token_stream: Token::lexer(input).spanned() }
}
}
Finally, we implement the Iterator trait:
impl<'input> Iterator for Lexer<'input> {
type Item = Spanned<Token, usize, LexicalError>;
fn next(&mut self) -> Option<Self::Item> {
self.token_stream
.next()
.map(|(token, span)| Ok((span.start, token?, span.end)))
}
}
Update the grammar
Next, in our grammar.lalrpop file (or any other name), we can integrate our lexer as follows:
use crate::tokens::{Token, LexicalError};
use crate::ast;
grammar;
// ...
extern {
type Location = usize;
type Error = LexicalError;
enum Token {
"var" => Token::KeywordVar,
"print" => Token::KeywordPrint,
"identifier" => Token::Identifier(<String>),
"int" => Token::Integer(<i64>),
"(" => Token::LParen,
")" => Token::RParen,
"=" => Token::Assign,
";" => Token::Semicolon,
"+" => Token::OperatorAdd,
"-" => Token::OperatorSub,
"*" => Token::OperatorMul,
"/" => Token::OperatorDiv,
}
}
NB: This part allows us to give a precise name to the tokens emitted by our Lexer. We can then use those names ("identifier", "var", ...) in our grammar rules to reference the desired token.
MST -- 注意:这部分允许我们为 Lexer 发出的标记提供一个精确的名称。然后,我们可以在语法规则中使用这些名称 (“identifier”, “var”, ...) 来引用所需的标记。
GPT -- “注意:这一部分允许我们为词法分析器发出的令牌指定精确的名称。然后,我们可以在语法规则中使用这些名称(如
identifier、var等)来引用所需的令牌。”
Finally, we can build our rules:
pub Script: Vec<ast::Statement> = {
<stmts:Statement*> => stmts
}
pub Statement: ast::Statement = {
"var" <name:"identifier"> "=" <value:Expression> ";" => {
ast::Statement::Variable { name, value }
},
"print" <value:Expression> ";" => {
ast::Statement::Print { value }
},
}
pub Expression: Box<ast::Expression> = {
#[precedence(level="1")]
Term,
#[precedence(level="2")] #[assoc(side="left")]
<lhs:Expression> "*" <rhs:Expression> => {
Box::new(ast::Expression::BinaryOperation {
lhs,
operator: ast::Operator::Mul,
rhs
})
},
<lhs:Expression> "/" <rhs:Expression> => {
Box::new(ast::Expression::BinaryOperation {
lhs,
operator: ast::Operator::Div,
rhs
})
},
#[precedence(level="3")] #[assoc(side="left")]
<lhs:Expression> "+" <rhs:Expression> => {
Box::new(ast::Expression::BinaryOperation {
lhs,
operator: ast::Operator::Add,
rhs
})
},
<lhs:Expression> "-" <rhs:Expression> => {
Box::new(ast::Expression::BinaryOperation {
lhs,
operator: ast::Operator::Sub,
rhs
})
},
}
pub Term: Box<ast::Expression> = {
<val:"int"> => {
Box::new(ast::Expression::Integer(val))
},
<name:"identifier"> => {
Box::new(ast::Expression::Variable(name))
},
"(" <Expression> ")",
}
Our grammar is now complete.
Running your parser
The last step is to run our parser:
let source_code = std::fs::read_to_string("myscript.toy")?;
let lexer = Lexer::new(&source_code);
let parser = ScriptParser::new();
let ast = parser.parse(lexer)?;
println!("{:?}", ast);
浙公网安备 33010602011771号