Rust + nom:如何包装解析器以发出令牌?



所以我试图移植一个用Javascript编写的解析器。在解析器中,我提供一个字符串源输入并获得令牌。令牌只是一个对象,具有类型和一对偏移值,用于标记其在源代码中的位置:

enum TokenType {
Foo,
Bar
}
struct Token {
token_type: TokenType,
start_offset: usize,
end_offset: usize,
}

然后在parsimmon中有一个方便的节点操作符,它包装了一个解析器或组合子,并使其输出一个节点,就像上面描述的Token结构一样。我想用nom重新创建这个行为,这是我得到的:

struct LexInput<'a> {
source: &'a str,
location: usize,
}
fn token<'a>(
parser: impl Fn(&str) -> IResult<&'a str, &str>,
token_type: TokenType,
) -> impl Fn(&LexInput) -> IResult<LexInput<'a>, Token> {
move |input: &LexInput| {
let start_offset = input.location;
let (remaining_source, output) = parser(input.source)?;
let end_offset = start_offset + output.len();
let token = Token::new(token_type, start_offset, end_offset);
let remaining = LexInput::new(remaining_source, end_offset);
Ok((remaining, token))
}
}

我对rust还很陌生,所以花了我一段时间才到这里,但是代码看起来很有希望,除了我不知道如何使用它。我本能地写道:

let (remaining, token) = token(tag("|"), TokenType::Bar)(&LexInput::new("|foo", 0)).unwrap();
assert_eq!(remaining.source, "foo");

当然这行不通。错误信息像往常一样令人困惑:

expected associated type `<impl Fn(&str)-> Result<(&str, &str), nom::Err<nom::error::Error<&str>>> as FnOnce<(&str,)>>::Output`
found associated type `<impl Fn(&str)-> Result<(&str, &str), nom::Err<nom::error::Error<&str>>> as FnOnce<(&str,)>>::Output`

我的意思是,发现的和期望的对我来说似乎完全一样。

谁能帮我弄清楚这里出了什么问题吗?

这就是你想要的吗?

use nom::{bytes::complete::tag, IResult};
#[derive(Debug)]
pub enum TokenType {
Foo,
Bar,
}
#[derive(Debug)]
pub struct Token {
pub token_type: TokenType,
pub start_offset: usize,
pub end_offset: usize,
}
#[derive(Debug)]
pub struct LexInput<'a> {
source: &'a str,
location: usize,
}
impl<'a> LexInput<'a> {
fn new(source: &'a str, location: usize) -> Self {
Self { source, location }
}
}
impl Token {
fn new(token_type: TokenType, start_offset: usize, end_offset: usize) -> Self {
Self {
token_type,
start_offset,
end_offset,
}
}
}
fn token<'a>(
parser: impl Fn(&'a str) -> IResult<&'a str, &str>,
token_type: TokenType,
) -> impl FnOnce(LexInput<'a>) -> IResult<LexInput<'a>, Token> {
move |input: LexInput| {
let start_offset = input.location;
let (remaining_source, output) =
parser(input.source).map_err(|e| e.map_input(|_| input))?;
let end_offset = start_offset + output.len();
let token = Token::new(token_type, start_offset, end_offset);
let remaining = LexInput::new(remaining_source, end_offset);
Ok((remaining, token))
}
}
fn main() {
let source = "|foo".to_string();
let (remaining, token) = token(tag("|"), TokenType::Bar)(LexInput::new(&source, 0)).unwrap();
println!("remaining: {:?}", remaining);
println!("token: {:?}", token);
}
remaining: LexInput { source: "foo", location: 1 }
token: Token { token_type: Bar, start_offset: 0, end_offset: 1 }

你的主要错误是终生相关的。在没有注释生存期的任何地方,都会使用默认生存期,它不满足'a

fn token<'a>(
// The result can't be `'a` if it refers to the input `&str`, the input also has to be `'a`.
parser: impl Fn(&str) -> IResult<&'a str, &str>,
token_type: TokenType,
// Same here, `&LexInput` needs to be `'a`. But as it has a lifetime attached, just use that one instead: `LexInput<'a>`.
) -> impl Fn(&LexInput) -> IResult<LexInput<'a>, Token> {
// Same here, although here the anonymous lifetime is sufficient to figure it out
move |input: &LexInput| {
let start_offset = input.location;
// Here, an error conversion is missing, because the error carries the
// input and therefore can't be just directly raised; parser has `&str`
// as input, while `token` has `LexInput` as input. Luckily, the 
//`map_input` method exists.
let (remaining_source, output) = parser(input.source)?;
let end_offset = start_offset + output.len();
let token = Token::new(token_type, start_offset, end_offset);
let remaining = LexInput::new(remaining_source, end_offset);
Ok((remaining, token))
}
}

进一步评论已经有了nom_locate板条箱,它做的正是你在这里试图做的。

nom_locatecrate的最大优点是LocatedSpan类型可以直接被nom的解析器使用。不需要在你的类型和&str之间来回转换。这使得代码更加简单。

use nom::{bytes::complete::tag, IResult};
use nom_locate::LocatedSpan;
type Span<'a> = LocatedSpan<&'a str>;
#[derive(Debug)]
pub enum TokenType {
Foo,
Bar,
}
#[derive(Debug)]
pub struct Token {
pub token_type: TokenType,
pub start_offset: usize,
pub end_offset: usize,
}
impl Token {
fn new(token_type: TokenType, start_offset: usize, end_offset: usize) -> Self {
Self {
token_type,
start_offset,
end_offset,
}
}
}
fn token<'a>(
parser: impl Fn(Span<'a>) -> IResult<Span<'a>, Span<'a>>,
token_type: TokenType,
) -> impl FnOnce(Span<'a>) -> IResult<Span<'a>, Token> {
move |input: Span| {
let start_offset = input.location_offset();
let (remaining, _) = parser(input)?;
let end_offset = remaining.location_offset();
let token = Token::new(token_type, start_offset, end_offset);
Ok((remaining, token))
}
}
fn main() {
let source = "|foo".to_string();
let (remaining, token) = token(tag("|"), TokenType::Bar)(Span::new(&source)).unwrap();
println!("remaining: {:?}", remaining);
println!("token: {:?}", token);
}
remaining: LocatedSpan { offset: 1, line: 1, fragment: "foo", extra: () }
token: Token { token_type: Bar, start_offset: 0, end_offset: 1 }

nom::combinator::map和一些重组的帮助下,你可以进一步减少它:

use nom::{bytes::complete::tag, combinator::map, IResult};
use nom_locate::LocatedSpan;
type Span<'a> = LocatedSpan<&'a str>;
#[derive(Debug, Clone)]
pub enum TokenType {
Foo,
Bar,
}
#[derive(Debug)]
pub struct Token {
pub token_type: TokenType,
pub start_offset: usize,
pub end_offset: usize,
}
impl Token {
fn new(token_type: TokenType, start_offset: usize, end_offset: usize) -> Self {
Self {
token_type,
start_offset,
end_offset,
}
}
}
fn token<'a>(
parser: impl Fn(Span<'a>) -> IResult<Span<'a>, Span<'a>>,
token_type: TokenType,
) -> impl FnMut(Span<'a>) -> IResult<Span<'a>, Token> {
map(parser, move |matched| {
Token::new(
token_type.clone(),
matched.location_offset(),
matched.location_offset() + matched.len(),
)
})
}
fn main() {
let source = "|foo".to_string();
let (remaining, token) = token(tag("|"), TokenType::Bar)(Span::new(&source)).unwrap();
println!("remaining: {:?}", remaining);
println!("token: {:?}", token);
}
remaining: LocatedSpan { offset: 1, line: 1, fragment: "foo", extra: () }
token: Token { token_type: Bar, start_offset: 0, end_offset: 1 }

最新更新