提升精神表示解析成功,尽管令牌不完整



我有一个非常简单的路径结构,我正在尝试使用boost spirit.lex解析它。

我们有以下语法:

token := [a-z]+
path := (token : path) | (token)

所以我们在这里只谈论冒号分隔的小写 ASCII 字符串。

我有三个例子"xyz","abc:xyz

","abc:xyz:"。

前两个应被视为有效。第三个带有尾随冒号,不应被视为有效。不幸的是,我拥有的解析器将这三个都识别为有效。语法不应该允许空令牌,但显然精神就是这样做的。我错过了什么才能让第三个被拒绝?

此外,如果您阅读下面的代码,则在注释中还有另一个版本的解析器,要求所有路径都以分号结尾。当我激活这些行时,我可以得到适当的行为(即拒绝"abc:xyz:;"),但这不是我真正想要的。

有人有什么想法吗?

谢谢。

#include <boost/config/warning_disable.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
#include <iostream>
#include <string>
using namespace boost::spirit;
using boost::phoenix::val;
template<typename Lexer>
struct PathTokens : boost::spirit::lex::lexer<Lexer>
{
      PathTokens()
      {
         identifier = "[a-z]+";
         separator = ":";
         this->self.add
            (identifier)
            (separator)
            (';')
            ;
      }
      boost::spirit::lex::token_def<std::string> identifier, separator;
};

template <typename Iterator>
struct PathGrammar 
   : boost::spirit::qi::grammar<Iterator> 
{
      template <typename TokenDef>
      PathGrammar(TokenDef const& tok)
         : PathGrammar::base_type(path)
      {
         using boost::spirit::_val;
         path
            = 
            (token >> tok.separator >> path)[std::cerr << _1 << "n"]
            |
            //(token >> ';')[std::cerr << _1 << "n"]
            (token)[std::cerr << _1 << "n"]
             ; 
          token 
             = (tok.identifier) [_val=_1]
          ;
      }
      boost::spirit::qi::rule<Iterator> path;
      boost::spirit::qi::rule<Iterator, std::string()> token;
};

int main()
{
   typedef std::string::iterator BaseIteratorType;
   typedef boost::spirit::lex::lexertl::token<BaseIteratorType, boost::mpl::vector<std::string> > TokenType;
   typedef boost::spirit::lex::lexertl::lexer<TokenType> LexerType;
   typedef PathTokens<LexerType>::iterator_type TokensIterator;
   typedef std::vector<std::string> Tests;
   Tests paths;
   paths.push_back("abc");
   paths.push_back("abc:xyz");
   paths.push_back("abc:xyz:");
   /*
     paths.clear();
     paths.push_back("abc;");
     paths.push_back("abc:xyz;");
     paths.push_back("abc:xyz:;");
   */
   for ( Tests::iterator iter = paths.begin(); iter != paths.end(); ++iter )
   {
      std::string str = *iter;
      std::cerr << "*****" << str << "*****n";
      PathTokens<LexerType> tokens;
      PathGrammar<TokensIterator> grammar(tokens);
      BaseIteratorType first = str.begin();
      BaseIteratorType last = str.end();
      bool r = boost::spirit::lex::tokenize_and_parse(first, last, tokens, grammar);
      std::cerr << r << " " << (first==last) << "n";
   }
}
除了

llonemiz已经说过的内容之外,这里有一个使用qi::eoi的技巧,我有时会使用:

path = (
           (token >> tok.separator >> path) [std::cerr << _1 << "n"]
         | token                           [std::cerr << _1 << "n"]
    ) >> eoi;

这使得语法在成功匹配结束时需要eoi(输入结束)。这会导致所需的结果:

http://liveworkspace.org/code/23a7adb11889bbb2825097d7c553f71d

*****abc*****
abc
1 1
*****abc:xyz*****
xyz
abc
1 1
*****abc:xyz:*****
xyz
abc
0 1

问题在于firstlast的含义,在你调用tokenize_and_parse之后。 first==last检查您的字符串是否已完全标记化,您无法推断任何有关语法的信息。如果像这样隔离解析,则会获得预期的结果:

  PathTokens<LexerType> tokens;
  PathGrammar<TokensIterator> grammar(tokens);
  BaseIteratorType first = str.begin();
  BaseIteratorType last = str.end();
  LexerType::iterator_type lexfirst = tokens.begin(first,last);
  LexerType::iterator_type lexlast = tokens.end();

  bool r = parse(lexfirst, lexlast, grammar);
  std::cerr << r << " " << (lexfirst==lexlast) << "n";

这就是我最终得到的。它使用来自@sehe和@llonesmiz的建议。请注意转换为 std::wstring 以及在语法定义中使用操作,这在原始帖子中不存在。

#include <boost/config/warning_disable.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
#include <boost/bind.hpp>
#include <iostream>
#include <string>
//
// This example uses boost spirit to parse a simple
// colon-delimited grammar.
//
// The grammar we want to recognize is:
//    identifier := [a-z]+
//    separator = :
//    path= (identifier separator path) | identifier
//
// From the boost spirit perspective this example shows
// a few things I found hard to come by when building my
// first parser.
//    1. How to flag an incomplete token at the end of input
//       as an error. (use of boost::spirit::eoi)
//    2. How to bind an action on an instance of an object
//       that is taken as input to the parser.
//    3. Use of std::wstring.
//    4. Use of the lexer iterator.
//
// This using directive will cause issues with boost::bind
// when referencing placeholders such as _1.
// using namespace boost::spirit;
//! A class that tokenizes our input.
template<typename Lexer>
struct Tokens : boost::spirit::lex::lexer<Lexer>
{
      Tokens()
      {
         identifier = L"[a-z]+";
         separator = L":";
         this->self.add
            (identifier)
            (separator)
            ;
      }
      boost::spirit::lex::token_def<std::wstring, wchar_t> identifier, separator;
};
//! This class provides a callback that echoes strings to stderr.
struct Echo
{
      void echo(boost::fusion::vector<std::wstring> const& t) const
      {
         using namespace boost::fusion;
         std::wcerr << at_c<0>(t) << "n";
      }
};

//! The definition of our grammar, as described above.
template <typename Iterator>
struct Grammar : boost::spirit::qi::grammar<Iterator> 
{
      template <typename TokenDef>
      Grammar(TokenDef const& tok, Echo const& e)
         : Grammar::base_type(path)
      {
         using boost::spirit::_val;
         path
            = 
            ((token >> tok.separator >> path)[boost::bind(&Echo::echo, e,::_1)]
             |
             (token)[boost::bind(&Echo::echo, &e, ::_1)]
             ) >> boost::spirit::eoi; // Look for end of input.
          token 
             = (tok.identifier) [_val=boost::spirit::qi::_1]
          ;
      }
      boost::spirit::qi::rule<Iterator> path;
      boost::spirit::qi::rule<Iterator, std::wstring()> token;
};

int main()
{
   // A set of typedefs to make things a little clearer. This stuff is
   // well described in the boost spirit documentation/examples.
   typedef std::wstring::iterator BaseIteratorType;
   typedef boost::spirit::lex::lexertl::token<BaseIteratorType, boost::mpl::vector<std::wstring> > TokenType;
   typedef boost::spirit::lex::lexertl::lexer<TokenType> LexerType;
   typedef Tokens<LexerType>::iterator_type TokensIterator;
   typedef LexerType::iterator_type LexerIterator;
   // Define some paths to parse.
   typedef std::vector<std::wstring> Tests;
   Tests paths;
   paths.push_back(L"abc");
   paths.push_back(L"abc:xyz");
   paths.push_back(L"abc:xyz:");
   paths.push_back(L":");
   // Parse 'em.
   for ( Tests::iterator iter = paths.begin(); iter != paths.end(); ++iter )
   {
      std::wstring str = *iter;
      std::wcerr << L"*****" << str << L"*****n";
      Echo e;
      Tokens<LexerType> tokens;
      Grammar<TokensIterator> grammar(tokens, e);
      BaseIteratorType first = str.begin();
      BaseIteratorType last = str.end();
      // Have the lexer consume our string.
      LexerIterator lexFirst = tokens.begin(first, last);
      LexerIterator lexLast = tokens.end();
      // Have the parser consume the output of the lexer.
      bool r = boost::spirit::qi::parse(lexFirst, lexLast, grammar);
      // Print the status and whether or note all output of the lexer 
      // was processed.
      std::wcerr << r << L" " << (lexFirst==lexLast) << L"n";
   }
}

最新更新