解析 antlr4 中由分号分隔的 sql 语句的语法



我正在寻找单独的sql语句,这些语句可能有注释/* */或字符串"test"或行注释--行注释(sql样式),用分号分隔。一个例子是:

Blah blah 'string ; ' ;
More text /* semicolon(;) inside comment */
Some more text 
in multiple lines

上面的文本应该只检索两个语句,因为字符串 ' ' 和注释/* */中的分号不应算作分隔符。我目前的语法是:

grammar SqlStatements;
sql_stmts:
    text (';' text)* EOF
    ;
text:
     SINGLE_LINE_COMMENT*
    | STRING* 
    | TEXT* 
;
TEXT:
 ~['--';''/*']* 
;
STRING
: 
     ''' (''''|~''')* '''
;
SINGLE_LINE_COMMENT
 : '--' ~[rn]* 
 ;
MULTILINE_COMMENT
 : '/*' .*? ( '*/' | EOF ) 
 ;

上面的代码在键入 * 时崩溃。

解析 SQL 的常用方法是首先拆分各个语句。这可能涉及处理分隔符开关,例如,当您在转储中有一个存储过程时,必须作为一个整体进行处理,但需要分号作为内部语句分隔符。

这可以通过跳过注释和字符串的优化循环非常快速地完成。以下是如何在MySQL Workbench中处理此问题的代码:

/**
 * A statement splitter to take a list of sql statements and split them into individual statements,
 * return their position and length in the original string (instead the copied strings).
 */
size_t MySQLParserServicesImpl::determineStatementRanges(const char *sql, size_t length,
                                                         const std::string &initial_delimiter,
                                                         std::vector<std::pair<size_t, size_t> > &ranges,
                                                         const std::string &line_break)
{
  _stop = false;
  std::string delimiter = initial_delimiter.empty() ? ";" : initial_delimiter;
  const unsigned char *delimiter_head = (unsigned char*)delimiter.c_str();
  const unsigned char keyword[] = "delimiter";
  const unsigned char *head = (unsigned char *)sql;
  const unsigned char *tail = head;
  const unsigned char *end = head + length;
  const unsigned char *new_line = (unsigned char*)line_break.c_str();
  bool have_content = false; // Set when anything else but comments were found for the current statement.
  while (!_stop && tail < end)
  {
    switch (*tail)
    {
      case '/': // Possible multi line comment or hidden (conditional) command.
        if (*(tail + 1) == '*')
        {
          tail += 2;
          bool is_hidden_command = (*tail == '!');
          while (true)
          {
            while (tail < end && *tail != '*')
              tail++;
            if (tail == end) // Unfinished comment.
              break;
            else
            {
              if (*++tail == '/')
              {
                tail++; // Skip the slash too.
                break;
              }
            }
          }
          if (!is_hidden_command && !have_content)
            head = tail; // Skip over the comment.
        }
        else
          tail++;
        break;
      case '-': // Possible single line comment.
      {
        const unsigned char *end_char = tail + 2;
        if (*(tail + 1) == '-' && (*end_char == ' ' || *end_char == 't' || is_line_break(end_char, new_line)))
        {
          // Skip everything until the end of the line.
          tail += 2;
          while (tail < end && !is_line_break(tail, new_line))
            tail++;
          if (!have_content)
            head = tail;
        }
        else
          tail++;
        break;
      }
      case '#': // MySQL single line comment.
        while (tail < end && !is_line_break(tail, new_line))
          tail++;
        if (!have_content)
          head = tail;
        break;
      case '"':
      case ''':
      case '`': // Quoted string/id. Skip this in a local loop.
      {
        have_content = true;
        char quote = *tail++;
        while (tail < end && *tail != quote)
        {
          // Skip any escaped character too.
          if (*tail == '\')
            tail++;
          tail++;
        }
        if (*tail == quote)
          tail++; // Skip trailing quote char to if one was there.
        break;
      }
      case 'd':
      case 'D':
      {
        have_content = true;
        // Possible start of the keyword DELIMITER. Must be at the start of the text or a character,
        // which is not part of a regular MySQL identifier (0-9, A-Z, a-z, _, $, u0080-uffff).
        unsigned char previous = tail > (unsigned char *)sql ? *(tail - 1) : 0;
        bool is_identifier_char = previous >= 0x80
        || (previous >= '0' && previous <= '9')
        || ((previous | 0x20) >= 'a' && (previous | 0x20) <= 'z')
        || previous == '$'
        || previous == '_';
        if (tail == (unsigned char *)sql || !is_identifier_char)
        {
          const unsigned char *run = tail + 1;
          const unsigned char *kw = keyword + 1;
          int count = 9;
          while (count-- > 1 && (*run++ | 0x20) == *kw++)
            ;
          if (count == 0 && *run == ' ')
          {
            // Delimiter keyword found. Get the new delimiter (everything until the end of the line).
            tail = run++;
            while (run < end && !is_line_break(run, new_line))
              run++;
            delimiter = base::trim(std::string((char *)tail, run - tail));
            delimiter_head = (unsigned char*)delimiter.c_str();
            // Skip over the delimiter statement and any following line breaks.
            while (is_line_break(run, new_line))
              run++;
            tail = run;
            head = tail;
          }
          else
            tail++;
        }
        else
          tail++;
        break;
      }
      default:
        if (*tail > ' ')
          have_content = true;
        tail++;
        break;
    }
    if (*tail == *delimiter_head)
    {
      // Found possible start of the delimiter. Check if it really is.
      size_t count = delimiter.size();
      if (count == 1)
      {
        // Most common case. Trim the statement and check if it is not empty before adding the range.
        head = skip_leading_whitespace(head, tail);
        if (head < tail)
          ranges.push_back(std::make_pair<size_t, size_t>(head - (unsigned char *)sql, tail - head));
        head = ++tail;
        have_content = false;
      }
      else
      {
        const unsigned char *run = tail + 1;
        const unsigned char *del = delimiter_head + 1;
        while (count-- > 1 && (*run++ == *del++))
          ;
        if (count == 0)
        {
          // Multi char delimiter is complete. Tail still points to the start of the delimiter.
          // Run points to the first character after the delimiter.
          head = skip_leading_whitespace(head, tail);
          if (head < tail)
            ranges.push_back(std::make_pair<size_t, size_t>(head - (unsigned char *)sql, tail - head));
          tail = run;
          head = run;
          have_content = false;
        }
      }
    }
  }
  // Add remaining text to the range list.
  head = skip_leading_whitespace(head, tail);
  if (head < tail)
    ranges.push_back(std::make_pair<size_t, size_t>(head - (unsigned char *)sql, tail - head));
  return 0;
}

这也适用于大型 sql 脚本,并且可以在大约 1 秒内拆分包含 100 万行的转储(当然取决于您运行此脚本的框)。var _stop 是一个标志,用于允许中断拆分进程。代码正在处理MySQL代码,因此它可以正确处理隐藏的命令(版本注释)。

使用每个查询的开始和长度信息,您现在可以转到解析器。

即使迈克的回答很好,我也需要在antlr中创建语法。以下语法对我有用:

sql_stmts:
    sql_stmt (';'+ sql_stmt)*
    ;
sql_stmt:
    TEXT*
;
TEXT:
~[']
| STRING
;
BLOCK_COMMENT
: '/*' .*? ( '*/' | EOF ) -> channel(HIDDEN)
;
LINE_COMMENT
: '--' ~[rn]* -> channel(HIDDEN)
;
SPACES
 : [ u000Btrn] -> channel(HIDDEN)
 ;
STRING
: 
     ''' (''''|~''')* ''' 
;

首先,不要忽略编译语法时生成的警告和错误消息。

其次,TEXT 规则不会做你认为它做的事情——引号在那里不起作用。 请参阅文档。

第三,您的第一行输入实际上是文本字符串文本半。 第二个 TEXT 是 SEMI 规则之前的空格,但您的规则只允许在 SEMI 之前出现一次非连续的 TEXT。

相关内容

  • 没有找到相关文章

最新更新