我正在寻找单独的sql语句,这些语句可能有注释/* */或字符串"test"或行注释--行注释(sql样式),用分号分隔。一个例子是:
Blah blah 'string ; ' ;
More text /* semicolon(;) inside comment */
Some more text
in multiple lines
上面的文本应该只检索两个语句,因为字符串 ' ' 和注释/* */中的分号不应算作分隔符。我目前的语法是:
grammar SqlStatements;
sql_stmts:
text (';' text)* EOF
;
text:
SINGLE_LINE_COMMENT*
| STRING*
| TEXT*
;
TEXT:
~['--';''/*']*
;
STRING
:
''' (''''|~''')* '''
;
SINGLE_LINE_COMMENT
: '--' ~[rn]*
;
MULTILINE_COMMENT
: '/*' .*? ( '*/' | EOF )
;
上面的代码在键入 * 时崩溃。
解析 SQL 的常用方法是首先拆分各个语句。这可能涉及处理分隔符开关,例如,当您在转储中有一个存储过程时,必须作为一个整体进行处理,但需要分号作为内部语句分隔符。
这可以通过跳过注释和字符串的优化循环非常快速地完成。以下是如何在MySQL Workbench中处理此问题的代码:
/**
* A statement splitter to take a list of sql statements and split them into individual statements,
* return their position and length in the original string (instead the copied strings).
*/
size_t MySQLParserServicesImpl::determineStatementRanges(const char *sql, size_t length,
const std::string &initial_delimiter,
std::vector<std::pair<size_t, size_t> > &ranges,
const std::string &line_break)
{
_stop = false;
std::string delimiter = initial_delimiter.empty() ? ";" : initial_delimiter;
const unsigned char *delimiter_head = (unsigned char*)delimiter.c_str();
const unsigned char keyword[] = "delimiter";
const unsigned char *head = (unsigned char *)sql;
const unsigned char *tail = head;
const unsigned char *end = head + length;
const unsigned char *new_line = (unsigned char*)line_break.c_str();
bool have_content = false; // Set when anything else but comments were found for the current statement.
while (!_stop && tail < end)
{
switch (*tail)
{
case '/': // Possible multi line comment or hidden (conditional) command.
if (*(tail + 1) == '*')
{
tail += 2;
bool is_hidden_command = (*tail == '!');
while (true)
{
while (tail < end && *tail != '*')
tail++;
if (tail == end) // Unfinished comment.
break;
else
{
if (*++tail == '/')
{
tail++; // Skip the slash too.
break;
}
}
}
if (!is_hidden_command && !have_content)
head = tail; // Skip over the comment.
}
else
tail++;
break;
case '-': // Possible single line comment.
{
const unsigned char *end_char = tail + 2;
if (*(tail + 1) == '-' && (*end_char == ' ' || *end_char == 't' || is_line_break(end_char, new_line)))
{
// Skip everything until the end of the line.
tail += 2;
while (tail < end && !is_line_break(tail, new_line))
tail++;
if (!have_content)
head = tail;
}
else
tail++;
break;
}
case '#': // MySQL single line comment.
while (tail < end && !is_line_break(tail, new_line))
tail++;
if (!have_content)
head = tail;
break;
case '"':
case ''':
case '`': // Quoted string/id. Skip this in a local loop.
{
have_content = true;
char quote = *tail++;
while (tail < end && *tail != quote)
{
// Skip any escaped character too.
if (*tail == '\')
tail++;
tail++;
}
if (*tail == quote)
tail++; // Skip trailing quote char to if one was there.
break;
}
case 'd':
case 'D':
{
have_content = true;
// Possible start of the keyword DELIMITER. Must be at the start of the text or a character,
// which is not part of a regular MySQL identifier (0-9, A-Z, a-z, _, $, u0080-uffff).
unsigned char previous = tail > (unsigned char *)sql ? *(tail - 1) : 0;
bool is_identifier_char = previous >= 0x80
|| (previous >= '0' && previous <= '9')
|| ((previous | 0x20) >= 'a' && (previous | 0x20) <= 'z')
|| previous == '$'
|| previous == '_';
if (tail == (unsigned char *)sql || !is_identifier_char)
{
const unsigned char *run = tail + 1;
const unsigned char *kw = keyword + 1;
int count = 9;
while (count-- > 1 && (*run++ | 0x20) == *kw++)
;
if (count == 0 && *run == ' ')
{
// Delimiter keyword found. Get the new delimiter (everything until the end of the line).
tail = run++;
while (run < end && !is_line_break(run, new_line))
run++;
delimiter = base::trim(std::string((char *)tail, run - tail));
delimiter_head = (unsigned char*)delimiter.c_str();
// Skip over the delimiter statement and any following line breaks.
while (is_line_break(run, new_line))
run++;
tail = run;
head = tail;
}
else
tail++;
}
else
tail++;
break;
}
default:
if (*tail > ' ')
have_content = true;
tail++;
break;
}
if (*tail == *delimiter_head)
{
// Found possible start of the delimiter. Check if it really is.
size_t count = delimiter.size();
if (count == 1)
{
// Most common case. Trim the statement and check if it is not empty before adding the range.
head = skip_leading_whitespace(head, tail);
if (head < tail)
ranges.push_back(std::make_pair<size_t, size_t>(head - (unsigned char *)sql, tail - head));
head = ++tail;
have_content = false;
}
else
{
const unsigned char *run = tail + 1;
const unsigned char *del = delimiter_head + 1;
while (count-- > 1 && (*run++ == *del++))
;
if (count == 0)
{
// Multi char delimiter is complete. Tail still points to the start of the delimiter.
// Run points to the first character after the delimiter.
head = skip_leading_whitespace(head, tail);
if (head < tail)
ranges.push_back(std::make_pair<size_t, size_t>(head - (unsigned char *)sql, tail - head));
tail = run;
head = run;
have_content = false;
}
}
}
}
// Add remaining text to the range list.
head = skip_leading_whitespace(head, tail);
if (head < tail)
ranges.push_back(std::make_pair<size_t, size_t>(head - (unsigned char *)sql, tail - head));
return 0;
}
这也适用于大型 sql 脚本,并且可以在大约 1 秒内拆分包含 100 万行的转储(当然取决于您运行此脚本的框)。var _stop 是一个标志,用于允许中断拆分进程。代码正在处理MySQL代码,因此它可以正确处理隐藏的命令(版本注释)。
使用每个查询的开始和长度信息,您现在可以转到解析器。
即使迈克的回答很好,我也需要在antlr中创建语法。以下语法对我有用:
sql_stmts:
sql_stmt (';'+ sql_stmt)*
;
sql_stmt:
TEXT*
;
TEXT:
~[']
| STRING
;
BLOCK_COMMENT
: '/*' .*? ( '*/' | EOF ) -> channel(HIDDEN)
;
LINE_COMMENT
: '--' ~[rn]* -> channel(HIDDEN)
;
SPACES
: [ u000Btrn] -> channel(HIDDEN)
;
STRING
:
''' (''''|~''')* '''
;
首先,不要忽略编译语法时生成的警告和错误消息。
其次,TEXT 规则不会做你认为它做的事情——引号在那里不起作用。 请参阅文档。
第三,您的第一行输入实际上是文本字符串文本半。 第二个 TEXT 是 SEMI 规则之前的空格,但您的规则只允许在 SEMI 之前出现一次非连续的 TEXT。