我试图创建一个SQL词法分析器(好吧,一个完整的解析器,但你必须开始某处),我不知道如何进行。我想这样写:
def nextToken(input: List[Char]) = input match {
case 'S'::'E'::'L'::'E'::'C'::'T'::tail => (SELECT, tail)
case _ => ??? // etc.
}
但是SQL不区分大小写。我可以把输入中的所有字符都大写,但这也会把字符串大写。我真正需要的是一种不区分大小写的比较方法,然后留下正确的tail
(在匹配令牌后剩余的List[Char]
)。在Scala 2.10.x中是否有一种方法可以轻松地做到这一点?
如果将输入保留为String,那么正则表达式似乎可以很好地解决这个问题。
val SelectRe = """(?i)SELECT(.*)""".r // the (?i) makes it case-insensitive
def nextToken(input: String) = input match {
case SelectRe(tail) => (SELECT, tail)
case _ => ??? // etc.
}
我需要这是一个有效的解析器,所以@dhg的答案不适合我。这是我选定的答案。它使用不区分大小写的Trie的变体,但是
- 最短的比赛获胜
- 我们不消耗整个搜索输入。我们只消耗一根火柴,不能超过。
结果如下:
case class CaseInsensitiveTrie[T](children: List[Node[T]] = Nil) {
def add(key: List[Char], value: T): CaseInsensitiveTrie[T] = {
def loop(key: List[Char], nodes: List[Node[T]]): List[Node[T]] = key match {
case c::Nil =>
nodes.find(_.key == c) match {
case Some(node) => nodes // No updating values
case None => Node(c, Some(value), Nil) :: nodes
}
case c::tail =>
nodes.find(_.key == c) match {
case Some(node) =>
val children = loop(tail, node.children)
val newNode = Node(c, None, children)
newNode :: nodes.filterNot(_.key == c)
case None =>
val children = loop(tail, Nil)
val newNode = Node(c, None, children)
newNode :: nodes
}
case Nil => ??? // Programmer error
}
CaseInsensitiveTrie(loop(key, children))
}
def find(input: List[Char]): Option[(T, List[Char])] = {
def loop(input: List[Char], nodes: List[Node[T]]): Option[(T, List[Char])] = input match {
case c::Nil =>
nodes.find(_.key.toUpper == c.toUpper).flatMap(_.value.map(_ -> Nil))
case c::tail =>
nodes.find(_.key.toUpper == c.toUpper).flatMap {
case node if node.value.isDefined =>
node.value.map(_ -> tail)
case node =>
loop(tail, node.children)
}
}
loop(input, children)
}
}
object CaseInsensitiveTrie {
case class Node[T](key: Char, value: Option[T], children: List[CaseInsensitiveTrie.Node[T]])
}
只有两个功能- find
和add
。只使用不区分大小写的关键字填充树,然后创建以下unapply函数:
val trie = CaseInsensitiveTrie[Token]()
.add("SELECT".toList, SELECT)
.add("FROM".toList, FROM)
.add("WHERE".toList, WHERE)
.add("AS".toList, AS)
object Keyword {
def unapply(input: List[Char]): Option[(Token, List[Char])] = {
trie.find(input)
}
}
因此,在匹配中,我们可以测试匹配并同时提取匹配和输入的剩余部分:
input match {
case '*'::tail => (STAR, Scan(tail, location + 1).trim)
case '('::tail => (LPAREN, Scan(tail, location + 1).trim)
case ')'::tail => (RPAREN, Scan(tail, location + 1).trim)
case '.'::tail => (DOT, Scan(tail, location + 1).trim)
case ','::tail => (COMMA, Scan(tail, location + 1).trim)
case '='::tail => (EQ, Scan(tail, location + 1).trim)
case '<'::'>'::tail => (NE, Scan(tail, location + 2).trim)
case '>'::tail => (GT, Scan(tail, location + 1).trim)
case '>'::'='::tail => (GE, Scan(tail, location + 2).trim)
case '<'::tail => (LT, Scan(tail, location + 1).trim)
case '>'::'='::tail => (LE, Scan(tail, location + 2).trim)
case Keyword(kw, tail) if tail.headOption.fold(true)(_.isWhitespace) =>
(kw, Scan(tail, location + kwLength(kw)).trim)
}
这不是最快的Trie实现,但它可以满足我的需求。如果性能成为一个问题,我可以将其压缩为