拆分字符串并获取每个段的开始索引



我正在尝试拆分一个字符串并获取我得到的每个"单词"的所有开始索引。

例如,对于这样的字符串:

"Rabbit jumped over a fence and this Rabbit loves carrots"

如何拆分它以获取每个单词的索引?

0,7,14,19,21,27,31,36,43,49

你可以这样做

val str="Rabbit jumped over a fence and this Rabbit loves carrots"
val indexArr=str.split(" ").scanLeft(0)((prev,next)=>prev+next.length+1).dropRight(1)

示例输出:

ndexArr: Array[Int] = Array(0, 7, 14, 19, 21, 27, 31, 36, 43, 49)

这是一个解决方案,即使分隔符的宽度不是恒定的(不仅对于长度为1的分隔符(。

  1. 使用前瞻和后视(?<=FOO)|(?=FOO)的组合,而不是单个分隔符FOO
  2. 扫描令牌和分隔符的长度,累积其长度以获得起始索引
  3. 丢弃每隔两个条目(分隔符(

在代码中:

val txt = "Rabbit jumped over a fence and this Rabbit loves carrots"
val pieces = txt.split("(?= )|(?<= )")
val startIndices = pieces.scanLeft(0){ (acc, w) => acc + w.size }
val tokensWithStartIndices = (pieces zip startIndices).grouped(2).map(_.head)
tokensWithStartIndices foreach println

结果:

(Rabbit,0)
(jumped,7)
(over,14)
(a,19)
(fence,21)
(and,27)
(this,31)
(Rabbit,36)
(loves,43)
(carrots,49)

下面是一些中间输出,以便您可以更好地了解每个步骤中发生的情况:

scala> val txt = "Rabbit jumped over a fence and this Rabbit loves carrots"
txt: String = Rabbit jumped over a fence and this Rabbit loves carrots
scala> val pieces = txt.split("(?= )|(?<= )")
pieces: Array[String] = Array(Rabbit, " ", jumped, " ", over, " ", a, " ", fence, " ", and, " ", this, " ", Rabbit, " ", loves, " ", carrots)
scala> val startIndices = pieces.scanLeft(0){ (acc, w) => acc + w.size }
startIndices: Array[Int] = Array(0, 6, 7, 13, 14, 18, 19, 20, 21, 26, 27, 30, 31, 35, 36, 42, 43, 48, 49, 56)

即使该行以空格开头,或者有多个空格或制表符分隔某些单词,这也应该是准确的。 它遍历String注意从任何空格字符(空格、制表符、换行符等(到非空格字符的过渡。

val txt = "Rabbit jumped over a fence and this Rabbit loves carrots"
txt.zipWithIndex.foldLeft((Seq.empty[Int],true)){case ((s,b),(c,i)) =>
if (c.isWhitespace) (s,true)
else if (b) (s :+ i, false)
else (s,false)
}._1

这是zipWithIndexcollect的替代混合:

0 :: str.zipWithIndex.collect { case (' ', i) => i + 1 }.toList

在第一个单词的索引前面不是很优雅,它只允许使用长度为 1 的分隔符;但它相当最小且可读。

最新更新