读取带有火花的 CSV 时 ^G 的分隔符等效是多少?



所以,我真的需要帮助做一件愚蠢的事情,但显然我自己做不到。

我在一个文件中有一组行,格式如下(在OSX上用less读取):

XXXXXXXX^GT^XXXXXXXX^GN^G0^GDL^GN^G2018-09-14 13:57:00.0^G2018-09-16 00:00:00.0^GCompleted^GN^GN^G1^G2018-09-16 21:41:02.267^G1^G2018-09-16 21:41:02.267^GXXXXXXX^GN
YYYYYYYY^GS^XXXXXXXX^GN^G0^GDL^GN^G2018-08-29 00:00:00.0^G2018-08-29 23:00:00.0^GCompleted^GN^GN^G1^G2018-09-16 21:41:03.797^G1^G2018-09-16 21:41:03.81^GXXXXXXX^GN

所以分隔符是BEL分隔符,我以这种方式加载CSV:

val df = sqlContext.read.format("csv")
.option("header", "false")
.option("inferSchema", "true")
.option("delimiter", "u2407")
.option("nullValue", "\N")
.load("part0000")

但当我读它的时候,它只是把行读成一列,这样:

XXXXXXXXCXXXXXXXXN0DLN2018-09-15 00:00:00.02018-09-16 00:00:00.0CompletedNN12018-09-16 21:41:03.25712018-09-16 21:41:03.263XXXXXXXXN
XXXXXXXXSXXXXXXXXN0DLN2018-09-15 00:00:00.02018-09-15 23:00:00.0CompletedNN12018-09-16 21:41:03.3712018-09-16 21:41:03.373XXXXXXXXN

似乎有一个unkown character(你什么也看不到,只是因为我在stackoverflow上格式化了它)代替了^G

更新:这可能是scala对spark的限制吗?如果我用scala这样运行代码:

val df = sqlContext.read.format("csv")
.option("header", "false")
.option("inferSchema", "true")
.option("delimiter", "\a")
.option("nullValue", "\N")
.load("part-m-00000")
display(df)

我得到一个大脂肪

java.lang.IllegalArgumentException: Unsupported special character for delimiter: a

而如果我使用python运行:

df = sqlContext.read.format('csv').options(header='false', inferSchema='true', delimiter = "a", nullValue = '\N').load('part-m-00000')
display(df)

一切都很好!

在spark scala中,这些版本看起来有局限性,以下是代码中支持的csv分隔符,

apache/spark/sqlcatalyst/csv/CSVOptions.scala

val delimiter = CSVExprUtils.toChar(
parameters.getOrElse("sep", parameters.getOrElse("delimiter", ",")))

---CSVExprUtils.toChar

apache/spark/sqlcatalyst/ccsv/CSVExprUtils.scala

def toChar(str: String): Char = {
(str: Seq[Char]) match {
case Seq() => throw new IllegalArgumentException("Delimiter cannot be empty string")
case Seq('\') => throw new IllegalArgumentException("Single backslash is prohibited." +
" It has special meaning as beginning of an escape sequence." +
" To get the backslash character, pass a string with two backslashes as the delimiter.")
case Seq(c) => c
case Seq('\', 't') => 't'
case Seq('\', 'r') => 'r'
case Seq('\', 'b') => 'b'
case Seq('\', 'f') => 'f'
// In case user changes quote char and uses " as delimiter in options
case Seq('\', '"') => '"'
case Seq('\', ''') => '''
case Seq('\', '\') => '\'
case _ if str == """u0000""" => 'u0000'
case Seq('\', _) =>
throw new IllegalArgumentException(s"Unsupported special character for delimiter: $str")
case _ =>
throw new IllegalArgumentException(s"Delimiter cannot be more than one character: $str")
}

相关内容

  • 没有找到相关文章

最新更新