如何解析数据并将其放入Spark SQL表中



我有一个我想使用Spark SQL分析的日志文件。日志文件的格式是这样的:

71.19.157.174 - - [24/Sep/2014:22:26:12 +0000] "GET /error HTTP/1.1" 404 505 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"

我有一个正则表达模式,可以用来解析数据:

Pattern.compile("""^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(S+) (S+) (S+)" (d{3}) (d+)""")

此外,我还创建了案例类:

case class LogSchema(ip: String, client: String, userid: String, date: String, method: String, endpoint: String, protocol: String, response: String, contentsize: String)

但是,我无法将其转换为可以运行Spark SQL查询的表。

我如何使用正则方式来解析数据并将其放在表格中?

说您在/home/user/logs/log.txt中有日志文件,然后您可以使用以下逻辑从日志文件中获取table/dataframe

val rdd = sc.textFile("/home/user/logs/log.txt")
val pattern = Pattern.compile("""^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(S+) (S+) (S+)" (d{3}) (d+)""")
val df = rdd.map(line => pattern.matcher(line)).map(elem => {
  elem.find
  LogSchema(elem.group(1), elem.group(2), elem.group(3), elem.group(4), elem.group(5), elem.group(6), elem.group(7), elem.group(8), elem.group(9))
}).toDF()
df.show(false)

您应该有以下dataframe

+-------------+------+------+--------------------------+------+--------+--------+--------+-----------+
|ip           |client|userid|date                      |method|endpoint|protocol|response|contentsize|
+-------------+------+------+--------------------------+------+--------+--------+--------+-----------+
|71.19.157.174|-     |-     |24/Sep/2014:22:26:12 +0000|GET   |/error  |HTTP/1.1|404     |505        |
+-------------+------+------+--------------------------+------+--------+--------+--------+-----------+

我已经使用了您提供的case class

case class LogSchema(ip: String, client: String, userid: String, date: String, method: String, endpoint: String, protocol: String, response: String, contentsize: String)

相关内容

  • 没有找到相关文章