我正在尝试使用spark从HTTP读取JSON文件。因为它不是HDFS或任何Spark可以轻松读取数据并将其转换为数据帧的地方。URL(S)是HTTPS,需要一个令牌和一堆标头来成功检索响应。有没有办法完成这个任务?响应是这样的,可以很容易地转换为数据帧中的一行。
{
"code": "403010",
"message": "message 1"
}
{
"code": "403010",
"message": "message 1"
}
{
"code": "403010",
"message": "message 1"
}
现在响应很奇怪,因为有多个JSON头,但它是来自API的实际响应。
答案提供给这个url。(对于其他人)使用Scala Spark从URL获取结果
import org.apache.spark.sql.{DataFrame, SQLContext, SparkSession}
def GetUrlContentJson(url: String): DataFrame ={
val result = scala.io.Source.fromURL(url).mkString
//only one line inputs are accepted. (I tested it with a complex Json and it worked)
val jsonResponseOneLine = result.toString().stripLineEnd
//You need an RDD to read it with spark.read.json! This took me some time. However it seems obvious now
val jsonRdd = spark.sparkContext.parallelize(jsonResponseOneLine :: Nil)
val jsonDf = spark.read.json(jsonRdd)
return jsonDf
}
val response = GetUrlContentJson(url)
response.show