我有这样的数据
8213034705_cst,95,2.927373,jake7870,0,95,117.5,xbox,3
,10,0.18669,parakeet2004,5,1,120,xbox,3
8213060420_gfd,26,0.249757,bluebubbles_1,25,1,120,xbox,3
8213060420_xcv,80,0.59059,sa4741,3,1,120,xbox,3
,75,0.657384,jhnsn2273,51,1,120,xbox,3
我正在尝试将"缺失值"放在缺少记录的第一列(或完全删除它们)。我正在尝试执行以下代码,但它给了我错误
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.sql._
import org.apache.log4j._
import org.apache.spark.sql.functions
import java.lang.String
import org.apache.spark.sql.functions.udf
//import spark.implicits._
object DocParser2
{
case class Auction(auctionid:Option[String], bid:Double, bidtime:Double, bidder:String, bidderrate:Integer, openbid:Double, price:Double, item:String, daystolive:Integer)
def readint(ip:Option[String]):String = ip match
{
case Some(ip) => ip.split("_")(0)
case None => "missing value"
}
def main(args:Array[String]) =
{
val spark=SparkSession.builder.appName("DocParser").master("local[*]").getOrCreate()
import spark.implicits._
val intUDF = udf(readint _)
val lines=spark.read.format("csv").option("header","false").option("inferSchema", true).load("data/auction2.csv").toDF("auctionid","bid","bidtime","bidder","bidderrate","openbid","price","item","daystolive")
val recordsDS=lines.as[Auction]
recordsDS.printSchema()
println("splitting auction id into String and Int")
// recordsDS.withColumn("auctionid_int",java.lang.String.split('auctionid,"_")).show() some error with the split method
val auctionidcol=recordsDS.col("auctionid")
recordsDS.withColumn("auctionid_int",intUDF('auctionid)).show()
spark.stop()
}
}
但它是通过以下运行时错误
不能将java.lang.String转换为val intUDF行中的Scala.option = udf(readint _)
你能帮我找出错误吗?
谢谢
UDF 从来没有Option
作为输入,而是需要传递实际类型。如果是String
,您可以在UDF中进行null检查,对于不能为null的基元类型(Int,Double 等),还有其他解决方案...
您可以使用
spark.read.csv
读取csv
文件,并使用 na.drop()
删除包含缺失值的记录,在 Spark 2.0.2 上进行了测试:
val df = spark.read.option("header", "false").option("inferSchema", "true").csv("Path to Csv file")
df.show
+--------------+---+--------+-------------+---+---+-----+----+---+
| _c0|_c1| _c2| _c3|_c4|_c5| _c6| _c7|_c8|
+--------------+---+--------+-------------+---+---+-----+----+---+
|8213034705_cst| 95|2.927373| jake7870| 0| 95|117.5|xbox| 3|
| null| 10| 0.18669| parakeet2004| 5| 1|120.0|xbox| 3|
|8213060420_gfd| 26|0.249757|bluebubbles_1| 25| 1|120.0|xbox| 3|
|8213060420_xcv| 80| 0.59059| sa4741| 3| 1|120.0|xbox| 3|
| null| 75|0.657384| jhnsn2273| 51| 1|120.0|xbox| 3|
+--------------+---+--------+-------------+---+---+-----+----+---+
df.na.drop().show
+--------------+---+--------+-------------+---+---+-----+----+---+
| _c0|_c1| _c2| _c3|_c4|_c5| _c6| _c7|_c8|
+--------------+---+--------+-------------+---+---+-----+----+---+
|8213034705_cst| 95|2.927373| jake7870| 0| 95|117.5|xbox| 3|
|8213060420_gfd| 26|0.249757|bluebubbles_1| 25| 1|120.0|xbox| 3|
|8213060420_xcv| 80| 0.59059| sa4741| 3| 1|120.0|xbox| 3|
+--------------+---+--------+-------------+---+---+-----+----+---+