我有这个数据帧 df 的模式:
root
|-- id: long (nullable = true)
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _VALUE: string (nullable = true)
| | |-- _href: string (nullable = true)
| | |-- type: string (nullable = true)
如何修改数据帧,使列 a 仅包含_href
值而不包含_value
类型?
可能吗?
我尝试过这样的事情,但它是错误的:
df=df.withColumn('a', 'a._href')
例如这是我的数据:
+---+---------------------------------------------------------------------+
|id| a |
+---+---------------------------------------------------------------------+
| 17|[[Gwendolyn Tucke,http://facebook.com],[i have , http://youtube.com]]|
| 23|[[letter, http://google.com],[hihow are you , http://google.co.il]] |
+---+---------------------------------------------------------------------+
但是当我想看起来像这样时:
+---+---------------------------------------------+
|id| a |
+---+---------------------------------------------+
| 17|[[http://facebook.com],[ http://youtube.com]]|
| 23|[[http://google.com],[http://google.co.il]] |
+---+---------------------------------------------+
ps:我根本不想用熊猫。
您可以选择a._href并将其分配给新列。试试这个 Scala 解决方案。
scala> case class sub(_value:String,_href:String)
defined class sub
scala> val df = Seq((17,Array(sub("Gwendolyn Tucke","http://facebook.com"),sub("i have"," http://youtube.com"))),(23,Array(sub("letter","http://google.com"),sub("hihow are you","http://google.co.il")))).toDF("id","a")
df: org.apache.spark.sql.DataFrame = [id: int, a: array<struct<_value:string,_href:string>>]
scala> df.show(false)
+---+-----------------------------------------------------------------------+
|id |a |
+---+-----------------------------------------------------------------------+
|17 |[[Gwendolyn Tucke, http://facebook.com], [i have, http://youtube.com]]|
|23 |[[letter, http://google.com], [hihow are you, http://google.co.il]] |
+---+-----------------------------------------------------------------------+
scala> df.select("id","a._href").show(false)
+---+------------------------------------------+
|id |_href |
+---+------------------------------------------+
|17 |[http://facebook.com, http://youtube.com]|
|23 |[http://google.com, http://google.co.il] |
+---+------------------------------------------+
您可以将其分配给新列
scala> val df2 = df.withColumn("result",$"a._href")
df2: org.apache.spark.sql.DataFrame = [id: int, a: array<struct<_value:string,_href:string>> ... 1 more field]
scala> df2.show(false)
+---+-----------------------------------------------------------------------+------------------------------------------+
|id |a |result |
+---+-----------------------------------------------------------------------+------------------------------------------+
|17 |[[Gwendolyn Tucke, http://facebook.com], [i have, http://youtube.com]]|[http://facebook.com, http://youtube.com]|
|23 |[[letter, http://google.com], [hihow are you, http://google.co.il]] |[http://google.com, http://google.co.il] |
+---+-----------------------------------------------------------------------+------------------------------------------+
scala> df2.printSchema
root
|-- id: integer (nullable = false)
|-- a: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _value: string (nullable = true)
| | |-- _href: string (nullable = true)
|-- result: array (nullable = true)
| |-- element: string (containsNull = true)
scala>
你可以试试下面的代码:
from pyspark.sql.functions import *
df.select("id", explode("a")).select("id","a._href", "a.type").show()
上面的代码将在同一级别返回具有 3 列(id、_href、类型)的数据帧,您可以将其用于进一步分析。
我希望它有所帮助。
问候
尼拉杰