如何在换行符上拆分数据帧列值，并创建一个包含最后 2 项(行)的新列

我想用换行符拆分列值，并创建一个包含最后两项(行(的新列

df1 = spark.createDataFrame([
["001rnLuc  Krierrn2363  Ryan Road, Long Lake South Dakota"],
["002rnJeanny  Thornrn2263 Patton Lane Raleigh North Carolina"],
["003rnTeddy E Beecherrn2839 Hartland Avenue Fond Du Lac Wisconsin"],
["004rnPhilippe  Schaussrn1 Im Oberdorf Allemagne"],
["005rnMeindert I TholenrnHagedoornweg 138 Amsterdam"]
]).toDF("s")

这不起作用(无值(：

df.withColumn('last_2', split(df.s, 'rn')[-2])

您只需将函数substring_index

df1.withColumn('last2',f.substring_index('s','rn',-2)).drop('s').show(10,False)
+-----------------------------------------------------------+
|last2                                                      |
+-----------------------------------------------------------+
|Luc  Krier
2363  Ryan Road, Long Lake South Dakota        |
|Jeanny  Thorn
2263 Patton Lane Raleigh North Carolina     |
|Teddy E Beecher
2839 Hartland Avenue Fond Du Lac Wisconsin|
|Philippe  Schauss
1 Im Oberdorf Allemagne                 |
|Meindert I Tholen
Hagedoornweg 138 Amsterdam              |
+-----------------------------------------------------------+

希望对你有帮助

是的，我也面临着负索引的相同问题，但正索引有效。我尝试使用切片功能，它工作正常。你能试试这个吗：

import pyspark.sql.functions as F
df1 = sqlContext.createDataFrame([ ["001rnLuc Krierrn2363 Ryan Road, Long Lake South Dakota"], ["002rnJeanny Thornrn2263 Patton Lane Raleigh North Carolina"], ["003rnTeddy E Beecherrn2839 Hartland Avenue Fond Du Lac Wisconsin"], ["004rnPhilippe Schaussrn1 Im Oberdorf Allemagne"], ["005rnMeindert I TholenrnHagedoornweg 138 Amsterdam"] ]).toDF("s")
df_r = df1.withColumn('spl',F.split(F.col('s'),'rn'))
df_res = df_r.withColumn("res",F.slice(F.col("spl"),-1,1))

也许这很有帮助 -

val sDF = Seq("""001rnLuc  Krierrn2363  Ryan Road, Long Lake South Dakota""",
"""002rnJeanny  Thornrn2263 Patton Lane Raleigh North Carolina""",
"""003rnTeddy E Beecherrn2839 Hartland Avenue Fond Du Lac Wisconsin""",
"""004rnPhilippe  Schaussrn1 Im Oberdorf Allemagne""",
"""005rnMeindert I TholenrnHagedoornweg 138 Amsterdam""").toDF("""s""")
val processedDF = sDF.withColumn("col1", slice(split(col("s"), """\r\n"""), -2, 2))
processedDF.show(false)
processedDF.printSchema()
/**
* +--------------------------------------------------------------------+-------------------------------------------------------------+
* |s                                                                   |col1                                                         |
* +--------------------------------------------------------------------+-------------------------------------------------------------+
* |001rnLuc  Krierrn2363  Ryan Road, Long Lake South Dakota        |[Luc  Krier, 2363  Ryan Road, Long Lake South Dakota]        |
* |002rnJeanny  Thornrn2263 Patton Lane Raleigh North Carolina     |[Jeanny  Thorn, 2263 Patton Lane Raleigh North Carolina]     |
* |003rnTeddy E Beecherrn2839 Hartland Avenue Fond Du Lac Wisconsin|[Teddy E Beecher, 2839 Hartland Avenue Fond Du Lac Wisconsin]|
* |004rnPhilippe  Schaussrn1 Im Oberdorf Allemagne                 |[Philippe  Schauss, 1 Im Oberdorf Allemagne]                 |
* |005rnMeindert I TholenrnHagedoornweg 138 Amsterdam              |[Meindert I Tholen, Hagedoornweg 138 Amsterdam]              |
* +--------------------------------------------------------------------+-------------------------------------------------------------+
*
* root
* |-- s: string (nullable = true)
* |-- col1: array (nullable = true)
* |    |-- element: string (containsNull = true)
*/

相关内容

最新更新

热门标签：