按列分组后填充缺失值



我有与自治市镇、街道名称和邮政编码相关的数据。我正在尝试根据自治市镇和街道名称填写邮政编码中的缺失值

我的数据如下所示:

    ï..BOROUGH      Street.Name Zip.Code
    2850662      BRONX CITY ISLAND ROAD    10464
    2850740      BRONX CITY ISLAND ROAD    10464
    2850749      BRONX CITY ISLAND ROAD       NA
    2850919      BRONX CITY ISLAND ROAD    10464
    3491200      BRONX CITY ISLAND ROAD       NA

预期输出为:

    ï..BOROUGH      Street.Name Zip.Code
    2850662      BRONX CITY ISLAND ROAD    10464
    2850740      BRONX CITY ISLAND ROAD    10464
    2850749      BRONX CITY ISLAND ROAD    10464
    2850919      BRONX CITY ISLAND ROAD    10464
    3491200      BRONX CITY ISLAND ROAD    10464

我认为我们需要遵循这种方法 -

  1. 获取"Street_Name"和"Zip_Code"的映射(筛选出空Zip_Code(
  2. 使用"Street_Name"将主数据帧与Zip_Code数据帧联接,如果主数据框中不为 null,则填充"邮政编码",否则从我们的Zip_Code数据帧填充。

试试这段代码 -

from pyspark.sql.types import *
from pyspark.sql.functions import col
schema = StructType([StructField('BOROUGH', IntegerType(), True),
                     StructField('Street_Name', StringType(), True),
                     StructField('Zip_Code', IntegerType(), True)])

data = [(2850662,'BRONX CITY ISLAND ROAD',10464),
        (2850740,'BRONX CITY ISLAND ROAD',10464),
        (2850749,'BRONX CITY ISLAND ROAD',None),
        (2850919,'BRONX CITY ISLAND ROAD',10464),
        (3491200,'BRONX CITY ISLAND ROAD',None)]
df = spark.createDataFrame(data,schema)
df_Zip_Code = df.filter(df.Zip_Code.isNotNull()).select('Street_Name','Zip_Code').distinct()
df.alias('a').
    join(df_Zip_Code.alias('b'),col('a.Street_Name') == col('b.Street_Name'), 'inner').
    selectExpr("a.BOROUGH AS BOROUGH","a.Street_Name AS Street_Name","CASE WHEN a.Zip_Code IS NULL THEN b.Zip_Code ELSE a.Zip_Code END AS Zip_Code" ).
    show()

相关内容

  • 没有找到相关文章

最新更新