我有与自治市镇、街道名称和邮政编码相关的数据。我正在尝试根据自治市镇和街道名称填写邮政编码中的缺失值
我的数据如下所示:
ï..BOROUGH Street.Name Zip.Code
2850662 BRONX CITY ISLAND ROAD 10464
2850740 BRONX CITY ISLAND ROAD 10464
2850749 BRONX CITY ISLAND ROAD NA
2850919 BRONX CITY ISLAND ROAD 10464
3491200 BRONX CITY ISLAND ROAD NA
预期输出为:
ï..BOROUGH Street.Name Zip.Code
2850662 BRONX CITY ISLAND ROAD 10464
2850740 BRONX CITY ISLAND ROAD 10464
2850749 BRONX CITY ISLAND ROAD 10464
2850919 BRONX CITY ISLAND ROAD 10464
3491200 BRONX CITY ISLAND ROAD 10464
我认为我们需要遵循这种方法 -
- 获取"Street_Name"和"Zip_Code"的映射(筛选出空Zip_Code(
- 使用"Street_Name"将主数据帧与Zip_Code数据帧联接,如果主数据框中不为 null,则填充"邮政编码",否则从我们的Zip_Code数据帧填充。
试试这段代码 -
from pyspark.sql.types import *
from pyspark.sql.functions import col
schema = StructType([StructField('BOROUGH', IntegerType(), True),
StructField('Street_Name', StringType(), True),
StructField('Zip_Code', IntegerType(), True)])
data = [(2850662,'BRONX CITY ISLAND ROAD',10464),
(2850740,'BRONX CITY ISLAND ROAD',10464),
(2850749,'BRONX CITY ISLAND ROAD',None),
(2850919,'BRONX CITY ISLAND ROAD',10464),
(3491200,'BRONX CITY ISLAND ROAD',None)]
df = spark.createDataFrame(data,schema)
df_Zip_Code = df.filter(df.Zip_Code.isNotNull()).select('Street_Name','Zip_Code').distinct()
df.alias('a').
join(df_Zip_Code.alias('b'),col('a.Street_Name') == col('b.Street_Name'), 'inner').
selectExpr("a.BOROUGH AS BOROUGH","a.Street_Name AS Street_Name","CASE WHEN a.Zip_Code IS NULL THEN b.Zip_Code ELSE a.Zip_Code END AS Zip_Code" ).
show()