如何替换PySpark中的重音字符?



我在数据框架中有一个带有重音值的字符串列,如

'México', 'Albânia', 'Japão'

如何用重音替换字母来得到这个:

'Mexico', 'Albania', 'Japao'

我在Stack OverFlow中尝试了许多可用的解决方案,比如:

def strip_accents(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn')

失望归来

strip_accents('México')
>>> 'M?xico'

您可以使用translate:

df = spark.createDataFrame(
[
('1','Japão'),
('2','Irã'),
('3','São Paulo'),
('5','Canadá'),
('6','Tókio'),
('7','México'),
('8','Albânia')
],
["id", "Local"]
)
df.show(truncate = False)
+---+---------+
|id |Local    |
+---+---------+
|1  |Japão    |
|2  |Irã      |
|3  |São Paulo|
|5  |Canadá   |
|6  |Tókio    |
|7  |México   |
|8  |Albânia  |
+---+---------+
from pyspark.sql import functions as F
df
.withColumn('Loc_norm', F.translate('Local',
'ãäöüẞáäčďéěíĺľňóôŕšťúůýžÄÖÜẞÁÄČĎÉĚÍĹĽŇÓÔŔŠŤÚŮÝŽ',
'aaousaacdeeillnoorstuuyzAOUSAACDEEILLNOORSTUUYZ'))
.show(truncate=False)
+---+---------+---------+
|id |Local    |Loc_norm |
+---+---------+---------+
|1  |Japão    |Japao    |
|2  |Irã      |Ira      |
|3  |São Paulo|Sao Paulo|
|5  |Canadá   |Canada   |
|6  |Tókio    |Tokio    |
|7  |México   |Mexico   |
|8  |Albânia  |Albânia  |
+---+---------+---------+

在PySpark中,您可以创建pandas_udf它是矢量化的,所以它比普通的udf要好。

这似乎是熊猫最好的方法。因此,我们可以使用它来为PySpark应用程序创建pandas_udf

from pyspark.sql import functions as F
import pandas as pd
@F.pandas_udf('string')
def strip_accents(s: pd.Series) -> pd.Series:
return s.str.normalize('NFKD').str.encode('ascii', 'ignore').str.decode('utf-8')

测试:

df = spark.createDataFrame([('México',), ('Albânia',), ('Japão',)], ['country'])
df = df.withColumn('country2', strip_accents('country'))
df.show()
# +-------+--------+
# |country|country2|
# +-------+--------+
# | México|  Mexico|
# |Albânia| Albania|
# |  Japão|   Japao|
# +-------+--------+

最新更新