使用Pyspark从数据帧列中删除非ASCII和特定字符



我想清理数据帧列City中的数据。它可以具有以下值:

威尼斯®威尼斯威尼斯?威尼斯威尼斯®威尼斯

我想删除所有非ascii字符以及?,和。我怎样才能做到这一点?

您可以通过Regex只过滤字母来清理字符串

# create dataframes
date_data = [
(1,"Venice®"),
(2,"VeniceÆ"),
(3,"Venice?"),
(4,"Venice")]
schema = ["id","name"]
df_raw = spark.createDataFrame(data=date_data, schema = schema)
df_raw.show()
+---+--------+
|id |name    |
+---+--------+
|1  |Venice®|
|2  |VeniceÆ |
|3  |Venice? |
|4  |Venice  |
+---+--------+
# apply regular expression
df_clean=(df_raw.withColumn("clean_name",f.regexp_replace(f.col("name"), "[^a-zA-Z]", "")))
df_clean.show()
+---+--------+----------+
| id|    name|clean_name|
+---+--------+----------+
|  1|Venice®|    Venice|
|  2| VeniceÆ|    Venice|
|  3| Venice?|    Venice|
|  4|  Venice|    Venice|
+---+--------+----------+

附言:但我怀疑你看到这样的字符后正确导入火花。例如,忽略了上标

最新更新