我想清理数据帧列City中的数据。它可以具有以下值:
威尼斯®威尼斯威尼斯?威尼斯威尼斯®威尼斯
我想删除所有非ascii字符以及?,和。我怎样才能做到这一点?
您可以通过Regex只过滤字母来清理字符串
# create dataframes
date_data = [
(1,"Venice®"),
(2,"VeniceÆ"),
(3,"Venice?"),
(4,"Venice")]
schema = ["id","name"]
df_raw = spark.createDataFrame(data=date_data, schema = schema)
df_raw.show()
+---+--------+
|id |name |
+---+--------+
|1 |Venice®|
|2 |VeniceÆ |
|3 |Venice? |
|4 |Venice |
+---+--------+
# apply regular expression
df_clean=(df_raw.withColumn("clean_name",f.regexp_replace(f.col("name"), "[^a-zA-Z]", "")))
df_clean.show()
+---+--------+----------+
| id| name|clean_name|
+---+--------+----------+
| 1|Venice®| Venice|
| 2| VeniceÆ| Venice|
| 3| Venice?| Venice|
| 4| Venice| Venice|
+---+--------+----------+
附言:但我怀疑你看到这样的字符后正确导入火花。例如,忽略了上标