什么是用于删除列中所有文本的python正则表达式



我正在尝试清理列:

df:
+-----+------------------+--------------------+--------------------+--------------+--------------+
|     | league           | home_team          | away_team          | home_score   | away_score   |
+=====+==================+====================+====================+==============+==============+
|   0 | Champions League | APOEL              | Qarabag            | 1            | 2            |
+-----+------------------+--------------------+--------------------+--------------+--------------+
|   1 | Champions League | FC Copenhagen      | TNS                | 1            | 0            |
+-----+------------------+--------------------+--------------------+--------------+--------------+
|   2 | Champions League | AIK                | Maribor            | 3            | 2 ET         |
+-----+------------------+--------------------+--------------------+--------------+--------------+

预期

df:
+-----+------------------+--------------------+--------------------+--------------+--------------+
|     | league           | home_team          | away_team          | home_score   | away_score   |
+=====+==================+====================+====================+==============+==============+
|   0 | Champions League | APOEL              | Qarabag            | 1            | 2            |
+-----+------------------+--------------------+--------------------+--------------+--------------+
|   1 | Champions League | FC Copenhagen      | TNS                | 1            | 0            |
+-----+------------------+--------------------+--------------------+--------------+--------------+
|   2 | Champions League | AIK                | Maribor            | 3            | 2            |
+-----+------------------+--------------------+--------------------+--------------+--------------+

我正在尝试

df['away_score'] = df['away_score'].astype(str).str.replace('(s?w+)$', '', regex=True)

(适用于regex101,但不适用于panda(

但列中的所有数据都被替换了。

+-----+------------------+--------------------+--------------------+--------------+--------------+
|     | league           | home_team          | away_team          | home_score   | away_score   |
+=====+==================+====================+====================+==============+==============+
|   0 | Champions League | APOEL              | Qarabag            | 1            |              |
+-----+------------------+--------------------+--------------------+--------------+--------------+
|   1 | Champions League | FC Copenhagen      | TNS                | 1            |              |
+-----+------------------+--------------------+--------------------+--------------+--------------+
|   2 | Champions League | AIK                | Maribor            | 3            | 2            |
+-----+------------------+--------------------+--------------------+--------------+--------------+

什么应该是正确的正则表达式?

我尝试了这个正则表达式,它成功了。

df['away_score'] = df['away_score'].astype(str).str.replace('[a-zA-Z]', '', regex=True)

要完全清理文本(包括空格(,您应该使用:

df['away_score'] = df['away_score'].astype(str).str.replace('[a-zA-Zs]', '', regex=True)

这样,您还可以清理字母表之前的空格,例如ETET之前的空格。

如果你不仅想清理文本,还想清理一些非数字的符号(只留下数字(,你可以使用:

df['away_score'] = df['away_score'].astype(str).str.replace('D', '', regex=True)

最新更新