我希望以这样一种方式清理数据帧,即只有包含数字的单元格才不会有空格,但有名称的单元格保持不变。
Author
07 07 34
08 26 20
08 26 20
Tata Smith
Jhon Doe
08 26 22
3409243
这是我的失败的方法
df.loc[df["Author"].str.isdigit(), "Author"] = df["Author"].strip()
我该如何处理?
您可能想要使用regex。
import pandas as pd
import re
# Create a sample dataframe
import io
df = pd.read_csv(io.StringIO('Authorn 07 07 34 n 08 26 20 n 08 26 20 n Tata Smithn Jhon Doen 08 26 22n 3409243'))
# Use regex
mask = df['Author'].str.fullmatch(r'[d ]*')
df.loc[mask, 'Author'] = df.loc[mask, 'Author'].str.replace(' ', '')
# You can also do the same treatment by the following line
# df['Author'] = df['Author'].apply(lambda s: s.replace(' ', '') if re.match(r'[d ]*$', s) else s)
作者 |
---|
070734 |
082620 |
082620 |
Tata Smith |
082622 |
3409243 |
这个怎么样?
import pandas as pd
df = pd.read_csv('two.csv')
# remove spaces on copy
df['Author_clean'] = df['Author'].str.replace(" ","")
# try conversion to numeric if possible
df['Author_clean'] = df['Author_clean'].apply(pd.to_numeric, errors='coerce')
# fill missing vals with original strings
df['Author_clean'].fillna(df['Author'], inplace=True)
print(df.head(10))
输出:
Author Author_clean
0 07 07 34 70734.0
1 08 26 20 82620.0
2 08 26 20 82620.0
3 Tata Smith Tata Smith
4 Jhon Doe Jhon Doe
5 08 26 22 82622.0
6 3409243 3409243.0