这是一个带有数据和非ASCII字符的列
Summary 1
United Kingdom - ��Global Consumer Technology - ��American Express
United Kingdom - ��VP Technology - Founder - ��Hogarth Worldwide
Aberdeen - ��SeniorCore Analysis Specialist - ��COREX Group
London, - ��ED, Equit Technology, London - ��Morgan Stanley
United Kingdom - ��Chief Officer, Group Technology - ��BP
如何将它们分开并保存在不同的列
我使用的代码是:
import io
import pandas as pd
df = pd.read_csv("/home/vipul/Desktop/dataminer.csv", sep='s*+.*?-s*')
df = df.reset_index()
df.columns = ["First Name", "Last Name", "Email", "Profile URL", "Summary 1", "Summary 2"]
df.to_csv("/home/vipul/Desktop/new.csv")
说,您有这样的系列中的一列:
s
0 United Kingdom - ��Global Consumer Technolog...
1 United Kingdom - ��VP Technology - Founder -...
2 Aberdeen - ��SeniorCore Analysis Specialist ...
3 London, - ��ED, Equit Technology, London - �...
4 United Kingdom - ��Chief Officer, Group Tech...
Name: Summary 1, dtype: object
选项1
为了扩展此答案,您可以使用str.split
对非ASCII字符进行分配:
s.str.split(r'-s*[^x00-x7f]+', expand=True)
0 1 2
0 United Kingdom Global Consumer Technology American Express
1 United Kingdom VP Technology - Founder Hogarth Worldwide
2 Aberdeen SeniorCore Analysis Specialist COREX Group
3 London, ED, Equit Technology, London Morgan Stanley
4 United Kingdom Chief Officer, Group Technology BP
选项2
str.extractall
unstack
:
s.str.extractall('([x00-x7f]+)')[0].str.rstrip(r'- ').unstack()
match 0 1 2
0 United Kingdom Global Consumer Technology American Express
1 United Kingdom VP Technology - Founder Hogarth Worldwide
2 Aberdeen SeniorCore Analysis Specialist COREX Group
3 London, ED, Equit Technology, London Morgan Stanley
4 United Kingdom Chief Officer, Group Technology BP
另一种方法:
a
0 United Kingdom - ��Global Consumer Technolog...
1 United Kingdom - ��VP Technology - Founder -...
2 Aberdeen - ��SeniorCore Analysis Specialist ...
3 London, - ��ED, Equit Technology, London - �...
4 United Kingdom - ��Chief Officer, Group Tech...
使用此功能提取ASSCI CHAR(其中Unicode代码点优于128(,使用ORD Build-In功能
def extract_ascii(x):
string_list = filter(lambda y : ord(y) < 128, x)
return ''.join(string_list)
并将其应用于列。
df1.a.apply(extract_ascii).str.split('-', expand=True)
这是结果:
0 1 2 3
0 United Kingdom Global Consumer Technology American Express None
1 United Kingdom VP Technology Founder Hogarth Worldwide
2 Aberdeen SeniorCore Analysis Specialist COREX Group None
3 London, ED, Equit Technology, London Morgan Stanley None
4 United Kingdom Chief Officer, Group Technology BP None