在非ASCII字符上拆分数据框列



这是一个带有数据和非ASCII字符的列

Summary 1
United Kingdom - ��Global Consumer Technology - ��American Express 
United Kingdom - ��VP Technology - Founder - ��Hogarth Worldwide
Aberdeen - ��SeniorCore Analysis Specialist - ��COREX Group
London, - ��ED, Equit Technology, London - ��Morgan Stanley
United Kingdom - ��Chief Officer, Group Technology - ��BP

如何将它们分开并保存在不同的列

我使用的代码是:

import io
import pandas as pd
df = pd.read_csv("/home/vipul/Desktop/dataminer.csv", sep='s*+.*?-s*')
df = df.reset_index()
df.columns = ["First Name", "Last Name", "Email", "Profile URL", "Summary 1", "Summary 2"]
df.to_csv("/home/vipul/Desktop/new.csv")

说,您有这样的系列中的一列:

s
0    United Kingdom - ��Global Consumer Technolog...
1    United Kingdom - ��VP Technology - Founder -...
2    Aberdeen - ��SeniorCore Analysis Specialist ...
3    London, - ��ED, Equit Technology, London - �...
4    United Kingdom - ��Chief Officer, Group Tech...
Name: Summary 1, dtype: object

选项1
为了扩展此答案,您可以使用str.split对非ASCII字符进行分配:

s.str.split(r'-s*[^x00-x7f]+', expand=True)
                 0                                 1                  2
0  United Kingdom        Global Consumer Technology    American Express
1  United Kingdom           VP Technology - Founder   Hogarth Worldwide
2        Aberdeen    SeniorCore Analysis Specialist         COREX Group
3         London,      ED, Equit Technology, London      Morgan Stanley
4  United Kingdom   Chief Officer, Group Technology                  BP

选项2
str.extractall unstack

s.str.extractall('([x00-x7f]+)')[0].str.rstrip(r'- ').unstack()
match               0                                1                  2
0      United Kingdom       Global Consumer Technology   American Express
1      United Kingdom          VP Technology - Founder  Hogarth Worldwide
2            Aberdeen   SeniorCore Analysis Specialist        COREX Group
3             London,     ED, Equit Technology, London     Morgan Stanley
4      United Kingdom  Chief Officer, Group Technology                 BP

另一种方法:

a
0   United Kingdom - ��Global Consumer Technolog...
1   United Kingdom - ��VP Technology - Founder -...
2   Aberdeen - ��SeniorCore Analysis Specialist ...
3   London, - ��ED, Equit Technology, London - �...
4   United Kingdom - ��Chief Officer, Group Tech...

使用此功能提取ASSCI CHAR(其中Unicode代码点优于128(,使用ORD Build-In功能

def extract_ascii(x):
    string_list = filter(lambda y : ord(y) < 128, x)
    return ''.join(string_list)

并将其应用于列。

df1.a.apply(extract_ascii).str.split('-', expand=True)

这是结果:

             0          1                              2           3
0   United Kingdom  Global Consumer Technology  American Express    None
1   United Kingdom  VP Technology   Founder Hogarth Worldwide
2   Aberdeen    SeniorCore Analysis Specialist  COREX Group None
3   London, ED, Equit Technology, London    Morgan Stanley  None
4   United Kingdom  Chief Officer, Group Technology BP  None

相关内容

  • 没有找到相关文章

最新更新