通过分隔符拆分panda列,行中有两个不同的大小



我正试图用空格分隔符将pandas数据帧中的一列拆分为多列。我意识到有些行有一个日期字段,因此与没有日期字段的行相比,它需要额外的列。以下是列值的示例

DA Firstname Lastname 09/30/2020 07:44 AM 9/23/2020 6:06:38 PM
JW Firstname Lastname 10/25/2020 11:06 AM None

第一行不适合使用空格分隔符,因为有8个空格。第二行适用于我的数据集,因为会有6个空格。有没有把日期组合在一起作为分隔符?

["Inital"Firstname"lastname"date/time1"date/time2"];日期/时间2";列也可以包括";无";

我尝试使用的代码是

dataset= pd.read_csv("newOutput6",encoding = "ISO-8859-1", delimiter="t", names = ['Name','Date'], index=False)
tmpDF = pd.DataFrame(columns=['Initals','FName','LName','SignupTime','Waiver'])
tmpDF[['Initals','FName','LName','SignupTime','Waiver']] = dataset['Name'].str.split(' ', expand=True)

示例csv:

,Name,Date
0,MA FName LName 10/25/2020 09:40 PM None,"October 26, 2020,8:00AM Until 8:50AM "
1,JB FName LName 10/26/2020 07:19 AM None,"October 26, 2020,8:00AM Until 8:50AM "
2,TB FName LName 10/25/2020 09:03 PM None,"October 26, 2020,8:00AM Until 8:50AM "
3,MB FName LName 10/25/2020 09:40 PM None,"October 26, 2020,8:00AM Until 8:50AM "
4,NC FName LName 10/25/2020 10:17 PM None,"October 26, 2020,8:00AM Until 8:50AM "
5,AC FName LName 10/25/2020 09:23 PM None,"October 26, 2020,8:00AM Until 8:50AM "
6,NF FName LName 10/26/2020 07:56 AM None,"October 26, 2020,8:00AM Until 8:50AM "
7,BG FName LName 10/25/2020 10:41 PM None,"October 26, 2020,8:00AM Until 8:50AM "
8,GH FName LName 10/26/2020 07:39 AM None,"October 26, 2020,8:00AM Until 8:50AM "
9,EH FName LName 10/25/2020 10:06 PM None,"October 26, 2020,8:00AM Until 8:50AM "
10,DM FName LName 10/25/2020 11:42 PM None,"October 26, 2020,8:00AM Until 8:50AM "
11,JM FName LName 10/25/2020 09:24 PM None,"October 26, 2020,8:00AM Until 8:50AM "
12,TP FName LName 10/26/2020 12:32 AM None,"October 26, 2020,8:00AM Until 8:50AM "
13,DS FName LName 10/25/2020 11:12 PM None,"October 26, 2020,8:00AM Until 8:50AM "
14,KS FName LName 10/25/2020 07:46 PM None,"October 26, 2020,8:00AM Until 8:50AM "
15,JW FName LName 10/25/2020 11:06 AM None,"October 26, 2020,8:00AM Until 8:50AM "
16,DA FName LName 09/30/2020 07:44 AM 9/23/2020 6:06:38 PM,"October 26, 2020,9:00AM Until 9:50AM "

索引16是一个不遵循传统格式的行,我怀疑需要一个正则表达式来确定这一点。

在FirstName和LastName中没有空格的情况下(否则如何区分它们(:

pattern = ('^(?P<Initials>w+)s'
+ '(?P<FName>w+)s'
+ '(?P<LName>w+)s'
+ '(?P<SignupTime>d+/d+/d+ d+:d+ w+)s'
+ '(?P<Waiver>.*)'
)
df['name'].str.extract(pattern)

输出:

Initials      FName     LName           SignupTime                Waiver
0       DA  Firstname  Lastname  09/30/2020 07:44 AM  9/23/2020 6:06:38 PM
1       JW  Firstname  Lastname  10/25/2020 11:06 AM                  None

更新:对于可选的首字母,您可以尝试以下模式:

pattern = ('^(?P<Initials>w+s)?'    # make initial optional
+ '(?P<FName>w+)s+'
+ '(?P<LName>w+)s+'
+ '(?P<SignupTime>d+/d+/d+ d+:d+ w+)s'
+ '(?P<Waiver>.*)'
)

请注意,现在如果Initials存在,将有一个尾随空格,您可以轻松处理。

最新更新