在嵌套列表中的CSV文本多个定系数(跳过首次出现)



我的数据看起来像

[['15/09/16, 12:21 pm - User1: Hey'],
 ['15/09/16, 12:22 pm - User2: <Media omitted>'],
 ["15/09/16, 12:22 pm - User2: It's yesterday's work"],
 ['15/09/16, 12:22 pm - User1: Gotta work on it.']]

我试图将此嵌套列表分为日期,时间,用户名,消息的每一列。

现在我的定界符是

,分开日期,

-分开时间

:分开用户名和消息

但是问题是如果我使用:,它也会分开时间,因为这是XX:XX的格式。

到目前为止,我的第一步是正确进行分裂,然后我可以转换为CSV。

尝试1- 我试图在阅读时直接将数据分开,但没有任何改变。

delim=",","-",":"
regexPattern = '|'.join(map(re.escape, delim))
data = []
for line in open('/content/drive/My Drive/sample.txt'):
    items = line.rstrip('rn').split(regexPattern)   # strip new-line characters and split on column delimiter
    items = [item.strip() for item in items]  # strip extra whitespace off data items
    data.append(items)

尝试2- 我试图在写入CSV

时分裂
delim=",","-",":"
regexPattern = '|'.join(map(re.escape, delim))
with open('output.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    re.split(regexPattern,data)
    writer.writerows(data)

这陷入错误,因为拆分期望字符串,我有一个列表。不知道如何实现我的主要目标。

任何帮助都将不胜感激。

使用模式re.compile(r",|-|:s+")

ex:

import re
data = [['15/09/16, 12:21 pm - User1: Hey'],
 ['15/09/16, 12:22 pm - User2: <Media omitted>'],
 ["15/09/16, 12:22 pm - User2: It's yesterday's work"],
 ['15/09/16, 12:22 pm - User1: Gotta work on it.']]

regexPattern = re.compile(r",|-|:s+")
for i in data:
    for j in i:
        print(regexPattern.split(j))

输出:

['15/09/16', ' 12:21 pm ', ' User1', 'Hey']
['15/09/16', ' 12:22 pm ', ' User2', '<Media omitted>']
['15/09/16', ' 12:22 pm ', ' User2', "It's yesterday's work"]
['15/09/16', ' 12:22 pm ', ' User1', 'Gotta work on it.']

使用正则分组。

演示:

import re
data = [['15/09/16, 12:21 pm - User1: Hey'],
 ['15/09/16, 12:22 pm - User2: <Media omitted>'],
 ["15/09/16, 12:22 pm - User2: It's yesterday's work"],
 ['15/09/16, 12:22 pm - User1: Gotta work on it, what,hello.']]

regexPattern = re.compile(r"(?P<date>d{2,}/d{2,}/d{2,}),s*(?P<time>d{2,}:d{2,}s*[a-z]{2,})s*-s*(?P<user>w+):s*(?P<msg>.*)$")
for i in data:
    for j in i:
        print(regexPattern.match(j).groups())

输出:

('15/09/16', '12:21 pm', 'User1', 'Hey')
('15/09/16', '12:22 pm', 'User2', '<Media omitted>')
('15/09/16', '12:22 pm', 'User2', "It's yesterday's work")
('15/09/16', '12:22 pm', 'User1', 'Gotta work on it, what,hello.')

没有正则

def parse(item):
    date_time, user_message =  item.split(' - ', 1)
    return [*date_time.split(', '), *user_message.split(': ', 1)]
eggs = [['15/09/16, 12:21 pm - User1: Hey'],
        ['15/09/16, 12:22 pm - User2: <Media omitted>'],
        ["15/09/16, 12:22 pm - User2: It's yesterday's work"],
        ['15/09/16, 12:22 pm - User1: Gotta work on it.']]
spam = [parse(egg[0]) for egg in eggs]
print(spam)

输出

[['15/09/16', '12:21 pm', 'User1', 'Hey'],
 ['15/09/16', '12:22 pm', 'User2', '<Media omitted>'],
 ['15/09/16', '12:22 pm', 'User2', "It's yesterday's work"],
 ['15/09/16', '12:22 pm', 'User1', 'Gotta work on it.']]
  • 输出格式是为了清晰的我
  • 您需要明确指定MaxSplit为1

这是使用正则群体的完美案例。

s = '15/09/16, 12:21 pm - User1: Hey'
ms = re.match(r'(d+/d+/d+).+?(d+:d+).+-s(.*):s(.*)', s)
print(ms.groups()) # ('15/09/16', '12:21', 'User1', 'Hey')

您可以将它们重新加入CSV线。

最新更新