r-在每个时间戳拆分注释



嘿,我在一个单元格中有一个带有各种时间戳的注释,如下所示:-

2019-07-26 20:36:19-(工作笔记(通知呼叫者交易已从Concur中删除。将INC解析为没有待处理的操作。向来电者发送解决方案电子邮件,复制粘贴回复并简化/总结从工程师团队收到的信息是更新工作笔记是将状态更新为等待用户是

2019-07-26 10:32:05-oneflow(工作笔记([代码]嗨,团队。我们已经删除了那些git。

我想要的是将此单元格拆分成行,以便每个时间戳都用其各自的文本进行拆分

请帮忙。R或Python中的任何代码都会有所帮助

使用regex:的Python选项

import re
s = """2019-07-26 20:36:19 - (Work notes) Informed the caller that the [...]
line without timestamp!
2019-07-26 10:32:05 - oneflow (Work notes)[code] Hi Team.We have removed those gits."""
# search for the timestamps
timestamps = re.findall(r'd{4}-d{2}-d{2} d{2}:d{2}:d{2}', s)
# if timestamps were found, obtain their indices in the string:
if timestamps:
idx = [s.index(t) for t in timestamps] + [None] # add None to get the last part...
# split the string and put the results in tuples:
text_tuples = []
l = len(timestamps[0]) # how many characters to expect for the timestamp
for i, j in zip(idx[:-1], idx[1:]): # use zip to iterate over two sequences at once
text_tuples.append((s[i:i+l], # timestamp
s[i+l:j].strip(' - '))) # part before next timestamp
# text_tuples
# [('2019-07-26 20:36:19',
#   '(Work notes) Informed the caller that the [...]nline without timestamp!n'),
#  ('2019-07-26 10:32:05',
#   'oneflow (Work notes)[code] Hi Team.We have removed those gits.')]

在本例中,您将获得一个元组列表,其中包含时间戳和相应的行的其余部分。如果一行没有时间戳,它将不会进入输出。


编辑:OP注释后pandasDataFrame的扩展:

import re
import pandas as pd
# create a custom function to split the comments:
def split_comment(s):
# search for the timestamps
timestamps = re.findall(r'd{4}-d{2}-d{2} d{2}:d{2}:d{2}', s)
# if timestamps were found, obtain their indices in the string:
if timestamps:
idx = [s.index(t) for t in timestamps] + [None] # add None to get the last part...
# split the string and put the results in tuples:
splitted = []
l = len(timestamps[0]) # how many characters to expect for the timestamp
for i, j in zip(idx[:-1], idx[1:]): # use zip to iterate over two sequences at once
splitted.append([s[i:i+l], # timestamp
s[i+l:j].strip(' - ')]) # part before next timestamp
return splitted
return ['NaT', s] # no timestamp found, return s
s0 = """2019-07-26 20:36:19 - (Work notes) Informed the caller that the [...]
line without timestamp!
2019-07-26 10:32:05 - oneflow (Work notes)[code] Hi Team.We have removed those gits."""
s1 = "2019-07-26 20:36:23  another comment"
# create example df
df = pd.DataFrame({'s': [s0, s1], 'id': [0, 1]})
# create a dummy column that holds the resulting series we get if we apply the function:
df['tmp'] = df['s'].apply(split_comment)
# explode the df so we have one row for each timestamp / comment pair:
df = df.explode('tmp').reset_index(drop=True)
# create two columns from the dummy column, 'timestamp' and 'comment':
df[['timestamp', 'comment']] = pd.DataFrame(df['tmp'].to_list(), index=df.index)
# drop stuff we dont need anymore:
df = df.drop(['s', 'tmp'], axis=1)
# so now we have:
# df
#    id            timestamp                                            comment
# 0   0  2019-07-26 20:36:19  (Work notes) Informed the caller that the [......
# 1   0  2019-07-26 10:32:05  oneflow (Work notes)[code] Hi Team.We have rem...
# 2   1  2019-07-26 20:36:23                                    another comment

最新更新