在字符串上循环,然后将分隔符添加回子字符串



我正在尝试用字符串拆分一列。我想用"果汁"将每个单元格中的字符串分开,但如果"果汁"不在最后一个子字符串中,则保留"果汁"。

示例:df['value']如下所示:

1. applejuice, orangejuice, juice, applejuice, pineapple juice,  berriesjuice 
2. carrotjuice, juice, pinapple juice, water, berriesjuice, juice

我的新colmn df['value2']的输出如下:

1. [applejuice, orangejuice, juice], [applejuice, pineapple ,juice], [berriesjuice]
2. [carrotjuice, juice], [pinapple juice], [water, berriesjuice, juice]

不清楚为什么需要数据帧,但首先用逗号分隔,然后迭代并检查字符串是否等于juice

import re
lines = [
'applejuice, orangejuice, juice, applejuice, pineapple juice, juice, berriesjuice',
'carrotjuice, juice, pinapple juice, water, berriesjuice, juice'
]
def getSections(line):
strings = re.split(',\s*', line)

sections = []
section = []
for x in strings:
if x == 'juice':
section.append(x)
sections.append(section[:])
section = []
else:
section.append(x)
if len(section) > 0:
sections.append(section)
del section

return sections
for s in map(getSections, lines):
print(s)
[['applejuice', 'orangejuice', 'juice'], ['applejuice', 'pineapple juice', 'juice'], ['berriesjuice']]
[['carrotjuice', 'juice'], ['pinapple juice', 'water', 'berriesjuice', 'juice']]

如果需要,可以从列表列表中创建DataFrame。

将此函数应用于value列即可完成此工作。它首先在","(注意空格(上进行拆分,然后每次遇到"果汁"时都单独生成新的子列表。

def separate(string):
substrings = [[]]
for x in string.split(', '):
substrings[-1].append(x)
if x == 'juice':
substrings.append([])
return substrings
import pandas as pd
df = pd.DataFrame({'value' : [
'applejuice, orangejuice, juice, applejuice, pineapple juice, juice, berriesjuice', 
'carrotjuice, juice, pinapple juice, water, berriesjuice, juice'
]})
df['value2'] = df.value.apply(separate, axis=0)

不过我不确定速度。

最新更新