我正在尝试用字符串拆分一列。我想用"果汁"将每个单元格中的字符串分开,但如果"果汁"不在最后一个子字符串中,则保留"果汁"。
示例:df['value']如下所示:
1. applejuice, orangejuice, juice, applejuice, pineapple juice, berriesjuice
2. carrotjuice, juice, pinapple juice, water, berriesjuice, juice
我的新colmn df['value2']的输出如下:
1. [applejuice, orangejuice, juice], [applejuice, pineapple ,juice], [berriesjuice]
2. [carrotjuice, juice], [pinapple juice], [water, berriesjuice, juice]
不清楚为什么需要数据帧,但首先用逗号分隔,然后迭代并检查字符串是否等于juice
。
import re
lines = [
'applejuice, orangejuice, juice, applejuice, pineapple juice, juice, berriesjuice',
'carrotjuice, juice, pinapple juice, water, berriesjuice, juice'
]
def getSections(line):
strings = re.split(',\s*', line)
sections = []
section = []
for x in strings:
if x == 'juice':
section.append(x)
sections.append(section[:])
section = []
else:
section.append(x)
if len(section) > 0:
sections.append(section)
del section
return sections
for s in map(getSections, lines):
print(s)
[['applejuice', 'orangejuice', 'juice'], ['applejuice', 'pineapple juice', 'juice'], ['berriesjuice']]
[['carrotjuice', 'juice'], ['pinapple juice', 'water', 'berriesjuice', 'juice']]
如果需要,可以从列表列表中创建DataFrame。
将此函数应用于value
列即可完成此工作。它首先在","(注意空格(上进行拆分,然后每次遇到"果汁"时都单独生成新的子列表。
def separate(string):
substrings = [[]]
for x in string.split(', '):
substrings[-1].append(x)
if x == 'juice':
substrings.append([])
return substrings
import pandas as pd
df = pd.DataFrame({'value' : [
'applejuice, orangejuice, juice, applejuice, pineapple juice, juice, berriesjuice',
'carrotjuice, juice, pinapple juice, water, berriesjuice, juice'
]})
df['value2'] = df.value.apply(separate, axis=0)
不过我不确定速度。