在pandas数据框架中拆分几个句子

我有一个pandas数据框架，它的列看起来像这样。

句子
['这是文本。’，‘这是另一篇文章。’，‘这也是文本。'， '更多的文字。']
['这在另一行也是一样的。，"另一行，另一个文本。"，"文本在第二行。">

您可以在一行中使用np.char.split:

df['separated'] = np.char.split(df['sentences'].tolist()).tolist()

@Kata如果你认为sentences列类型是str意味着每一行中的元素是字符串而不是列表，例如"['This is text.', 'This is another text.', 'This is also text.', 'Even more text.']"，那么你需要先尝试将它们转换为列表。一种方法是使用ast.literal_eval。

from ast import literal_eval
df['sentences'] = df['sentences'].apply(literal_eval)
df['separated'] = np.char.split(df['sentences'].tolist()).tolist()

关于数据的说明:不推荐使用这种方式存储数据。如果可能的话，修复数据的来源。它需要在每个单元格中是字符串，最好不是列表，或者至少只是列表，而不是字符串表示列表。

使用df作为数据框架，您可以尝试以下操作:

df["splitted"] = (
df["sentences"]
.str.strip("[]'"").str.split("'. '|'. "|". '|". "")
.explode()
.str.findall(r"b([^ ]+?)b")
.groupby(level=0).agg(list)
)

第一个.strip,[,],',"从行首到行尾。
然后.split将行变为句子列表。
.explode结果列，通过.findall将句子中的单词提取成列表。
然后将相应的单词列表组合回一个列表。

Resultdf["splitted]for

df = pd.DataFrame({
"sentences": [
"""['This is text.', 'This is another text.', 'This is also text.', 'Even more text.']""",
"""["This is the same in another row.", 'Another row another text.', 'Text in second row.', 'Last text in second row.']"""
]
})

0  [['This', 'is', 'text'], ['This', 'is', 'another', 'text'], ['This', 'is', 'also', 'text'], ['Even', 'more', 'text']]
1  [['This', 'is', 'the', 'same', 'in', 'another', 'row'], ['Another', 'row', 'another', 'text'], ['Text', 'in', 'second', 'row'], ['Last', 'text', 'in', 'second', 'row']]

相关内容

最新更新

热门标签：