输出到pandas数据框架



我以以下方式从文本中提取引号,输出如下:

data = [
(""Hello, nice to meet you," said John. Jane said, "It is nice to meet you as well."", {"url": "example1.com", "date": "Jan 1"}),
(""Hello, nice to meet you," said John", {"url": "example2.com", "date": "Jan 2"}),
]
for record in data:
doc = textacy.make_spacy_doc(record, lang="en_core_web_sm")
print(list(textacy.extract.triples.direct_quotations(doc)))

'''
[DQTriple(speaker=[John], cue=[said], content="Hello, nice to meet you,"), DQTriple(speaker=[Jane], cue=[said], content="It is nice to meet you as well.")]
[DQTriple(speaker=[John], cue=[said], content="Hello, nice to meet you,")]
'''

我的目标是将输出与原始数据集的元数据一起转换为pandas数据框。具体来说,我希望它看起来像这样:

import pandas as pd
output = {"url": ["example1.com", "example1.com", "example2.com"],
"date": ["Jan 1", "Jan 1", "Jan 2"],
"speaker": ["John", "John", "Jane"],
"cue": ["said", "said", "said"],
"content": ["Hello, nice to meet you", "It is nice to meet you as well", "Hello, nice to meet you"]}
df = pd.DataFrame(output)
print(df)
'''
url   date speaker   cue                         content
0  example1.com  Jan 1    John  said         Hello, nice to meet you
1  example1.com  Jan 1    John  said  It is nice to meet you as well
2  example2.com  Jan 2    Jane  said         Hello, nice to meet you
'''

是否有有效的方法来做到这一点?

l = []
for record in data:
doc = textacy.make_spacy_doc(record, lang="en_core_web_sm")
l.append(list(textacy.extract.triples.direct_quotations(doc)))
out = pd.Series(l).explode().apply(pd.Series)

相关内容

  • 没有找到相关文章

最新更新