我以以下方式从文本中提取引号,输出如下:
data = [
(""Hello, nice to meet you," said John. Jane said, "It is nice to meet you as well."", {"url": "example1.com", "date": "Jan 1"}),
(""Hello, nice to meet you," said John", {"url": "example2.com", "date": "Jan 2"}),
]
for record in data:
doc = textacy.make_spacy_doc(record, lang="en_core_web_sm")
print(list(textacy.extract.triples.direct_quotations(doc)))
'''
[DQTriple(speaker=[John], cue=[said], content="Hello, nice to meet you,"), DQTriple(speaker=[Jane], cue=[said], content="It is nice to meet you as well.")]
[DQTriple(speaker=[John], cue=[said], content="Hello, nice to meet you,")]
'''
我的目标是将输出与原始数据集的元数据一起转换为pandas数据框。具体来说,我希望它看起来像这样:
import pandas as pd
output = {"url": ["example1.com", "example1.com", "example2.com"],
"date": ["Jan 1", "Jan 1", "Jan 2"],
"speaker": ["John", "John", "Jane"],
"cue": ["said", "said", "said"],
"content": ["Hello, nice to meet you", "It is nice to meet you as well", "Hello, nice to meet you"]}
df = pd.DataFrame(output)
print(df)
'''
url date speaker cue content
0 example1.com Jan 1 John said Hello, nice to meet you
1 example1.com Jan 1 John said It is nice to meet you as well
2 example2.com Jan 2 Jane said Hello, nice to meet you
'''
是否有有效的方法来做到这一点?
l = []
for record in data:
doc = textacy.make_spacy_doc(record, lang="en_core_web_sm")
l.append(list(textacy.extract.triples.direct_quotations(doc)))
out = pd.Series(l).explode().apply(pd.Series)