将文本DF拆分为单个句子DF:如何使用lambda创建更长的熊猫数据框并应用?



这个问题可能看起来很长,但我保证它真的不复杂。

我有一个带有文本块和一些 ID 列的 DF。我想创建一个新的 DF,其中包含每个句子作为自己的行。

original_df = pd.DataFrame(data={"year":[2018,2019], "text_nr":[1,2], "text":["This is one sentence. This is another!","Please help me. I am lost. "]})
original_df
>>>
year  text_nr  text
0  2018  1        "This is one sentence. This is another!"
1  2019  2        "Please help me. I am lost."

我想使用 spacy 将每个文本块拆分为单独的句子,并创建一个如下所示的新 DF:

sentences_df
>>>
year  text_nr  sent_nr sentence
0  2018      1       1   "This is one sentence". 
1  2018      1       2   "This is another!"
2  2019      2       1   "Please help me."
3  2019      2       2   "I am lost."

我找到了一种方法来做到这一点:

import spacy
nlp = spacy.load("en_core_news_sm")
sentences_list = []
for i, row in original_df.iterrows():
doc = nlp(row["text"])
sentences = [(row["year"],row["text_nr"],str(i+1),sent.string.replace('n','').replace('t','').strip()) for i, sent in enumerate(doc.sents)]
sentences_list = sentences_list+sentences
sentences_df = pd.DataFrame(sentences_list, columns = ["year",text_nr","sent_nr","sentence"])

但它不是很优雅,我读到df.apply(lambda: ...)方法要快得多。 但是,当我尝试它时,我永远无法获得正确的结果。我尝试了以下两种方法:

  1. 第一次尝试:
nlp = spacy.load("en_core_news_sm")
def sentencizer (x, nlp_model):
sentences = {}
doc = nlp_model(x["text"])
for i, sent in enumerate(doc.sents):
sentences["year"]=x["year"]
sentences["text_nr"]=x["text_nr"]
sentences["sent_nr"] = str(i+1)
sentences["sentence"] = sent.string.replace('n','').replace('t','').strip()
return sentences
sentences_df = original_df.head().apply(lambda x: pd.Series(sentencizer(x,nlp)),axis=1)

这只会得到最后一句话

sentences_df
>>>
year  text_nr sent_nr  sentence
0  2018        1       2  "This is another!"
1  2019        2       2  "I am lost!"
  1. 第二次尝试
nlp = spacy.load("en_core_news_sm")
def sentencizer (x, nlp_model):
sentences = {"year":[],"text_nr":[],"sent_nr":[],"sentence":[]}
doc = nlp_model(x["text"])
for i, sent in enumerate(doc.sents):
sentences["year"].append(x["year"])
sentences["text_nr"].append(x["text_nr"])
sentences["sent_nr"].append(str(i+1))
sentences["sentence"].append(sent.string.replace('n','').replace('t','').strip())
return sentences
sentences_df = original_df.apply(lambda x: pd.Series(sentencizer(x,nlp)),axis=1)

这为我提供了一个带有列表作为条目的 DF:

sentences_df
>>>
year          text_nr sent_nr    sentence
0  [2018, 2018]  [1, 1]  [1, 2]  ["This is one sentence.", "This is another!"]
1  [2019, 2019]  [2, 2]  [1, 2]  ["Please help me.", "I am lost."]

我可能会尝试扩展最后一个 df,但我相信有一种方法可以一次性正确完成此操作。我想使用spacy来拆分文本,因为它具有比仅使用正则表达式/字符串拆分更高级的句子边界检测。您无需下载spacy来帮助我(->string.split()对于此处的虚拟数据很好(。我只需要找到一个与以下内容相同的逻辑,以便我可以重写它以将其与spacy一起使用。

nlp = spacy.load("en_core_news_sm")
doc = nlp("This is a sentence.n This is another! ")
sentences = [sent.string.strip() for sent in doc.sents] #doc.sents is a generator
sentences
>>>
["This is a sentence", "This is another!"]

因此,类似这样的东西会很棒:

text = "This is a sentence.n This is another! "
sentences = [sent.replace("n","").strip() for sent in text.split(".")]
sentences
>>>
["This is a sentence", "This is another!"]

非常感谢任何帮助。我对编程很陌生,所以请怜悯:)

找到了一个有效的解决方案:

nlp = spacy.load("en_core_news_sm")
def splitter(x,nlp):
doc = nlp(x["text"])
a = [str(sent) for sent in doc.sents]
b = len(a)
dictionary = {"text_nr": np.repeat(x["text_nr"],b), "sentence_nr": list(range(1, b+1)), "sentence": a}
dictionaries = [{key : value[i] for key, value in dictionary.items()} for i in range(b)]
for dictionary in dictionaries:
rows_list.append(dictionary)
original_df.apply(lambda x: splitter(x,nlp), axis = 1)
new_df = pd.DataFrame(rows_list, columns=['text_nr', 'sentence_nr','sentence'])

沿着这条线的东西可能会起作用:

# update punctuations list if needed
punctuations = '.!?'
(original_df.drop('text',axis=1)
.merge(original_df.text
.str.extractall(f'(?P<sentence>[^{punctuations}]+[{punctuations}])s?')
.reset_index('match'),
left_index=True, right_index=True, how='left')
)

输出:

year  text_nr  match               sentence
0  2018        1      0  This is one sentence.
0  2018        1      1       This is another!
1  2019        2      0        Please help me.
1  2019        2      1             I am lost.

最新更新