重新排列文本块,使每个块都以一个完整的句子结尾



我有三组文本集团(实际上还有更多...),它们显示了完整文本的一部分。但是,由于某些句子是在两个文本块之间分配的,因此原始文本的分区未正确完成。

text1 = {"We will talk about data about model specification parameter 
estimation and model application and the context where we will apply 
the simple example.Is an application where we would like to analyze 
the market for electric cars because"};
text2 = {"we are interested in the market of electric cars.The choice 
that we are interested in is the choice of each individual to 
purchase an electric car or not And we will see how"};
text3 = {"to address this question. Furthermore, it needs to be noted that this is only a model text and there is no content associated with it. "};

例如,text2以"我们对电动汽车市场感兴趣"开头。这是一个不完整的第一句话,实际上是在文本块1中开始的(请参阅最后一句话)。

我想确保每个文本块以完整的句子结束。因此,我想将不完整的第一个句子移至最后一个文本块。例如,结果是:

 text1corr = {"We will talk about data about model specification parameter 
    estimation and model application and the context where we will apply 
    the simple example.Is an application where we would like to analyze 
    the market for electric cars because we are interested in the market of electric cars."};
text2corr = {"The choice that we are interested in is the choice of each individual to purchase an electric car or not And we will see how to address this question."};
text3corr = {"Furthermore, it needs to be noted that this is only a model text and there is no content associated with it. "};

我该如何在Python中进行?这甚至可能吗?

您可以使用函数zip_longest()通过字符串对迭代:

from itertools import zip_longest
import re
l = [text1, text2, text3]
new_l = []
for i, j in zip_longest(l, l[1:], fillvalue=''):
    # remove leading and trailing spaces
    i, j = i.strip(), j.strip()
    # remove leading half sentence
    if i[0].islower():
        i = re.split(r'[.?!]', i, 1)[-1].lstrip()
    # append half sentence from next string
    if i[-1].isalpha():
        j = re.split(r'[.?!]', j, 1)[0]
        i = f"{i} {j}."
    new_l.append(i)
for i in new_l:
    print(i)

输出:

We will talk about data about model specification parameter estimation and model application and the context where we will apply the simple example.Is an application where we would like to analyze the market for electric cars because we are interested in the market of electric cars.
The choice that we are interested in is the choice of each individual to purchase an electric car or not And we will see how to address this question.
Furthermore, it needs to be noted that this is only a model text and there is no content associated with it.
text1 = "We will talk about data about model specification parameter 
estimation and model application and the context where we will apply 
the simple example.Is an application where we would like to analyze 
the market for electric cars because"
text2 = "we are interested in the market of electric cars.The choice 
that we are interested in is the choice of each individual to 
purchase an electric car or not And we will see how"
text3 = "to address this question. Furthermore, it needs to be noted that this is only a model text and there is no content associated with it. "

textList = [text1,text2,text3]

corrected_list = []
prev_incomplete_sentece = ''
for index , text in enumerate(textList):
    if(len(prev_incomplete_sentece) > 0):
        corrected_text =  text[len(prev_incomplete_sentece) + 1:]
    else:
        corrected_text = text
    if(index +1 < len(textList)):
        corrected_text += ' '+ textList[index+1].split('.')[0]
        prev_incomplete_sentece = textList[index+1].split('.')[0]
    corrected_list.append(corrected_text)    

输出:

['We will talk about data about model specification parameter estimation and model application and the context where we will apply the simple example.Is an application where we would like to analyze the market for electric cars because we are interested in the market of electric cars',
 'The choice that we are interested in is the choice of each individual to purchase an electric car or not And we will see how to address this question',
 ' Furthermore, it needs to be noted that this is only a model text and there is no content associated with it. ']

相关内容

最新更新