如何提取带有关键字或子字符串的句子前后的句子



我想创建一个函数,提取包含感兴趣的关键字或子字符串的句子,以及前后的句子。如果可能的话,我想具体说明我想摘录的句子数量。如果关键句子是第一句话,我希望函数不会返回错误。

在下面的例子中,我创建了一个提取单个句子的函数。如何扩展以包含更多句子?


data = [[0, 'Johannes Gensfleisch zur Laden zum Gutenberg was a German inventor, printer, publisher, and goldsmith who introduced printing to Europe with his mechanical movable-type printing press. His work started the Printing Revolution in Europe and is regarded as a milestone of the second millennium, ushering in the modern period of human history. It played a key role in the development of the Renaissance, Reformation, Age of Enlightenment, and Scientific Revolution, as well as laying the material basis for the modern knowledge-based economy and the spread of learning to the masses.'], 
[1, 'While not the first to use movable type in the world,[a] Gutenberg was the first European to do so. His many contributions to printing include the invention of a process for mass-producing movable type; the use of oil-based ink for printing books;[7] adjustable molds;[8] mechanical movable type; and the use of a wooden printing press similar to the agricultural screw presses of the period.[9] His truly epochal invention was the combination of these elements into a practical system that allowed the mass production of printed books and was economically viable for printers and readers alike. Gutenbergs method for making type is traditionally considered to have included a type metal alloy and a hand mould for casting type. The alloy was a mixture of lead, tin, and antimony that melted at a relatively low temperature for faster and more economical casting, cast well, and created a durable type.'], 
[2, 'The use of movable type was a marked improvement on the handwritten manuscript, which was the existing method of book production in Europe, and upon woodblock printing, and revolutionized European book-making. Gutenbergs printing technology spread rapidly throughout Europe and later the world. His major work, the Gutenberg Bible (also known as the 42-line Bible), was the first printed version of the Bible and has been acclaimed for its high aesthetic and technical quality. In Renaissance Europe, the arrival of mechanical movable type printing introduced the era of mass communication which permanently altered the structure of society. The relatively unrestricted circulation of information—including revolutionary ideas—transcended borders, captured the masses in the Reformation, and threatened the power of political and religious authorities; the sharp increase in literacy broke the monopoly of the literate elite on education and learning and bolstered the emerging middle class. Across Europe, the increasing cultural self-awareness of its people led to the rise of proto-nationalism, accelerated by the flowering of the European vernacular languages to the detriment of Latins status as lingua franca. In the 19th century, the replacement of the hand-operated Gutenberg-style press by steam-powered rotary presses allowed printing on an industrial scale, while Western-style printing was adopted all over the world, becoming practically the sole medium for modern bulk printing. ']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['text_number', 'text'])
def extract_key_sentences(text,word_list):
joined_word_list = '|'.join(word_list)
print(joined_word_list)
sentence = re.findall(r'([^.]*?'+joined_word_list+'[^.]*.)', text)
return sentence
tools_list=['printing press','paper','ink','woodblock','molds','method']
df['key_sentence']=df['text'].apply(lambda x : extract_key_sentences(str(x),tools_list))
df['key_sentence'].head()

按照目前的方法,似乎没有提取完整的句子,见第3行,文本以"书籍制作方法"开头。

IIUC,您可以split您的字符串和explode每行有一个句子,识别与关键字匹配的句子,并使用groupby.rolling.max传播到相邻的句子。

然后聚合为单个字符串(可选(:

word_list=['printing press','paper','ink','woodblock','molds','method']
joined_word_list = '|'.join(map(re.escape, word_list))
N = 3
df[f'{N}_around'] = (df['text']
.str.split('(?<=.)s*').explode()
.loc[lambda d: 
d.str.contains(joined_word_list)
.groupby(level=0).rolling(2*N+1, center=True, min_periods=1).max()
.droplevel(1).astype(bool)
]
.groupby(level=0).agg(' '.join)
)

在这种特殊的情况下,对于第三行,它只匹配第一个句子,并传播以保留下面的3个句子,去掉剩下的4个。

输出:

text_number                                               text  
0            0  Johannes Gensfleisch zur Laden zum Gutenberg w...   
1            1  While not the first to use movable type in the...   
2            2  The use of movable type was a marked improveme...   
3_around  
0  Johannes Gensfleisch zur Laden zum Gutenberg w...  
1  While not the first to use movable type in the...  
2  The use of movable type was a marked improveme...  

相关内容

最新更新