在一个不带剪切词的字符串中，在一个字符编号之前获取n个单词

我得到了一个字符串和该字符串中的字符位置。如果字符位置在单词的中间，我想以一种不包括最后一个单词的方式获得该位置之前的n个单词

text = 'the house is big the house is big the house is big'
char_nr = 19
list_of_words_before = text[:char_nr-1].split()
print(list_of_words_before) # we see that the string is splited in "the" I dont want hence the t in the list
nr_words = 3
if nr_words >len(list_of_words_before):
nr_words = len(list_of_words_before)

list_of_words_before[-nr_words:]

这给出：

['the', 'house', 'is', 'big', 't']
['is', 'big', 't']

但实际上，我真正想要的是["房子"、"是"、"大"]，因为这只是一个词的一部分。

首先，你如何确保用单词之间的空格分隔？还有其他解决方案吗？

使用正则表达式：

>>> import re
>>> text = 'the house is big the house is big the house is big'
>>> result = re.match(r".{0,18}b", text).group(0).split()
>>> result
['the', 'house', 'is', 'big']
>>> result[-3:]
['house', 'is', 'big']

说明：

.任意字符
{0,18}与前面的(.(匹配0到18次，尽可能多
b匹配以单词的开头或结尾结束，所以我们不会得到偏词

也许是这样的：

text = 'the house is big the house is big the house is big'
char_nr = 19
list_of_words_before = text[:char_nr - 1]
splitted = list_of_words_before.split()
if list_of_words_before[-1] != ' ':
splitted = splitted[:-1]
nr_words = 3
print(splitted[-nr_words:])

输出：

['house', 'is', 'big']

您可以检查char_nr中的字符，如果它是非单词字符，则拆分是正确的，否则您需要从列表中删除最后一项。假设" "是单词之间唯一可以出现的字符：

if text[char_nr] != " ":
list_of_words_before = list_of_words_before[:-1]

我想这就是您想要的：

def get_n_words(text, char_nr, nr_words):
if text[char_nr-1] == " ":
list_of_words_before = text[:char_nr-1].split()
else:
list_of_words_before = text[:char_nr-1].split()[:-1]
print(list_of_words_before)
if nr_words >len(list_of_words_before):
nr_words = len(list_of_words_before)

print(list_of_words_before[-nr_words:])
text_1 = 'the house is big the house is big the house is big'
text_2 = 'the house is big a house is big the house is big'
print("Last word truncated:")
get_n_words(text_1, 19, 3)
print("nLast word not truncated:")
get_n_words(text_2, 19, 3)

它有以下输出：

Last word truncated:
['the', 'house', 'is', 'big']
['house', 'is', 'big']
Last word not truncated:
['the', 'house', 'is', 'big', 'a']
['is', 'big', 'a']

您可以使用一种模式，使用S以非空白字符开始匹配，然后使用.{0,18}将任何字符匹配0-18次，同时使用负前瞻(?!S)在右侧断言非空白字符

S.{0,18}(?!S)

Regex演示| Python演示

import re
text = 'the house is big the house is big the house is big'
char_nr = 19
pattern = rf"S.{{0,{char_nr - 1}}}(?!S)"
strings = re.findall(pattern, text)
print(strings)
list_of_words_before = strings[1].split()
print(list_of_words_before)
nr_words = 3
lenOfWordsBefore = len(list_of_words_before)
if nr_words > lenOfWordsBefore:
nr_words = lenOfWordsBefore
print(list_of_words_before[-nr_words:])

输出

['the house is big', 'the house is big', 'the house is big']
['the', 'house', 'is', 'big']
['house', 'is', 'big']

相关内容

最新更新

热门标签：