将字符串的开始和结束索引中的每个单词映射到字典



我试图找到字符串中每个单词的索引范围(开始索引和结束索引,省略空格,并且索引从1开始,以供人类阅读)。我认为最好的方法是做一个列表的列表,其中每个嵌套的列表包含单词和开始和结束索引的列表。从一个示例字符串中,我得到以下列表:

text = "i have a list of lists that contain a word and there indices my method works except with repeated words like of or a or the or it"

收益率:

boundaries_list=[['i', [1, 1]], ['have', [3, 6]], ['a', [4, 4]], ['list', [10, 13]], ['of', [15, 16]], ['lists', [18, 22]], ['that', [24, 27]], ['contain', [29, 35]], ['a', [4, 4]], ['word', [39, 42]], ['and', [44, 46]], ['there', [48, 52]], ['indices', [54, 60]], ['my', [62, 63]], ['method', [65, 70]], ['works', [72, 76]], ['except', [78, 83]], ['with', [85, 88]], ['repeated', [90, 97]], ['words', [99, 103]], ['like', [105, 108]], ['of', [15, 16]], ['or', [40, 41]], ['a', [4, 4]], ['or', [40, 41]], ['the', [48, 50]], ['or', [40, 41]], ['it', [86, 87]]]

这可以工作,但它不是很可读。把它编成字典当然很好。字典是有用的,除非你有多个相同的键。对我来说,这意味着一个重复的单词的第一次出现将是该单词唯一出现在字典中,从而排除了该重复单词的任何其他出现的索引范围。

为了解决这个问题,我尝试在字典列表上使用defaultdict,但这只给我第一个单词的索引范围,重复出现n个单词。

例如:

for one_d in boundaries_list:
    nested_list_to_nested_dict = dict({one_d[0]:one_d[1]  })
    new_list.append(nested_list_to_nested_dict)

res = defaultdict(list)
for d in new_list:
    for k, v in d.items():
        res[k].append(v)
print(res)
>>> defaultdict(<class 'list'>, {'i': [[1, 1]], 'have': [[3, 6]], 'a': [[4, 4], [4, 4], [4, 4]], 'list': [[10, 13]], 'of': [[15, 16], [15, 16]], 'lists': [[18, 22]], 'that': [[24, 27]], 'contain': [[29, 35]], 'word': [[39, 42]], 'and': [[44, 46]], 'there': [[48, 52]], 'indices': [[54, 60]], 'my': [[62, 63]], 'method': [[65, 70]], 'works': [[72, 76]], 'except': [[78, 83]], 'with': [[85, 88]], 'repeated': [[90, 97]], 'words': [[99, 103]], 'like': [[105, 108]], 'or': [[40, 41], [40, 41], [40, 41]], 'the': [[48, 50]], 'it': [[86, 87]]})

任何帮助都非常感谢。

您可以使用re,具有匹配对象的startend属性:

import re
from collections import defaultdict
text = "i have a list of lists that contain a word and there indices my method works except with repeated words like of or a or the or it"
output = defaultdict(list)
for m in re.finditer(r"S+", text):
    output[m.group(0)].append((m.start(0)+1, m.end(0)))
print(output)
# defaultdict(<class 'list'>, {'i': [(1, 1)], 'have': [(3, 6)], 'a': [(8, 8), (37, 37), (116, 116)], 'list': [(10, 13)], 'of': [(15, 16), (110, 111)], 'lists': [(18, 22)], 'that': [(24, 27)], 'contain': [(29, 35)], 'word': [(39, 42)], 'and': [(44, 46)], 'there': [(48, 52)], 'indices': [(54, 60)], 'my': [(62, 63)], 'method': [(65, 70)], 'works': [(72, 76)], 'except': [(78, 83)], 'with': [(85, 88)], 'repeated': [(90, 97)], 'words': [(99, 103)], 'like': [(105, 108)], 'or': [(113, 114), (118, 119), (125, 126)], 'the': [(121, 123)], 'it': [(128, 129)]})

我添加了一个双空格,只是为了测试

text = "i have a  list of lists that contain a word and there indices my method works except with repeated words like of or a or the or it"
from collections import defaultdict
new_dict = defaultdict(list)
offset = 0
for word in text.split(" "):
    new_dict[word].append([offset, offset+len(word)])
    offset += len(word) + 1;
new_dict

输出:

defaultdict(list,
            {'i': [[0, 1]],
             'have': [[2, 6]],
             'a': [[7, 8], [37, 38], [116, 117]],
             '': [[9, 9]],
             'list': [[10, 14]],
             'of': [[15, 17], [110, 112]],
             'lists': [[18, 23]],
             'that': [[24, 28]],
             'contain': [[29, 36]],
             'word': [[39, 43]],
             'and': [[44, 47]],
             'there': [[48, 53]],
             'indices': [[54, 61]],
             'my': [[62, 64]],
             'method': [[65, 71]],
             'works': [[72, 77]],
             'except': [[78, 84]],
             'with': [[85, 89]],
             'repeated': [[90, 98]],
             'words': [[99, 104]],
             'like': [[105, 109]],
             'or': [[113, 115], [118, 120], [125, 127]],
             'the': [[121, 124]],
             'it': [[128, 130]]})

dict索引精确地给出了字符串切片的开始和结束位置。例如:text[128:130] = 'it'

最新更新