这与以下问题有关-在Python 中搜索Unicode字符
我有这样的字符串-
sentence = 'AASFG BBBSDC FEKGG SDFGF'
我把它分开,得到下面这样的单词列表-
sentence = ['AASFG', 'BBBSDC', 'FEKGG', 'SDFGF']
我使用以下代码搜索单词的一部分,并获得整个单词-
[word for word in sentence.split() if word.endswith("GG")]
返回['FEKGG']
现在我需要弄清楚这个词的前后。
例如,当我搜索"GG"时,它会返回['FEKGG']
。此外,它应该能够获得
behind = 'BBBSDC'
infront = 'SDFGF'
使用此生成器:
如果您有以下字符串(从原始字符串编辑):
sentence = 'AASFG BBBSDC FEKGG SDFGF KETGG'
def neighborhood(iterable):
iterator = iter(iterable)
prev = None
item = iterator.next() # throws StopIteration if empty.
for next in iterator:
yield (prev,item,next)
prev = item
item = next
yield (prev,item,None)
matches = [word for word in sentence.split() if word.endswith("GG")]
results = []
for prev, item, next in neighborhood(sentence.split()):
for match in matches:
if match == item:
results.append((prev, item, next))
返回:
[('BBBSDC', 'FEKGG', 'SDFGF'), ('SDFGF', 'KETGG', None)]
这里有一种可能性:
words = sentence.split()
[pos] = [i for (i, word) in enumerate(words) if word.endswith("GG") ]
behind = words[pos - 1]
infront = words[pos + 1]
您可能需要注意边缘情况,例如"…GG"
未出现、多次出现或是第一个和/或最后一个单词。就目前情况来看,任何这些都会引发一个例外,这很可能是正确的行为。
使用正则表达式的完全不同的解决方案首先避免了将字符串拆分为数组:
match = re.search(r'b(w+)s+(?:w+GG)s+(w+)b', sentence)
(behind, infront) = m.groups()
这是一种方法。如果"GG"单词在句子的开头或结尾,则前面和后面的元素将为None
。
words = sentence.split()
[(infront, word, behind) for (infront, word, behind) in
zip([None] + words[:-1], words, words[1:] + [None])
if word.endswith("GG")]
sentence = 'AASFG BBBSDC FEKGG SDFGF AAABGG FOOO EEEGG'
def make_trigrams(l):
l = [None] + l + [None]
for i in range(len(l)-2):
yield (l[i], l[i+1], l[i+2])
for result in [t for t in make_trigrams(sentence.split()) if t[1].endswith('GG')]:
behind,match,infront = result
print 'Behind:', behind
print 'Match:', match
print 'Infront:', infront, 'n'
输出:
Behind: BBBSDC
Match: FEKGG
Infront: SDFGF
Behind: SDFGF
Match: AAABGG
Infront: FOOO
Behind: FOOO
Match: EEEGG
Infront: None
另一个基于itertools的选项,在大型数据集上可能对内存更友好
from itertools import tee, izip
def sentence_targets(sentence, endstring):
before, target, after = tee(sentence.split(), 3)
# offset the iterators....
target.next()
after.next()
after.next()
for trigram in izip(before, target, after):
if trigram[1].endswith(endstring): yield trigram
EDIT:修复了的打字错误