在字符串数组中分割指定长度的连续相似字符



我有一个数组

["ejjjjmmtthh", "zxxuueeg", "aanlljrrrxx", "dqqqaaabbb", "oocccffuucccjjjkkkjyyyeehh"]

,并且需要在每个长度为k(在本例中为3)的字符串元素中提取连续字符,而不使用regex或groupby。

这是我目前为止写的:

s = ["ejjjjmmtthh", "zxxuueeg", "aanlljrrrxx", "dqqqaaabbb", "oocccffuucccjjjkkkjyyyeehhh"]
k = 3
output = []
for i in s:
result = ""
for j in range(1,len(i)-1):
if i[j]==i[j-1] or i[j]==i[j+1]:
result+=i[j]
if i[-1] == result[-1]:
result+=i[-1]
if i[0]==result[0]:
result=i[0]+result
output.append(result)
print(output)
#current output = ['jjjjmmtthh', 'xxuuee', 'aallrrrxx', 'qqqaaabbb', 'oocccffuucccjjjkkkyyyeehhh'] 
#expected outcome(for k =3) = ['rrr','qqq','aaa','bbb','ccc','ccc','jjj','kkk','yyy','hhh'] 

我的问题:

  1. 如何适应k条件?
  2. 是否有更优的方法来做到这一点?

这个解决方案更容易读,而且不太长。它适用于k>

0。
s = ["ejjjjmmtthh", "zxxuueeg", "aanlljrrrxx", "dqqqaaabbb", "oocccffuucccjjjkkkjyyyeehhh"]
k = 3
output = []
for element in s:
state = "" #State variable (reset on every list item)
for char in element: #For each character
if state != "" and char == state[-1]:  # Check if the last character is the same (only if state isn't empty)
state += char #Add it to the state
else:
if len(state) == k: #Otherwise, check if we have k characters
output.append(state) #Append te result if we do
state = char #Reset the state
#If there are no more characters (end of element), check too
if len(state) == k: 
output.append(state)
print(output)

k = 3输出

['rrr', 'qqq', 'aaa', 'bbb', 'ccc', 'ccc', 'jjj', 'kkk', 'yyy', 'hhh']

k = 1输出

['e', 'z', 'g', 'n', 'j', 'd', 'j']

这里我按连续相同的字母手动分组。然后我只在它们与k长度相同的情况下才将它们计算到结果中。这是有效的,但我确信有一个更优化的方法:

s = ["ejjjjmmtthh", "zxxuueeg", "aanlljrrrxx", "dqqqaaabbb", "oocccffuucccjjjkkkjyyyeehhh"]
k = 3
def _next_group(st):
if not st:
return None
first = st[0]
res = [first]
for s in st[1:]:
if s == first:
res.append(s)
else:
break
return res
result = []
for st in s:
while True:
group = _next_group(st)
if not group:
break
if len(group) == k:
result.append("".join(group))
if len(group) == len(st):
break
st = st[len(group):]
print(result)

输出:['rrr', 'qqq', 'aaa', 'bbb', 'ccc', 'ccc', 'jjj', 'kkk', 'yyy', 'hhh']

for-loop方法

备注:我建议使用分治策略解决方案:关注一个字符串(而不是一个字符串列表),让一个函数,然后概括与循环/理解…

def repeated_chars(string, k=3):
out = []
c, tmp = 0, '' # counter, tmp char
for char in s:
if tmp == '':
tmp = char
c += 1
continue
if tmp == char:
c += 1
else:
if c == k:
out.append((tmp, c))
tmp = char
c = 1
# last term
if c == k:
out.append((tmp, c))
return [char * i for char, i in out]
data = ['jjjjmmtthh', 'xxuuee', 'aallrrrxx', 'qqqaaabbb', 'oocccffuucccjjjkkkyyyeehhh']
# apply the function to all strings
out = []
for s in data:
out.extend(repeated_chars(s, k=3))
print(out)
#['rrr', 'qqq', 'aaa', 'bbb', 'ccc', 'ccc', 'jjj', 'kkk', 'yyy', 'hhh']

编辑:是的,groubpy不应该按要求使用,但是做这项工作需要以某种方式分组(例如,请参阅接受的答案),因此将责任分为多个功能似乎是一个好主意,这是良好的实践,通过重新实现groupby

在这种情况下,groupby似乎是显而易见的,如果你不能使用itertools中的一个,就写一个。

核心函数也应该处理字符串,而不是字符串列表——只是for循环,以防你有字符串列表。

一旦你有了你的groubpy,它是直接的:

def extract_groups(s: str, k: int):
return [group for group in groupby(s) if len(group) == k]

让我们试一试:

input_strings = [
"ejjjjmmtthh",
"zxxuueeg",
"aanlljrrrxx",
"dqqqaaabbb",
"oocccffuucccjjjkkkjyyyeehhh",
]
expected_outputs = [
[],
[],
["rrr"],
["qqq", "aaa", "bbb"],
["ccc", "ccc", "jjj", "kkk", "yyy", "hhh"],
]
outputs = [extract_groups(s, k=3) for s in input_strings]
print(outputs == expected_outputs)  # True

事实上,outputs是一个组列表:

In [  ]: outputs
Out[  ]: [[], [], ['rrr'], ['qqq', 'aaa', 'bbb'], ['ccc', 'ccc', 'jjj', 'kkk', 'yyy', 'hhh']]

如果你真的想让它变平,就把它变平:

In [  ]: from itertools import chain
... : list(chain.from_iterable(outputs))
Out[  ]: ['rrr', 'qqq', 'aaa', 'bbb', 'ccc', 'ccc', 'jjj', 'kkk', 'yyy', 'hhh']
In [  ]: [group for s in input_strings for group in extract_groups(s, k)]
Out[  ]: ['rrr', 'qqq', 'aaa', 'bbb', 'ccc', 'ccc', 'jjj', 'kkk', 'yyy', 'hhh']

groupby函数参考:

def groupby(s: str):
if not s:
return []
result = []
tgt = s[0]
counter = 1
for c in s[1:]:
if c == tgt:
counter += 1
else:
result.append(tgt * counter)
tgt = c
counter = 1
result.append(tgt * counter)
return result

相关内容

  • 没有找到相关文章

最新更新