在可能不规则的字符串/数组中查找最常见重复模式的算法

如果此问题是可用问题的重复，请道歉。还没有找到一个正是我想要的。

我感兴趣的是检测字符串/数组(如ABCABCABCABC(中的模式，在这种情况下，它同样可以用整数编码。我的应用程序是这样的，我使用流式传感器，其中前面提到的序列中的每个字母都是一个传感器(例如A是传感器(。由于传感器故障之类的原因，我的序列并不总是周期性的/重复的。由于各种故障，它们可能会像这样出现，例如BCABCABCAB或ABCBCBCA。

我的应用程序变得更加困难，因为我先验地不知道我的数据集中有多少传感器，所以我需要一个算法来从序列中推断出这个数字(就像上面给出的那样(。遗憾的是，该算法应该为所有给定的示例生成ABC，因为这是最长、最常见的模式。

我的一个想法就是：

import numpy as np
from collections import Counter
# ABCABCABCABC encoded with integers 
A = np.array(
[[ 1 ,2, 3],
[ 1 ,2, 3],
[ 1 ,2, 3],
[ 1 ,2, 3]])
c = Counter(map(tuple, A)).most_common()[0]
# ((1,2,3), 4)

但这似乎效率很低，因为我必须多次重塑数组(可能很多次，因为我的序列很长，回想起来，我不知道我的重复序列的长度是3(，然后每次运行Counter来评估出现(或不出现(模式的规律性。

其他想法包括使用Knuth–Morris–Pratt算法以及n-gram或其组合。或者计算后缀树。

有更好的方法吗？

编辑

更多详细信息：

数据大小：长度在1000到1000000 ish之间的序列(尽管上限不太可能(
的子序列不能有重复的条目，它们必须是唯一的。即子序列不能是ABB。原因很简单；最终，我对每个传感器的时间演变感兴趣

好的，所以我想出了这个，请试着打破它。

from nltk import ngrams
from iteration_utilities import all_monotone
def find_longest_monotonic_increasing_ngram(seq):
# Store stats
gram_stats = {}
# Find longest common subsequence / n-gram
M = []
for m in range(1,int(0.2*len(seq))):
gram = Counter(ngrams(seq, m)).most_common()[0]
# Check if gram is monotonically increasing (i.e. is it sorted)
if all_monotone(gram[0],strict=True,decreasing=False):
gram_stats[m] = gram
M.append(m)
return max([gram_stats[m][0] for m in M], key=len)

MWE:

A = np.tile([1,2,3], 30)
# Mess up
A = np.insert(A,0,[1,2]) # One missing sensor at t = start
A = np.append(A,1) # two missing sensors at t = final
A[50] = 2 # Missed sensor reading at t=50 
# Run
find_longest_monotonic_increasing_ngram(A)
>>> (1, 2, 3)

相关内容

最新更新

热门标签：