缩短算法执行时间

我正在从事数据挖掘项目，我为关联规则任务选择了先验算法。简单地说我对执行时间不满意，我已经实现了它。我将只描述我的代码中有问题的部分。

我有两个列表列表。

List<List<int>> one;

List<List<int>> two;

我必须遍历列表one的元素并检查one[i]是否是two[j]的子集

foreach(List<int> items in one)
{
foreach(List<int> items2 in two)
{
if(items2.ContainsSetOf(items1))
{
//do something
}
}

我在想是否有办法减少这种分配的执行时间。(并行执行、使用字典等)

你们知道如何减少它吗？

谢谢！

将它们设置为集合列表，并使用集合操作来查找一组子集是否为另一组子集。

例

HashSet<int> set1 = new HashSet<int>();
set1.Add(1);
set1.Add(2);
HashSet<int> set2 = new HashSet<int>();
set2.Add(1);
set2.Add(2);
set2.Add(3);
List<HashSet<int>> one = new List<HashSet<int>>();
one.add(set1);
one.add(set2);
List<HashSet<int>> two = new List<HashSet<int>>();
two.add(set1);
two.add(set2);
foreach(Set<int> setA in one) {
foreach(Set<int> setB in two) {
if(setA.IsSubsetOf(setB)) {
// do something
}
}
}

C# 代码片段

var dict = new Dictionary<int, HashSet<List<int>>>();
foreach (List<int> list2 in two) {
foreach (int i in list2) {
if(dict.ContainsKey(i) == FALSE) {
//create empty HashSet dict[i]
dict.Add(i, new HashSet<List<int>>());
}
//add reference to list2 to the HashSet dict[i]
dict[i].Add(list2); 
}
}
foreach (List<int> list1 in one) {
HashSet<List<int>> listsInTwoContainingList1 = null;
foreach (int i in list1) {
if (listsInTwoContainingList1 == null) {
listsInTwoContainingList1 = new HashSet<List<int>>(dict[i]);
} else {
listsInTwoContainingList1.IntersectWith(dict[i]);
}
if(listsInTwoContainingList1.Count == 0) {   //optimization :p
break;
}
}
foreach (List<int> list2 in listsInTwoContainingList1) {
//list2 contains list1
//do something
}   
}

例

L2= {
L2a = {10, 20, 30, 40}
L2b = {30, 40, 50, 60}
L2c = {10, 25, 30, 40}
}
L1 = {
L1a = {10, 30, 40}
L1b = {30, 25, 50}
}

在代码的第一部分之后：

dict[10] = {L2a, L2c}
dict[20] = {L2a}
dict[25] = {L2c}
dict[30] = {L2a, L2b, L2c}
dict[40] = {L2a, L2b, L2c}
dict[50] = {L2c}
dict[60] = {L2c}

在代码的第二部分中：

L1a: dict[10] n dict[30] n dict[40] = {L2a, L2c}
L1b: dict[30] n dict[25] n dict[50] = { }

所以L1a包含在L2a和L2c中，但L1b没有。

复杂性

现在关于算法的复杂性，假设L1有n1个元素，L2有n2个元素，L1子列表的平均元素数是m1，L2子列表的平均元素数是m2。然后：

最初的解决方案是：O(n1 x n2 x m1 x m2)，如果containsSetOf方法执行嵌套循环，或者充其量是O(n1 x n2 x (m1 + m2))，如果它使用 HashSet。Is7aq的解决方案也是O(n1 x n2 x (m1 + m2))的。
建议的解决方案是：O(n2 x m2 + n1 x (m1 x nd + n2))，其中nd是集合dict[i]的平均元素数。

所提出的解决方案的效率在很大程度上取决于以下nd：

如果nd很大 - 接近n2(当每个整数都是L2的每个子列表的一部分时)，那么它和原始的一样慢。
但是，如果预计nd很小(即L2的子列表彼此完全不同)，那么建议的解决方案通常会快得多，尤其是在n1和n2很大的情况下。

如果要减少检查"列表中是否列表"(或设置为子集)的次数，一种方法是构建列表的层次结构(树)。当然，性能改进(如果有的话)取决于数据 - 如果没有任何列表包含其他列表，则必须像现在一样进行所有检查。

相关内容

最新更新

热门标签：