MySQL:数百万阵列的数字数组之间的相似性



我的数组为50个数字:

54,12,79,34,66,22,78,192,54,23,55,87,23,63... (up to 50)

和一个堆叠的一百万阵列,每个数字为50个数字:

  1: 76,34,67,4,12,34... (up to 50)
  2: 34,12,68,97,55,33... (up to 50)
  3: 21,65,87,23,65,45... (up to 50)
  4: ....
  5: (up to one million)

1)如何将其存储在MySQL数据库中?

2),最重要的是,如何比较第一个数组以了解与其他数组的相似性?我的意思是...我想拥有:

Similarity to 1: 13%
Similarity to 2: 11%
Similarity to 3: 16%
...

相似性应该一个一个元素运行...第一个与第一个,第二个与第二个元素一起运行,然后产生50个元素的平均值。

如果订单不重要,则可以将它们存储为排序的数组:

    1: 4,12,34,34,67,76... (up to 50)
    2: 12,33,34,55,68,97... (up to 50)
    3: 21,23,45,65,65,87... (up to 50)

因此,使用排序的数组,您可以通过使用类似于排序阵列合并算法的算法来在任何两个序列之间获得区别,因此,通过排序,您可以获得O(n*logn)时间。

但是,如果您需要比较,并且如果有合理的上限和下限,则可以枚举所有序列的所有唯一数字,即:

    0  => 4
    1  => 12
    2  = >21
    3  => 23
    4  => 33  
    5  => 34
    6  => 45
    7  => 55
    8  => 65 
    9  => 67
    10 => 68
    11 => 76
    12 => 87

将它们存储为一系列计数器,即:

    1: 1100020001010
    2: 0100110100101
    3: 0011001020001

因此,差异是许多不同的数字分为总字符,但是我实际上看不到此方法的任何优点,因为它也有效或O(n*logn)。

这应该让您前进。假设所有您将使用的数组将包含与您初始问题中提到的相同数量的元素。

中,这是准确的。
<?php
$base = array(636,3305,705,3080,1895,3586,1879,817,3330,2884,487,1267,1016,2100,3598,2535,3894,2945,282,1182,3785,2489,3812,2829,1332,229,3577,125,2735,1126,1194,3366,430,1895,2446,2321,1480,325,3133,809,3204,3616,2071,220,1715,1669,2750,1608,613,3028);
$compare_a = array(355,3118,1293,2333,3632,2652,2677,1360,1295,1478,2742,1157,2545,2151,1593,3992,601,1913,1317,3728,581,3325,2612,1710,1430,1985,399,2731,2408,3821,1563,2759,2939,2852,1091,2570,1503,3764,3926,2794,1241,2668,3947,3782,818,1540,3774,1414,3449,1091);
$compare_b = array(1821,2179,1411,1559,193,3304,1484,2125,2722,1879,2031,2611,1142,928,1372,2140,1230,1498,1250,1362,287,3055,2933,186,3310,3397,3665,2196,691,7,3677,2508,2182,1088,66,2371,391,1546,495,3108,3421,2522,1719,563,3446,3087,2698,676,584,3944);
$compare_c = array(3354,3250,2884,1803,3844,1981,2882,1998,1196,1959,495,3514,3284,844,1848,2834,2415,459,3158,1862,1123,2334,491,3668,1136,406,4000,3854,2326,2169,2250,1680,1419,1133,3478,1262,3110,2359,3255,305,318,3745,3814,3598,589,1662,2431,2999,2116,1589);
$compare_d = array(1474,3489,2708,1704,2086,3248,2817,3403,467,3783,3208,3348,2426,595,3998,2089,2948,3546,189,2510,1723,1054,2364,3330,3480,3553,697,2268,3544,2338,374,1017,1827,3077,2717,3908,2325,1533,3310,2788,1316,2518,2135,3737,3109,2133,1826,2056,1678,2011);
$compare_e = array(2688,2677,3180,154,1614,3138,3234,3219,2160,3929,3951,2577,2157,1592,174,148,604,2921,1681,2425,1334,45,2550,2421,3833,47,716,2117,459,3702,3997,3142,2378,3177,3292,3988,2315,2525,3206,474,2453,3157,3047,610,748,3217,753,1347,2137,2430);
$similar = abs(((count(array_diff($base, $compare_a)) -count($base)) / count($base)) * 100);
print '1) $base compared with $compare_a is: '. $similar .'% similar to $base<br />';
$similar = abs(((count(array_diff($base, $compare_b)) -count($base)) / count($base)) * 100);
print '2) $base compared with $compare_b is: '. $similar .'% similar to $base<br />';
$similar = abs(((count(array_diff($base, $compare_c)) -count($base)) / count($base)) * 100);
print '3) $base compared with $compare_c is: '. $similar .'% similar to $base<br />';
$similar = abs(((count(array_diff($base, $compare_d)) -count($base)) / count($base)) * 100);
print '4) $base compared with $compare_d is: '. $similar .'% similar to $base<br />';
$similar = abs(((count(array_diff($base, $compare_e)) -count($base)) / count($base)) * 100);
print '5) $base compared with $compare_e is: '. $similar .'% similar to $base<br />';
?>

上面的代码应为您吐出来:

1) $base compared with $compare_a is: 0% similar to $base
2) $base compared with $compare_b is: 2% similar to $base
3) $base compared with $compare_c is: 4% similar to $base
4) $base compared with $compare_d is: 2% similar to $base
5) $base compared with $compare_e is: 0% similar to $base

这实际上取决于您要用作算法以确定相似性的内容。在您的问题中,您说您希望将每个元素与同一位置的另一个元素进行比较。PHP的内置函数array_diff()为您执行此操作。一个成熟的示例将取决于您如何重述这些阵列。我可以对此进行修改以从数据库中提取数据,然后在循环或其他内容中运行计算。但是我需要更多详细信息才能在这方面为您提供帮助。

最新更新