我正在尝试使用kernlab
R包中的kkmeans()
函数来实现Kernel K Means集群。我的问题是,当我用函数的clusters
参数指定了一些数量的集群时,我的代码返回了预期的输出,但对其他数量的集群抛出了一个错误:
if(sum(abs(dc((中的错误<1e-15(断裂:缺少值,其中TRUE/FALSE需要
我的猜测是,这是一个收敛问题,因为当我增加集群数量时,错误似乎会出现,但这会令人惊讶,因为我的行数比我指定的集群数量多得多。虽然我可以用8000x3矩阵成功指定10个集群,但我收到了100个集群的错误。类似地,我可以指定5个集群,但不能指定具有该数据的50行子集的10个集群。
下面是一个可复制的最小示例,其中我的代码复制了成功和错误。
如果centers = 10
则出错
kernlab::kkmeans(mymat, centers=10)
#> Using automatic sigma estimation (sigest) for RBF or laplace kernel
#> Error in if (sum(abs(dc)) < 1e-15) break: missing value where TRUE/FALSE needed
如果centers = 5
没有错误
kernlab::kkmeans(mymat, centers=5)
#> Using automatic sigma estimation (sigest) for RBF or laplace kernel
#> Spectral Clustering object of class "specc"
#>
#> Cluster memberships:
#>
#> 1 1 1 1 2 1 1 3 3 5 5 5 3 2 2 2 4 4 3 3 5 2 2 5 5 5 5 5 5 2 4 3 3 3 2 2 5 3 3 5 5 4 4 4 3 1 4 2 5 3
#>
#> Gaussian Radial Basis kernel function.
#> Hyperparameter : sigma = 0.756590498067127
#>
#> Centers:
#> [,1] [,2] [,3]
#> [1,] 15.75871 -16.69486 191.5841
#> [2,] 16.74850 -21.94730 186.8914
#> [3,] 15.99483 -18.95892 190.2622
#> [4,] 15.45729 -18.13571 191.9611
#> [5,] 16.69136 -22.19600 187.0055
#>
#> Cluster size:
#> [1] 7 10 12 7 14
#>
#> Within-cluster sum of squares:
#> [1] 301006.7 443237.8 607889.4 305777.1 685823.5
示例数据(50x3矩阵(
mymat <- structure(c(15.9390001296997, 15.9079999923706, 16.087999343872,
15.7930002212524, 15.9619998931884, 15.6129999160766, 15.7550001144409,
16.7740001678466, 16.9080009460449, 17.0769996643066, 16.3640003204345,
16.5960006713867, 16.579999923706, 16.4570007324218, 16.2320003509521,
16.1639995574951, 15.6180000305175, 15.5109996795654, 15.5120000839233,
15.628999710083, 16.9950008392333, 17.3530006408691, 17.2229995727539,
16.8910007476806, 17.1800003051757, 17.1709995269775, 16.9860000610351,
16.704999923706, 16.273000717163, 15.8830003738403, 15.6230001449584,
15.333999633789, 15.3839998245239, 15.3870000839233, 17.1119995117187,
17.6200008392333, 16.8349990844726, 16.4969997406005, 16.2479991912841,
16.1259994506835, 15.8059997558593, 15.378999710083, 15.4320001602172,
15.2100000381469, 15.2519998550415, 15.2150001525878, 15.4280004501342,
17.4790000915527, 16.6739997863769, 16.4330005645751, -16.6299991607666,
-16.9529991149902, -17.5610008239746, -17.8290004730224, -18.6200008392333,
-17.1079998016357, -16.25, -21.716999053955, -21.1219997406005,
-21.8209991455078, -20.1840000152587, -20.0450000762939, -20.9599990844726,
-19.5240001678466, -18.6590003967285, -19.4379997253417, -18.6280002593994,
-18.0669994354248, -16.204999923706, -15.5830001831054, -23.9489994049072,
-23.57200050354, -24.3969993591308, -23.2880001068115, -22.6019992828369,
-23.2329998016357, -22.5979995727539, -22.6140003204345, -20.8059997558593,
-19.4300003051757, -19.4729995727539, -17.5690002441406, -16.8110008239746,
-15.2930002212524, -25.2509994506835, -24.7649993896484, -24.8080005645751,
-21.9939994812011, -21.5189990997314, -20.329999923706, -20.25,
-19.1380004882812, -18.6180000305175, -18.5900001525878, -16.1620006561279,
-14.5329999923706, -14.4359998703002, -25.8169994354248, -24.2159996032714,
-22.57200050354, 190.996994018554, 190.996002197265, 190.18699645996,
191.039993286132, 190.205993652343, 191.919006347656, 191.766006469726,
187.14599609375, 186.889007568359, 186.225997924804, 188.60400390625,
187.932006835937, 187.837005615234, 188.453002929687, 189.382995605468,
189.360000610351, 191.25, 191.845001220703, 192.580001831054,
192.414993286132, 185.358001708984, 184.570999145507, 184.595993041992,
186.091995239257, 185.613998413085, 185.25, 186.235000610351,
187.003005981445, 188.744995117187, 190.169998168945, 190.921005249023,
192.628997802734, 192.768005371093, 193.281997680664, 184.602996826171,
183.796005249023, 185.414001464843, 187.811004638671, 188.615005493164,
189.263000488281, 190.167007446289, 191.781997680664, 191.837997436523,
192.582000732421, 193.399002075195, 194.184005737304, 193.509994506835,
183.776000976562, 186.173995971679, 187.774993896484), dim = c(50L,
3L), dimnames = list(NULL, c("x", "y", "z")))
这似乎是函数在kkmeans()
调用期间内部随机生成的东西的问题。我不知道";为什么";这种情况正在发生,您可能需要与作者核实,以确定这是一个错误还是预期行为。
虽然我用数据和代码重现了您的错误(每次都运行一个新的R实例(,但完全相同的函数调用有时也会产生其他错误,有时不会产生错误。然而,当set.seed()
时,它是否这样做是完全可复制的,这表明它与决定模型其他参数的起始值有关。
下面我展示了(a(这可能会产生另一个错误(实际上,我看到了第三个错误,但没有保存种子来繁殖它(,(b(即使它";收敛;仅基于随机种子,它就产生了非常不同的聚类,并且(c(超参数调整在很大程度上受到随机数种子的影响。我忘了保存种子,以便在运行时使用10个集群获得一些集群结果。
我不知道为什么会发生这种情况:我的直觉是,在某些情况下,自动生成的设置是荒谬的/越界的,这会产生错误。这可能是因为你的数据在某种程度上很奇怪,也可能是因为设置超参数的算法没有多大意义。它也可能是一个bug,所以也许值得作为一个问题发布。
在任何情况下,要问自己的一个问题是,你是否想使用行为在产生结果时如此不一致的东西,在随机种子中产生非常不同的结果,并且你不知道算法是否真的在做它所说的事情,等等
示例1:clusters=5
,无错误,set.seed(123)
set.seed(123)
#> Hyperparameter : sigma = 0.463522505156128
#>
#> Centers:
#> [,1] [,2] [,3]
#> [1,] 16.53045 -21.18700 187.8918
#> [2,] 17.16138 -24.59687 184.7860
#> [3,] 15.73436 -17.87491 191.2586
#> [4,] 15.63425 -16.63862 192.0088
#> [5,] 16.19467 -20.16442 189.1617
#>
#> Cluster size:
#> [1] 11 8 11 8 12
#>
#> Within-cluster sum of squares:
#> [1] 537972.8 386310.2 544994.1 391965.9 604386.9
示例2:clusters=5
,无错误,set.seed(3)
有效,但每个集群的观测数量非常不同!注意不同的超参数。
#> Hyperparameter : sigma = 0.290281708176631
#>
#> Centers:
#> [,1] [,2] [,3]
#> [1,] 15.97636 -18.38464 190.5449
#> [2,] 16.24809 -20.10409 188.9572
#> [3,] 15.63660 -17.85633 191.5151
#> [4,] 17.06100 -22.70840 185.8834
#> [5,] 17.16138 -24.59687 184.7860
#>
#> Cluster size:
#> [1] 11 11 15 5 8
#>
#> Within-cluster sum of squares:
#> [1] 545547.7 538434.5 757947.0 236986.8 386310.2
示例3:clusters=5
,无错误,set.seed(999)
有效,但每个集群的观测数量非常不同!再次注意不同的超参数!
#> Gaussian Radial Basis kernel function.
#> Hyperparameter : sigma = 0.128189488632645
#>
#> Centers:
#> [,1] [,2] [,3]
#> [1,] 16.93157 -22.25171 186.4579
#> [2,] 15.45090 -15.99500 192.8452
#> [3,] 15.73677 -18.32277 191.0152
#> [4,] 17.16244 -24.44533 184.8376
#> [5,] 16.32218 -20.69291 188.5965
#>
#> Cluster size:
#> [1] 7 10 13 9 11
#>
#> Within-cluster sum of squares:
#> [1] 294630.1 457490.3 604486.8 441669.5 539478.6
示例4:clusters = 10
,新错误,set.seed(99)
新错误。
#> Error in (function (classes, fdef, mtable) : unable to find an inherited method for function 'affinMult' for signature '"rbfkernel", "numeric"'
示例5:clusters = 10
,新错误,set.seed(3)
原始错误。
#> Error in if (sum(abs(dc)) < 1e-15) break: missing value where TRUE/FALSE needed
不包括:集群=10时的额外错误(未找到矩阵中的所有列(,并成功获得集群=10的一些集群。