k 均值聚类,具有基于节点值的约束



也许我错过了一些东西,因为这似乎是一个简单的问题,但我在网上查了一下,在文献中没有找到任何东西。

基本上,我需要做的是根据目的地城市的位置(因此纬度/经度作为每个节点的特征,欧几里得距离作为相似性度量)对一组目的地城市进行聚类,并具有固定数量的聚类。一切似乎都很好,k-means就可以了。但是,我对每个集群都有以下约束:每个城市(节点)都有一个分配给它的相应值,并且每个集群中这些值的总和不应超过固定阈值(所有集群的阈值相同)。有没有简单的方法可以做到这一点?

您有 2 个选项:

-您可以改用 rpart 作为聚类,并使用权重和 minbucket 选项。然而,预测将给你的聚类将是矩形的。

-你可以看看我找到的源代码https://searchcode.com/codesearch/view/18689414/:

kmeans <-
function(x, centers, iter.max = 10, nstart = 1,
         algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen"))
{
    do_one <- function(nmeth) {
        Z <-
            switch(nmeth,
                   { # 1
                       Z <- .Fortran(C_kmns, x, m, p,
                                centers = centers,
                                as.integer(k), c1 = integer(m), integer(m),
                                nc = integer(k), double(k), double(k), integer(k),
                                double(m), integer(k), integer(k),
                                as.integer(iter.max), wss = double(k),
                                ifault = 0L)
                       switch(Z$ifault,
                              stop("empty cluster: try a better set of initial centers",
                                   call.=FALSE),
                              warning(gettextf("did not converge in %d iterations",
                                               iter.max), call.=FALSE, domain =NA),
                              stop("number of cluster centres must lie between 1 and nrow(x)",
                                   call.=FALSE)
                              )
                       Z
                   },
                   { # 2
                       Z <- .C(C_kmeans_Lloyd, x, m, p,
                               centers = centers, as.integer(k),
                               c1 = integer(m), iter = as.integer(iter.max),
                               nc = integer(k), wss = double(k))
                       if(Z$iter > iter.max)
                           warning("did not converge in ",
                                  iter.max, " iterations", call.=FALSE)
                       if(any(Z$nc == 0))
                           warning("empty cluster: try a better set of initial centers", call.=FALSE)
                       Z
                   },
                   { # 3
                       Z <- .C(C_kmeans_MacQueen, x, m, p,
                               centers = as.double(centers), as.integer(k),
                               c1 = integer(m), iter = as.integer(iter.max),
                               nc = integer(k), wss = double(k))
                       if(Z$iter > iter.max)
                           warning("did not converge in ",
                                   iter.max, " iterations", call.=FALSE)
                       if(any(Z$nc == 0))
                           warning("empty cluster: try a better set of initial centers", call.=FALSE)
                       Z
                    })
        Z
    }
    x <- as.matrix(x)
    m <- as.integer(nrow(x))
    if(is.na(m)) stop("invalid nrow(x)")
    p <- as.integer(ncol(x))
    if(is.na(p)) stop("invalid ncol(x)")
    if(missing(centers))
    stop("'centers' must be a number or a matrix")
    nmeth <- switch(match.arg(algorithm),
                    "Hartigan-Wong" = 1,
                    "Lloyd" = 2, "Forgy" = 2,
                    "MacQueen" = 3)
    if(length(centers) == 1L) {
    if (centers == 1) nmeth <- 3
    k <- centers
        ## we need to avoid duplicates here
        if(nstart == 1)
            centers <- x[sample.int(m, k), , drop = FALSE]
        if(nstart >= 2 || any(duplicated(centers))) {
            cn <- unique(x)
            mm <- nrow(cn)
            if(mm < k)
                stop("more cluster centers than distinct data points.")
            centers <- cn[sample.int(mm, k), , drop=FALSE]
        }
    } else {
    centers <- as.matrix(centers)
        if(any(duplicated(centers)))
            stop("initial centers are not distinct")
        cn <- NULL
    k <- nrow(centers)
        if(m < k)
            stop("more cluster centers than data points")
    }
    if(iter.max < 1) stop("'iter.max' must be positive")
    if(ncol(x) != ncol(centers))
    stop("must have same number of columns in 'x' and 'centers'")
    if(!is.double(x)) storage.mode(x) <- "double"
    if(!is.double(centers)) storage.mode(centers) <- "double"
    Z <- do_one(nmeth)
    best <- sum(Z$wss)
    if(nstart >= 2 && !is.null(cn))
    for(i in 2:nstart) {
        centers <- cn[sample.int(mm, k), , drop=FALSE]
        ZZ <- do_one(nmeth)
        if((z <- sum(ZZ$wss)) < best) {
        Z <- ZZ
        best <- z
        }
    }
    centers <- matrix(Z$centers, k)
    dimnames(centers) <- list(1L:k, dimnames(x)[[2L]])
    cluster <- Z$c1
    if(!is.null(rn <- rownames(x)))
        names(cluster) <- rn
    totss <- sum(scale(x, scale = FALSE)^2)
    structure(list(cluster = cluster, centers = centers, totss = totss,
                   withinss = Z$wss, tot.withinss = best,
                   betweenss = totss - best, size = Z$nc),
          class = "kmeans")
}
## modelled on print methods in the cluster package
print.kmeans <- function(x, ...)
{
    cat("K-means clustering with ", length(x$size), " clusters of sizes ",
        paste(x$size, collapse=", "), "n", sep="")
    cat("nCluster means:n")
    print(x$centers, ...)
    cat("nClustering vector:n")
    print(x$cluster, ...)
    cat("nWithin cluster sum of squares by cluster:n")
    print(x$withinss, ...)
    cat(sprintf(" (between_SS / total_SS = %5.1f %%)n",
        100 * x$betweenss/x$totss),
    "Available components:n", sep="n")
    print(names(x))
    invisible(x)
}
fitted.kmeans <- function(object, method = c("centers", "classes"), ...)
{
    method <- match.arg(method)
    if (method == "centers") object$centers[object$cl, , drop=FALSE]
    else object$cl
}

请注意,代码会检查以下行是否进行了改进:

 if((z <- sum(ZZ$wss)) < best) {
        Z <- ZZ
        best <- z
        }

您可以在此处添加约束。

您可以使用与 KMean 相同的原理。迭代 2-3 直到收敛:

  1. 将城市分配给集群(随机)
  2. 计算聚类的质心
  3. 为质心指定点,以便:
    • 指定质心的点的距离总和最小化
    • 遵守阈值约束

在标准 KMean 中,没有限制。因此,通过将每个点分配给最近的质心来轻松执行第二步。在这里,您必须在步骤 2 中解决一个优化问题。如果您只是将其建模为整数规划问题,则可能会更快。OR工具具有解决整数规划问题的工具。

下面是一个 python 实现,它使用不同的约束进行 K 均值聚类分析,包括集群中实例总权重最大的约束。

最新更新