集群计算 - "Top Terms"在 Mahout 集群转储的输出中实际意味着什么?



我是新来的……我得到以下输出

/opt/hadoop/mahout-distribution-0.9/bin$ mahout clusterdump 
>    -d /app/hadoop/dmacs/training_set1_sparseout/dictionary.file-0 
>    -dt sequencefile 
>    -i /app/hadoop/dmacs/training_set1_sparseout/kmeans-clusters/clusters-2-final 
>    -n 20 
>    -b 100 
>    -o /app/hadoop/dmacs/kmeans_final_output/cdump.txt 
>    -dm org.apache.mahout.common.distance.CosineDistanceMeasure   
:VL-1480{n=150 c=[1000062,3,2005:0.098, 1000079,1,2002:0.080, 1000079,2,2002:0.078, 1000079,3,2002:0.
    Top Terms:
            25                                      =>  10.670724073251089
            31                                      =>   7.999464999039968
            1664010,5,2005                          =>  1.2396535428365072
            2439493,1,2003                          =>   1.184131249586741
            507603,1,2005                           =>  0.9944797229766845
            199257,3,2005                           =>  0.9928587055206299
            2602249,3,2004                          =>  0.9890585215886434
            184705,3,2004                           =>  0.9728035926818848
            447759,5,2005                           =>  0.9652122163772583
            1152594,3,2004                          =>  0.9619592666625977
            104237,5,2005                           =>  0.9515269517898559
            1473980,3,2005                          =>  0.9478832610448201
            2118461,4,2005                          =>  0.9315701317787171
            1037245,3,2005                          =>  0.9236405754089355
            1639792,1,2002                          =>  0.9183504740397136
            1227322,1,2003                          =>  0.9121313015619914
            2019240,3,2004                          =>   0.909924259185791
            1117152,5,2005                          =>  0.9050878302256267
            2040853,3,2004                          =>  0.9025738382339478
            1309838,5,2005                          =>  0.8964522886276245

在输出中最上面的项实际意味着什么?提前感谢!!

顶部术语是指这些文档的前几个术语,这些文档是集群的一部分。您可以使用clusterdump命令使用-n / -- numWords标志来控制top terms输出。

有关标志的详细信息,请参考帮助:

mahout-distribution-0.9$ bin/mahout clusterdump -h

还可以看看类似的问题:解释mahout clusterdumper

的输出

最新更新