r语言 - 添加主题编号



来自这样的过程:

library(stm)
library(quanteda)
data("data_corpus_irishbudget2010", package = "quanteda.textmodels")
quant_dfm <- dfm(data_corpus_irishbudget2010, remove_punct = TRUE, remove_numbers = TRUE,
remove = stopwords("english"))
my_lda_fit20 <- stm(quant_dfm, K = 20, verbose = FALSE)

如何在每行所属主题的输入数据框中添加新列?

> 对象quant_dfm不是 data.frame,而是类dfm或文档特征矩阵的对象。因此,不能简单地添加新列。

一种方法可能是将主题比例绑定到文档元数据上:

quant_stm <- convert(quant_dfm, to = "stm")
result <- cbind(quant_stm$meta, my_lda_fit20$theta)
result
year debate number      foren     name party            1            2            3            4            5            6            7            8            9           10           11           12           13           14           15           16           17           18           19           20
1  2010 BUDGET     01      Brian  Lenihan    FF 3.287636e-05 1.027189e-05 2.845976e-05 2.705457e-05 1.178247e-05 1.958443e-05 1.688009e-05 1.527970e-05 1.662154e-05 1.569261e-05 9.996385e-01 8.161064e-06 3.134414e-06 7.771382e-06 3.287636e-05 1.178247e-05 1.027189e-05 3.287636e-05 3.287636e-05 2.726512e-05
2  2010 BUDGET     02    Richard   Bruton    FG 1.739765e-05 2.178125e-05 3.941765e-05 4.226469e-05 2.263278e-05 4.629310e-05 9.994162e-01 4.363172e-05 4.269271e-05 4.374664e-05 2.194216e-05 3.626869e-05 3.145178e-05 3.168451e-05 1.739765e-05 2.263278e-05 2.178125e-05 1.739765e-05 1.739765e-05 4.598653e-05
3  2010 BUDGET     03       Joan   Burton   LAB 1.270716e-05 2.701389e-05 2.528983e-05 3.198254e-05 1.722971e-05 3.540361e-05 3.028381e-05 3.148696e-05 2.992819e-05 2.998334e-05 8.736542e-06 9.995755e-01 1.300002e-05 1.599658e-05 1.270716e-05 1.722971e-05 2.701389e-05 1.270716e-05 1.270716e-05 3.313102e-05
4  2010 BUDGET     04     Arthur   Morgan    SF 1.851431e-05 2.045892e-05 3.245534e-05 3.207746e-05 2.387346e-05 3.423070e-05 2.691018e-05 2.575705e-05 2.885381e-05 2.563915e-05 3.584059e-06 1.351142e-05 9.995712e-01 9.390317e-06 1.851431e-05 2.387346e-05 2.045892e-05 1.851431e-05 1.851431e-05 3.362038e-05
5  2010 BUDGET     05      Brian    Cowen    FF 2.215497e-05 1.240243e-05 4.015021e-05 4.588765e-05 1.160664e-05 3.514791e-05 2.677760e-05 2.550147e-05 2.578279e-05 2.485740e-05 8.530111e-06 1.628969e-05 9.007719e-06 9.995729e-01 2.215497e-05 1.160664e-05 1.240243e-05 2.215497e-05 2.215497e-05 3.254880e-05
6  2010 BUDGET     06       Enda    Kenny    FG 1.807534e-05 2.374804e-05 3.691986e-05 4.085170e-05 2.494992e-05 4.967362e-05 4.637380e-05 4.567917e-05 4.658941e-05 9.993948e-01 2.153943e-05 3.796325e-05 3.200702e-05 3.107570e-05 1.807534e-05 2.494992e-05 2.374804e-05 1.807534e-05 1.807534e-05 4.685114e-05
7  2010 BUDGET     07     Kieran ODonnell    FG 3.122759e-05 3.811376e-05 5.552180e-05 6.305921e-05 3.701247e-05 9.990649e-01 6.551151e-05 6.451951e-05 6.510495e-05 6.631872e-05 3.630601e-05 5.917915e-05 5.583803e-05 5.758112e-05 3.122759e-05 3.701247e-05 3.811376e-05 3.122759e-05 3.122759e-05 7.095163e-05
8  2010 BUDGET     08      Eamon  Gilmore   LAB 1.728585e-05 2.378723e-05 3.808019e-05 4.205237e-05 2.442556e-05 4.679972e-05 4.531215e-05 9.994039e-01 4.726069e-05 4.401952e-05 2.084719e-05 3.782904e-05 3.062490e-05 3.143581e-05 1.728585e-05 2.442556e-05 2.378723e-05 1.728585e-05 1.728585e-05 4.630430e-05
9  2010 BUDGET     09    Michael  Higgins   LAB 5.453784e-05 8.491352e-05 1.246776e-04 1.417395e-04 4.990550e-01 1.348059e-04 1.186019e-04 1.250461e-04 1.169334e-04 1.231220e-04 8.797356e-05 1.090667e-04 1.556956e-04 7.117493e-05 5.453784e-05 4.990550e-01 8.491352e-05 5.453784e-05 5.453784e-05 1.931644e-04
10 2010 BUDGET     10     Ruairi    Quinn   LAB 5.434290e-05 4.990870e-01 1.146335e-04 1.335057e-04 7.981937e-05 1.288407e-04 1.052599e-04 1.128768e-04 1.167311e-04 1.111979e-04 6.940718e-05 1.662445e-04 1.275750e-04 7.098637e-05 5.434290e-05 7.981937e-05 4.990870e-01 5.434290e-05 5.434290e-05 1.917145e-04
11 2010 BUDGET     11       John  Gormley Green 2.493549e-01 8.531671e-05 1.598439e-04 1.736796e-04 7.803350e-05 1.574403e-04 1.196615e-04 1.158435e-04 1.294153e-04 1.159484e-04 3.745399e-04 1.111192e-04 1.838983e-04 2.111470e-04 2.493549e-01 7.803350e-05 8.531671e-05 2.493549e-01 2.493549e-01 4.011218e-04
12 2010 BUDGET     12      Eamon     Ryan Green 4.703621e-05 5.445045e-05 9.351519e-05 9.986821e-01 5.408563e-05 8.559327e-05 8.217942e-05 7.759309e-05 7.144833e-05 7.411570e-05 7.180670e-05 7.268896e-05 7.589689e-05 1.063595e-04 4.703621e-05 5.408563e-05 5.445045e-05 4.703621e-05 4.703621e-05 1.014811e-04
13 2010 BUDGET     13     Ciaran    Cuffe Green 1.177092e-04 1.245277e-04 9.958456e-01 2.071010e-04 1.205545e-04 1.756227e-04 1.780671e-04 1.676308e-04 1.605510e-04 1.619185e-04 1.798436e-04 1.511904e-04 1.838093e-04 1.998252e-04 1.177092e-04 1.205545e-04 1.245277e-04 1.177092e-04 1.177092e-04 1.427885e-03
14 2010 BUDGET     14 Caoimhghin OCaolain    SF 1.901996e-05 2.609234e-05 3.584822e-05 3.832532e-05 2.292068e-05 4.726114e-05 4.250320e-05 4.765145e-05 9.993975e-01 4.519085e-05 2.223008e-05 3.643320e-05 3.468398e-05 3.148667e-05 1.901996e-05 2.292068e-05 2.609234e-05 1.901996e-05 1.901996e-05 4.678058e-05

请注意,您可以使用$theta从 lda 模型访问主题比例矩阵。有关详细信息,请参阅help(stm)值部分。

如果您对每个文档贡献最高比例的主题感兴趣,则可以申请比例矩阵:

result2 <- data.frame(quant_stm$meta,
maxtopic = apply(my_lda_fit20$theta,1,which.max))
result2
year debate number      foren     name party maxtopic
1  2010 BUDGET     01      Brian  Lenihan    FF       11
2  2010 BUDGET     02    Richard   Bruton    FG        7
3  2010 BUDGET     03       Joan   Burton   LAB       12
4  2010 BUDGET     04     Arthur   Morgan    SF       13
5  2010 BUDGET     05      Brian    Cowen    FF       14
6  2010 BUDGET     06       Enda    Kenny    FG       10
7  2010 BUDGET     07     Kieran ODonnell    FG        6
8  2010 BUDGET     08      Eamon  Gilmore   LAB        8
9  2010 BUDGET     09    Michael  Higgins   LAB       16
10 2010 BUDGET     10     Ruairi    Quinn   LAB        2
11 2010 BUDGET     11       John  Gormley Green       15
12 2010 BUDGET     12      Eamon     Ryan Green        4
13 2010 BUDGET     13     Ciaran    Cuffe Green        3
14 2010 BUDGET     14 Caoimhghin OCaolain    SF        9

最新更新