我正在尝试在r中实现文档指纹的筛选算法。
这里参考http://www.ida.liu.se/~TDDC03/oldprojects/2005/final-projects/prj10.pdf
我的问题:
如何得到n-gram的哈希值以及如何选择这些
nGrams <- c("adoru", "dorun", "orunr", "runru", "unrun", "nrunr" ,"runru",
"unrun","nruna", "runad", "unado", "nador", "adoru", "dorun", "orunr" ,"runru" ,
"unrun")
似乎
library(digest)
v <- sapply(nGrams,digest,algo="crc32")
uv <- unique(v)
(as.integer(as.hexmode(uv))-1) %% 4 == 0
将是一个很好的开始。(CRC32总是奇数,所以必须减去1)