在R中复制Postgres pg_trgm文本相似度评分



有人知道如何从R中的相似性(文本,文本)函数复制(pg_trgm) postgres三元组相似性评分吗?我正在使用stringdist包,宁愿使用R来计算这些在。csv文件中的文本字符串矩阵,而不是运行一堆postgresql查询。

在postgres中运行similarity(string1, string2)给我一个0到1之间的数字得分。

我厌倦了使用stringdist包来获得分数,但我认为我仍然需要将下面的代码除以一些东西。

stringdist(string1, string2, method="qgram",q = 3 )

是否有一种方法来复制pg_trgm分数与stringdist包或在R中这样做的另一种方式?

一个例子是获得一本书的描述和科幻小说等类型的描述之间的相似性得分。例如,如果我有两本书的描述,使用相似度分数

book 1 = "Area X has been cut off from the rest of the continent for decades. Nature has reclaimed the last vestiges of human civilization. The first expedition returned with reports of a pristine, Edenic landscape; the second expedition ended in mass suicide, the third expedition in a hail of gunfire as its members turned on one another. The members of the eleventh expedition returned as shadows of their former selves, and within weeks, all had died of cancer. In Annihilation, the first volume of Jeff VanderMeer's Southern Reach trilogy, we join the twelfth expedition.
     The group is made up of four women: an anthropologist; a surveyor; a psychologist, the de facto leader; and our narrator, a biologist. Their mission is to map the terrain, record all observations of their surroundings and of one anotioner, and, above all, avoid being contaminated by Area X itself.
     They arrive expecting the unexpected, and Area X delivers—they discover a massive topographic anomaly and life forms that surpass understanding—but it’s the surprises that came across the border with them and the secrets the expedition members are keeping from one another that change everything."
book 2= "From Wall Street to Main Street, John Brooks, longtime contributor to the New Yorker, brings to life in vivid fashion twelve classic and timeless tales of corporate and financial life in America
     What do the $350 million Ford Motor Company disaster known as the Edsel, the fast and incredible rise of Xerox, and the unbelievable scandals at GE and Texas Gulf Sulphur have in common? Each is an example of how an iconic company was defined by a particular moment of fame or notoriety; these notable and fascinating accounts are as relevant today to understanding the intricacies of corporate life as they were when the events happened.
     Stories about Wall Street are infused with drama and adventure and reveal the machinations and volatile nature of the world of finance. John Brooks’s insightful reportage is so full of personality and critical detail that whether he is looking at the astounding market crash of 1962, the collapse of a well-known brokerage firm, or the bold attempt by American bankers to save the British pound, one gets the sense that history repeats itself.
     Five additional stories on equally fascinating subjects round out this wonderful collection that will both entertain and inform readers . . . Business Adventures is truly financial journalism at its liveliest and best."
genre 1 = "Science fiction is a genre of fiction dealing with imaginative content such as futuristic settings, futuristic science and technology, space travel, time travel, faster than light travel, parallel universes, and extraterrestrial life. It often explores the potential consequences of scientific and other innovations, and has been called a "literature of ideas".[1] Authors commonly use science fiction as a framework to explore politics, identity, desire, morality, social structure, and other literary themes."

如何使用R脚本获得每本书的描述与科幻小说类型(如pg_trgm)的描述的相似度分数?

这样怎么样?

library(textcat)
?textcat_xdist
# Compute cross-distances between collections of n-gram profiles.
round(textcat_xdist(
  list(
       text1="hello there",
       text2="why hello there",
       text3="totally different"
       ),
     method="cosine"),
3)
#      text1 text2 text3
#text1 0.000 0.078 0.731
#text2 0.078 0.000 0.739
#text3 0.731 0.739 0.000

最新更新