使用BERT单词嵌入生成同义词或类似单词



我想使用BERT单词嵌入生成同义词或类似单词。我开始用BERT来做这件事。为了以后的软件集成,它必须用JAVA完成,所以我选择了easy-bert(https://github.com/robrua/easy-bert)。

看来我可以通过这种方式获得单词嵌入:

try(Bert bert = Bert.load(new File("com/robrua/nlp/easy-bert/bert-uncased-L-12-H-768-A-12"))) {
    float[][] embedding = bert.embedTokens("A sequence");
    float[][][] embeddings = bert.embedTokens("Multiple", "Sequences");
}

你知道我如何从这些单词嵌入中获得相似的单词吗?

谢谢你的帮助!

这个主题的类似任务(词汇替换(属于LS07和LS14。一位研究人员使用BERT在这些基准中实现了SOTA。你会有兴趣阅读这篇论文的。https://www.aclweb.org/anthology/P19-1328.pdf

作者说如下。

将dropout应用于目标词的嵌入以进行部分屏蔽单词,允许BERT平衡考虑目标单词的语义和上下文,用于提出替代候选者,以及然后根据替代的影响来验证候选关于句子的全局语境表征">

我不知道如何重现相同的结果,因为实现不对公众开放。但这里有一个提示——嵌入丢弃可以用于生成替代候选者。

我使用Luminoso开发了一种方法。我为他们工作,所以这有点像广告,但它正是你想做的。

https://www.luminoso.com/search

Luminoso非常擅长理解对话文本,如产品评论、产品描述、调查结果和故障通知单。它不需要任何形式的训练或本体构建,并且将围绕您的语言构建一个语言模型。您将页面的文本输入Luminoso,它将为文本中使用的概念生成一组同义词。

作为一个示例项目,我使用Amazon.com的美容产品进行了搜索。我将只复制几个围绕三个概念自动生成的同义词。从该数据集中生成了17851个同义词。

scent, rose-like, not sickeningly, not nauseating, not overwhelming, herb-y, no sweetness, cucumber-y, not too citrus-y, no gardenia, not lemony, pachouli, vanilla-like, fragarance, not spicy, flowerly, musk, perfume-like, floraly, not cloyingly => scent
recommend, recommende, advice, suggestion, highly recommend, suggest, recommeded, recommendation, recommend this product, reccommended, advise, suggest, indicated, suggestion, advice, agree, recommend, say, considering, mentioned => recommend
bottle, no sprayer, 8-oz, beaker, decanter, push-down, dispenser, pipet, pint, not the bottle, no dropper, keg, gallon, jug, pump-top, liter, half-full, decant, tumbler, vial => bottle
eczema, non-steroidal, ulcerative, dematitis, ecsema, Elidel, dermititis, inflammation, pityriasis, hydrocortizone, dyshidrotic, chickenpox, Stelatopia, perioral, rosacea, dry skin, nummular, ecxema, mild-moderate, ezcema => eczema

这个搜索索引中有80万个产品,所以结果也很大,但这也适用于小数据集。

除了同义词格式外,您还可以将其直接放入弹性搜索中,并将特定页面的同义词与该页面关联起来。

这是使用相同技术增强的Elasticsearch索引的示例。它被调高了,所以添加了太多的概念,但只是为了向你展示它如何找到概念之间的关系。

{"index": {"_index": "amzbeauty", "_type": "_doc", "_id": "130414089X"}}
{"title": "New Benefit Waterproof Automatic Eyeliner Pen - Black - BAD Gal Liner", "text": "Length : 13.5 cmnColor: Blackn100% Brand new and unused.nSmudge free.nFine-tip. Easy to blend and smooth to applynCan make fine and bold eyeline with new texture and furnishing.nProvide rich and consistant colournLongwearing and waterproofnFregrance Free", "primary_concepts": ["not overpoweringly", "concoction", "equipped", "fine-tip", "water-resistant", "luxuriant", "make", "fixture", "☆", "not lengthen", "washable", "not too heady", "blendable", "doesn't collect", "shade", "niche", "supple", "smudge-proof", "sumptuous", "movable", "black", "over-apply", "quick", "silky", "colored", "sweatproof", "opacity", "accomodate", "fuchsia", "furnishes", "meld", "sturdily", "smear", "inch", "mid-back", "chin-length", "smudge", "alredy", "not cheaply", "long-wearing", "eyeline", "texture", "steady", "no-name", "audacious", "easy", "edgy", "is:A", "marketers", "greys", "decadent", "applicable", "Crease-free", "magenta", "free", "itIn", "stay-true", "racy", "application", "glides", "smooth", "sleek", "taupe", "grainy", "dark", "wealthy", "JP7506CF", "gray", "grayish", "width", "newness", "purfumes", "Lancme", "blackish", "easily", "doesn't smudge", "maroon", "blend", "convenient", "smoother", "Moschino", "long-wear", "mauve", "medium-length", "no raccoon", "revamp", "demure", "richly", "white", "brand", "offers", "lenght", "soft", "doesn't smear", "provide", "provides", "unusable", "eye-liner", "unopened", "straightforward", "silky-smooth", "uniting", "compactness", "bold", "fearless", "mix", "indulgent", "brash", "serviceable", "unmarked", "not musky", "constructed", "racoon", "smoothly", "sealant", "merged", "boldness", "reuse", "unused", "long", "Kors", "effortless", "luscious", "stain", "rich", "discard", "richness", "opulent", "short", "consistency", "fine", "sents", "newfound", "fade-resistant", "mixture", "hue", "sassy", "apply", "fragnance", "heathy", "adventurous", "not enthusiastic", "longwearing", "fregrance", "non-waterproof", "empty", "lashline", "simple", "newly", "you'r", "combined", "no musk", "mingle", "waterproof", "painless", "pinkish", "thickness", "clump-free", "gos", "consistant", "color", "smoothness", "name-brand", "new", "smudgeproof", "yaaay", "water-proof", "eyemakeup", "not instant", "spidery", "furnish", "tint", "product", "reapply", "not black", "no globs", "imitators", "blot", "cinch", "uncomplicated", "untouched", "length"], "related_concepts": ["eyeliner", "no goofs", "doesn't smear", "pen", "hundreds"]}
{"index": {"_index": "amzbeauty", "_type": "_doc", "_id": "130414643X"}}
{"title": "Goodskin Labs Eyliplex-2 Eye Life and Circle Reducer - 10ml", "text": "Eyliplex-2 is a dual solution that focuses on the problematic eye area. This breakthrough, 24-hour system from the scientists at good skin pharmacy visibly tightens eye areas while reducing dark circles. 0.34 oz. each. 64% of subjects reported younger looking eyes immediately and a 20% reduction in the appearance of dark circles in clinical studies.", "primary_concepts": ["coloration", "Laboratories", "oncology", "cornea", "undereye", "eye", "immediately", "☆", "teen", "dry-skin", "good", "eyelids", "puffiness", "behold", "research", "temperamental", "dermatological", "breakthrough", "study", "store", "nice", "lasik", "instantaneously", "teenaged", "multi", "rheostat", "dermatology", "chemist", "invisibly", "PhD", "pharmacy", "alredy", "not cheaply", "optional", "pharmacist", "Obagi-C", "topic", "supermarket", "reversible", "studies", "Younger", "medically", "report", "thermo", "tightness", "dual", "eliminate", "researcher", "Minimization", "cutaneous", "hydration", "O2", "taupe", "increase", "moisturization", "dark", "preliminary", "excellent", "Quad", "well", "appearance", "dusky", "quickly", "instantly", "CVS", "Dermal", "great", "revolutionary", "biologist", "epidermis", "blackish", "disclosed", "problem", "youngsters", "murky", "scientific", "teenager", "oz", "dark circles", "clinically", "emphasis", "absorption", "skin", "loosen", "intractable", "technological", "reduction", "clinician", "nutritional", "forthwith", "grocer", "scientifically", "swiftly", "examination", "state-of-the-art", "not acne prone", "zone", "decrease", "younger-looking", "excellently", "troublesome", "system", "radius", "tighten", "FDA", "decent", "noticeably", "WD-40", "clearer", "scientist", "saggy", "significantly", "improvement", "Teamine", "interchangeable", "visible", "visable", "no fine line", "shortly", "minimize", "survey", "problematic", "young", "glance", "racoon", "vicinity", "youthful", "exacerbated", "focal", "region", "groundbreaking", "reddish", "focus", "reduce", "increments", "nad", "fasten", "area", "soon", "complexion", "squinting", "look", "grocery", "eyliplex-2", "Eyliplex-2", "subsequently", "even-toned", "bothersome", "eyes", "mitigate", "markedly", "philosophy:you", "difficult", "darkish", "bluish", "satisfactory", "darken", "epidermal", "lessen", "appearence", "ocular", "ergonomically", "diminished", "progression", "purplish", "sun-damaged", "Cellex-C", "visibly", "diagnosis", "drugstore", "under-eye", "apothecary", ":-D", "terrific", "clinical", "oz.", "Endocrinology", "time-released", "Nouriva", "tight", "adolescent", "subject", "eyeballs", "sking", "Pro-Retinol", "aggravate", "younger", "shortcomings", "solution", "assess", "promptly", "teenage", "Kinetin", "24-hour", "Mart", "youth", "visibility", "scientists", "taut", "better", "eyesight", "no dark circles", "not reduce", "photoaging", "Pending"], "related_concepts": ["A22", "A82", "Amazon", "daytime", "HK", "nighttime", "smell", "dark circles", "purchased"]}
{"index": {"_index": "amzbeauty", "_type": "_doc", "_id": "1304146537"}}

Luminoso使用ConceptNet中的单词嵌入,它也开发了ConceptNet,该技术超越了ConceptNet为您提供的技术。我有偏见,但每次我浏览数据时,我都会感到惊讶。不是免费的,但它确实可以在数据的绝对零预训练的情况下工作,实际上没有什么是免费的。

最新更新