在检查上下文时,Gensim的Word2Vec实现会超出句子级别



我找到了这个问题,该问题提供了句子顺序可能重要的证据(但效果也可能是不同随机初始化的结果)。

我想为我的项目处理reddit评论转储,但是从JSON提取的字符串将是未分类的,并且属于非常不同的子reddits和主题,所以我不想弄乱上下文:

{"gilded":0,"author_flair_text":"Male","author_flair_css_class":"male","retrieved_on":1425124228,"ups":3,"subreddit_id":"t5_2s30g","edited":false,"controversiality":0,"parent_id":"t1_cnapn0k","subreddit":"AskMen","body":"I can't agree with passing the blame, but I'm glad to hear it's at least helping you with the anxiety. I went the other direction and started taking responsibility for everything. I had to realize that people make mistakes including myself and it's gonna be alright. I don't have to be shackled to my mistakes and I don't have to be afraid of making them. ","created_utc":"1420070668","downs":0,"score":3,"author":"TheDukeofEtown","archived":false,"distinguished":null,"id":"cnasd6x","score_hidden":false,"name":"t1_cnasd6x","link_id":"t3_2qyhmp"}

那么,邻居句子对Gensim Word2Vec至关重要,我应该恢复整个注释树结构,还是可以简单地提取"句子袋"并在其上训练模型?

gensim word2vec期望的语料库是 listss of-tokens 。(例如,折腾列表的列表将有效,但是对于较大的库存,您通常需要提供可重新设计的峰值,以从持久存储中流式传输文本示例,以避免将整个语料库保持在内存中。)

单词矢量训练仅在单个文本示例中考虑上下文。也就是说,在一个tok列中。因此,如果两个连续的示例是...

['I', 'do', 'not', 'like', 'green', 'eggs', 'and', 'ham']
['Everybody', 'needs', 'a', 'thneed']

...在"火腿"one_answers"每个人"之间的这些例子中没有影响。(上下文仅在每个示例中。)

仍然,如果将示例的顺序订购在一起,则可能会对质量产生细微的影响。例如,您不希望单词x的所有示例都发生在语料库的开头,并且所有词y的示例都会迟到 - 这阻止了达到最佳结果的相互融合的概述。<<<<<<<<<<<<<</p>

因此,如果您的语料库以任何形式的顺序出现,由主题,作者或大小或语言结合在一起,则执行初始混音以删除这种结块通常是有益的。(再重新拆除,例如在培训通行证之间,通常可以忽略不计的其他好处。)

最新更新