在python中从头开始构建预测模型

我有一堆文本，我正在用python分析这些文本，以便生成一个能够详细描述类似人类文本的预测模型。

对于这项任务，我生成一个包含输入中出现的每个单词的词典，并将其指向另一个包含后面每个单词及其出现次数的词典，这样我就可以进行加权选择。

在伪代码中：

dict['foo']={'bar':3, 'barbar':1, 'baz':4}
prev_word=dict['foo']
nextword=random.choices(list(prev_word.keys()), weights=prev_word.values())

尽管方法很初级，但它的效果很好，所以我试图通过保存前一个单词的预测来改进它，以影响下一个单词：

dict[0]['foo']={'bar':3, 'barbar':1, 'baz':4}
while not word='///ending///':
for n in range(len( dict)):
remember=dict[n][prev_word]
del remember[0]
remember.append({})
semantics=semantics/2 ###### Each turn every value gets reduced by half
semantics=add_dict(remember,dict[word]) ####  And added to the predictions
word=predict(semantics,word)
output.append(word)
remember=semantics
print(output)   


####so if I have the word cat and the next word can be jumps and the next can be to:
dict['cat']=[{'jumps':5},{'to':4}]
####and the next words to jumps are to and the:
dict['jumps']=[{'to':3},{'the':6}]
####the weights used to the prediction for jumps would be:
semantics=[{'to':7},{'the':6}]

但令人惊讶的是，这并没有像只考虑下一个单词那样有效。在最后一种情况下，预期输出为

"cat jumps to the"

但它经常产生

"cat jumps to at"

在以前的更初级的实现中没有经常发生的事情。那么，我的新方法中有什么不好的地方吗？或者可能只是我的代码中有什么坏的地方？

我的意思是，对下一个单词进行预测是一种糟糕的方法吗？

基本解决了：问题是我包含了从倒数第二个单词到最后一个单词的所有预测，这增加了噪音，解决方案是只计算倒数第二单词与最后一个词的共同预测，并添加最后一个字的所有静止预测。

next[1]['black']
{'jumps':3,'writes':2}
word
'cat'
next[0]['cat']
{'jumps':2, 'scratch':1}
add(next[1]['black'],next[0]['cat'])
{'jumps':5, 'scratch':1}
result
'black cat jumps'

代替：

add(next[1]['black'],next[0]['cat'])
{'jumps':5, 'scratch':1, 'writes':1}
result
'black cat writes' ###Which has less sense but could have no sense at all

相关内容

最新更新

热门标签：