一个热编码句子

在这里，我对one-get编码的实现：

%reset -f
import numpy as np 
import pandas as pd
sentences = []
s1 = 'this is sentence 1'
s2 = 'this is sentence 2'
sentences.append(s1)
sentences.append(s2)
def get_all_words(sentences) : 
unf = [s.split(' ') for s in sentences]
all_words = []
for f in unf : 
for f2 in f : 
all_words.append(f2)
return all_words

def get_one_hot(s , s1 , all_words) : 
flattened = []
one_hot_encoded_df = pd.get_dummies(list(set(all_words)))
for a in [np.array(one_hot_encoded_df[s]) for s in s1.split(' ')] : 
for aa in a : 
flattened.append(aa)
return flattened
all_words = get_all_words(sentences)
print(get_one_hot(sentences , s1 , all_words))
print(get_one_hot(sentences , s2 , all_words))

这将返回：

[1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]
[1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1]

如您所见，稀疏向量是小句子的返回。似乎编码发生在字符级别而不是单词级别？如何正确对下面的单词进行热编码？

我认为编码应该是？

：

s1 -> 1, 1, 1, 1
s2 -> 1, 1, 1, 0

字符级别的编码

这是因为循环：

for f in unf : 
for f2 in f : 
all_words.append(f2)

该f2正在循环字符串f的字符。实际上，您可以将整个函数重写为：

def get_all_words(sentences) :
unf = [s.split(' ') for s in sentences]
return list(set([word for sen in unf for word in sen]))

正确的独热编码

此循环

for a in [np.array(one_hot_encoded_df[s]) for s in s1.split(' ')] : 
for aa in a : 
flattened.append(aa)

实际上是在制作一个很长的矢量。我们来看看one_hot_encoded_df = pd.get_dummies(list(set(all_words)))的输出：

1  2  is  sentence  this
0  0  1   0         0     0
1  0  0   0         0     1
2  1  0   0         0     0
3  0  0   1         0     0
4  0  0   0         1     0

上面的循环是从此数据帧中选取相应的列并附加到输出flattened。我的建议是简单地利用 pandas 功能来允许您对几列进行子集，然后求和并裁剪为 0 或 1，以获得独热编码向量：

def get_one_hot(s , s1 , all_words) :
flattened = []
one_hot_encoded_df = pd.get_dummies(list(set(all_words)))
return one_hot_encoded_df[s1.split(' ')].T.sum().clip(0,1).values

输出将是：

[0 1 1 1 1]
[1 1 0 1 1]

分别为你的两个判决。这是如何解释这些：从one_hot_encoded_df数据帧的行索引中，我们知道我们使用 0 表示2,1 表示this,2 表示1，等等。所以输出[0 1 1 1 1]表示单词袋中的所有项目，除了2，您可以通过输入'this is sentence 1'

确认

相关内容

最新更新

热门标签：