Python:变量大小在返回语句后改变



我正在尝试使用word2vec嵌入文本分类任务。然而,奇怪的是,从preprocess()函数返回的值与返回之前的值不同。有人知道我的代码有什么问题吗?

train_data = [   {'corrected': 'have a good day', 'father': 1},
{'corrected': 'i suggest you see this movie', 'father': 1},
{'corrected': 'The afternoon grew so glowering that in the sixth inning the arc lights were turned on--always a wan sight in the daytime, like the burning headlights of a funeral procession. Aided by the gloom, Fisher was slicing through the Sox rookies, and Williams did not come to bat in the seventh. He was second up in the eighth. This was almost certainly his last time to come to the plate in Fenway Park, and instead of merely cheering, as we had at his three previous appearances, we stood, all of us, and applauded.', 'father': 2},
{'corrected': 'worse than any show', 'father': 1},
{'corrected': 'nice movie, so love it', 'father': 2},
{'corrected': "The day I picked my dog up from the pound was one of the happiest days of both of our lives. I had gone to the pound just a week earlier with the idea that I would just 'look' at a puppy. Of course, you can no more just look at those squiggling little faces so filled with hope and joy than you can stop the sun from setting in the evening. I knew within minutes of walking in the door that I would get a puppy… but it wasn't until I saw him that I knew I had found my puppy", 'father': 2}
]
train_data= pd.DataFrame(train_data)
# Load Pretrained Word2Vec
embed = hub.load("https://tfhub.dev/google/Wiki-words-250/2")
def get_word_count(essay):
"""
get the number of vocab in the essay
"""
return len(essay)
def get_word2vec_enc(essays):
"""
get word2vec value for each word in sentence.
concatenate word in numpy array, so we can use it as RNN input
"""
encoded = []
for essay in essays:
tokens = essay.split(" ")
word2vec_embedding = embed(tokens)
encoded.append(word2vec_embedding)
return encoded
def get_padded_encoded_essays(encoded_essays):
"""
for short essays, we prepend zero padding so all input to RNN has same length,
for long essays, we truncate it to the first 250 words
"""
padded_essays_encoding = []
for enc_essay in encoded_essays:
if get_word_count(enc_essay)> 250:
enc_essay[:249]

else:
zero_padding_cnt = 250 - enc_essay.shape[0]
pad = np.zeros((1, 250))
for i in range(zero_padding_cnt):
enc_essay = np.concatenate((pad, enc_essay), axis=0)
padded_essays_encoding.append(enc_essay)
return padded_essays_encoding
def ses_encode(ses):
"""
return one hot encoding for Y value
"""
if ses == 1: 
return [1,0]  # for high ses
else: 
return [0,1]  # for low ses

def preprocess(df):
"""
encode text value to numeric value
"""

# encode words into word2vec
essays = df['corrected'].tolist()
print("essay length:" + str(len(essays)))

encoded_essays = get_word2vec_enc(essays)
padded_encoded_essays = get_padded_encoded_essays(encoded_essays)
print("padded_encoded_essays length:" + str(len(padded_encoded_essays)))

# encoded ses
sess = df['father'].tolist()
encoded_ses = [ses_encode(ses) for ses in sess]
X = np.vstack(padded_encoded_essays)
print("X length:" + str(len(X)))
Y = np.vstack(encoded_ses)
return X, Y
train_X, train_Y = preprocess(train_data)
len(train_X) # it returns 1500
len(train_Y) # it returns 6

当我调用train_X, train_Y = preprocess(train_data)时,三个打印语句是"essay length:6;padded_encoded_essays长度:6;X长度:1500"。我不知道为什么np.vstack()会改变大小。是否有一种方法保持大小相同,而让代码工作没有警告(当我不包括np.vstack(),我的代码有另一个问题)?

提前谢谢你

在这一行中,您要查找的是essays:

的长度
print("X length:" + str(len(essays)))

但是,X定义为:

X = np.vstack(padded_encoded_essays)

也许这就是原因,你只是简单地打印一些东西的长度,并返回其他东西作为X

最新更新