我是深度学习领域的新手,我正在尝试构建一个编码器。
trainFromTextFile = "train.FROM"
trainToTextFile = "train.TO"
trainFromText = open(trainFromTextFile, 'r', encoding='utf-8').read().lower()
trainToText = open(trainToTextFile, 'r', encoding='utf-8').read().lower()
trainFromSentence = re.split('n', trainFromText)
trainToSentence = re.split('n', trainToText)
trainFromWords = re.split(' |n', trainFromText)
trainToWords = re.split(' |n', trainToText)
print('Found %s sentences from TrainFrom Text' %len(trainFromSentence))
print('Found %s sentences from TrainTo Text' %len(trainToSentence))
print('Found %s words from TrainFrom Text' %len(trainFromWords))
print('Found %s words from TrainTo Text' %len(trainToWords))
trainInput = trainFromSentence[0:1000]
trainTarget = trainToSentence[0:1000]
max_len = 100 # Cut comments after 100 words
max_words = 10000 # Consider the top 10,000 words in the dataset
tokenizerInput = Tokenizer(num_words=max_words)
tokenizerInput.fit_on_texts(trainInput)
wordInput = tokenizerInput.text_to_word_sequence(trainInput)
sequencesInput = tokenizerInput.texts_to_sequences(trainInput)
sequencesInput = pad_sequences(sequencesInput, maxlen=max_len) #Pad so all the arrays are the same size
Inputindex = tokenizerInput.word_index
Inputcount = tokenizerInput.word_counts
nInput = len(tokenizerInput.word_counts) + 1
print("Train From File:n")
print('Found %s sentences.' %len(trainInput))
print('Found %s sequences.' %len(sequencesInput))
print('Found %s unique tokens.' % len(Inputindex))
print('Found %s unique words.' % len(Inputcount))
这就是我目前所掌握的,我想知道如何使用手头的数据,并构建一个编码器来接收这些数据。
这通常是构建不同类型的自动编码器链接的方法。但从你的问题来看,你似乎对使用编码器-解码器类型模型的序列间预测感兴趣,该模型主要基于递归神经网络。可以在这里找到的教程链接