在Tensorflow Python数组修改中运行语音模型



我正在尝试运行一个使用MFCC和Google语音数据集训练的模型。该模型在这里使用前2台jupyter笔记本进行了训练。

现在,我正试图用Tensorflow 1.15.2在树莓派上实现它,注意它也在TF 1.15.2中进行了训练。模型加载,我得到一个正确的模型。summary((:

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 15, 15, 32)        160       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 7, 7, 32)          0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 6, 6, 32)          4128      
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 3, 3, 32)          0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 2, 2, 64)          8256      
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 1, 1, 64)          0         
_________________________________________________________________
flatten (Flatten)            (None, 64)                0         
_________________________________________________________________
dense (Dense)                (None, 64)                4160      
_________________________________________________________________
dropout (Dropout)            (None, 64)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 65        
=================================================================
Total params: 16,769
Trainable params: 16,769
Non-trainable params: 0

我的程序,它接收1秒的音频带并输出一个wav文件,然后打开该文件(不确定如何使用数据(并转换为张量字符串,然后用模型预测:

import os
import wave #Audio
import pyaudio #Audio
import time
import matplotlib.pyplot as plt
from math import ceil
import tensorflow as tf
import numpy as np
tf.compat.v1.enable_eager_execution() #We call this to establish a tf session
# Load Frozen Model
path = '/home/pi/Desktop/tflite-speech-recognition-master/saved_model_stop'
#print(path)
model = tf.keras.models.load_model(path)
#print(model)
model.summary()

# Pi Hat Config 
RESPEAKER_RATE = 16000 #Hz
RESPEAKER_CHANNELS = 2 # Originally 2 channel audio, slimmed to 1 channel for a 1D array of audio 
RESPEAKER_WIDTH = 2
RESPEAKER_INDEX = 2  # refer to input device id
CHUNK = 1024
RECORD_SECONDS = 1   # Change according to how many seconds to record for
WAVE_OUTPUT_FILENAME = "output.wav" #Temporary file name
WAVFILE = WAVE_OUTPUT_FILENAME #Clean up name
# Pyaudio
p = pyaudio.PyAudio() #To use pyaudio
#words = ["no","off","on","stop","_silence_","_unknown_","yes"] #Words in our model 
word = ["stop","not stop"]
def WWpredict(input_file):
decoded_audio = decode_audio(input_file)
#tf.print(decoded_audio,summarize =-1) #print full array
print(decoded_audio)
print(decoded_audio.shape)
prediction = model.predict(decoded_audio,steps =None)
guess = words[np.argmax(prediction)]
print(guess)
def decode_audio(input_file):
if input_file in os.listdir():
print("Audio file found:", input_file)

input_data = tf.io.read_file(input_file)
print(input_data)
audio,_d = tf.audio.decode_wav(input_data,RESPEAKER_CHANNELS)
print(audio)
print(_d)
return audio
def record(): #This function will record 1 second of your voice every 1 second and will output a wav file that it will overwrite every second

stream = p.open(
rate=RESPEAKER_RATE,
format=p.get_format_from_width(RESPEAKER_WIDTH),
channels=RESPEAKER_CHANNELS,
input=True,
input_device_index=RESPEAKER_INDEX,)

print("* recording")

frames = []

for i in range(0, ceil(RESPEAKER_RATE / CHUNK * RECORD_SECONDS)):
data = stream.read(CHUNK)
frames.append(data)

print("* done recording")

#print(len(frames), "bit audio:")
#print(frames)
#print(int.from_bytes(frames[-1],byteorder="big",signed = True)) #Integer for the last frame

stream.stop_stream()
stream.close()

wf = wave.open(WAVE_OUTPUT_FILENAME, 'wb')
wf.setnchannels(RESPEAKER_CHANNELS)
wf.setsampwidth(p.get_sample_size(p.get_format_from_width(RESPEAKER_WIDTH)))
wf.setframerate(RESPEAKER_RATE)
wf.writeframes(b''.join(frames))
wf.close()

while(True):
record()
WWpredict(WAVFILE)
time.sleep(1)

现在,当我们实际运行这个时,我最初得到以下输出:

tf.Tensor(
[[ 0.0000000e+00  0.0000000e+00]
[ 0.0000000e+00  0.0000000e+00]
[-3.0517578e-05 -3.0517578e-05]
...
[ 2.2949219e-02  3.6926270e-03]
[ 2.3315430e-02  3.3874512e-03]
[ 2.2125244e-02  4.1198730e-03]], shape=(16384, 2), dtype=float32)
(16384, 2)

这是意料之中的,然而,我的预测将不起作用,因为它需要它具有(无,16,16,1(的维度。我完全不知道如何获得(16384,2(的二维数组,并使其成为(16,16(*,然后稍后只添加None和1。如果有人知道怎么做,请告诉我。16384可以被16整除,因为它是16位音频。谢谢

ValueError: Error when checking input: expected conv2d_input to have 4 dimensions, but got array with shape (16384, 2)

我们需要使用Python_Speech_features创建MFCC。这为我们提供了1,16,16,然后我们扩展了1,16,16,1的维度。

最新更新