将python-sounddevice.RawInputStream生成的音频数据发送到谷歌云语音到文本以进行异步识别

我正在编写一个脚本，该脚本将数据从麦克风发送到Google Cloud Speech-to-Text API。我需要访问gRPC API，以便在录制过程中产生实时读数。一旦录制完成，我需要访问REST API以进行更精确的异步识别。

直播部分正在发挥作用。它基于quickstart示例，但使用了python sounddevice而不是pyAudio。下面的流将cffi_backend_buffer对象记录到一个队列中，一个单独的线程收集这些对象，将它们转换为字节，并将它们提供给API。

import queue
import sounddevice
class MicrophoneStream:
def __init__(self, rate, blocksize, queue_live, queue):
self.queue = queue
self.queue_live = queue_live
self._audio_stream = sounddevice.RawInputStream(
samplerate = rate,
dtype='int16',
callback = self.callback,
blocksize = blocksize,
channels = 1,
)
def __enter__(self):
self._audio_stream.start()
return self
def stop(self):
self._audio_stream.stop()
def __exit__(self, type, value, traceback):
self._audio_stream.stop()
self._audio_stream.close()
def callback(self, indata, frames, time, status):
self.queue.put(indata)
self.queue_live.put(indata)

我计划在录制完成后使用第二个队列进行异步识别。然而，仅仅像我在实时识别中那样发送字节字符串似乎不起作用：

from google.cloud import speech

client = speech.SpeechClient()
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=16000,
language_code='en-US',
max_alternatives=1)
audio_data = []
while not queue.empty():
audio_data.append(queue.get(False))
audio_data = b"".join(audio_data)
audio = speech.RecognitionAudio(content=audio_data)
response = client.recognize(config=config, audio=audio)

由于发送原始音频数据的字节串可以进行流式识别，因此我认为原始数据和识别配置没有问题。也许还有更多的东西？我知道，如果我从*.wav文件中读取二进制数据并发送它而不是audio_data，识别就会起作用。如何将原始音频数据转换为PCM WAV，以便将其发送到API？

事实证明，这段代码有两个错误。

看起来我放入队列的cffi_backend_buffer对象的行为就像指向某个内存区域的指针。如果我像在流媒体识别中那样立即访问它们，效果会很好。但是，如果我将它们收集到队列中以供以后使用，它们指向的缓冲区就会被覆盖。解决方案是将字节字符串放入队列：

def callback(self, indata, frames, time, status):
self.queue.put(bytes(indata))
self.queue_live.put(bytes(indata))

异步识别要求PCM WAV文件具有标头。显然，我的原始音频数据没有它们。解决方案是将数据写入*.wav文件，我按照以下方式完成：

import io
import wave
from google.cloud import speech

client = speech.SpeechClient()
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=16000,
language_code='en-US',
max_alternatives=1)
# Collect raw audio data
audio_data = []
while not queue.empty():
audio_data.append(queue.get(False))
audio_data = b"".join(audio_data)
# Convert to a PCM WAV file with headers
file = io.BytesIO()
with wave.open(file, mode='wb') as w:
w.setnchannels(1)
w.setsampwidth(2)
w.setframerate(16000)
w.writeframes(audio_data)
file.seek(0)
audio = speech.RecognitionAudio(content=file.read())
response = client.recognize(config=config, audio=audio)

相关内容

最新更新

热门标签：