我的最终目标是使用TTS将一些印度语文本转换为音频,并将该音频传递给接受mp3和ogg的消息传递系统。奥格是首选。
我在 Ubuntu 上,我获取音频字符串的流程是这样的。
- 印度语文本传递到 API
- API 返回一个 json,其中包含一个名为 audioContent 的键值。
audioString = response.json()['audio'][0]['audioContent']
- 使用此
decode_string = base64.b64decode(dat)
到达解码的字符串
我目前正在将其转换为mp3,如您所见,我首先编写了wave文件,然后将其转换为mp3。
wav_file = open("output.wav", "wb")
decode_string = base64.b64decode(audioString)
wav_file.write(decode_string)
# Convert this to mp3 file
print('mp3file')
song = AudioSegment.from_wav("output.wav")
song.export("temp.mp3", format="mp3")
有没有办法在不执行io的情况下将audioString
直接转换为ogg文件?
我已经尝试了 torchaudio 和 pyffmpeg 来加载audioString
并进行转换,但它似乎不起作用。
我们可以将WAV数据写入FFmpegstdin
管道,并从FFmpegstdout
管道读取编码的OGG数据.
我的以下答案描述了如何使用视频,我们可以将相同的解决方案应用于音频。
管道结构:
-------------------- Encoded --------- Encoded ------------
| Input WAV encoded | WAV data | FFmpeg | OGG data | Store to |
| stream | ----------> | process | ----------> | BytesIO |
-------------------- stdin PIPE --------- stdout PIPE -------------
该实现等效于以下 shell 命令:cat input.wav | ffmpeg -y -f wav -i pipe: -acodec libopus -f ogg pipe: > test.ogg
根据维基百科,OGG格式的常见音频编解码器是Vorbis,Opus,FLAC和OggPCM(我选择了Opus音频编解码器)。
该示例使用 ffmpeg-python 模块,但它只是绑定到 FFmpeg 子进程(必须安装 FFmpeg CLI,并且必须在执行路径中)。
执行 FFmpeg 子流程,stdin
管道作为输入,stdout
管道作为输出:
ffmpeg_process = (
ffmpeg
.input('pipe:', format='wav')
.output('pipe:', format='ogg', acodec='libopus')
.run_async(pipe_stdin=True, pipe_stdout=True)
)
输入格式设置为wav
,输出格式设置为ogg
并且所选编码器libopus
。
假设音频文件相对较大,我们无法一次写入整个 WAV 数据,因为这样做(不"耗尽"stdout
管道)会导致程序执行停止。
我们可能必须在单独的线程中写入 WAV 数据(以块为单位),并在主线程中读取编码数据。
下面是"编写器"线程的示例:
def writer(ffmpeg_proc, wav_bytes_arr):
chunk_size = 1024 # Define chunk size to 1024 bytes (the exacts size is not important).
n_chunks = len(wav_bytes_arr) // chunk_size # Number of chunks (without the remainder smaller chunk at the end).
remainder_size = len(wav_bytes_arr) % chunk_size # Remainder bytes (assume total size is not a multiple of chunk_size).
for i in range(n_chunks):
ffmpeg_proc.stdin.write(wav_bytes_arr[i*chunk_size:(i+1)*chunk_size]) # Write chunk of data bytes to stdin pipe of FFmpeg sub-process.
if (remainder_size > 0):
ffmpeg_proc.stdin.write(wav_bytes_arr[chunk_size*n_chunks:]) # Write remainder bytes of data bytes to stdin pipe of FFmpeg sub-process.
ffmpeg_proc.stdin.close() # Close stdin pipe - closing stdin finish encoding the data, and closes FFmpeg sub-process.
"编写器线程"将 WAV 数据写入小卡盘.
最后一个块较小(假设长度不是卡盘大小的倍数)。
最后,stdin
管道关闭.
关闭stdin
完成数据编码,并关闭 FFmpeg 子进程。
在主线程中,我们正在启动线程,并从管道中读取编码stdout
"OGG"数据(以块为单位):
thread = threading.Thread(target=writer, args=(ffmpeg_process, wav_bytes_array))
thread.start()
while thread.is_alive():
ogg_chunk = ffmpeg_process.stdout.read(1024) # Read chunk with arbitrary size from stdout pipe
out_stream.write(ogg_chunk) # Write the encoded chunk to the "in-memory file".
为了读取剩余的数据,我们可能会使用ffmpeg_process.communicate()
:
# Read the last encoded chunk.
ogg_chunk = ffmpeg_process.communicate()[0]
out_stream.write(ogg_chunk) # Write the encoded chunk to the "in-memory file".
完整代码示例:
import ffmpeg
import base64
from io import BytesIO
import threading
# Equivalent shell command
# cat input.wav | ffmpeg -y -f wav -i pipe: -acodec libopus -f ogg pipe: > test.ogg
# Writer thread - write the wav data to FFmpeg stdin pipe in small chunks of 1KBytes.
def writer(ffmpeg_proc, wav_bytes_arr):
chunk_size = 1024 # Define chunk size to 1024 bytes (the exacts size is not important).
n_chunks = len(wav_bytes_arr) // chunk_size # Number of chunks (without the remainder smaller chunk at the end).
remainder_size = len(wav_bytes_arr) % chunk_size # Remainder bytes (assume total size is not a multiple of chunk_size).
for i in range(n_chunks):
ffmpeg_proc.stdin.write(wav_bytes_arr[i*chunk_size:(i+1)*chunk_size]) # Write chunk of data bytes to stdin pipe of FFmpeg sub-process.
if (remainder_size > 0):
ffmpeg_proc.stdin.write(wav_bytes_arr[chunk_size*n_chunks:]) # Write remainder bytes of data bytes to stdin pipe of FFmpeg sub-process.
ffmpeg_proc.stdin.close() # Close stdin pipe - closing stdin finish encoding the data, and closes FFmpeg sub-process.
# The example reads the decode_string from a file, assume: decoded_bytes_array = base64.b64decode(audioString)
with open('input.wav', 'rb') as f:
wav_bytes_array = f.read()
# Encode as base64 and decode the base64 - assume the encoded and decoded data are bytes arrays (not UTF-8 strings).
dat = base64.b64encode(wav_bytes_array) # Encode as Base64 (used for testing - not part of the solution).
wav_bytes_array = base64.b64decode(dat) # wav_bytes_array applies "decode_string" (from the question).
# Execute FFmpeg sub-process with stdin pipe as input and stdout pipe as output.
ffmpeg_process = (
ffmpeg
.input('pipe:', format='wav')
.output('pipe:', format='ogg', acodec='libopus')
.run_async(pipe_stdin=True, pipe_stdout=True)
)
# Open in-memory file for storing the encoded OGG file
out_stream = BytesIO()
# Starting a thread that writes the WAV data in small chunks.
# We need the thread because writing too much data to stdin pipe at once, causes a deadlock.
thread = threading.Thread(target=writer, args=(ffmpeg_process, wav_bytes_array))
thread.start()
# Read encoded OGG data from stdout pipe of FFmpeg, and write it to out_stream
while thread.is_alive():
ogg_chunk = ffmpeg_process.stdout.read(1024) # Read chunk with arbitrary size from stdout pipe
out_stream.write(ogg_chunk) # Write the encoded chunk to the "in-memory file".
# Read the last encoded chunk.
ogg_chunk = ffmpeg_process.communicate()[0]
out_stream.write(ogg_chunk) # Write the encoded chunk to the "in-memory file".
out_stream.seek(0) # Seek to the beginning of out_stream
ffmpeg_process.wait() # Wait for FFmpeg sub-process to end
# Write out_stream to file - just for testing:
with open('test.ogg', "wb") as f:
f.write(out_stream.getbuffer())
您可以通过以下方式使用TorchAudio执行此操作。
几个注意事项
- OPUS支持可通过
libsox
(在Windows上不可用)或ffmpeg
(在Linux/macOS/Windows上可用)获得。 - 在最新的稳定版本 (v0.13) 上,
torchaudio.save
可以使用libsox
对 OPUS 格式进行编码。但是,libsox
上的底层实现存在错误,因此不建议将torchaudio.save
用于 OPUS。 - 相反,建议使用
torchaudio.io
中的StreamWriter
,从 v0.13 开始可用。(您需要安装ffmpeg>=4.1,<5
) - OPUS 仅支持 48kHz。
- OPUS 仅支持单声道。指定 1 以外的num_channels不会引发错误,但会产生错误的音频数据。
import io
import base64
from torchaudio.io import StreamReader, StreamWriter
# 0. Generate test data
with open("foo.wav", "rb") as file:
data = file.read()
data = base64.b64encode(data)
# 1. Decode base64
data = base64.b64decode(data)
# 2. Load with torchaudio
reader = StreamReader(io.BytesIO(data))
reader.add_basic_audio_stream(
frames_per_chunk=-1, # Decode all the data at once
format="s16p", # Use signed 16-bit integer
)
reader.process_all_packets() # Decode all the data
waveform, = reader.pop_chunks() # Get the waveform
# 3. Save to OPUS.
writer = StreamWriter("output.opus")
writer.add_audio_stream(
sample_rate=48000, # OPUS only supports 48000 Hz
num_channels=1, # OPUS only supports monaural
format="s16",
encoder_option={"strict": "experimental"},
)
with writer.open():
writer.write_audio_chunk(0, waveform)