我对Google的语音分化感到困惑——它似乎无法区分两种非常不同的声音(一个男人和一个女人)。
所附代码处理公开可用的"医疗对话";音频文件;这是一个很好的录音,在16千赫,但拨号仍然将两个扬声器识别为"扬声器1"。
man:
speaker 1 @ 0:00:00.600000 | 0.91 | hello
speaker 1 @ 0:00:02.600000 | 0.99 | good
speaker 1 @ 0:00:02.800000 | 0.99 | morning
woman:
speaker 1 @ 0:00:02.900000 | 0.97 | good
speaker 1 @ 0:00:04.200000 | 0.99 | morning
man:
speaker 1 @ 0:00:04.300000 | 0.99 | so
speaker 1 @ 0:00:06.100000 | 0.99 | tell
speaker 1 @ 0:00:06.700000 | 0.99 | me
speaker 1 @ 0:00:06.800000 | 0.99 | what's
speaker 1 @ 0:00:07.200000 | 0.99 | going
speaker 1 @ 0:00:07.300000 | 0.99 | on
我已经尝试了各种SpeakerDiarizationConfig参数,但没有真正的改进。
还请注意,尽管本例使用异步语音识别器,但我使用流识别器得到的结果同样很差。
有谁有更好的结果提示吗?
from google.cloud import speech
def transcribe_gcs(gcs_uri):
"""Asynchronously transcribes the audio file specified by the gcs_uri."""
client = speech.SpeechClient()
audio = speech.RecognitionAudio(uri=gcs_uri)
# https://cloud.google.com/speech-to-text/docs/multiple-voices#speech_transcribe_diarization_beta-python
diarization_config = speech.SpeakerDiarizationConfig(
enable_speaker_diarization=True,
min_speaker_count=1,
max_speaker_count=3,
)
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=16000,
language_code="en-US",
enable_word_confidence=True,
diarization_config = diarization_config,
#model="phone_call",
)
operation = client.long_running_recognize(config=config, audio=audio)
print("Waiting for operation to complete...")
response = operation.result(timeout=90)
# Each result is for a consecutive portion of the audio. Iterate through
# them to get the transcripts for the entire audio file.
for result in response.results:
#print(f"is_final: {result.is_final}")
for n in range(len(result.alternatives)):
print(f"Alternative {n}")
alt = result.alternatives[n]
print(f"Transcript:n {alt.transcript}")
# The first alternative is the most likely one for this portion.
alt = result.alternatives[0]
print(f"Transcript 0: {alt.transcript}")
print(f"Confidence 0: {alt.confidence}")
for word in alt.words:
print(f"speaker {word.speaker_tag} @ {str(word.start_time):16} | {word.confidence:2.2} | {word.word}")
# #
print("n")
if __name__=="__main__":
audio_uri="gs://cloud-samples-data/speech/medical_conversation_2.wav"
transcribe_gcs(audio_uri)
我有一个类似的问题,分界化不起作用。但这是因为每个声音都在一个特定的频道,所以我必须通过频道来区分说话者:
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
enable_automatic_punctuation=True,
enable_separate_recognition_per_channel=True,
audio_channel_count=2,
language_code="en-US",
)
这个选项适合我。此外,您还可以使用增强型,这将花费更多的时间来处理,但可以提供更好的结果:
config = speech.RecognitionConfig(
...
use_enhanced=True,
model="phone_call",
)