为什么谷歌语音分割不能区分两种不同的声音？

我对Google的语音分化感到困惑——它似乎无法区分两种非常不同的声音(一个男人和一个女人)。

所附代码处理公开可用的"医疗对话";音频文件;这是一个很好的录音，在16千赫，但拨号仍然将两个扬声器识别为"扬声器1"。

man:
speaker 1 @ 0:00:00.600000   | 0.91 | hello
speaker 1 @ 0:00:02.600000   | 0.99 | good
speaker 1 @ 0:00:02.800000   | 0.99 | morning
woman:
speaker 1 @ 0:00:02.900000   | 0.97 | good
speaker 1 @ 0:00:04.200000   | 0.99 | morning
man:
speaker 1 @ 0:00:04.300000   | 0.99 | so
speaker 1 @ 0:00:06.100000   | 0.99 | tell
speaker 1 @ 0:00:06.700000   | 0.99 | me
speaker 1 @ 0:00:06.800000   | 0.99 | what's
speaker 1 @ 0:00:07.200000   | 0.99 | going
speaker 1 @ 0:00:07.300000   | 0.99 | on

我已经尝试了各种SpeakerDiarizationConfig参数，但没有真正的改进。

还请注意，尽管本例使用异步语音识别器，但我使用流识别器得到的结果同样很差。

有谁有更好的结果提示吗?

from google.cloud import speech
def transcribe_gcs(gcs_uri):
"""Asynchronously transcribes the audio file specified by the gcs_uri."""
client = speech.SpeechClient()
audio = speech.RecognitionAudio(uri=gcs_uri)
# https://cloud.google.com/speech-to-text/docs/multiple-voices#speech_transcribe_diarization_beta-python                                                                                                  
diarization_config = speech.SpeakerDiarizationConfig(
enable_speaker_diarization=True,
min_speaker_count=1,
max_speaker_count=3,
)
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=16000,
language_code="en-US",
enable_word_confidence=True,
diarization_config = diarization_config,
#model="phone_call",                                                                                                                                                                                  
)
operation = client.long_running_recognize(config=config, audio=audio)
print("Waiting for operation to complete...")
response = operation.result(timeout=90)
# Each result is for a consecutive portion of the audio. Iterate through                                                                                                                                  
# them to get the transcripts for the entire audio file.                                                                                                                                                  
for result in response.results:
#print(f"is_final: {result.is_final}")                                                                                                                                                                
for n in range(len(result.alternatives)):
print(f"Alternative {n}")
alt = result.alternatives[n]
print(f"Transcript:n {alt.transcript}")
# The first alternative is the most likely one for this portion.                                                                                                                                      
alt = result.alternatives[0]
print(f"Transcript 0: {alt.transcript}")
print(f"Confidence 0: {alt.confidence}")
for word in alt.words:
print(f"speaker {word.speaker_tag} @ {str(word.start_time):16} | {word.confidence:2.2} | {word.word}")
# #                                                                                                                                                                                                   
print("n")

if __name__=="__main__":
audio_uri="gs://cloud-samples-data/speech/medical_conversation_2.wav"
transcribe_gcs(audio_uri)

我有一个类似的问题，分界化不起作用。但这是因为每个声音都在一个特定的频道，所以我必须通过频道来区分说话者:

config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
enable_automatic_punctuation=True,
enable_separate_recognition_per_channel=True,
audio_channel_count=2,
language_code="en-US",
)

这个选项适合我。此外，您还可以使用增强型，这将花费更多的时间来处理，但可以提供更好的结果:

config = speech.RecognitionConfig(
...
use_enhanced=True,
model="phone_call",
)

相关内容

最新更新

热门标签：