Google Speech-to-Text v2 Context/Hints Phrase Didn't Help for Homophone

77 views Asked by At

I'm having a problem when using the Google STT v2 to use context, or in v2 it's called speech adaptation hints. I'm adding the phrase context "u", but it is recognized as "you". The same audio content can give me a "u" transcription using v1, but not when using v2. Here is the related documentation for v2: https://cloud.google.com/speech-to-text/v2/docs/reference/rest/v2/projects.locations.recognizers#TranscriptNormalization

Here's the base64 audio content: https://codepen.io/trisamsul/pen/gOEyEvR. You can try to decode it here: https://base64.guru/converter/decode/audio

Here's my request payload for v2

URL: https://{url}/v2/projects/{project_name}/locations/{stt_location}/recognizers/_:recognize
METHOD: POST
BODY PAYLOAD:

{
    "config": {
        "model": "short",
        "languageCodes": [
            "en-US"
        ],
        "autoDecodingConfig": {},
        "adaptation": {
            "phraseSets": [
                {
                    "inlinePhraseSet": {
                        "phrases": [
                            {
                                "value": "u",
                                "boost": 10
                            }
                        ],
                        "boost": 10
                    }
                }
            ]
        }
    },
    "content": <base64_audio>
}

Here's the response that I got:

{
    "metadata": {
        "totalBilledDuration": "2s"
    },
    "results": [
        {
            "alternatives": [
                {
                    "transcript": "you",
                    "confidence": 0.71810806
                }
            ],
            "resultEndOffset": "1.590s",
            "languageCode": "en-us"
        }
    ]
}

I tried to change the boost value from 0-20, but the result doesn't change. I even tried boost value -1 which mentioned in the documentation that it will throw an error (reference), but I didn't get the error. So I'm not sure the phrase even being processed.

Here's the v1 request and response as a reference:

URL: https://speech.googleapis.com/v1/speech:recognize
BODY PAYLOAD:
{
    "config": {
        "encoding": "WEBM_OPUS",
        "languageCode": "en",
        "sampleRateHertz": 48000,
        "speechContexts": {
            "phrases": [
                "u"
            ]
        }
    },
    "audio": {
        "content": <base64_audio>
    }
}

Here's the response for v1:

{
    "results": [
        {
            "alternatives": [
                {
                    "transcript": "U",
                    "confidence": 0.8393447
                }
            ],
            "resultEndTime": "1.230s",
            "languageCode": "en-us"
        }
    ],
    "totalBilledTime": "2s",
    "requestId": "1673882512317147501",
    "usingLegacyModels": true
}

Is there a way for me to check that the phrase is being processed and working, and ultimately how to make it work the same as v1.

0

There are 0 answers