Query by Example(Searching in audio database using audio query)

64 views Asked by At

I have been given audio database consisting of speech recordings. Speech can be in any language. So transcrips for speech are not available. Now I will be given one query. I want to see for which audio file this query matches most.

I don't know how to proceed with this problem. Can somebody give any idea how can I proceed?

1

There are 1 answers

0
Jon Nordby On

Query by example for speech content is a task known as Spoken Content Retrieval. There are several approaches to this task. One strategy would be to first do Speech Recognition (speech to text) and then Text Retrieval. However, this can have an Out of Vocabulary problem, where phrases unknown to the speech recognition system cannot be represented (and thus queried for).

A very comprehensive review of the alternative approaches can be found in Spoken Content Retrieval - Beyond Cascaded Speech Recognition and Text Retrieval by Lin-shan Lee and Hung-yi Lee (2022).

One strategy is to apply Audio Content Retrieval to speech. There is a very good and practical explanations of these techniques in Fundamentals of Music Processing: Content-Based Audio Retrieval. This is shown in the context of music, but a lot of the key concepts transfer across.

A simple system for spoken content retrieval could be

  • Convert all audio (query and database) to compact time-series representation for speech. For example, MFCC
  • Use a similarity function for mathing query against all candidates in database. For example, Dynamic Time Warping
  • Sort the matches by similarity score, and return the top-K results

This should work reasonably well for speech with high signal-to-noise ratio with databases small enough to fit into memory. Query time complexity would be approximately O(n_recordings) * O(recording_length). In Python this could be implemented using the librosa library.

In order to evaluate your approach, you should set up a database and set of test queries with known-good matches in the database. Then run these to establish

As the database of content to be searched grows large, the query time or memory usage of the above approach will become unacceptably high at some point. Contemporary large-scale systems for multi-media retrieval tends to use a Deep Neural Network to learn a vector representation (embedding) and Approximate Nearest Neighbours search with indexes to get fast query times.