0:00
/
0:00

Audio Fingerprinting

How do computers know that every sound is different?

Defined as “A distinct and recognizable mathematical representation of audio”.

In the video clip, the overlapping waveform could be from the roaring baby elephant or Benson Boone yelling the chorus to Beautiful Things or something else entirely?

If you clicked on the “something else entirely” link, you were greeted with a 30 second explanation into audio fingerprinting.

Another great read is this NPR article, “Voice 'Fingerprints' Change Crime-Solving”.

Why it works

We can thank Avery Wang for his paper “An Industrial-Strength Audio Search Algorithm” that is core to the Shazam Application.

To help understand how this works I’ll use a spectrogram to break-down and visualize how computers “hear” sounds.

This is how a Baby Elephants Roar’s data is interpreted

As you can see, the baby elephant roaring has a distinct structure and pattern.

Now looking at the “Please…. Stay!” in the chorus of Benson Boone’s “Beautiful Things”, the pattern of sound is distinct irrespective of additional noise present during his Grammy performance

Whereas Benson Boone singing “Please… Stay!” has a separate and completely different structure and pattern.

The spectrogram are 3 seconds in length. Slice the 3 seconds into .1 second “slices”, assign a number to the bright spots at different frequencies and at different time intervals.

Computers are really good at numbers. Once the peaks and valleys are converted into useable computer data, it’s trivial for the phone in your pocket to find patterns in the numbers.

Expanding use cases

Shazam is the most common and recognizable use case. The same algorithm and methods have use cases outside of commercial and consumer endeavors:

  • Detecting a bad cough detected as a Covid-19 infection.

    Everyone has a distinct cough. Adding to the complexity of diagnosis are the “types” of coughs that our brains try to subjectively interpret. This type of analysis is magnitudes faster and can be deployed against large crowds.

  • Listening for bird songs to identify endangered species during their migrations.

    When birds migrate, they travel in the thousands, with each individual making a sound to communicate to their flock and other individuals. To us it’s just noise. Invisible to us is the audio fingerprint captured by computers.

  • Not to forget that many content creators are familiar with copyright infringement notices erroneously sent due to errors in detection


When computers convert audio into numbers, patterns invisible to human ears are apparent. Where we hear a noisy mess of overlapping sounds, a computer sees distinct numerical fingerprints - allowing it to identify specific sounds even in chaotic environments, whether it's a song in a crowded bar or a bird call in a noisy forest.


Thanks for reading Augmented Insights! Subscribe for free to receive new posts and support my work.

Discussion about this video

User's avatar