Defined as “A distinct and recognizable mathematical representation of audio”.
In the video clip, the overlapping waveform could be from the roaring baby elephant or Benson Boone yelling the chorus to Beautiful Things or something else entirely?
If you clicked on the “something else entirely” link, you were greeted with a 30 second explanation into audio fingerprinting.
Another great read is this NPR article, “Voice 'Fingerprints' Change Crime-Solving”.
Why it works
We can thank Avery Wang for his paper “An Industrial-Strength Audio Search Algorithm” that is core to the Shazam Application.
To help understand how this works I’ll use a spectrogram to break-down and visualize how computers “hear” sounds.

As you can see, the baby elephant roaring has a distinct structure and pattern.

Whereas Benson Boone singing “Please… Stay!” has a separate and completely different structure and pattern.
The spectrogram are 3 seconds in length. Slice the 3 seconds into .1 second “slices”, assign a number to the bright spots at different frequencies and at different time intervals.
Computers are really good at numbers. Once the peaks and valleys are converted into useable computer data, it’s trivial for the phone in your pocket to find patterns in the numbers.
Expanding use cases
Shazam is the most common and recognizable use case. The same algorithm and methods have use cases outside of commercial and consumer endeavors:
Detecting a bad cough detected as a Covid-19 infection.
Everyone has a distinct cough. Adding to the complexity of diagnosis are the “types” of coughs that our brains try to subjectively interpret. This type of analysis is magnitudes faster and can be deployed against large crowds.
Listening for bird songs to identify endangered species during their migrations.
When birds migrate, they travel in the thousands, with each individual making a sound to communicate to their flock and other individuals. To us it’s just noise. Invisible to us is the audio fingerprint captured by computers.
Not to forget that many content creators are familiar with copyright infringement notices erroneously sent due to errors in detection
When computers convert audio into numbers, patterns invisible to human ears are apparent. Where we hear a noisy mess of overlapping sounds, a computer sees distinct numerical fingerprints - allowing it to identify specific sounds even in chaotic environments, whether it's a song in a crowded bar or a bird call in a noisy forest.
Share this post