It has been a while since the last time I found myself working on a project involving music or even audio in general. In fact, I have another project planned revolving around music visualization, but I felt like my memory of the relevant topics has somewhat degraded with time. I needed a brief refresher course with something simpler to build my confidence back, and this is the principal motivation behind this project.

When it comes to simple audio processing tasks, I felt like the problem of pitch detection has just the right balance of difficulty to suit my current goals, which made it an easy choice. To summarize: the project's objective is to develop a straightforward pitch detection technique and then demonstrate its use in deriving of musical notes from captured whistling audio in real time.

Table of contents


If your browser supports the required capabilities (which it should, unless you have neglected to update it for a long while), you should be able to view this interactive demo.

Digital soundwaves

In essence, digital audio is simply a discretization of measured physical soundwaves into a binary representation in the form of a temporal sequence of samples. The samples are captured in a regular time interval and each one is a number representing the sound's intensity at the time of its capture.

The sampling rate is the number of samples we capture per unit time, and is usually expressed in Hertz. The sampling rate is of importance because it may affect the precision of any further analysis we wish to apply to the sampled audio. The larger the sampling rate, the shorter the temporal gap becomes between consecutive samples, and thus the more accurate our approximation of the original sound becomes. Figure 1 visualizes the sequence formed when sampling a simple sine wave.

Figure 1: A sine wave oscillating at 440 Hz.

The sine wave has a very neat, smooth form. In practice, the sound waves produced by musical instruments go through much more intricate motions than that of the sine. Figure 2 visualizes what a piano's middle-C waveform looks like.

Figure 2: A piano's middle-C.

Of course, we already know that the note played in figure 2 is C4, and those of us with musical background could probably identify it as such by hearing alone, but how could we extract the correct pitch of notes in a piece, given only its sampled waveform (figure 3)?

Figure 3: The scale of C-major played on the piano.

Detecting pitch

Pitch is the perceived frequency of sound. Most sounds are not made up of one frequency, but rather a sum of many waves at different frequencies and intensitites audible on top of each other. Therefore, an initial step towards detecting the pitch of a sound is to first sample it, and then extract the frequencies present together with their intensities from the sample sequence.

The frequency domain

This transformation from the time domain — relating time to sound intensity — to the frequency domain — relating a frequency to its intensity — is called the Fourier transform, or in cases where the involved signals are discrete (as it is when discussing digital audio), it is rather called the discrete Fourier transform (or DFT for short). Figure 4 shows the DFT of a sound over the course of its playback.

Figure 4: A visualization of the frequency domain. The horizontal axis denotes frequencies from C3-C6 and the intensity of a given frequency is expressed by its corresponding bar height.
A sine wave oscillating at 440 Hz.
The scale of C-major played on the piano.

With the DFT, we already see the influence the sampling rate has on our analysis capabilities: with a sampling rate of \(s\), the maximum frequency detectable with the DFT is \(f_{max} = 0.5s\). This frequency is called the Nyquist frequency. What the DFT does is it divides the interval \([0, f_{max})\) into \(m\) bins, each being associated with a sub-interval \(B\) and an intensity value \(b\):

$$ i \in \{1, 2, \ldots m\} $$ $$ B_i = [f_{i}, f_{i+1}) \text{ where } f_i = \frac{i-1}{m} f_{max} $$ $$ b_i \in [0, 1] $$

Like the sampling rate, the choice of \(m\) also affects our analysis capabilities. For example, if \(f_{max} = 100\), and we are interested in computing the difference in intensity of frequencies at 10 Hz and 11 Hz, we would have to choose \(m >= 100\), because otherwise a single bin would include both 10 Hz and 11 Hz in itself, and we would not be able to make a distinction between them.

For our purposes, we are interested in analyzing frequencies corresponding to the notes C2-C6. The frequency of C6 is approximately 1050 Hz. Therefore, the usual sampling rate of any standard audio (typically 20-100 KHz) is more than enough to cover all relevant frequencies. In addition, a bin count \(m\) chosen such that \(f_{max} / m <= 4\) would ensure we can distinguish between each individual note-pair, as the minimum difference on the C2-C6 range is the one between C2 and C#2, which spans approximately 3.9 Hz.

A naive approach

Assuming we have chosen a sampling rate and a DFT bin count that is suitable for our purposes, an initial approach to detecting the pitch of a sound at a point in time is to assume that it corresponds to the frequency with the greatest intensity at that time. Figure 5 showcases a detector using that strategy.

Figure 5: Pitch detection on the sound of the C-major scale being played on the piano.

The results of this naive approach are not bad, but there is room for improvement on two fronts:

  1. The detection keeps on going even when the audio is practically silent.
  2. When there are multiple bins whose high intensities are very close to each other, the detection may get jittery as they sporadically combat each other for the spot reserved for the bin with maximal intensity.


We deal with the first issue by treating all bins whose intensity is smaller than some threshold value as having no value at all, i.e. as if \(b_i = 0\) instead of its actual (albeit small) value. To overcome jitter, we buffer the detection results over some time window, and output only the note which appears most often in that buffer.

For example: imagine that we perform a note detection at a rate of 100 Hz, and that a 6-spot window lists our past six detections as the following:

$$ (C4, C4, E5, C4, C4, C4) $$

To consider E5 as a plausible pitch would not be a safe bet here. It is more likely that the bin corresponding to E5 has just temporarily overtaken C4 as the dominant frequency, but only for a brief time. If at a future time E5 were to become the actual pitch, our window would eventually take a similar form to the following:

$$ (C4, C4, E5, E5, E5, E5) $$

That is the point in time where it is alright to say that E5 has replaced C4 as the dominant pitch.

Of course, the chosen window size is important. The rate of detections for the animation on this page is roughly dependent on your browser's animation frame rates, but if we generally assume that the rate is in the range of 30-60 frames per second, than for our purposes a window with 12 spots would do moderately well. Figure 6 showcases a detector using our refined approach.

Figure 6: Refined pitch detection on the sound of the C-major scale being played on the piano.

On my machine, figure 6 produces the detected sequence:

$$ (C4, G5, D4, E4, F4, G4, A4, B4, C5) $$

...which is not bad at all!

Source code

You can view the source code on GitHub.

See also