How is a human voice unique?

Question

Well, I am quite new to concepts of vocal sounds. From the physics point of view I believe a sound has two basic parameters i.e, frequency and amplitude.

Considering the end sound wave produced by human voice it must have frequency and amplitude as parameters. Well, when a human can speak in multiple frequencies (multiple pitches) and amplitudes (multiple volumes) I was wondering what makes every human voice unique?

Even if two persons produce a sustained note(say a same music note) their voices can be easily distinguished.So why do the voices seem different?

Are there any others parameters that distinguishes or do I have a misconception?

You have a large misconception - humans do not produce pure frequencies like a sine wave generator. — Jon Custer, Feb 26 '16 at 18:16
@JohnRennie Well, in terms of amplitude and frequency what makes a human voice unique? Amplitude and frequencies are physics concepts. — Sanath Bharadwaj, Feb 26 '16 at 18:18
Air flow is a physics concept, but we don't answer questions about how to fly planes here. — Kyle Kanos, Feb 26 '16 at 18:19
Most sounds including human voice are sums of lots of frequencies with different amplitudes, not just a single frequency. Even a sustained note has a set of frequencies, even if it's played by piano or pipe organ (although the latter can have quite few frequencies). — Ruslan, Feb 26 '16 at 18:52
I did a quick google search and found this paper (looks like its for a lab course) that you might be interested in. It does a spectral analysis of different types of voices. http://www.ling.cam.ac.uk/li9/lab3_m08_speechandspectralanalysis.pdf — tmwilson26, Feb 26 '16 at 18:54
Vocal cords, lungs... The features of your voice come from your body and how you use it. — Mitchell Porter, Feb 26 '16 at 19:11
Of course this is a physics question; which the answers below clearly show. — Steeven, Feb 26 '16 at 22:17
Possible duplicate of What are those characteristics by which every sound can be identified uniquely? and also Sound difference between musical instruments — , Feb 26 '16 at 23:36
Also, here's what a pure monochromatic sinusoid sounds like. — , Feb 26 '16 at 23:38
I'm voting to close this question as off-topic because it is about distinguishing human sounds rather than arbitrary sounds. — Danu, Feb 27 '16 at 09:49

Brionius · Accepted Answer · 2016-02-26T20:04:53.157

A "pure tone" is a sound that has a single sine function as its pressure profile. The human voice is not a pure tone; it is a superposition of many different sine waves with different frequencies and different amplitudes. Here is an image illustrating how many sine waves of different frequencies can combine to make a more complicated waveform like the human voice:

(image credit)

Thus a human voice has many more parameters than just a single amplitude and frequency. It has many amplitudes, one for each of many different frequencies (along with a phase for each as well). Furthermore, these amplitudes change over time as the human voice makes different sounds.

This picture, for example, is a "spectrogram" of a human voice.

(image credit: By Dvortygirl, Mysid - FFT'd in baudline; original sound by DvortygirlThis file was derived from:En-us-it's all Greek to me.ogg, CC BY-SA 3.0 )

The x-axis is time, the y-axis is frequency, and the intensity indicates the amplitude of each frequency component at each point in time. A pure tone would show up as a single solid horizontal line. You can see that the human voice is made of many many frequency components of various amplitudes.

This is the same reason a violin, oboe, and piano sound different even when they play "the same note". The musical terminology for the specific balance of different frequency components is known as "timbre".

See the Wikipedia article for further reading.

CR Drost · Answer 2 · 2016-02-26T21:23:13.017

Here is an image of the waveforms of three people saying the word "ramen." The first two are actually the same person on different occasions, having therefore the same pitch to his voice. The third is a woman saying the same word "ramen". I have altered the durations of the clips so that they all take up the same amount of time overall.

(click to expand).

If you look very closely, there is an initial segment of less-turbulence (R) morphing into a segment with a lot of turbulence (A) morphing into what's essentially a pure frequency (M) with, in the man's case, an overtone; followed by another rougher patch (E), followed by another "more pure" note (N), which seems to be very similar if a bit softer, more drawn-out, and possibly with a higher overtone in each case.

One thing that's very noticeable is that the woman's voice goes up-and-down a lot more, which manifests as her voice's higher pitch.

Another thing is this "turbulence" stuff: this stuff, and any sort of "noise," is a lot of different frequencies happening at once. Your ear actually has a part called the "cochlea" which appears to have little hairs that each are at a slightly different resonance frequency due to their different locations in the organ -- so different frequencies vibrate different hairs in your ears! It's the whole pattern of how these hairs vibrate together which makes the difference between the "a" sounds in Dad and Father, which are very different vowel-sounds (at least in American English!).

In general then there are not two pure numbers which distinguish a pure sound (its frequency and amplitude) but there are instead two functions of frequency which distinguish a pure sound. The first function is the amplitude as a function of frequency -- any pure sound is going to have a bunch of different components at different frequencies! -- and the second parameter is called the phase of the different frequencies. The two numbers are only going to distinguish two sine waves that start out in-phase, but very few of the sounds you hear are sine waves and very few of the sounds you hear are perfectly in phase.

Since a phase is best represented as an angle with such periodic and quasi-periodic waveforms, the natural description of a sound is actually in terms of a function which assigns every frequency a 2D scaled rotation matrix where the rotation angle is the phase and the scale-factor is the scale; it's in 2D because you only need one angle. Such scaled rotation matrices are also known as complex numbers and this function is called the sound's Fourier transform, defined as: $$y[f] = \mathcal F_{t \to f}~y(t) = \int_{-\infty}^\infty dt~ e^{-2\pi i f t} y(t).$$ It turns out that this has a very cool property which is its "inverse transform"$$y(t) = \mathcal F^{-1}_{f\to t}~ y[f] = \int_{-\infty}^\infty dt~ e^{+2\pi i f t} y[f].$$First off, the fact that this exists at all means that both pictures are 100% equivalent for any time-signal: we can always analyze what it looks like in the frequency domain. Second, the fact that its inverse takes almost exactly the same form allows us to reuse our fast Fourier transform tricks to build a fast inverse-Fourier transform, so these are used all the time in signal processing.

Each human voice contains a different baseline pitch, a different accent (mapping of words to actual sounds!), a different phase profile, some different choices of harmonics. It's a testament to how powerful our brain is, and how long it takes us to learn a language, that we can even recognize that two different people from different places are saying the same word! But there are obviously some patterns, like the simpler "more pure" natures of the M and N sounds above, which our brain can "latch on" to in order to group together common sounds. So it's not impossible, it's just very difficult.

Is that audacity I see in that screenshot? What a wonderful (and powerful!) program that little guy is... — corsiKa, Feb 26 '16 at 23:33
Could you post the spectrograms? That would be more interesting, in my opinion. (Click the little arrow on "Audio Track" and change "Waveform" to "Spectrogram," then go to Preferences → Spectrograms and change the window size to 8192 or higher.) — wchargin, Feb 27 '16 at 00:01

Dries · Answer 3 · 2016-02-27T09:26:46.337

As stated by the others, a sound is made up by sinus-waves of different frequencies. The tuning you hear, is determined by the lowest frequency (fundamental). The other frequencies are multiples of that ground frequency and are called overtones.

Summarising what is shown below: the amount in which the different overtones are present, determine the colour of the sound and makes the difference between your voice and mine, between a piano and a saxophone.

As an example, I examined two a's (440 Hz). One produced by a tuning-fork, the other played on an oboe (human speech is a bit more complex, but qualitatively, it's the same).

Below, the two recorded sound are displayed simultaniously:

Performing a fourier transform (looking at which frequencies are present in the sound) on the tuning-fork sound, the result is below: one frequency is very dominant: 440 Hz, the other frequencies hardly have any influence (notice the dB-scale and thus logarithmic scale on the y-axis).

The same analysis on the oboe sound reveals much more: Several peaks at 440 Hz, 880 Hz, 1320 Hz, ... (2x, 3x, 4x, ... 440 Hz) As you can see, the tuning you hear (440 Hz), is not the frequency that is most present in the sound (often, the first peak is the highest, but the pattern you see below is what gives the oboe its particular sound). Your hearing is trained to perceive the series of peaks as a whole and recognize the ground frequency as the pitch.

I think you mean "tuning-fork" not "pitch-fork"! – CJ Dennis Feb 27 '16 at 06:39 — CJ Dennis, Feb 27 '16 at 06:39

How is a human voice unique?

3 Answers3