Here is an image of the waveforms of three people saying the word "ramen." The first two are actually the same person on different occasions, having therefore the same pitch to his voice. The third is a woman saying the same word "ramen". I have altered the durations of the clips so that they all take up the same amount of time overall.

(click to expand).
If you look very closely, there is an initial segment of less-turbulence (R) morphing into a segment with a lot of turbulence (A) morphing into what's essentially a pure frequency (M) with, in the man's case, an overtone; followed by another rougher patch (E), followed by another "more pure" note (N), which seems to be very similar if a bit softer, more drawn-out, and possibly with a higher overtone in each case.
One thing that's very noticeable is that the woman's voice goes up-and-down a lot more, which manifests as her voice's higher pitch.
Another thing is this "turbulence" stuff: this stuff, and any sort of "noise," is a lot of different frequencies happening at once. Your ear actually has a part called the "cochlea" which appears to have little hairs that each are at a slightly different resonance frequency due to their different locations in the organ -- so different frequencies vibrate different hairs in your ears! It's the whole pattern of how these hairs vibrate together which makes the difference between the "a" sounds in Dad and Father, which are very different vowel-sounds (at least in American English!).
In general then there are not two pure numbers which distinguish a pure sound (its frequency and amplitude) but there are instead two functions of frequency which distinguish a pure sound. The first function is the amplitude as a function of frequency -- any pure sound is going to have a bunch of different components at different frequencies! -- and the second parameter is called the phase of the different frequencies. The two numbers are only going to distinguish two sine waves that start out in-phase, but very few of the sounds you hear are sine waves and very few of the sounds you hear are perfectly in phase.
Since a phase is best represented as an angle with such periodic and quasi-periodic waveforms, the natural description of a sound is actually in terms of a function which assigns every frequency a 2D scaled rotation matrix where the rotation angle is the phase and the scale-factor is the scale; it's in 2D because you only need one angle. Such scaled rotation matrices are also known as complex numbers and this function is called the sound's Fourier transform, defined as:
$$y[f] = \mathcal F_{t \to f}~y(t) = \int_{-\infty}^\infty dt~ e^{-2\pi i f t} y(t).$$ It turns out that this has a very cool property which is its "inverse transform"$$y(t) = \mathcal F^{-1}_{f\to t}~ y[f] = \int_{-\infty}^\infty dt~ e^{+2\pi i f t} y[f].$$First off, the fact that this exists at all means that both pictures are 100% equivalent for any time-signal: we can always analyze what it looks like in the frequency domain. Second, the fact that its inverse takes almost exactly the same form allows us to reuse our fast Fourier transform tricks to build a fast inverse-Fourier transform, so these are used all the time in signal processing.
Each human voice contains a different baseline pitch, a different accent (mapping of words to actual sounds!), a different phase profile, some different choices of harmonics. It's a testament to how powerful our brain is, and how long it takes us to learn a language, that we can even recognize that two different people from different places are saying the same word! But there are obviously some patterns, like the simpler "more pure" natures of the M and N sounds above, which our brain can "latch on" to in order to group together common sounds. So it's not impossible, it's just very difficult.