Because I can't help myself...
[...] however there seem to be no rigorous justifications outside of "it seems to work experimentally" (which I suppose is sufficient for an empirical science, but still leaves me feeling rather uneasy).
That's not merely sufficient for empirical sciences - that is the only justification a scientific theory could ever have or need. Your unease is understandable, because mathematics and physics are wholly different in this respect. To oversimplify things enormously, mathematicians spend their time exploring the consequences of and connections between rules and frameworks that they build themselves. In contrast, scientists are trying to figure out which rules and frameworks work best to make accurate predictions about the natural world in which we find ourselves.
Climbing down off of my soapbox, I'll attempt to answer your question. Consider the following formulation of classical, single-particle Hamiltonian mechanics.
- The fundamental structure underlying the theory is the phase space $X$. A point $(\mathbf q,\mathbf p)\in X$ corresponds to a possible state of the system by specifying the particle's position and momentum.
- The physical observables of the system are given by measurable functions from $X\rightarrow \mathbb R$.
- The possible states $\rho$ of the system correspond to probability measures on $X$. $\rho$ may be a continuous measure with probability density $f_\rho$, or it may be discrete (e.g. a Dirac measure). The former describes a typical state in e.g. statistical mechanics, where we don't know the exact position and momentum of each particle, while the latter is typically what we use in "ordinary" point mechanics.
- A state $\rho$ evolves in time via a Hamiltonian function $H$, which is a specially chosen observable. Specifically,
$$\frac{d}{dt}\rho = -\{\rho,H\}$$
where $\{\cdot,\cdot\}$ is the Poisson bracket, which in canonical coordinates takes the form
$$\frac{d}{dt}\rho = -\left(\frac{\partial \rho}{\partial \mathbf q}\cdot \frac{\partial H}{\partial \mathbf p} - \frac{\partial \rho}{\partial \mathbf p} \cdot \frac{\partial H}{\partial \mathbf q}\right)$$
With that out of the way, we can make assign probabilities to the results of measurement outcomes as follows. The probability that an observable $A:X\rightarrow \mathbb R$ takes a value in the set $E\subseteq \mathbb R$ is given by
$$\mathrm{Prob}_\rho(A;E) := \rho\bigg(\mathrm{preim}_A(E)\bigg)$$
Assuming that $\rho$ is a continuous measure, this becomes
$$\mathrm{Prob}_\rho(A;E) = \int_{\mathrm{preim}_A(E)} f_\rho(\mathbf r,\mathbf p) \mathrm d\mathbf r \mathrm d\mathbf p$$
If $\rho$ is a Dirac measure supported at the point $(\mathbf q_0, \mathbf p_0)$, then
$$\mathrm{Prob}_\rho(A;E) = \begin{cases} 1 & A (\mathbf q_0,\mathbf p_0)\in E \\ 0 & \text{else}\end{cases}$$
That brisk summary contains the two primary ingredients necessary for a physical theory:
- Given a state, we need to be able to compute the probabilities of various measurement outcomes, and
- We need to be able to evolve states forward in time
It turns out that this framework is incredibly successful, but it has some limitations which render it fundamentally incapable of accurately modeling certain phenomena. Instead, we must turn to quantum mechanics, which differs from classical physics in one fundamental way - the existence of incompatible observables.
Two observables $A$ and $B$ are called compatible if we can define their joint probability distribution - that is, if we can meaningfully assign probabilities to simultaneous measurements of both. In classical mechanics, this is can always be done; given a state $\rho$, the probability of measuring observables $A$ and $B$ to take values in the sets $E_A$ and $E_B$ is simply
$$\mathrm{prob}_\rho(A,B;E_A,E_B) = \rho\bigg(\mathrm{preim}_A(E_A)\cap \mathrm{preim}_B(E_B)\bigg)$$
The incompatibility of certain quantum observables means that we require a more general logical framework which allows for the possibility that two observables simply don't have a well-defined joint probability distribution.
I'll now describe such a framework. Here's a formulation of quantum mechanics:
- The fundamental structure underlying the theory is the Hilbert space $\mathscr H$ with inner product $\langle \cdot,\cdot\rangle$.
- The physical observables of the system correspond to the self-adjoint operators $A:\mathscr H\rightarrow \mathscr H$.
- The possible states $\rho$ of the system correspond to density operators - that is, positive self-adjoint operators with trace $1$. Given a unit vector $\psi\in \mathscr H$, one possible form for a density operator is $\rho_\psi: \phi \mapsto \psi \langle \psi, \phi\rangle$ (in bra-ket notation, we woudl write $\rho_\psi = |\psi\rangle\langle \psi|$). If the state has this form it is called pure, but not all states do.
- A state $\rho$ evolves in time via a Hamiltonian operator $H$, which is a specially chosen observable. Specifically,
$$\frac{d}{dt}\rho = -[\rho,H]/i\hbar$$
where $[\cdot,\cdot]$ is the commutator bracket for two operators on $\mathscr H$.
With that out of the way, we can make assign probabilities to the results of measurement outcomes as follows. Any self-adjoint operator $A:\mathscr H \rightarrow \mathscr H$ corresponds to a unique projection-valued measure$^\ddagger$ $\mu_A$, which maps a set $E\subseteq \mathbb R$ to an orthogonal projection operator $\mu_A(E)$. Given a state $\rho$, the probability that $A$ takes its value in $\mu_E$ is given by
$$\mathrm{prob}_\rho(A;E) = \mathrm{Tr}\bigg(\rho \mu_A(E)\bigg)$$
If $\rho$ is a pure state, then the probability becomes
$$\mathrm{prob}_{\rho_\psi}(A;E) = \mathrm{Tr}\bigg(\rho_\psi \mu_A(E)\bigg) = \langle \psi, \mu_A(E)\psi\rangle$$
For a specific example, if $\mathscr H=L^2(\mathbb R)$ with the inner product $\langle \psi,\phi\rangle = \int \overline{\psi(x)} \phi(x) \mathrm dx$, and $A=\hat X$ is the position operator $\big(\hat X\psi\big)(x) = x \psi(x)$, then $\mu_{\hat X}(E) = \chi_E$ where
$$\chi_E(x) = \begin{cases}1 & x\in E\\ 0 & \text{else}\end{cases}$$
Therefore, we have$^{\ddagger\ddagger}$
$$\mathrm{prob}_{\rho_\psi}(\hat X;E) = \int_\mathbb R \chi_E(x)|\psi(x)|^2 \ \mathrm dx = \int_E |\psi(x)|^2 \mathrm dx$$
Once again, we have our necessary ingredients for producing probabilities and evolving states. But now we have a crucial difference, because we have incompatible observables. If $A$ and $B$ are self-adjoint operators with projection-valued measures $\mu_A$ and $\mu_B$ respectively, then the probability of measuring them to take values in the sets $E_A$ and $E_B$ individually are found via the orthogonal projection operators $\mu_A(E_A)$ and $\mu_B(E_B)$.
We might imagine that measuring them together would correspond to using the operator $\mu_{A,B}(E_A,E_B) = \mu_A(E_A)\mu_B(E_B)$, and we would be correct provided that $A$ and $B$ (and by extension, $\mu_A(E_A)$ and $\mu_B(E_B)$) commute. If they don't, then $\mu_A(E_A)\mu_B(E_B) \neq \mu_B(E_B)\mu_A(E_A)$ and the order in which they are measured matters. We say that $A$ and $B$ are incompatible observables.
The existence of incompatible observables has additional implications. For example, a state $\rho$ is called sharp if, for every orthogonal projection operator $P$, $\rho(P)=0$ or $\rho(P)=1$ with no intermediate values (the Dirac measure states from classical physics are sharp). The existence of incompatible observables implies that no sharp states can exist.
Of course, you may ask why such a framework is the right one. There are some reasons that it is not quite as arbitrary as it seems.
- Gleason's theorem tells us that given some reasonable assumptions, any probability measure which assigns probabilities to orthogonal projectors can be manifested as a density operator.
- The Gelfand-Naimark theorem tells us that any $C^*$ algebra of observable quantities can be realized as bounded operators on some Hilbert space. By implication, we need only motivate the fact that our observables should have the structure of a $C^*$ algebra to completely justify our use of a Hilbert space as a concrete vehicle for the theory.
- Classical observables have the structure of a Poisson algebra, which differs from a $C^*$ algebra. But if we jump from the former to the latter, then we are led immediately to the standard formulation of quantum mechanics. I don't know how to motivate that, really, but it's worth considering.
$^\ddagger$This is why we require that observables are self-adjoint.
$^{\ddagger\ddagger}$This is why $|\psi(x)|^2$ can be identified with the probability density associated with the position observable.