Below I've derived this rule for the joint probability of two consecutive measurements that holds in any axiomatic formulation of quantum mechanics without further assumption. This derivation applies whether or not the two observables commute, and does not assume conditional probabilities, etc.
The way I represent measurements follows from the Stinespring Dilation Theorem, which says that all quantum channels can be represented via isometries (which are lenght-preserving maps between Hilbert spaces of different size; see below for further details).
For measurements, the Hilbert space is "dilated" to include degrees of freedom that codify the measurement outcome. By Choi's Theorem, this is equivalent to the Kraus representation. Moreover, I can embed isometries in unitaries, which leads to a unitary picture of measurements for qubits and more generally. That unitary picture of measurement is equivalent to the Many-Worlds Interpretation (MWI).
Suppose I have two operators $A$ and $B$ with spectral decompositions
$$ A \, = \, \sum\limits_{a=0}^{N^{\,}_A-1} \, A^{\,}_a \, \mathbb{P}^{\,}_a ~~,~~B \, = \, \sum\limits_{b=0}^{N^{\,}_B-1} \, B^{\,}_b \, \widetilde{\mathbb{P}}^{\,}_b ~,~~$$
where $\{A^{\,}_a\}$ are the $N^{\,}_A$ unique eigenvalues of $A$ (and thus the possible outcomes upon measurement), and $\{B^{\,}_b\}$ are the $N^{\,}_B$ unique eigenvalues of $B$. The projector $\mathbb{P}^{\,}_a$ projects onto the $a$th eigenspace of $A$, so that $A \mathbb{P}^{\,}_a = A^{\,}_a \mathbb{P}^{\,}_a$; likewise the projector $\widetilde{\mathbb{P}}^{\,}_b$ projects onto the $b$th eigenspace of $B$, so that $B \widetilde{\mathbb{P}}^{\,}_b= B^{\,}_b \widetilde{\mathbb{P}}^{\,}_b$. The projectors are orthogonal, complete, and idempotent:
$$ \sum\limits_{a=0}^{N^{\,}_A-1} \, \mathbb{P}^{\,}_a \, = \, \mathbb{1}~~~~\text{and}~~~~\mathbb{P}^{\,}_a \, \mathbb{P}^{\,}_{a'} \, = \, \delta^{\,}_{a,a'} \mathbb{P}^{\,}_a~,~~$$
and likewise for the projectors for $B$.
We encode measurement channels by representing the degrees of freedom corresponding to the measurement apparati. In this case, we have two measurement devices: One for each of $A$ and $B$, with $N^{\,}_A$ and $N^{\,}_B$ internal states, respectively. It will help to define the cyclic shift operators
$$ X^k_A \, = \, \sum\limits_{q=0}^{N^{\,}_A-1} \, \left| q+k \, \text{mod}\, N^{\,}_A \middle\rangle \hspace{-0.4mm} \middle\langle q \right| ~~,~~~\widetilde{X}^k_B \, = \, \sum\limits_{q=0}^{N^{\,}_B-1} \, \left| q+k \, \text{mod}\, N^{\,}_B \middle\rangle \hspace{-0.4mm} \middle\langle q \right| ~,~~$$
which are both unitary operators that reduce to the Pauli $X$ matrix for $N^{\,}_{A,B}=2$, and are also known as "Weyl $X$" operators or "shift" operators in clock models.
The unitary channel that captures measurement of $A$ (and $B$) requires that we identify a "default state" of the apparatus, which we take to be $\left| 0 \right\rangle$ without loss of generality (this is the case in experiments as far as I'm aware). With this choice, the operator that captures projective measurement of the operator $A$ is the channel
$$ \mathsf{M}^{\vphantom{\dagger}}_{[A]} \, = \, \sum\limits_{a=0}^{N^{\,}_A-1} \, \mathbb{P}^{\,}_a \otimes X^{a}_A \, ,~~$$
while the operator that captures measurement of $B$ is
$$ \mathsf{M}^{\vphantom{\dagger}}_{[B]} \, = \, \sum\limits_{b=0}^{N^{\,}_B-1} \, \widetilde{\mathbb{P}}^{\,}_b \otimes \widetilde{X}^{b}_B \, ,~~$$
and I'll note that various useful relations can be found in this paper.
Now, suppose that we measure $A$ only. Starting in the state $\left| \alpha \right\rangle$ of the physical system (and the default state $\left|0 \right\rangle$ of the measurement apparatus), we write
$$\left| \Psi (0)\right\rangle \, \equiv \, \left| \alpha \right\rangle \otimes \left| 0 \right\rangle \, ,~~$$
and we then have
$$ \left| \Psi (t)\right\rangle \, = \, \mathsf{M}^{\vphantom{\dagger}}_{[A]} \, \left| \Psi (0)\right\rangle \, = \, \sum\limits_{a=0}^{N^{\,}_A-1} \, \mathbb{P}^{\,}_a \left| \alpha \right\rangle \otimes X^{a}_A \,\left| 0\right\rangle \, = \, \sum\limits_{a=0}^{N^{\,}_A-1} \, \mathbb{P}^{\,}_a \left| \alpha \right\rangle \otimes \left| a\right\rangle \, , ~$$
where the state $\mathbb{P}^{\,}_a \left| \alpha \right\rangle$ of the physical system is just an unnormalized verison of the collapsed (Copenhagen) wavefunction corresponding to observing outcome $a$. In the case where the $a$th eigenvalue $A$ is nondegenerate, we have $\mathbb{P}^{\,}_a = \left| a \middle\rangle \hspace{-0.4mm}\middle\langle a \right|$ so that $\mathbb{P}^{\,}_a \left| \alpha \right\rangle = \left\langle a \middle| \alpha \right\rangle \, \left| a \right\rangle$.
The operator $\mathsf{M}^{\,}_{[A]}$ is unitary acting on $\left| \Psi (0) \right\rangle = \left| \alpha \right\rangle \otimes \left| 0 \right\rangle$. The isometry $\mathsf{V}^{\,}_{[A]}$ corresponding to the measurement of $A$ is similar, but we do not include the ancilla state (for the apparatus) from the outset. Instead, we have:
$$ \mathsf{V}^{\vphantom{\dagger}}_{[A]} \, : \, \left| \alpha \right\rangle \to \left| \alpha'\right\rangle \, = \, \sum\limits_{a=0}^{N^{\,}_A-1} \, P^{\,}_a \left| \alpha \right\rangle \otimes \left| a \right\rangle \, ,$$
where there is no apparatus state prior to the isometry. Thus, the isometry dilates the Hilbert space by introducing the state of the apparatus, and the Stinespring theorem tells us this is always a way to represent a quantum channel, including measurements. To see that this mapping is isometric (meaning that it is length preserving), we compute
$$ \left\langle \alpha' \middle| \alpha' \right\rangle \, = \, \left\langle \alpha \middle| \mathsf{V}^{\dagger}_{[A]} \, \mathsf{V}^{\vphantom{\dagger}}_{[A]} \middle| \alpha \right\rangle \, = \, \sum\limits_{a,a'=0}^{N^{\,}_A-1} \, \left\langle \alpha \middle| P^{\,}_{a'} \, P^{\,}_a \middle| \alpha \right\rangle \times \left\langle a' \middle| a \right\rangle $$
$$ = \sum\limits_{a=0}^{N^{\,}_A-1} \, \left\langle \alpha \middle| P^{\,}_a \middle| \alpha \right\rangle \, = \, \left\langle \alpha\middle| \alpha \right\rangle \, = \, 1 \, , ~$$
where I used the fact that (i) the basis states for the apparatus are orthonormal (i.e., $ \left\langle a' \middle| a \right\rangle = \delta^{\,}_{a,a'}$), (ii) the projectors are idempotent (i.e., $P^2_a = P^{\,}_a$), and (iii) the projectors are complete (i.e., $\sum_a \, P^{\,}_a = \mathbb{1}$ is the identity). The unitary version $\mathsf{M}$ also preserves length, and hence the state $\left| \Psi (t) \right\rangle = \left| \alpha' \right\rangle$ is properly normalized. Note that an isometry from a Hilbert space to itself is always unitary.
Importantly, the probability of recovering outcome $a$ upon measuring $A$ is given by the expectation value (in the post-measurement state $\left| \Psi (t)\right\rangle$) of $\mathbb{1}^{\,}_{\rm phys} \otimes \left| a \middle\rangle \hspace{-0.4mm} \middle\langle a \right|^{\,}_A$, which is just the projector onto the state $a$ of the measurement apparatus! In other words,
$$ p^{\vphantom{\prime}}_a \, \equiv \, \left\langle \mathbb{1}^{\,}_{\rm phys} \otimes \left| a \middle\rangle \hspace{-0.4mm} \middle\langle a \right|^{\,}_A \right\rangle^{\vphantom{\dagger}}_{\Psi(t)} \, = \, \sum\limits_{n,n'=0}^{N^{\,}_A} \, \left\langle \alpha \middle| \mathbb{P}^{\,}_n \mathbb{1} \mathbb{P}^{\,}_{n'} \middle| \alpha \right\rangle \otimes \left\langle 0 \middle| X^{-n}_A \, \mathbb{P}^{\,}_a \, X^{n'}_A \middle| 0 \right\rangle \, = \, \left\langle \alpha \middle| \mathbb{P}^{\,}_a \middle| \alpha \right\rangle \, , ~~$$
which is exactly the same as what the Copenhagen interpretation tells us! If the eigenvalue $A^{\,}_a$ is nondegenerate (i.e., there is exactly one eigenstate $\left| a \right\rangle$ with eigenvalue $A^{\,}_a$ under $A$), then we have
$$p^{\,}_a \, \to \, \left| \left\langle a \middle| \alpha \right\rangle \right|^2 \, , ~$$
as expected.
Now, suppose we measure $A$ and then we measure $B$. The combination of the two channels acts as
$$ \left| \Psi (t)\right\rangle \, = \, \mathsf{M}^{\vphantom{\dagger}}_{[B]} \, \mathsf{M}^{\vphantom{\dagger}}_{[A]} \, \left| \Psi (0)\right\rangle \, = \, \sum\limits_{b=0}^{N^{\,}_B-1}\, \sum\limits_{a=0}^{N^{\,}_A-1} \, \widetilde{\mathbb{P}}^{\,}_b \, \mathbb{P}^{\,}_a \, \left| \alpha \right\rangle \otimes X^{a}_A \,\left| 0\right\rangle \otimes \widetilde{X}^{b}_B \,\left| 0\right\rangle \,, ~$$
and now we can compute the joint probability to recover outcome $A^{\,}_a$ upon measuring $A$ and outcome $B^{\,}_b$ upon measuring $B$, starting from the state $\left| \alpha \right\rangle$ of the physical system:
$$ p^{\vphantom{\prime}}_{b|a} \, \equiv \, \left\langle \mathbb{1}^{\,}_{\rm phys} \otimes \left| a \middle\rangle \hspace{-0.4mm} \middle\langle a \right|^{\,}_A\otimes \left| b \middle\rangle \hspace{-0.4mm} \middle\langle b \right|^{\,}_B \right\rangle^{\vphantom{\dagger}}_{\Psi(t)} \, = \, \sum\limits_{m,m'=0}^{N^{\,}_B} \, \sum\limits_{n,n'=0}^{N^{\,}_A} \, \left\langle \alpha \middle| \mathbb{P}^{\,}_{n} \, \widetilde{\mathbb{P}}^{\,}_m \,\mathbb{1} \, \widetilde{\mathbb{P}}^{\,}_{m'} \, \mathbb{P}^{\,}_{n'} \middle| \alpha \right\rangle \times \left\langle 0 \middle| X^{-n}_A \middle| a \middle\rangle \hspace{-0.4mm} \middle\langle a \middle| X^{n'}_A \middle| 0 \right\rangle \times \left\langle 0 \middle| \widetilde{X}^{-m}_B \middle| b \middle\rangle \hspace{-0.4mm} \middle\langle b \middle| \widetilde{X}^{m'}_B \middle| 0 \right\rangle \, ,~~ $$
which simplifies to
$$p^{\vphantom{\prime}}_{b|a} \, = \, \left\langle \alpha \middle| \mathbb{P}^{\,}_a \, \widetilde{\mathbb{P}}^{\,}_b \, \mathbb{P}^{\,}_a \middle| \alpha \right\rangle \, \to \, \left| \left\langle b \middle| a \right\rangle \right|^2 \, \left| \left\langle a \middle| \alpha \right\rangle \right|^2 \, , ~~$$
where the latter expression holds when $A$ and $B$ are both nondegenerate. This is precisely what you wrote! If the order of measurements is reversed, simply swap $a \leftrightarrow b$ above, which gives $p^{\,}_{a|b}$!
I think it is much more straightforward to understand and derive the various relations for expectation values and joint/conditional probabilities in this Stinespring/MWI picture of measurement. It also accurately captures what happens in experiments. Let me know if anything is unclear or if additional details at any step would help!