This above diagram is the delayed-choice quantum eraser. My current understanding is as follows:
1) If the experiment is done with $BS_a$ and $BS_b$ both mirrors, then double-clumping is seen on $D_0$.
2) If the experiment is done with $BS_a$ and $BS_b$ both transparent, then an interference pattern is seen on $D_0$.
3) If the experiment is done as it was originally (50% transparency on $BS_a$ and $BS_b$), then clumping in the form of a double-clump, will be seen on $D_0$. However, when you filter the data based on $D_{1-4}$, $D_0$ can be shown as a sum of two interference patterns, and two double-clumps, that individually match what is normally seen in the double-slit experiment.
4) The experiment can be done, with $BS_i$ arbitrarily far away. The results remain the same.
These are my assumptions. They may be wrong (Well, they must be due to the paradox), so please let me know if that's where the issue lies. So 1) and 2) are the normal double-slit result, though the apparatus is more complicated this time. I still understand it. 3) is complicated, but not hard to understand in a way that doesn't break causality. My understanding is that the x-coordinate of a photon on $D_0$ will collapse its wavefunction notably (But not entirely). If the x-coordinate is such that it lies on where the center of the red path clump should be, then its entangled photon will be much more likely to arrive on $D_4$, but it may still land on $D_{1-3}$. Similarly, the x-coordinate of any photon on $D_0$ directly assigns the probability distribution that its entangled partner will have upon hitting the remaining detectors.
However, let's say the apparatus is such that $BS_i$ and the detectors are a light-year away from from G-T prism. What happens if, during transit (Say, a day before photon arrival), we swap out each 50% transparency $BS_i$ with one that is 100% reflective. Now, double clumping must result on $D_0$. But this is a paradox, since we just changed the result of the experiment after the experiment happened.
At the minimum, I notice that (2) must be wrong here. But, why wouldn't an interference pattern develop on $D_0$ when which-path information is lost? This is my first and foremost question.
But, there also appears to be a deeper result. Let's say that $R_i$ is the subset of photons on $D_0$ that match a given $D_i$ (Or to be more precise, say $R_i$ is a probability distribution). Then the above paradox is not possible if $R_1 + R_2 = R_3 + R_4$ (Since the transparency of $BS_i$ would then not affect $D_0$). I believe that my assumptions above are correct, so I believe that my paradox allows me to deduce here that $R_1 + R_2 = R_3 + R_4$ (So that the two interference patterns on $R_3$ and $R_4$ sum to a double-clump). But, why? I guess this also asks the question of, in the original experiment, why do the peaks and troughs of $R_{1-2}$ cancel out in that way (And hence my original question of why (2) is wrong)? But generalizing, why does $R_1 + R_2 = R_3 + R_4$ appear to hold? I ask this from a mathematical perspective, as opposed to the solution that "It's a paradox if they don't".
Note:
I haven't read the paper, and there's conflicting info on Wikipedia on this subject. It appears that the original experiment may have been performed so that red and blue paths are both aimed at x-coordinate 0, so that my assumption (1) is incorrect (Single-clumping occurs, not my presumed double clumping). I don't think this matters, since the original experimenters could have had the red and blue paths aimed slightly off from each other (But not so far away that the interference patterns of $R_1$ and $R_2$ are lost). And, the distinction doesn't matter for the $R_1 + R_2 = R_3 + R_4$ theorem.