Recommendation for learning mathematical statistics and probability

Question

I can easily find my way reading a book on homological algebra or algebraic geometry, but I tried once reading a book on statistics and... I felt dumb really: I simply do not understand the foundations of statistics, and it seems to me like reading Chinese.

My ultimate goal is to understand in detail how statistics is used in medicine and clinical research in general (so called biostatistics), and more specifically in running a randomized controlled trial.

My gut feeling is that the hard science used to do that is based on quite arbitrary assumptions, but my understanding of statistics is so poor that I might well be wrong.

When reading a book on statistics, I have the feeling that some basic implicit assumptions are not made explicit, which makes it hard for me to dive into the mathematical content of it. The same occurs when reading a book on probability.

Could you please recommend a casual approach of mathematical statistics/probability, so that I could understand better what is the background, the general philosophy, the underlying assumptions of it?

Also, some bibliography that clearly explains the link between mathematical statistics and probability: which one has to be leaned first? why is it so?

Learn probability theory first. For leisurely reading, you can read "Heads or Tails" by Lesigne to get an idea of how probability theory works outside of measure-theoretic details and "Ten Great Ideas about Chance" by Diaconis and Skyrms if philosophical issues with probability keep you from learning it. — Michael Greinecker, Jul 15 '23 at 16:47
@Michael Greinecker : thank you, I will have a look at those books. But I don't quite understand : to me, probability is sort of an abstraction of statistics when "letting the number of experiments go to infinity". Am I right ? If so, why is it necessary to learn probability first ? shouldn't it be the other way around ? — huurd, Jul 15 '23 at 17:04
There are different interpretations of probabilities, and these philosophical issues are well explained in the book by Diaconis and Skyrms. On a technical level, it is simply easier to calculate the distribution of coin flips of a biased coin with a known bias than thinking about how to estimate the bias of the coin from data. — Michael Greinecker, Jul 15 '23 at 17:08
Ok, so I have to read that book definitely :-) I didn't know that probability, as quantum mechanics, have different "interpretations". — huurd, Jul 15 '23 at 17:13
You definitely need to understand the basics of probability before you can understand statistics, since the latter is based on the former. I totally love William Feller's volume I (discrete probability), which in fact has much more than you need, but is written lucidly by a master of the art. — Daniel Asimov, Jul 16 '23 at 01:02
@huurd : You wrote "I have the feeling that some basic implicit assumptions are not made explicit". Can you explain why you have such a feeling? I have worked in probability and mathematical statistics for decades, and never had any such feeling. You also wrote "Could you please recommend a casual approach of mathematical statistics/probability [...]?$ Based on my experience, I think a casual approach is what is needed here the least. — Iosif Pinelis, Jul 16 '23 at 02:23
Previous comment continued: As for "general philosophy", I think what is useful there is comes, not at the beginning, but only at some time after you have mastered probability (and then based on it mathematical statistics) mathematically. But of course, these suggestions may not work for everyone. — Iosif Pinelis, Jul 16 '23 at 02:24
@huurd : As for the link between probability and mathematical statistics, it is all here: Probability theory studies probability measures, especially over special structures, such as product sets and many others. Mathematical statistics deals with statistical models, which are families/sets of probability measures (also called distributions) over a set $S$ (called the sample space). Only one of the distributions is assumed to be "true", that is, to correspond to the reality. The true distribution is only partially known, if at all. — Iosif Pinelis, Jul 16 '23 at 03:04
Previous comment continued: We have a number of (say independent) realizations of the random element of the sample space $S$ with the true distribution. Based on these realizations, we have to make inferences about the unknown "true" distribution. We want these inferences to be good, according to some criterion. So, we have an optimization problem, which in some cases can be solved exactly. — Iosif Pinelis, Jul 16 '23 at 03:04
@Iosif Pinelis : right now I can't give you specific examples (if I remember one of them I will come back later), but my overall feeling when starting to read a book on satistics or probability is that definitions are not motivated, and also I can't see clearly the global direction of the series of results developped in the book. As I said, my feeling when reading a book on algebra, geometry or homological algebra is very different : to me the definitions are well motivated, and I can understand where the author wants to go, so that my motivation to go on reading page after page remains intact — huurd, Jul 16 '23 at 06:24
@huurd : It is unclear to me what kind of motivation you are seeking. When you read a book on homological algebra, the definitions you see there are probably motivated purely mathematically, based on needs in some other mathematical areas -- and this is what seems to satisfy you. So, are you looking for definitions in mathematical statistics similarly motivated by needs in some other mathematical areas? If so, you will hardly find such motivation anywhere. Mathematical statistics grew, not from other areas of mathematics, but from messy, non-mathematical statistics. — Iosif Pinelis, Jul 16 '23 at 11:25
Previous comment continued: Anyhow, is anything of what I said about the link between probability and mathematical statistics unclear or not quite satisfying for you? — Iosif Pinelis, Jul 16 '23 at 11:27
Maybe a duplicate ? https://mathoverflow.net/questions/31655/statistics-for-mathematicians — kjetil b halvorsen, Jul 16 '23 at 16:37
@@Iosif Pinelis : what you said about the link between probability and statistics makes sense indeed, I never understood this link that way, and it is conceptually satisfying. But could you nevertheless link those notions (distribution, sample space, realization of a random element) with a concrete example, so that it makes even more sense to me ? — huurd, Jul 17 '23 at 06:28
Ok thank you for recommending that book I will have a look. I would be happy to read your comments on RCTs if you don't mind. — huurd, Jul 17 '23 at 07:45
As you have a background in algebraic geometry, you might take a look at Algebraic Statistics by Drton, Sturmfels, and Sullivant or the more recent Algebraic Statistics by Sullivant — Zach Teitler, Jul 17 '23 at 16:34
@huurd I just saw your comment about "letting the number of experiments go to infinity." This sounds to me like a frequentist view of probability theory, which is not the only possible philosophical view. It is sometimes contrasted with the classical view, which was motivated by analyzing simple games of chance. Frequentism was more motivated by hypothesis testing, which may be closer to your interests. — Timothy Chow, Jul 17 '23 at 18:53
@huurd : You wrote "But could you nevertheless link those notions (distribution, sample space, realization of a random element) with a concrete example, so that it makes even more sense to me ?" This is now done in my answer. Do you have a response to that? — Iosif Pinelis, Jul 18 '23 at 13:55
@Iosif Pinelis : yes I have seen your answer thank you so much, but I so busy this week that I can hardly read anything. I am waiting for a break within the next few days, and surely I will read your comments in detail and answer your question. — huurd, Jul 18 '23 at 14:04

score 6 · Answer 1 · answered Jul 16 '23 at 03:40

You might want to spend some time reading about the philosophy of probability. One book you might try is Philosophical Theories of Probability by Donald Gillies, which lays out many different philosophical approaches to thinking about probability. If nothing else, by seeing what others have said about the subject, you may be able to articulate more clearly what your own misgivings are.

The dominant point of view among mathematicians today, however, is that there is no need to fret over philosophical issues as far as probability theory is concerned (statistics is another matter, which I'll say more about in a moment), because Kolmogorov's axioms for probability theory, which are based on measure theory, allow us to treat probability theory as a purely mathematical subject. Regardless of your preferred philosophical approach, you will prove the same theorems and get the same answers, so the philosophical debates can be set aside when it comes to mastering the theory and solving problems.

Statistics, on the other hand, cannot be completely subsumed under mathematics the way probability theory can. In statistics, one estimates quantities, draws inferences, and makes decisions, and the justification of these activities requires assumptions that are not purely mathematical. For a simple example, take the opening scene of Rosencrantz and Guildenstern are Dead, in which repeated coin flips all come up heads. We want our statisticians to declare at some point that "such a coin surely cannot be fair" (for if statisticians can't tell us that, what good are they?), but it's clear that such a declaration must be based (in part, at least) on assumptions that are not purely mathematical. Unfortunately, statistics texts often don't lay out all these assumptions explicitly. While most of the standard assumptions are reasonable, others can be legitimately debated. You can get a flavor of how heated such debates can sometimes get by reading The Cult of Statistical Significance by Ziliak and McCloskey.

This seems like something I needed all my life! – tryst with freedom Jul 16 '23 at 09:05 — tryst with freedom, Jul 16 '23 at 09:05

Michael Hardy · Answer 2 · 2023-07-16T20:26:12.227

Probability theory is a prerequisite to mathematical statistics.

I like the book by DeGroot & Schervish. That starts out with probability theory and then does theory of statistics, and you see why the latter has a somewhat different flavor from the former. All that should be understood before anything else.

Then turn to two subjects: regression, done in an intellectually serious (i.e. applied) way, and design of experiments. And you might also include a texbook on sampling, partly just to remind you of how different statistics can be from other subjects.

As a small example of the "different flavor" referred to above: Suppose $X_1,\ldots,X_n\sim\text{i.i.d.} \operatorname N(\mu, \sigma^2).$ Then observe that any weighted average of $X_1,\ldots,X_n$ with constant (i.e. non-random) weights has expected value $\mu$ and it is easy to show that the one with the smallest variance is the one in which all of the weights are equal, but now consider other functions $f(X_1,\ldots,X_n)$ not depending on the "unobservable" quantities $\mu$ and $\sigma$, but whose expected value is equal to $\mu$ regardless of the values of $\mu$ and $\sigma.$ Among these are $\operatorname{median}(X_1,\ldots, X_n),$ $( \max\{X_1,\ldots,X_n\} +\min\{X_1,\ldots,X_n\})/2,$ and some other things. How do you show that among all such functions, the one with the smallest variance is the equally weighted mean? You don't often see that in textbooks only on probability, but you should expect things like that in any book on theory of statistics.

It sometimes seems that many mathematicians don't really suspect the field of regression exists, and think it consists only of fitting a line or a plane by least squares in the way that is introduced in secondary-school statistics courses. In fact, linear regression is an active research area that can challenge mathematicians, and has long been so. And linear regression includes things like fitting polynomial functions or sine waves by least squares: the reason it's called "linear" is not that one is fitting affine functions. And there is also nonlinear regression. One of the simplest examples of the latter is logistic regression in which one predicts a binary outcome based on continuous predictor variables. That that is "nonlinear" is at least hinted at by the fact that in that context estimation by maximum likelihood is not equivalent either to ordinary least squares or weighted least squares.

As a hint at the flavor of that part of design of experiments that relies on combinatorics, I'll give one example apparently due to Harold Hotelling: You want to weigh eight objects, and weighings have random errors that are independent and normally distributed with expectation zero. One way is to weigh them separately. Another is that you can measure all of them together and also measure differences between weights. Problem: Find seven separate partitions of this set of $8$ into two sets of $4$ and so chosen that if you measure the difference between the total weights of the two complementary sets of $4,$ and then calculate the weight of each object as a linear combination of the eight measurements (one of all eight together and then those seven differences) then the variance of the measurement of the weight of each object, determined in that way, is only $1/8$ of the variance of the measurement of any of the eight objects alone. (The solution is (currently) in Wikipedia's article on design of experiments.)

There are some words and concepts that mathematicians toss around promiscuously without examining them closely. One of those is "intuition," which has a variety of different meanings even if you restrict them to meanings commonly intended by mathematicians. Another is "arbitrary." I remember an occasion when a professor of mathematics asserted in a MathOverflow posting that "Definitions are arbitrary." Deductive logic will tell us that definitions must not be circular and that things must be "well defined," and I imagine the professor meant that beyond that deductive logic has nothing to say about how we define things. Looking at things like that only from the point of view of deductive logic, it may appear that definitions are "arbitrary." This excuses one of the frequent intellectual sins of mathematicians: failure to motivate definitions. An example is that in undergraduate textbooks matrix multiplication is defined by "dotting" rows with columns, without assigning students the exercise of showing that that is just what is needed to make matrix multiplication correspond to composition of linear transformations. Another such example is omitting from algebra textbooks any explanation of why one says "characteristic zero" rather than "infinite characteristic." Things like these, along with contempt for applications and several other things, are anti-intellectual. So this posted question says: "My gut feeling is that the hard science used to do that is based on quite arbitrary assumptions". (Only recently I learned that in Britain the word quite is not an intensifier meaning absolutely or extremely or unqualifiedly, but I will assume that is the meaning intended here.) So what I'm wondering at this point is whether, by "arbitrary," the poster meant not justified by deductive reasoning. If so, I think understanding of any applied field might be aided by losing the idea that such things are "arbitrary." (But this could be a wildly wrong guess about the poster's intentions, in which case, I order you not to read this present paragraph unless you want to.)

${}$

score 2 · Answer 3 · answered Jul 17 '23 at 16:00

$\newcommand{\si}{\sigma}\renewcommand\th\theta\newcommand{\R}{\mathbb R}\newcommand{\Si}{\Sigma}$Since you requested more details, I am collecting here my comments (somewhat modified), with substantial details added.

You wrote "I have the feeling that some basic implicit assumptions are not made explicit". I asked if you can explain why you have such a feeling. You then said you could not provide specific examples to explain this feeling, but your "overall feeling when starting to read a book on satistics or probability is that definitions are not motivated, and also [you] can't see clearly the global direction of the series of results developped in the book. [Your] feeling when reading a book on algebra, geometry or homological algebra is very different : to [you] the definitions are well motivated, and [you] can understand where the author wants to go".

-- It is unclear to me what kind of motivation you are seeking. When you read a book on homological algebra, the definitions you see there are probably motivated purely mathematically, based on needs in some other mathematical areas -- and this is what seems to satisfy you. So, are you looking for definitions in mathematical statistics similarly motivated by needs in some other mathematical areas? If so, you will hardly find such motivation anywhere. Mathematical statistics grew, not from other areas of mathematics, but from messy, non-mathematical statistics.

You also wrote "Could you please recommend a casual approach of mathematical statistics/probability [...]?" -- Based on my experience, I think a casual approach is what is needed here the least. As for "general philosophy", I think what is useful there is comes, not at the beginning, but only at some time after you have mastered probability (and then based on it mathematical statistics) mathematically. But of course, these suggestions may not work for everyone.
As for the link between probability and mathematical statistics, it is all here: Probability theory studies probability measures, especially over special structures, such as product sets and many others. Mathematical statistics deals with statistical models, which are families/sets of probability measures (also called distributions) over a set $S$ (called the sample space). Only one of the distributions is assumed to be "true", that is, to correspond to the reality. The true distribution is only partially known, if at all. We have a number of (say independent) realizations of the random element of the sample space $S$ with the true distribution. Based on these realizations, we have to make inferences about the unknown "true" distribution. We want these inferences to be good, according to some criterion. So, we may have an optimization problem, which in some cases can be solved exactly.

As you requested, here are a couple of examples illustrating the link between probability and mathematical statistics.

Example 1 (from the so-called parametric statistics): We make a sequence $(x_1,\dots,x_n)$ of measurements of a certain physical quantity $\mu$. We model the measurement errors as random variables (r.v.'s). We may think of the measurement error as the sum of many small errors due to a large number of (nearly) independent factors of relatively small effect each. Then, in view of the central limit theorem, we may model the measurement error as a normal r.v. We may further assume that there is no systematic error; that is, the mean of the random error is $0$. So, we may model the measurements $x_1,\dots,x_n$ as the realizations (that is, values) of independent normal r.v.'s $X_1,\dots,X_n$ each with mean $\mu$ and some variance $\si^2$. So, the statistical model here is the family $(P_{\mu,\si^2})$ of all normal distributions (over the real line), indexed by the two-dimensional parameter $\th:=(\mu,\si^2)$. Based on the sequence $(x_1,\dots,x_n)$ of measurements, we want to estimate the unknown quantity $\mu$, which is a function of the unknown value of the parameter $\th=(\mu,\si^2)$. The most natural estimate of $\mu$ seems to be $\bar x_n:=\frac1n\,\sum_1^n x_i$. Even if this is very unlikely, this estimate may be rather far from the true value $\mu$ -- say if all the measurements $x_i$ happen to be greater or significantly greater than $\mu$. However, the corresponding estimator $\bar X_n:=\frac1n\,\sum_1^n X_i$ (of which the estimate $\bar x_n$ is just one realization) will be good overall, in various average senses. (In general, an estimator is a function of the "sample" $(X_1,\dots,X_n)$). Indeed, $\bar X_n$ is unbiased for $\mu$ -- that is, the expected value $E_{\mu,\si^2}\bar X_n$ of $\bar X_n$ is $\mu$ for all $(\mu,\si^2)$ (where $E_{\mu,\si^2}T(X_1,\dots,X_n):=\int_{\R^n}P_{\mu,\si^2}(du_1)\cdots P_{\mu,\si^2}(du_n) T(u_1,\dots,u_n)$). Moreover, it is known that $\bar X_n$ will be optimal in the sense that it will have the smallest variance among all unbiased estimators of $\mu$ based on the "sample" $(X_1,\dots,X_n)$. Furthermore, the estimator $\bar X_n$ of $\mu$ is consistent, the sense that $\bar X_n\to\mu$ in $P_{\mu,\si^2}^{\otimes n}$-probability as $n\to\infty$. So, unless we are unlucky, the estimate $\bar x_n$ of $\mu$ should be good (but, because of random fluctuations, this is not $100\%$ guaranteed, especially if $n$ is not large enough). So, we see a lot here is about modeling; but, after the modeling is done, we use mathematical tools to prove properties of estimators, such as their optimality in a certain sense.

Example 2 (nonparametric statistics): Here the statistical model is a set of probability distributions not parametrized by a parameter with values in a low-dimensional space. For instance, here the model may be the set of all probability measures/distributions on a measurable space $(S,\Si)$, with the unknown distribution assumed completely unknown (so, we are not assuming its normality or anything like that). Still, suppose we have a "sample" $(X_1,\dots,X_n)$ from the completely unknown true distribution $P_*$; that is, $X_1,\dots,X_n$ are independent random elements of $S$ each with distribution $P_*$. Then we can consider the so-called empirical (random) measure $\hat P_n$ defined by the formula \begin{equation} \hat P_n(B):=\frac1n\,\sum_1^n 1(X_i\in B) \end{equation} for $B\in\Si$. Then, by the law of large numbers, $\hat P_n$ is a consistent estimator of $P_*$ in the sense that for each $B\in\Si$ \begin{equation} \hat P_n(B)\to P_*(B) \end{equation} in $P_*^{\otimes n}$-probability as $n\to\infty$. Closedness of $\hat P_n$ to $P_*$ in stronger senses has been extensively studied; see e.g. Talagrand.

Recommendation for learning mathematical statistics and probability

3 Answers3