$\newcommand{\si}{\sigma}\renewcommand\th\theta\newcommand{\R}{\mathbb R}\newcommand{\Si}{\Sigma}$Since you requested more details, I am collecting here my comments (somewhat modified), with substantial details added.
- You wrote "I have the feeling that some basic implicit assumptions are not made explicit". I asked if you can explain why you have such a feeling. You then said you could not provide specific examples to explain this feeling, but your "overall feeling when starting to read a book on satistics or probability is that definitions are not motivated, and also [you] can't see clearly the global direction of the series of results developped in the book. [Your] feeling when reading a book on algebra, geometry or homological algebra is very different : to [you] the definitions are well motivated, and [you] can understand where the author wants to go".
-- It is unclear to me what kind of motivation you are seeking. When you read a book on homological algebra, the definitions you see there are probably motivated purely mathematically, based on needs in some other mathematical areas -- and this is what seems to satisfy you. So, are you looking for definitions in mathematical statistics similarly motivated by needs in some other mathematical areas? If so, you will hardly find such motivation anywhere. Mathematical statistics grew, not from other areas of mathematics, but from messy, non-mathematical statistics.
You also wrote "Could you please recommend a casual approach of mathematical statistics/probability [...]?" -- Based on my experience, I think a casual approach is what is needed here the least. As for "general philosophy", I think what is useful there is comes, not at the beginning, but only at some time after you have mastered probability (and then based on it mathematical statistics) mathematically. But of course, these suggestions may not work for everyone.
As for the link between probability and mathematical statistics, it is all here: Probability theory studies probability measures, especially over special structures, such as product sets and many others. Mathematical statistics deals with statistical models, which are families/sets of probability measures (also called distributions) over a set $S$ (called the sample space). Only one of the distributions is assumed to be "true", that is, to correspond to the reality. The true distribution is only partially known, if at all.
We have a number of (say independent) realizations of the random element of the sample space $S$ with the true distribution. Based on these realizations, we have to make inferences about the unknown "true" distribution. We want these inferences to be good, according to some criterion. So, we may have an optimization problem, which in some cases can be solved exactly.
As you requested, here are a couple of examples illustrating the link between probability and mathematical statistics.
Example 1 (from the so-called parametric statistics): We make a sequence $(x_1,\dots,x_n)$ of measurements of a certain physical quantity $\mu$. We model the measurement errors as random variables (r.v.'s). We may think of the measurement error as the sum of many small errors due to a large number of (nearly) independent factors of relatively small effect each. Then, in view of the central limit theorem, we may model the measurement error as a normal r.v. We may further assume that there is no systematic error; that is, the mean of the random error is $0$. So, we may model the measurements $x_1,\dots,x_n$ as the realizations (that is, values) of independent normal r.v.'s $X_1,\dots,X_n$ each with mean $\mu$ and some variance $\si^2$. So, the statistical model here is the family $(P_{\mu,\si^2})$ of all normal distributions (over the real line), indexed by the two-dimensional parameter $\th:=(\mu,\si^2)$. Based on the sequence $(x_1,\dots,x_n)$ of measurements, we want to estimate the unknown quantity $\mu$, which is a function of the unknown value of the parameter $\th=(\mu,\si^2)$. The most natural estimate of $\mu$ seems to be $\bar x_n:=\frac1n\,\sum_1^n x_i$. Even if this is very unlikely, this estimate may be rather far from the true value $\mu$ -- say if all the measurements $x_i$ happen to be greater or significantly greater than $\mu$. However, the corresponding estimator $\bar X_n:=\frac1n\,\sum_1^n X_i$ (of which the estimate $\bar x_n$ is just one realization) will be good overall, in various average senses. (In general, an estimator is a function of the "sample" $(X_1,\dots,X_n)$). Indeed, $\bar X_n$ is unbiased for $\mu$ -- that is, the expected value $E_{\mu,\si^2}\bar X_n$ of $\bar X_n$ is $\mu$ for all $(\mu,\si^2)$ (where $E_{\mu,\si^2}T(X_1,\dots,X_n):=\int_{\R^n}P_{\mu,\si^2}(du_1)\cdots P_{\mu,\si^2}(du_n) T(u_1,\dots,u_n)$). Moreover, it is known that $\bar X_n$ will be optimal in the sense that it will have the smallest variance among all unbiased estimators of $\mu$ based on the "sample" $(X_1,\dots,X_n)$. Furthermore, the estimator $\bar X_n$ of $\mu$ is consistent, the sense that $\bar X_n\to\mu$ in $P_{\mu,\si^2}^{\otimes n}$-probability as $n\to\infty$. So, unless we are unlucky, the estimate $\bar x_n$ of $\mu$ should be good (but, because of random fluctuations, this is not $100\%$ guaranteed, especially if $n$ is not large enough). So, we see a lot here is about modeling; but, after the modeling is done, we use mathematical tools to prove properties of estimators, such as their optimality in a certain sense.
Example 2 (nonparametric statistics): Here the statistical model is a set of probability distributions not parametrized by a parameter with values in a low-dimensional space. For instance, here the model may be the set of all probability measures/distributions on a measurable space $(S,\Si)$, with the unknown distribution assumed completely unknown (so, we are not assuming its normality or anything like that). Still, suppose we have a "sample" $(X_1,\dots,X_n)$ from the completely unknown true distribution $P_*$; that is, $X_1,\dots,X_n$ are independent random elements of $S$ each with distribution $P_*$. Then we can consider the so-called empirical (random) measure $\hat P_n$ defined by the formula
\begin{equation}
\hat P_n(B):=\frac1n\,\sum_1^n 1(X_i\in B)
\end{equation}
for $B\in\Si$. Then, by the law of large numbers, $\hat P_n$ is a consistent estimator of $P_*$ in the sense that for each $B\in\Si$
\begin{equation}
\hat P_n(B)\to P_*(B)
\end{equation}
in $P_*^{\otimes n}$-probability as $n\to\infty$. Closedness of $\hat P_n$ to $P_*$ in stronger senses has been extensively studied; see e.g. Talagrand.