What is new in this answer w.r.t to the other answers. I introduce the density operator by inserting the identity $1 = \sum \vert i\rangle \langle i \vert$ in the average, similarly to how one can use the $\operatorname{tr} \langle A \rangle = \langle A \rangle$ trick to do the same thing.
Conceptual comments. The density matrix comes whenever your system is part of a bigger system and interacts with it. Since you can't model the total, you average out the effect of the other system on yours. This is equivalent to taking a "marginal distribution/state". It can be interpreted as the system being in one pure state or another with different probabilities.
The mathematical details
Suppose that we have a system which we divide in a subsystem and an environment. We know a basis for the subsystem, $\vert {\phi_i} \rangle$, and for the environment, $\vert {\xi_j} \rangle$. I assume them to be orthonormal. Then a generic state is:
$$ \vert {\psi} \rangle = \sum_{ij} c_{ij} \vert {\phi_i} \rangle \otimes \vert {\xi_j} \rangle $$
Suppose we want to perform a measurement on our subsystem only, meaning we have an operator $O$ which only acts on $\vert {\phi_i} \rangle$:
$$ (O\otimes {1}) (\vert {\phi_i} \rangle \otimes \vert{\xi_j}) \rangle \equiv (O\vert {\phi_i})\rangle \otimes \vert {\xi_j} \rangle $$
Its expected value is given by
\begin{align*}
\langle O \rangle = \langle {\psi} \vert O \vert {\psi} \rangle
\end{align*}
We don't know the total state $\vert \psi \rangle$, but we do know is the possible states of our subsystem, $\lbrace \vert \phi_m \rangle\rbrace$. What is the best we can do knowing only this?
We can use the full expression of $ \vert {\psi} \rangle$ and try to "hide" the things we don't know how to calculate, like the expressions involving $\vert {\xi_j} \rangle$. We can use that the inner product of tensor products is $(\langle \phi_i \vert \otimes \langle \xi_j \vert)(\vert \phi_k \rangle \otimes \vert \xi_l \rangle)=\langle \phi_i \vert \phi_k \rangle \langle \xi_j \vert \xi_l \rangle$ (or alternatively, one can think of the tensor product being defined such that this is true by construction).
\begin{align*}
\langle O \rangle = \langle {\psi} \vert O \vert {\psi} \rangle = & \sum_{ij} \sum_{kl}c_{kl} c_{ij}^* (\langle \phi_i \vert \otimes \langle \xi_j \vert) ( O \otimes 1 ) (\vert \phi_k \rangle \otimes \vert \xi_l \rangle) \\
=& \sum_{ij} \sum_{kl} c_{kl}c_{ij}^* \langle \phi_i \vert O \vert \phi_k \rangle \langle \xi_j \vert \xi_l \rangle \\
=& \sum_{ik} \left(\sum_{m} c_{km}c_{im}^* \right) \langle \phi_i \vert O \vert \phi_k \rangle \\
=& \sum_{ik} p_{ki} \langle \phi_i \vert O \vert \phi_k \rangle
\end{align*}
So basically so far this is it, because $p_{ki}$ gives us all the information we need to compute the average value; $p_{ki}$ is the marginal distribution of $c_{ij}^* c_{kl}$ (by "marginal" we mean we integrate out or sum the probability over the degrees of freedom of the environment).
Since it has two indices we could think of it as the element of a matrix, and that matrix is easy to construct, we just insert the identity $1=\sum_m \vert \phi_m \rangle \langle \phi_m \vert$ and get the density matrix $\rho$ by using the commutativity of scalars (we can move them around in the product):
\begin{align*}
\sum_{ik} p_{ik} \langle \phi_i \vert O \vert \phi_k \rangle = & \sum_{ik} p_{ik} \langle \phi_i \vert \sum_m \vert \phi_m \rangle \langle \phi_m \vert O \vert \phi_k \rangle \\
= & \sum_{ik} \sum_m p_{ik} \langle \phi_m \vert O \vert \phi_k \rangle \langle \phi_i \vert \phi_m \rangle \\
= & \sum_m \langle \phi_m \vert O \left(\sum_{ik} p_{ik} \vert \phi_k \rangle \langle \phi_i \vert\right) \vert \phi_m \rangle = \sum_m \langle \phi_m \vert O \rho \vert \phi_m \rangle \equiv \operatorname{tr} O \rho
\end{align*}
where $\rho=\sum_{ik} p_{ki} \vert \phi_k \rangle \langle \phi_i \vert$.
More comments
We have reduced the computation of the global average to calculating averages over the states of our subsystem of a new object, $O\rho$. The object $\rho$ has hidden in it both the state of our subsystem and the average effect of the environment. I will make a few further comments about how one can think about $\rho$ and the mathematical justification.
First of all, it is easy to prove (see any textbook or Wikipedia) that $\rho^\dagger = \rho$, therefore there exists an orthonormal basis in which $\rho$ is diagonal:
$$ \rho = \sum_\lambda p_\lambda \vert \lambda \rangle \langle \lambda \vert \quad \sum p_\lambda = 1 $$
What is the significance of the states $\vert \lambda \rangle$? They represent the most "classical-like" states. This is so because they are orthonormal states, $\langle \lambda \vert \lambda^{'} \rangle=\delta_{\lambda \lambda^{'}}$, there are no correlations between them. They are as mutually exclusive as you can get, meaning that if you know your system is in state $\vert \lambda \rangle$, then automatically it cannot be in $\vert \lambda^{'} \rangle$ if $\lambda^{'}\neq \lambda$ because the projection onto it is zero.
Consider a two level system $\lbrace \vert 0 \rangle, \vert 1 \rangle \rbrace$. Then if you are in state $\vert 0 \rangle$, you know you cannot be in $\vert 1 \rangle$, but you have $50\%$ of being in any of the two diagonal states, therefore the system isn't "classical-like" regarding diagonal states. Also, if you are assured all your particles are in one of these states ($\vert 0 \rangle$ or $\vert 1 \rangle$), you can measure without collapsing provided you have an observable with $\lbrace \vert 0 \rangle, \vert 1 \rangle \rbrace$ as eigenstates.
Such an observable always exists because we can just build it ourselves, we only need it to have the same eigenstates as the density matrix, $\lbrace \vert \lambda \rangle \rbrace$:
$$ M = \sum_\lambda m_\lambda \vert \lambda \rangle \langle \lambda \vert $$ (It measures 0 on everything which is outside $\text{span}\lbrace \vert \lambda \rangle \rbrace$).
The usefulnes of this observable is that we can use it to prepare the mixed state! We measure a random $\vert \psi \rangle$. After measuring, we know the state $\vert \lambda \rangle$ to which it collapsed. We will take $N$ measurements of $N$ different states and after a measurement we will keep the state or discard it according to whether our ratio of states is the one prescribed by the density matrix, $p_\lambda = N_\lambda / N$ where $N_\lambda$ is the number of states in state $\vert \lambda \rangle$. If we now measure $M$, then the average value will be
$$ \langle M \rangle = \sum \frac{N_\lambda}{N} \langle \lambda \vert M \vert \lambda \rangle = \sum p_\lambda m_\lambda $$
and we see that this is a classical average over $p_\lambda$ of the distribution $m_\lambda$. It can be shown that this equals $\text{tr}M\rho$.
As a curiosity, Von Neumann uses the idea of $M$ in his book "Mathematical Foundations of Quantum Mechanics" as a way to implement a filter which separates quantum systems in a way that doesn't collapse them. If you have particles in two different quantum states and all mixed in a box, you can physically separate the particles with a moving wall which acts as a filter only if the possible states of the particles are orthogonal. In that case you can find a measurement which can distinguish between the two states without collapsing them, thus allowing you to create a filter which lets or lets not pass a certain kind of particle, letting you separate the quantum systems.
So to sum up, the first introduction was a top down approach, where we got the density matrix by averaging out the environment, arriving at the fact that mathematically the result is like having an average over pure states. The second approach is taking that "average over pure states" as reality and considering ensembles of pure states. Then, if we have $N$ systems and $N_\lambda$ in state $\vert \lambda \rangle$, then averaging an observable over the $N$ systems gives the same result as a mixed state given by $\rho$.