You seem to think (because of the poor phrasing) that we're talking about linearizing with respect to $\theta$ alone, however, that is incorrect. To properly understand this we need to use multivariable calculus (actually just the idea that derivatives are linear approximations; see this MSE answer for the definition in the 2-dimensional case... which actually generalizes verbatim to any number of dimensions, as stated in the first half of this MSE answer). Below I describe the systematic way to think about it and derive the equation, but with some practice, the procedure becomes very mechanical.
Let $V,W$ be real vector spaces. When we speak of $n^{th}$ order ODEs, what we typically mean is that we are given some smooth function $f:V^{n+1}\to W$, and we are looking for a smooth curve $\phi:I\subset \Bbb{R}\to V$ (where $I$ is an open interval) such that for all $t$, the equation
\begin{align}
f(\phi(t),\dot{\phi}(t),\dots, \phi^{(n)}(t))&=0\tag{$*$}
\end{align}
is satisfied. Specializing to your example, we have:
Take $V=W=\Bbb{R}$ and $n=2$ (a second order ODE), so $n+1=3$. The function in question is $f:\Bbb{R}^3\to\Bbb{R}$, defined as
\begin{align}
f(\alpha_0,\alpha_1,\alpha_2)=\left[m_1+2m_2\sin^2(\alpha_0)\right]\alpha_2-m_1\Omega^2\sin^2(\alpha_0)\left[\cos(\alpha_0)-\cos\theta_{\text{eq}}\right].\tag{$**$}
\end{align}
All I did was take equation (11), and moved the terms over to one side, and I made a corresponding function $f$ of three variables. Now, solving this ODE means finding a function $\theta:\Bbb{R}\to\Bbb{R}$ (with appropriate initial conditions of course) such that $f(\theta(t),\dot{\theta}(t),\ddot{\theta}(t))=0$ for all $t$, i.e such that
\begin{align}
\left[m_1+2m_2\sin^2(\theta(t))\right]\ddot{\theta}(t)-m_1\Omega^2\sin^2(\theta(t))\left[\cos(\theta(t))-\cos\theta_{\text{eq}}\right]&=0.
\end{align}
If you rearrange this equation, this is precisely your equation (11). As a side remark: in this specific case, $f$ doesn't depend on the second variable $\alpha_1$, but that's just a fluke, so I'll just write everything out as if that were not the case so that it's clear how to handle more general situations.
Typically, what happens is that we can find some specific solution $\phi_0$ (so $f(\phi_0(t),\dot{\phi}_0(t),\dots, \phi_0^{(n)}(t))=0$), which satisfies some specific set of initial conditions. We can then ask the question of how to find a solution $\phi$ with slightly perturbed initial conditions. This is a very reasonable, and intuitive question to ask. For instance, in your specific case:
In your specific case, the special solution is the equilibrium solution $\theta_0(t)=\theta_{\text{eq}}$ for all $t$. This has $\dot{\theta_0}(t)=0,\ddot{\theta}_0(t)=0$ for all $t$. This is the 'boring solution' of the system; the pendulum stays perfectly still at the equilibrium. What we would like to know is what would the resulting motion of the pendulum would look like if for instance we initially started off the pendulum at a slightly different location from $\theta_{\text{eq}}$, and maybe gave it a small nudge in the beginning (i.e the small angle approximation)? Well, this exactly the type of situation we're analyzing above in more generality (notation is heavier, but the idea is simple enough I hope).
In general, this is a tough question to answer in closed form. So, we instead try to approximate $\phi$ from $\phi_0$. Maybe for notation sake we shall introduce the abbreviation $\Phi(t)=(\phi(t),\dot{\phi(t)},\dots,\phi^{(n)}(t))$, and likewise for $\Phi_0(t)$. Now, fix a particular $t$. We have (by definition of the derivative of $f$)
\begin{align}
f(\Phi(t))&=f(\Phi_0(t))+Df_{\Phi_0(t)}[\Phi(t)-\Phi_0(t)]+o\left(\|\Phi(t)-\Phi_0(t)\|\right)\\
&=0 + Df_{\Phi_0(t)}[\Phi(t)-\Phi_0(t)]+o\left(\|\Phi(t)-\Phi_0(t)\|\right).
\end{align}
Here, the notation $Df_p(h)$ means the derivative of $f$ at the point $p$ applied to the 'displacement vector' $h$. This expression motivates the following definition:
Definition.
With notation as above, we shall refer to the ODE
\begin{align}
Df_{\Phi_0(t)}[(\xi(t),\dot{\xi}(t),\dots,\xi^{(n)}(t))]&=0
\end{align}
as the linearized/variational ODE for $f$ about the particular solution $\phi_0$.
So, the goal is to find a function $\xi:I\to V$ which satisfies the above equation because if we manage to do so, then the function $\phi_0(t)+\xi(t)$ will be an approximate solution to $(*)$, i.e
\begin{align}
f(\phi_0(t)+\xi(t), \dot{\phi}_0(t)+\dot{\xi}(t),\dots, \phi_0^{(n)}(t)+\xi^{(n)}(t))&\approx 0,
\end{align}
where the error is up to first order in $\xi$ and its derivatives, i.e $o(\|(\xi(t),\dots, \xi^{(n)}(t))\|)$. Now, it is of course tradition to denote $\xi$ by the symbol $\delta\phi$, and the derivatives are denoted as $\xi^{(j)}(t)\equiv \delta\phi^{(j)}(t)$ (we avoid the more correct but cumbersome brackets $(\delta\phi)^{(j)}(t)$ in the notation).
In your specific example, as I mentioned above, the particular solution is $\theta_0(t)=\theta_{\text{eq}}$ for all $t$. This means $\dot{\theta}_0(t)=\ddot{\theta}_0(t)=0$. So, the linearized problem is to find a function $\delta\theta:\Bbb{R}\to\Bbb{R}$ such that for all $t\in\Bbb{R}$, we have
\begin{align}
0&=Df_{(\theta_0(t),\dot{\theta}_0(t),\ddot{\theta}_0(t))}[(\delta\theta(t),
\delta\dot{\theta}(t),\delta\ddot{\theta}(t))]\\
&=Df_{(\theta_{\text{eq}},0,0)}[(\delta\theta(t),
\delta\dot{\theta}(t),\delta\ddot{\theta}(t))].
\end{align}
In this special case things turned out very nicely because $\theta_0$ being a constant function means the curve $\Theta_0(t)=(\theta_0(t),\dot{\theta}_0(t),\ddot{\theta}_0(t))=(\theta_{\text{eq}},0,0)$ is just a single point, so we just have to calculate the derivative of $f$ at this single point. Recall the definition $(**)$ of our specific function $f$. In order to calculate the derivative $Df_{(\theta_{\text{eq}},0,0)}$ we could either calculate the partial derivatives directly, or just look at the first term in the Taylor expansion (these are literally equivalent by definition of differentiability). Now, we plug in
\begin{align}
\begin{cases}
\sin^2\alpha_0&=\sin^2\theta_{\text{eq}}+2\sin\theta_{\text{eq}}\cos\theta_{\text{eq}}(\alpha_0-\theta_{\text{eq}}) + \mathcal{O}(|\alpha_0-\theta_{\text{eq}}|^2)\\
\cos\alpha_0-\cos\theta_{\text{eq}}&=-\sin\theta_{\text{eq}}(\alpha_0-\theta_{\text{eq}})+ \mathcal{O}(|\alpha_0-\theta_{\text{eq}}|^2).
\end{cases}
\end{align}
into the definition for $f(\alpha_0,\alpha_1,\alpha_2)$, and only keep terms up to first order in $\|(\alpha_0,\alpha_1,\alpha_2)-(\theta_{\text{eq}},0,0)\|=\|(\alpha_0-\theta_{\text{eq}}, \alpha_1,\alpha_2)\|$. If you carry this out, you'll see that there is a term $(\alpha_0-\theta_{\text{eq}})\cdot \alpha_2$ which will appear (this is the term you're asking about), but this is second order in $\|(\alpha_0-\theta_{\text{eq}}, \alpha_1,\alpha_2)\|$. Yes, this term is linear in $\alpha_0-\theta_{\text{eq}}$, but that's not what we care about; this is multivariable calculus, so we are carrying out a linearization/differentiation in several variables. If you write this out fully, you'll find that
\begin{align}
f(\alpha_0,\alpha_1,\alpha_2)&=\left(m_1+2m_2\sin^2\theta_{\text{eq}}\right)\alpha_2 + m_1\Omega^2\sin^2(\theta_\text{eq})(\alpha_0-\theta_{\text{eq}}) + \mathcal{O}\left(\|(\alpha_0-\theta_{\text{eq}}, \alpha_1,\alpha_2)\|^2\right).
\end{align}
Thus, the derivative is the linear part:
\begin{align}
Df_{(\theta_{\text{eq}},0,0)}(h_0,h_1,h_2)&=\left(m_1+2m_2\sin^2\theta_{\text{eq}}\right)h_2 + m_1\Omega^2\sin^2(\theta_{\text{eq}})\,h_0.
\end{align}
Hence, the linearized/variational ODE to solve is:
\begin{align}
\left(m_1+2m_2\sin^2\theta_{\text{eq}}\right)\delta\ddot{\theta}(t) + m_1\Omega^2\sin^2(\theta_{\text{eq}})\,\delta\theta(t)&=0,
\end{align}
or equivalently,
\begin{align}
(\delta\theta)''&=-\frac{m_1\Omega^2\sin^2(\theta_{\text{eq}})}{m_1+2m_2\sin^2(\theta_{\text{eq}})}\delta\theta,
\end{align}
which is the equation for a simple harmonic oscillator.
How to make things more mechanical:
For differentiable functions $f$, the chain rule tells us that
\begin{align}
Df_p(h)&=\frac{d}{d\epsilon}\bigg|_{\epsilon=0}f(p+\epsilon h)\equiv (D_hf)(p),
\end{align}
i.e the derivative of $f$ at $p$ evaluated on $h$ equals the directional derivative of $f$ in the direction $h$, at the point $p$. Therefore, to calculate the (Frechet) derivative $Df_p(h)$, it is sufficient to consider $f(p+\epsilon h)$, and write things out to first order in $\epsilon$, as $f(p)+\epsilon \frac{d}{d\epsilon}\bigg|_{\epsilon=0}f(p+\epsilon h)+\mathcal{O}(\epsilon^2)$.
This is what's being said in the other answer as well: write $\theta(t)=\theta_0(t)+\epsilon\,\delta\theta(t)=\theta_{\text{eq}}(t)+\epsilon\,\delta\theta(t)$ (so that $\dot{\theta}(t)=\epsilon\,\delta\dot{\theta}(t)$ and $\ddot{\theta}(t)=\epsilon\,\delta\ddot{\theta}(t)$). Then, expand $f(\theta(t),\dot{\theta}(t),\ddot{\theta}(t))$ up to first order in $\epsilon$, and the coefficient of $\epsilon$ will lead you to the linearized equation.
With some more practice, the introduction of the function $f$, and even the introduction of the parameter $\epsilon$ become automatic and mental, so you wouldn't even have to write those out anymore. To conclude, the key takeaway here is that we are performing a linearization in several variables NOT just the single variable $\theta-\theta_{\text{eq}}$ (or $\alpha_0-\theta_{\text{eq}}$ in my notation). And the reason we want to perform a linearization in several variables is because in order for the analysis to be meaningful (i.e approximating about an exact solution) we need control over higher order derivatives as well (i.e $\|\Phi(t)-\Phi_0(t)\|$ in the general notation above). Even physically, this should make sense: when we consider small oscillations, it wouldn't make sense to have the initial position be the equilibrium point, but then have an incredibly large initial velocity (i.e smacking the pendulum really hard initially), because that would clearly make the oscillations large.