@Noldor and @CuriousOne are both partially right.
MIMO can and is used to increase capacity (but also reduce multi path, see next 2 paragraphs) by creating multiple nearly independent channels (NIC). What is not included in the explanation is that it is not antenna channels that are the NICs, it is what are called space-time channels. Not Einstein's space-time, but processing is done on the spatially different antennas, and a code in inserted in the time domain as well and processed. They are called spacetime codes. It actually makes all the multipath rays (think as an approximation of multiple rays) received at all the antennas to be separated after the decoding. That was the magic of MIMO. It took a lot of smart people to figure out that it was possible, and to figure out simple codes to do it - from there it kept getting improved. The codes were done such that the same time delays, accounting for both spatial and time differences due to the antennas and the reflecting multipath points, were separated into coherent signals.
Of course it all depended on the characteristics of the multipath. If no multipath, NO MIMO gain. It was not directivity, at all, it was these strange coding. People at really smart places like MIT and DARPA first could not believe it, thinking it was violating Shannon's law, but eventually did.
There are then a few variations, and different codes that are better for different purposes. You can have the same information (data) on each or some of those NICs and use the processing to implement diversity and reduce multipath. You don't reduce noise, that is not the purpose. Or you could instead insert different information (data) on each NIC and use it to increase capacity. Or you could make it adaptive, to do the best it can depending on the multipath environment, sometimes more capacity sometimes better diversity gains, or a mix.
MIMO has also been used to mean multiple output beamforming. Again, depending on the needs and propagation environment, people sometimes want this.
It was implemented first in a number of pilot projects and unique applications, and then first commercially in 802.11n. Now it is used in 802.11ac also and other versions. It is also specified as part of the 4G technologies using LTE and Advanced LTE. It will be used in 5G, starting around 2020 using some versions of cooperative and super MIMO. Super MIMO (actually called something else, I forget, but like super) will use a large number of antennas in a MIMO array, both at cellular base stations and at cell phones and other wireless devices. Large means maybe dozens or more.
Please note that antennas in most cellular and wireless devices are omni because you never know where the antenna you are talking to may be. The modern concept is to use processing like in some of the MIMO versions to exploit the spatial gains possible. We have exploited temporal and spectral degrees of freedom close to the max (in terms of Shannon limits), and spatial degrees of freedom are next. Polarization at most gives you a factor of 2, more or less 'negligible' in wireless comms, but when needed it is used.
It is not clear what other degrees of freedom can be exploited to increase capacity and performance. The Last one is the size of spatial cells, i.e. The coverage areas of base stations. As we reduce those to perhaps personal sized cells, then space would be able to be used for all it is worth.
Or somebody come up with a new degree of freedom.