There is more than one definition of Entropy. We kept on adding definitions because we kept on running into things that behaved like Entropy.
It might help to think of Entropy as information. What information does it take to describe a system? How much information?
Entropy always increases is a side effect of information is not destroyed. As your system evolves, the amount of information it takes to describe it could be the information required to describe the initial state, plus the information required to describe how it evolved.
But, you say, evolution is defined by its initial state!
But, this isn't true; in a quantum world, your initial observations of the state are not sufficient to predict your future observations. What more, any series of observations aren't sufficient to fully predict future observations. And even if you just want distributions, every observation will impact the distributions of what future observations will look like; you need to remember how you interacted with the system in order to know what state the system is in. And every interaction adds to the amount of information you need to store in order to fully describe its state.
Backing up a second, and without quantum mechanics, presume you know a lot about the world, but not everything. There are error bars on your ability to describe the universe.
So you take your universe's states and you divide it up into "given my level of observation, this is what I can distinguish between" pieces.
Next, I'm going to require that you don't have a near infinite number of possible states. Instead I'm going to require that you have the same number of states at each power of (say) 10 level of precision.
So if your ability to see the universe is accurate to within 1/10^30 meters, I'm asking for an equal number of states of precision 10^30, 10^29, 10^28, 10^27, 10^26 etc.
The lower precision buckets are going to be far larger than the higher precision buckets. But you are free to arrange states in any bucket.
If we describe the number of macro states in the large buckets, they'll be much larger than the number of macro states in the smaller buckets.
"Entropy always grows" could be saying "you can't arrange the states so that the universe doesn't move into the bigger buckets over time, and once it does it won't go back into the smaller buckets with near certainty".
Higher entropy states have insanely more microstates than lower entropy states. If we divide the universe in any halfway sensible way into macrostates, the universe will evolve from low-entropy macrostates to high-entropy macrostates, because staying within low-entropy macrostates requires us to pick the microstates within the macrostates with insane precision. There isn't enough room to fit the microstates we could evolve into inside the low-entropy macrostates.
The only way around this is to create something akin to 1 macrostate per microstate. Then we can no longer measure entropy, because everything has 0 entropy.
You can get mathematical about this. Our ability to describe the universe with mathmatics is an example of an insane ability to compress the states of the universe. And, from computational complexity theory (another entropy), perfect compression is impossible. You cannot take a system that has 10^30 states and a way of describing it that can fully describe it in less than log_2 10^30 bits.
As mathematical descriptions of the universe in its current state provide a highly accurate compressed description of what the future universe will look like, that means we are in a low entropy universe.
There are going to be far, far more universe states in which any attempt to build a mathematical model to predict future behavior means that the (description of the universe) + (mathematical model) to predict (future universe state) will be larger than the information we get about the future universe state.
This includes counting. Counting is an amazing compression; 1 sheep, 2 sheep. By saying we have 27 sheep, I'm describing sheep once, and then using log(number of sheep) bits of information to say we have a bunch of stuff like that one thing.
If sheep is a meaningful term, then saying "this is a sheep, with the following differences" is less information than fully describing the sheep.
Universes with 2 of some structure or arrangement of things that can be described this way -- in which counting has meaning -- are far fewer in number than ones that don't.
If we have 10^30 subatomic particles (roughly a human) and we need 10^10 bits to describe each of their locations, that is 10^40 bits to describe the matter of a human.
Fully describing what a human is (and isn't) might require 10^40 bits of information as well. But once you have done so, it might only take 10^30 bits to describe "this particular human".
If there is two humans in the universe, then you can say "10^40 bits for a human, then 2 times 10^30 bits for each of the humans".
This is efficient compression. And universes that can be compressed this way - by any algorithm - are in the minority.
"Entropy always increases" can be converted to "for any model of physics, the universe will evolve into a state where the model of physics is less useful".