There are indeed nonconvex Lagrangians and they are a problem for the Legendre transformation by making it multiple valued (indeed a convex but not strictly convex Lagrangian will throw up this problem). If one cannot get rid of this multiple valuedness through splitting the problem into convex and concave "sectors" (piecewise analysis where the domain of interest is restricted, as in the first paragraph of Qmecahnics's answer) or by imposing constraints, then it's game over for the Hamiltonian approach. Indeed, multiple valuedness of the Legendre transformation implies nonuniqueness of solution for the Euler-Lagrange equation. I'd like to talk about a famous convex, but not strictly convex example from my own field and what people do about the problems it raises. It provides an interesting illustration of difficulties with convexity and how they arise (i.e. as noted by the OP, they are tantamount to multiple valuedness of Legendre transformation) and also there are at least two common solutions to this particular problem whose appropriateness is different for different fields of physics! Your solution depends on what you want to achieve with your Hamiltonian.
This is the calculation of geodesics in a (semi) Riemannian manifold, so that:
$$\mathcal{L} = \sqrt{g(X)(\dot{X},\,\dot{X})} = \sqrt{g_{ij}(X)\,\dot{x}^i\,\dot{x}^j}\tag{1}$$
This is also the same problem as the calculation of rays from Fermat's least time principle, if you include optical density (refractive index) information in the metric tensor. Thus ray optics in an isotropic medium is the geometry of a conformally flat manifold (since $g$'s matrix in Cartesian co-ordinates is the squared refractive index times the identity); aniostropic mediums give more general geometry.
This example is historically important, not only for General Relativity but also because ray optics - this very problem - was the field that piqued Hamilton's interest in these matters.
The Bad and The Ugly
The Lagrangian is convex, but not strictly so. Consider the linear path:
$$\sigma(t) = t\,\dot{X}_0\tag{2}$$
in the tangent space i.e. where one moves by scaling the tangent vector $\dot{X}_0$. In response, the Lagrangian in (1) also scales linearly, so the linear path lies exactly within in the graph / on the edge of the epigraph of the Lagrangian. Thus, the conjugate momentum $\partial_\dot{X}\,\mathcal{L}$, being the one-form:
$$P(\text{_}) = \frac{g(\dot{X},\,\text{_})}{\sqrt{g(\dot{X},\,\dot{X})}}\tag{3}$$
is independent of $t$ as our point moves according to (2) . Thus $P$ is a highly many to one function on the tangent space at any point: any point in the tangent space of the form $t\,X_0$ for $t\in\mathbb{R}$ has the same value in (3). The Legendre transformation fails to pick out a unique $P$ for a each $\dot{X}$. Not surprisingly, therefore, if one does the Legendre transformation, one gets:
$$\mathcal{H} = P(\dot{X}) - \mathcal{L}=\frac{g(\dot{X},\,\dot{X})}{\sqrt{g(\dot{X},\,\dot{X})}}-\mathcal{L}=0\quad\quad \text{OMG!!}\tag{4}$$
The Legendre transformation in this case is clearly many to one (try saying that aloud with a straight face).
Let's look at this another way. Even the solution of the Euler-Lagrange equation for the Lagrangian - quite aside from wishing for a Hamitonian counterpart - in this problem is fraught (but it can be done with some care). The Hessian matrix of the mapping $\dot{X} \mapsto P = \partial_{\dot{X}} \mathcal{L}$ is:
$$h_{ij} = \partial_{\dot{x}^i} \partial_{\dot{x}^j} \mathcal{L} = \left(g(\dot{X},\,\dot{X})\,g_{ij} - g_{i k}\,\dot{x}^k\,g_{j \ell}\,\dot{x}^\ell\right)\,\mathcal{L}^{-3}\tag{5}$$
For our purposes, (5) is more transparent if we put it in matrix notation (here $G$ is the matrix of the metric tensor):
$$H = \frac{G}{(X^T\,G\,X)^\frac{3}{2}}\,\left(\mathrm{id} - \frac{\dot{X}\,\dot{X}^T\,G}{\dot{X}^T\,G\,\dot{X}}\right)\tag{6}$$
The rightmost term in the brackets $\dot{X}\,\dot{X}^T\,G/(\dot{X}^T\,G\,\dot{X})$ is recognized to be the projector onto the unit length vector parallel to $\dot{X}$, so the Hessian matrix is singular on every tangent space to the configuration space with kernel given by the line $\{t\,\dot{X}:\,t\in\mathbb{R}\}$. A scale multiple of the Hessian matrix is the co-efficient of $\ddot{X}$ in the Euler-Lagrange equation, showing that the Euler-Lagrange equation has a whole family of solutions. Lastly, we can look at the action integral itself and what happens to it when we scale the path parameter $\tau$. Suppose the action is computed over the interval $\tau\in[0,\,1]$, and we introduce a transformation $\tau=\zeta(\sigma)$ where $\sigma$ is any smooth function with $\zeta(0)=0;\,\zeta(1) = 1$ and write $Y(\sigma) = X(\zeta(\sigma))$, $\dot{Y}(\sigma) = \mathrm{d}_\sigma X(\zeta(\sigma))$ then:
$$\mathcal{S} = \int_{\tau=0}^1\,\sqrt{g(\dot{X}(\tau),\,\dot{X}(\tau))}\,\mathrm{d}\tau = \int_{\sigma=0}^1\,\sqrt{g\left(\frac{\dot{Y}(\sigma)}{\frac{\mathrm{d}\zeta}{\mathrm{d}\sigma}},\,\frac{\dot{Y}(\sigma)}{\frac{\mathrm{d}\zeta}{\mathrm{d}\sigma}}\right)}\,\frac{\mathrm{d}\zeta}{\mathrm{d}\sigma}\,\mathrm{d}\sigma \\= \int_{\tau=0}^1\,\sqrt{g(\dot{Y}(\tau),\,\dot{Y}(\tau))}\,\mathrm{d}\tau\tag{7}$$
Thus if $X(\tau)$ is an extremal path, then so is $X(\zeta(\tau))$ for any smooth, monotonic function with $\zeta(0)=0;\,\zeta(1)=1$. Intuitively, if we drive from A to B along the shortest (or longest) road, we can drive it with any speed versus time graph we choose, but we've still driven the extremal road.
The geodesic flow in the tangent space has multiple flowlines, indeed a whole sheet of flowlines, between any two points in configuration space; the preimages of the projection onto geodesics through configuration space at any given point are rays of tangent vectors, with the preimage over each point in the geodesic containing tangent vectors that are all scale multiples of one-another.
The Good
Let's look at the solution to this problem in semi Riemannian geometry. Here we cheat a bit and extremize the action integral:
$$\mathcal{L} = \int_0^1 g(\dot{X},\,\dot{X})\,\mathrm{d}\tau\tag{9}$$
i.e. we simply forget about the square root! (I suspect this crazy idea was originally tried through sheer desperation). Now we look at the Cauchy-Schwarz inequality for a relationship between the "real" and "cheat" Lagrangians:
$$\int_0^1\,\sqrt{g(\dot{X},\,\dot{X})}\,\cdot 1\,\mathrm{d}\tau \leq \sqrt{\int_0^1\,g(\dot{X},\,\dot{X})\,\mathrm{d}\tau}\sqrt{\int_0^1\,1\,\mathrm{d}\tau} = \sqrt{\int_0^1\,g(\dot{X},\,\dot{X}),\mathrm{d}\tau}\tag{10}$$
with equality if and only if $\sqrt{g(\dot{X},\,\dot{X})}$ is constant. We have already seen that if $X(\tau)$ minimizes the leftmost integral in (9), then so does $X(\zeta(\tau))$ where $\zeta(0)=0;\,\zeta(1)=1$. So we then find the function $\zeta(\tau)$ that makes $\sqrt{g(\dot{Y},\,\dot{Y})}$ constant and equal to the average speed of $X$ for the minimizing $X$. Cauchy Schwarz saturates for this case, so that we see that the minimum of $\int_0^1\,\sqrt{g(\dot{X},\,\dot{X})}\cdot 1\,\mathrm{d}\tau$ is precisely the same as the minimum of $\int_0^1\,g(\dot{X},\,\dot{X})\cdot 1\,\mathrm{d}\tau$, given the integrand is positive. On the other hand, if we are seeking to maximize the action (8), which is the case for geodesics in a Lorentzian manifold, then we simply go ahead and maximize (9). By "coincidence", we find that the maximization happens when $g(\dot{X},\,\dot{X})$ is constant, so that the overbound represented by (9) is saturated in this case, so that we have found one of the solutions that maximizes $\int_0^1\,\sqrt{g(\dot{X},\,\dot{X})}\cdot 1\,\mathrm{d}\tau$ also. Having found the one solution that minimizes the left hand side of (1), we can characterize all others through a transformation $\tau=\zeta(\sigma)$ with $\zeta(0)=0;\,\zeta(1)=1$. Or, in General Relativity, we ignore all the other solutions, because we postulate that the physical one is one where an observer’s proper time $\tau$ advances uniformly, the four-velocity is thus constant, acceleration is Minkowski-orthogonal to velocity and $\tau$ is thus affine. So we actually get more than the shape of the geodesic path with this approach; we also get an affine path parameterization.
So we now easily get our Hamiltonian formulation; if we put $\mathcal{L}=\frac{1}{2}\,g(\dot{X},\,\dot{X})$ then we get $P=\dot{X} _\flat;\,p_k = g_{kj} \dot{x}^j$ is simply the covector of $\dot{X}$ found by lowering the index of the latter, and so we have:
$$\mathcal{L} = \mathcal{H} = \frac{1}{2} g(\dot{X},\,\dot{X}) = \frac{1}{2}\,g_{ij}\dot{x}^i\,\dot{x}^j = \frac{1}{2} g^{\sharp\kern+1.4pt\sharp}(P,\,P)=\frac{1}{2}\,g^{ij}\,p_i\,p_j\tag{11}$$
We now see another reason why this crazy solution is beloved of physicists: the Lagrangian and Hamiltonian in (11) are those for the the corresponding formulations of Newtonian mechanics for a free particle. This is therefore a most pleasing, natural analogy when we are thinking about a particle that is "coasting" in an inertial frame. The Euler-Lagrange equation for (11) is readily shown to be $\ddot{X}^k + \Gamma^k_{ij}\,\dot{X}^i\,\dot{X}^j=0$; the analogy with a free particle makes it very satisfying to witness that Newton’s second law is $-g^{kj} \partial_{x^j} V = F^k = m\left(\ddot{X}^k + \Gamma^k_{ij}\,\dot{X}^i\,\dot{X}^j\right)$ when one puts a potential $V(x)$ into the mix. It's a thoroughly beautiful physical analogy. Hamilton’s equations for the geodesic are:
$$\dot{x}^k = g^{kj}\,p_j;\quad \dot{p}_k = -\frac{1}{2}\,\left(\partial_{x^k}\,g^{ij}\right)\,p_i\,p_j\tag{12}$$
Much of the time, this solution is perfectly acceptable in optics, too. Naturally, it will handle all the calculations of rays in smoothly inhomogeneous mediums. In optics, the affine parameter corresponding to proper time in GR is the optical path length, or the total phase delay along the path.
What seems like a cheat at first leads to a solution that is very elegant, smooth and easy and, for General Relativity and indeed most geometry, thoroughly complete.
However, this elegant solution has an awkward property in optics when we hit abrupt interfaces between dielectric mediums, which is an essential situation to analyze when we’re talking about lenses and mirrors, for example. The Hamiltonian approach necessitates a Lagrangian that is at least a $C^2$ function of $\dot{X}$, which assumption breaks down at such an abrupt interface. OK, so we use the Hamiltonian approach aside from at the interface, and we work out what transformation the interface works on the ray state $(X,\,P)$. But it turns out that if we do this, then Snell’s law shows us that:
The transverse components of the optical momentums are continuous across the interface, whereas the normal component necessarily is not.
That is, the transformation on the optical state $(X,\,P)$ wrought by the ray’s passage across the abrupt interface is not a symplectomorphism. The same is true of mirrors with this approach: $X$ is continuous across the interface, whereas $P\mapsto -P$, so the determinant of this linear transformation is -1 in three dimensions. The easiest way to understand all this is to note that the Hamiltonian in (12) equals the constant speed of the point in $(\text{optical phase per unit time})^2$; we can arbitrarily set this $1/2$ units - we can choose any constant as long as we are consistent (scaled and shifted affine parameters are still affine). If we take on this convention, and if we use co-ordinates which are locally Cartesian at the interface with the $x-y$ plane parallel to and the $z$ direction normal to the interface, then the optical momentums can be shown to be $p_k = n\,\gamma_k$, where $n$ is the refractive index at the point where the ray meets the interface and $\gamma_k$ are the direction cosines that the ray’s direction makes with the orthonormal axes. From here one can readily prove the above assertion about Snell’s law.
This situation brings us to the commoner method of handling singular Legendre Transformations - the use of constraints to get rid of "The Bad and The Ugly" redundancy that we discussed above. The use of $\mathcal{L} =\frac{1}{2} g(\dot{X},\,\dot{X})$ can be thought of as belonging to this idea if we think of it as finding the geodesic together with the constraint that our path parameter should be affine so that $\mathcal{L}=const$. In optics, when there are lenses and mirrors involved, the common solution is to constrain the speed along the path to be such that one of the co-ordinates $x^i$, say $x^3$ is itself the path parameter so that $\dot{x}^3=1$. The most obvious illustration of this idea is where the optical system has an optical axis, we measure the $z$ co-ordinate along this axis and so $z$-co-ordinate is the path parameter. This culls $x^3$ and $p^3$ from the Lagrangian and Hamiltonian, and now the phase space is four rather than six dimensional. More generally, we use generalized co-ordinates so that $\partial_1$ and $\partial_2$ are orthogonal to $\partial_3$ and that surfaces of constant $x^3$ are aligned with the dielectric interfaces. Clearly we can do this: we can use "voltage" co-ordinates (my name, not used in the literature) where the lens surfaces are equipotential surfaces and surfaces of constant $x^3$ in an electrostatics problem, and then the directions of increasing $x^1$ and $x^2$ lie in the equipotential surfaces. The third co-ordinate is then the voltage at any point. If we do this, then the transverse components of the optical momentum are still continuous across each interface. Since the $x^1$ and $x^2$ co-ordinates are also continuous, the dielectric interface now imparts a symplectomorphism - indeed the identity operator - on the optical state in this four-dimensional approach. In Cartesian co-ordinates, with $z$ as the path parameter, this approach looks like:
$$\begin{array}{ll}\mathcal{L} = n\,\sqrt{1+\dot{x}^2+\dot{y}^2} & \mathcal{H}=-\sqrt{n^2-p_x^2-p_y^2}\\
p_x = n\frac{\dot{x}}{1+\dot{x}^2+\dot{y}^2} & p_y = n\frac{\dot{y}}{1+\dot{x}^2+\dot{y}^2};\;\end{array}\tag{13}$$
and the epigraphs of both $\mathcal{L}$ and $\mathcal{H}$ are both perfectly well behaved, convex / concave hyperboloids. The path parameter is not affine, though, so you can't easily use this approach to calculate where the phasefronts are.
Sometimes in optics we use both approaches: if you want to calculate where the wavefronts of a field diverging from a source are, then you are clearly going to need the affine path parameters to know where the surface of each constant phase cross the geodesics and the six dimensional, $\mathscr{L} = g(\dot{X},\,\dot{X})$ approach. To do Ray Transfer Analysis, or if you need to avail yourself of the powerful optical invariant or the étendue notions (which are both invariant differential forms on symplectic optical phase space), then you need all the interfaces in the system to impart symplectomorphisms on the optical state and one will use the four-dimensional approach.