204. inequality

Inequality

Queried: Who invented inequalities; Google responded: The signs for greater than and less than were introduced by Thomas Harriot (1560-1621). He initially wrote triangular symbols that the editor of his book altered to what resemble the modern symbols. Inequalities are essential for the subject in order to adapt some mathematics originated in geometry and linear algebra.

I

Suppose $X:\Omega \to \mathbb{R}$ is a random variable and $g:\mathbb{R} \to [0,\infty]$ is a measurable function. If we let $g_{\ast}(A) = \inf \lbrace g(y): y \in A \rbrace$ for any $A \in \mathcal{B}_{\mathbb{R}}$, then Markov’s inequality guarantees that $g_{\ast}(A)P(X \in A) \leq \operatorname{E}g(X)I_{X \in A} \leq \operatorname{E}g(X)$ with $P(X \in A) = \operatorname{E}I_{X \in A}$. Moreover, if $g(x) = x^+$ and $A = [a,\infty)$ for some $a > 0$, then $P(X \geq a) \leq a^{-1}\operatorname{E}X^+$. The result is particularly appealing for some $X \geq 0$. Whereas, if $g(x) = {\vert x \vert}^q$ and $A = (-\infty, a] \cup [a, \infty)$ for some $a > 0$, then $P(\vert X \vert \geq a) \leq a^{-q}\operatorname{E}{\vert X \vert}^q$. A tighter upper bound $P(\vert Y - \operatorname{E}Y \vert \geq a) \leq a^{-2}\operatorname{Var}Y$, so-called Chebyshev’s inequality (#1), can also be constructed by letting $q=2$.

The second instance also works for a sequence $(X_n)_{n\in\mathbb{N}}$ consists of ind. $X_n$ such that $\operatorname{E}X_n = 0$ and $\operatorname{Var}X_n < \infty$. Specifically, $P(\vert S_n - \operatorname{E}S_n \vert \geq a) \leq a^{-2}\operatorname{Var}S_n$ for $S_n = \Sigma_{k=1}^n X_k$. If $(X_n)_{n\in\mathbb{N}}$ are i.i.d., then $P(\vert S_n - \operatorname{E}S_n \vert \geq a) \leq a^{-2} n\operatorname{Var}X_1$. We may use Truncated Chebyshev’s inequality without finite variances (#2). That is, given $\tilde{X}_n = X_n - \operatorname{E}X_n$ and $X_n \in L^2$, we compute $P(\vert \tilde{X}_n \vert \geq a) \leq a^{-2}\operatorname{Var}\tilde{X}_n$. Moreover, if $\tilde{S}_n = \Sigma_{k=1}^n \tilde{X}_k$, then $P(\vert S - \operatorname{E}\tilde{S}_n \vert \geq a) \leq a^{-2} \Sigma_{k=1}^n\operatorname{Var}\tilde{X}_k + \Sigma_{k=1}^{n} P(\vert X_k \vert > b_k)$, where $\tilde{X}_k = X_k$ whenever $\vert X_k \vert \leq b_k$ and $\tilde{X}_k = c$ otherwise. We can assume that $(b_k)_{k\in\mathbb{N}}$ is a seq. of positive real numbers wlog.

Chebyshev’s inequality is extended to Kolmogorov’s inequality (i.e. maximal inequality) $P(\max_{1 \leq k\leq n} \vert S_k \vert \geq a) \leq a^{-2}\operatorname{Var}S_n$ which we may restate by $P(\max_{n \leq k\leq m} \vert S_k \vert \geq a) \leq a^{-2}\operatorname{Var}S_m$. Its convenience arises when we need to bound the worst case deviation (i.e. it can occure at any $k$) of a partial sum process $(S_n)_{n\in\mathbb{N}}$ consists of $S_{n} = \Sigma_{k=1}^{n}X_k$, where each $X_k$ are pairwise ind.. It can be generalised to Hájek–Rényi inequality assuring that $P(\max_{1 \leq k \leq n} d_k \vert S_k \vert \geq a) \leq a^{-2}\Sigma_{k=1}^{n} d_{k}^{2}\operatorname{Var}X_k$, where $(d_k)_{1 \leq k\leq n}$ are some non-increasing positive real numbers. That is, whenever $d_k =1$ for all $k$, Hájek–Rényi inequality returns to Kolmogorov’s inequality.

II

Some inequalities rather relate moments of sums to sum of moments. Suppose $X$ and $Y$ both lie on the same probability space while $\operatorname{E}{\vert X \vert}^r, \operatorname{E}{\vert Y \vert}^r < \infty$ for some $r > 0$. The $c_r$-inequality assures that $\operatorname{E}{\vert X+Y \vert}^r \leq c_r(\operatorname{E}{\vert X \vert}^r + \operatorname{E}{\vert Y \vert}^r)$, where $c_r = 1$ if $r \leq 1$ and $c_r = 2^{r-1}$ if $r \geq 1$. In fact, we have the triangle inequality of the $r$-th order $\operatorname{E}{\vert X+Y \vert}^r \leq (\operatorname{E}\vert X \vert + \operatorname{E}\vert Y \vert)^r \leq \operatorname{E} {\vert X \vert}^r + \operatorname{E}{\vert Y \vert}^r$ for all $r \in (0,1]$. These show that a sum of random variables are integrable if all summands are integrable. By using the $c_r$-inequality, we can prove that the converse holds for all $r > 0$ if and only if $X$ and $Y$ are independent.

Minkowski’s inequality instead concerns sums of random variables in an $L^p$-space using the $L^p$-norm ${\left\lVert \cdot \right\rVert}_p = (\operatorname{E} {\vert \cdot \vert}^p)^{1/p}$. Namely, if $\operatorname{E}{\vert X \vert}^p, \operatorname{E}{\vert Y \vert}^p < \infty$ for any $p \geq 1$, then ${\left\lVert X+Y \right\rVert}_p \leq {\left\lVert X \right\rVert}_p + {\left\lVert Y \right\rVert}_p$. This becomes to the usual Euclidean norm for $p=2$, and encompoasses the triangle inequality and extends it $L^{p}$-norm for $p \geq 1$. Also, since the operator $\left\lVert \cdot \right\rVert: \mathbb{R} \to [0,\infty)$ is linear, ${\left\lVert \alpha{X} + \beta{Y} \right\rVert}_p = \vert \alpha \vert {\left\lVert X \right\rVert}_p + \vert \beta \vert {\left\lVert Y \right\rVert}_p$ for all $\alpha, \beta \in \mathbb{R}$ and ${\left\lVert X \right\rVert}_p, {\left\lVert Y \right\rVert}_p \in L^p$. Note that an $L^p$-space is indeed a vector space (#3), so is a Banach space if all sequence is a Cauchy sequence, and a Hilbert space if we futher have a covariance (i.e. an inner product).

Hölder’s inequality on the contrary concerns products of random variables in an $L^p$-space. That is, if $p, q > 1$ and $p^{-1} + q^{-1} = 1$ (#4), then $\vert\langle X,Y \rangle\vert := {\vert \operatorname{E}XY \vert} \leq {\operatorname{E}\vert XY \vert} \leq {\left\lVert X \right\rVert}_p \cdot {\left\lVert Y \right\rVert}_q$. The second inequality makes sense once we observe that $\operatorname{E}({\vert X \vert}/{\left\lVert X \right\rVert}_p \cdot {\vert Y \vert}/{\left\lVert Y \right\rVert}_p) \leq p^{-1}\operatorname{E}({\vert X \vert}^p/{\left\lVert X \right\rVert}_p^p) + q^{-1}\operatorname{E}({\vert Y \vert}^q/{\left\lVert Y \right\rVert}_q^q) = p^{-1} + q^{-1}$. If we suppose $p=q=2$, then Cauchy-Schwarz ineqaulity bounds $\operatorname{E}{\vert XY \vert} \leq {\left\lVert X \right\rVert}_2 \cdot {\left\lVert Y \right\rVert}_2$. Consequently, ${\vert \operatorname{Cov}(X,Y) \vert} \leq \sqrt{\operatorname{Var}X \cdot \operatorname{Var}Y}$ and also ${\vert \rho_{X,Y} \vert} \leq 1$, where $\rho_{X,Y} = \operatorname{Cov}(X,Y) / \sqrt{\operatorname{Var}X \cdot \operatorname{Var}Y}$ is the measure of statistics so-called Pearson correlation coefficient.

III

Jensen’s inequality in general states that a secant line as a weighted (i.e. total weight of $1$) means of a convex function $\varphi$ is drawn over $\varphi$. Whereas, given a probability space, an $\mathbb{S}$-valued integrable random variable $X$, and a measurable convex function $\varphi:\mathbb{R}\to\mathbb{R}$, it assures that $\varphi(\operatorname{E}(X\vert\mathcal{G})) \leq \operatorname{E}(\varphi(X)\vert\mathcal{G})$, where $\mathcal{G} \subset \mathcal{F}$ is a sub-$\sigma$-algebra. If a set $\mathbb{S}$ is given to be the real axis $\mathbb{R}$ and $\mathcal{G}$ is the trivial σ-algebra only containing $\lbrace \Omega, \emptyset \rbrace$, then $\varphi(\operatorname{E}X) \leq \operatorname{E}\varphi(X)$ and the equality holds if and only if $\varphi$ is linear on a convex set $A \subseteq \mathbb{R}$ such that $P(X \in A) = 1$ (#5). Jensen’s inequality and Cauchy-Schwarz inequality have a wide range of uses.

The direct use of Jensen’s inequality on a density $f$ is given by $\varphi(\int_{\mathbb{R}} g(x)f(x)\,\mathrm{d}x) \leq \int_{\mathbb{R}} \varphi(g(x))f(x)\,\mathrm{d}x$, where $g$ is Borel measurable (i.e. $g(x) = x$). If $\varphi(x) = x^{2n}$ for some $n \in \mathbb{N}$, then $\varphi^{\prime\prime}(x) \geq 0$ for all $x\in\mathbb{R}$ and $\varphi(\operatorname{E}X) = (\operatorname{E}X)^{2n} \leq \operatorname{E}X^{2n}$ which yields $\operatorname{Var}X = \operatorname{E}X^2 - (\operatorname{E}X)^2 \geq 0$. Moreover, if $p$ and $q$ are the true and another density of $X$, and we set $Y(x) = q(x)/p(x)$ while $\varphi(y) = -\log(y)$ (#6), then the construction yields Gibbs’ inequality $-\Sigma_{x}p(x)\log{p(x)} \leq -\Sigma_{x}p(x)\log{q(x)}$, which then relates to the non-negativity of the KL-Divergence $D_{KL}(P \vert\vert Q) = -\Sigma_{x}p(x) \log{[q(x)/p(x)]} = \Sigma_{x}p(x) \log{[p(x)/q(x)]} \geq 0$.

We can geometrically work with $X \in H$, as if they were vectors in Euclidean space, if proper notions of length and orthogonality are held in a set $H = \lbrace X: \operatorname{E}X^2 < \infty \rbrace$. For example, given $X,Y \in H$, the OLS of $Y$ can be done by finding the best linear function of $X$. Namely, $L(Y \vert X) = \operatorname{E}Y + (X-\operatorname{E}X) \cdot {\operatorname{Cov}(X,Y) \over \operatorname{Var}X}$ is the best function which uniquely satisfies both (i) $\operatorname{E}[L(Y \vert X)] = EY$; (ii) $\operatorname{Cov}[X, L(Y \vert X)] = \operatorname{Cov}(X,Y)$; If $U \in H$ is any linear function of $X$, then equality holds for $\operatorname{E}[Y - L(Y \vert X)]^2 \leq \operatorname{E}(Y-U)^2$ if and only if $U = L(Y \vert X)$. Otherwise, the mean squared error will be presented by $\operatorname{E}[Y - L(Y \vert X)]^2 = \operatorname{Var}(Y)[1-\operatorname{Cor}^2(X,Y)]$.

IV

The median $\operatorname{Med}X$ is real-valued numbers such that $P(X \leq \operatorname{Med}X) \geq 1/2 \land P(X \geq \operatorname{Med}X) \geq 1/2$. The mode of $X$ is any value $x \in \mathbb{R}$ that maximises $f_X$. Thus, if $P(\vert X \vert > a) < 1/2$ for some $a \in \mathbb{R}^{+}$, then ${\vert \operatorname{Med}X \vert} \leq a$. Also, if $\operatorname{Var}X < \infty$, then ${\vert \operatorname{Med}X - \operatorname{E}X \vert} = {\vert \operatorname{E}(\operatorname{Med}X - X) \vert} \leq \operatorname{E}{\vert X - \operatorname{Med}X \vert} \leq \operatorname{E}{\vert X - \operatorname{E}X \vert} \leq \sqrt{\operatorname{E}{\vert X - (\operatorname{E}X)^2 \vert}}$, where $\sqrt{\operatorname{E}{\vert X - (\operatorname{E}X)^2 \vert}} = \sqrt{\operatorname{Var}X}$. The first and the third inequalities are due to Jensen’s inequality being applied with each convex functions $\varphi(x) = \vert x \vert$ and $\varphi(x) = x^2$.

We say $X$ is symmetric around $a$ if two shifted random variables $X-a$ and $a-X$ have the same distribution $F_{X-a} = F_{a-X}$ for all $x\in\mathbb{R}$. We also say that $X$ is symmetric around $a$ if $P(X \geq a+x) = P(X \leq a-x)$ for all $x \in \mathbb{R}$, because $F_{X-a} = F_{a-X}$ if and only if $f_{X-a}(a+x) = f_{X-a}(a-x)$. We simply say $X$ is symmetric if $F_X = F_{-X}$ for all $x\in\mathbb{R}$, or, samely, $f_{X}(x) = f_{X}(-x)$ for all $x\in\mathbb{R}$. In fact, if $X$ is symmetric, then we always have a finite $\operatorname{E}X$ such that $\operatorname{E}X = \operatorname{E}(-X) = -\operatorname{E}X$. If, in addition, $X$ is unimodal, then $\operatorname{E}X = \operatorname{Med}X = \operatorname{Mode}X$, or, if $X$ is bimodal, then $\operatorname{E}X = \operatorname{Med}X$. It is often easier to work with symmetric $X$.

We can define a symmetrised random variable $X^s = X - X^\prime$ whose $\operatorname{E}X^s = 0$, by introducing an ind. random variable $X^\prime \stackrel{d}{=} X$, for some non symmetric $X$. That is, for all $a, b \in \mathbb{R}$, the weak symmetrisation inequalities assure that $P({\vert X - \operatorname{Med}X \vert} \geq b) \leq 2P(X^s \geq b) \leq 4P({\vert X - \operatorname{Med}X \vert} \geq b/2)$. For all $(a_{n})_{n\in\mathbb{N}}$ and $b$ on $\mathbb{R}$, the strong symmetrisation inequalities assure $P(\max_{1 \leq k \leq n}{\vert X_k - \operatorname{Med}X_k \vert} \geq b) \leq 2P(\max_{1 \leq k \leq n}{\vert X_k^s \vert} \geq b) \leq 4P(\max_{1 \leq k \leq n}{\vert X_k - \operatorname{Med}X_k \vert} \geq b/2)$. The relation is akin to that of Chebyshev’s and Kolmogorov’s. The median and symmetrisation bounds sums of random variables.

**

(#1) Bernstein inequalities can be an alternative, rediscovered in the form of Chernoff’s inequality, Hoeffding’s inequality and Azuma’s inequality - for the sum of random variables. (#2) We define a new r.v. that is asymptotically equiv. and easier to deal with. (#3) If we define a distance within a vector space $V$ by $d(X,Y) = {\left\lVert X-Y \right\rVert}_p$, then $d(X,Z) \leq d(X,Y) + d(Y,Z)$ implies that $V$ is a metric space. (#4) For any $p,q \in [1, \infty]$, we say $q$ is a dual exponent of $p$ if only if $p^{-1} + q^{-1} = 1$. (#5) If $\theta x_1 + (1-\theta)x_2 \in C$ for all $x_1, x_2 \in C$ with $0 \leq \theta \leq 1$, where $C \subseteq \mathbb{R}$, then $C$ is called a convex set. (#6) Dividing and taking log always helps.

I gathered words solely for my own purposes without any intention to break the rigorosity of the subjects.
Well, I prefer eating corn in spiral .

Hikikomori

204. inequality

Inequality

I

II

III

IV

**