Expected Value
The primitive idea came up in the mid-17th century during the study of the so-called problem of points. Later, Pafnuty Chebyshev in the mid-19th century established the expected value in terms of random variables. This page contains some definitions, properties, and thorems, but the finitness (i.e. integrability) will not be the main interest.
I
Let $X: \Omega \to \mathbb{R}$ be a random variable. The expected value of $X$ is the Lebesgue integral $\operatorname{E}X = \int_{\Omega} X\, \mathrm{d}P$, which we denote by $\mu_X$ or $\mu$ (#1). If $X = X^+ - X^- = \max(X, 0) - \min(X, 0)$, then either one of $\operatorname{E}X = \operatorname{E}X^+ - \operatorname{E}X^-$ must be finite for all $\omega \in \Omega$ to confront having undefined $\infty - \infty$. We can express $\operatorname{E}X = \int_{\mathbb{R}} x\, \mathrm{d}F_X(x)$. If $X$ is continuous, then $\operatorname{E}X = \int_{\mathbb{R}} xf_X(x)\, \mathrm{d}x$. If $X$ is discrete, then $\operatorname{E}X = \sum_{k=1}^{n} x_{k}p_{x_{k}}$. Moreover, a random variable $Y = g(X)$ being mapped by a Borel function $g: \mathbb{R} \to \mathbb{R}$ yields $\operatorname{E}Y = \int_{\Omega} Y\,\mathrm{d}P$, and a random vector $\boldsymbol{X} = [X_1, \dots, X_n]^{\top}$ has a mean vector $\operatorname{E}\boldsymbol{X} = \operatorname{E}[X_1, \dots, X_n]^{\top} = [\operatorname{E}X_1, \dots, \operatorname{E}X_n]^{\top}$.
An expecation is the weighted sum $\Sigma_{i=1}^{n}x_{i}p_{i} = \mu$ with $\Sigma_{i=1}^{n}p_i = 1$. The operator $\operatorname{E}$ inherits useful properties of $\int$ including (i) non-degeneracy: if $\operatorname{E}\vert X \vert = 0$, then $X=0$; (ii) non-negativity: if $X \geq 0$, then $\operatorname{E}X \geq 0$; (iii) monotonicity: if $X \leq Y$, then $\operatorname{E}X \leq \operatorname{E}Y$; (iv) domination: if $\operatorname{E}X^\beta < \infty$, then $\operatorname{E}X^\alpha < \infty$ for any $0 < \alpha < \beta$; (v) linearity: $\operatorname{E}(aX + Y) = a\operatorname{E}X + \operatorname{E}Y$ for all $a \in \mathbb{R}$; (vi) multiplicativity: if $P(X \cap Y) \neq P(X)P(Y)$, then often $\operatorname{E}XY \neq \operatorname{E}X\operatorname{E}Y$, but the independence always guarantee the equality in $\operatorname{E}XY = \operatorname{E}X \operatorname{E}Y$;. The linearity may be desirable but Simpson’s paradox avoids imprudently summing groups of variable.
Assume that $(X_n)_{n\in\mathbb{N}}$ consists of a non-negative random variable $X_n$ such that $P(X_n \geq 0) = 1$, if $\lim_{n\to\infty} X_{n} = X$ pointwisely, then $\lim_{n\to\infty} \operatorname{E}X_{n} = \operatorname{E}X$ is not necessarily true. Although we cannot interchange limits and expectations at will, the followings provide sufficient conditions (i) monotone convergence theorem: assumes a pointwise convergent sequence is monotonic, i.e. $X_{n} \nearrow X$ as $n \to \infty$ and $\vert X \vert < \infty$ (#2); (ii) dominated convergence theorem: there exists a random variable $Y$ that dominates $X_{n}$: there exists $Y$ such that $\vert X_{n} \vert \leq Y$ for all $n \in \mathbb{N}$ and $\operatorname{E}\vert Y \vert < \infty$; We call it bounded convergence theorem if $Y \geq 0$ is a constant.
II
A set of moments of a density is the fundamental descriptive statistics. The $r$-th moment and the $r$-th central moment can be obtained by $\operatorname{E}X^r = \int_{\mathbb{R}} x^r \mathrm{d}F_X(x)$ and $\operatorname{E}(X - \operatorname{E}X)^r = \int_{\mathbb{R}} (x - \mu)^r \mathrm{d}F_X(x)$, respectively. Given that the 1st moment equals to the mean $\mu$, if we let $\mu_r$ be the $r$-th central moment, then $\mu_2 = \sigma^2$ is the variance, $\mu_3 / \sigma^3$ is the skewness, and $\mu_4 / \sigma^4$ is the kurtosis. Descriptive statistics are crucial in understanding $X$. However, we may further compute the $r$-th absolute moment $\operatorname{E}{\vert X \vert}^r = \int_{\mathbb{R}} {\vert x \vert}^r\, \mathrm{d}F_X(x)$ and also the $r$-th absolute central moment $\operatorname{E}{\vert X - \operatorname{E}X \vert}^r = \int_{\mathbb{R}} {\vert x - \mu \vert}^r\, \mathrm{d}F_X(x)$, if the four moments are insufficient.
Let $m: \mathbb{R} \to \mathbb{R}$ be a moment generating function (mgf) defined by $m(t) = \operatorname{E}e^{tX}$, then its $r$-th derivative with respect to $x$, denoted by $m^{r}(0)$, is the $r$-th moment of $X$. While the existence of $m$ is not guaranteed, $F_{X_1} = F_{X_2}$ if and only if $m_{X_1}(t) = m_{X_2}(t)$ for all $t \in (-\varepsilon, \varepsilon)$ and $\varepsilon > 0$, and also a set of well-defined moments exists if and only if the tails of $F_X$ are exponentially bounded. For example, a collection of all moments (i.e. $r \geq 1$) uniquely determines $F_X$ on a bounded interval (i.e. Hausdorff moment problem), but fails to determine the cdf on an unbounded intervals (i.e. Hamburger moment problem) as the integral is infinite (#3).
If $m$ exists, then $m$ is analytic (#4), and so it can be linearly approximated by Taylor expansion $f(x) = \Sigma_{k=0}^{\infty} {f^{k}(x_0)\over{k!}}(x-x_0)^k$. If we let $x_0 = 0$ for $f(x) = e^x = e^{x_0} + e^{x_0}(x-x_0) + {1\over{2!}}e^{x_0}(x-x_0)^2 + \dots$, then $f(x) = 1 + x + {1\over{2}}x^2 + \dots$. Likewise, if we have $m(t) = \operatorname{E}[1 + tX + {1\over{2}}t^2X^2 + \dots]$ around $x_{0}=0$, the linearity of $\operatorname{E}$ yields $m(t) = 1 + t\operatorname{E}X + {1\over{2}}t^2\operatorname{E}X^2 + \dots$ returning $\mu_r$ at $t=0$. If $f$ fails to have derivatives of all orders at a point or the Taylor series of $f$ within any neighborhood is divergent, then $f$ is not analytic at that point. Or, if a point $x_0$ centering the series equals to zero, then $f(x) = \Sigma_{k=0}^{\infty}{1\over{k!}}x^k$ is known as Maclaurin expansion.
III
Suppose $X:\Omega\to\mathbb{R}$ is an $L^1$-convergent random variable and $\mathcal{G} \subset \mathcal{F}$ is a sub-$\sigma$-algebra containing events relevant only to some partial information. A conditional expectation of $X|\mathcal{G}$ is any $\mathcal{G}$-measurable function $\operatorname{E}(X \vert \mathcal{G})$ such that $\int_{A}\operatorname{E}(X \vert \mathcal{G})\,\mathrm{d}P = \int_{A}X \,\mathrm{d}P$ for all $A\in\mathcal{G}$. $\operatorname{E}(X \vert \mathcal{G})$ differs from $X$ which may not have a well-defined $\operatorname{E}X = \int_{A} X\,\mathrm{d}P\vert_{\mathcal{G}}$. Also, if $X \in L^2(\Omega, \mathcal{F}, P)$, then $\operatorname{E}(X \vert \mathcal{G})$ is the orthogonal projection of $X$ to the subspace $L^2(\Omega, \mathcal{G}, P)$ and is the best predictor of all $\mathcal{G}$-measurable $Y \in L^2(\Omega, \mathcal{G}, P)$ such that $X-\operatorname{E}(X \vert \mathcal{G}) \perp L^2$. Note that $\operatorname{E}(Y \vert X = x)$ and $\operatorname{E}(Y \vert X) := \operatorname{E}[Y \vert \sigma(X)]$ is a scalar and a random variable, respectively.
The Radon-Nikodym derivative, which derives the pdf of a continuous random variable $X$, yields the existence of $\operatorname{E}(X\vert\mathcal{G})$. We let $\mu:A \to \int_{A} X\,\mathrm{d}P$ be a finite measure, then $\mu$ and $P$ are on an absolutely continuous sample space $(\Omega, \mathcal{F})$, and $\mu$ is continuous with respect to $P$. Also, if $g$ is the inclusion function that maps $A$ contained in $\mathcal{G}$ to $\mathcal{F}$, then $\mu \circ g = \mu\vert_{\mathcal{G}}$ and $P \circ g = P\vert_{\mathcal{G}}$ are the restrictions of $\mu$ and $P$ (both on $\mathcal{G}$), respectively. Here $\mu\vert_{\mathcal{G}}$ is continuous with respect to $P\vert_{\mathcal{G}}$. The condition $P(g(\mathcal{G})) = 0$ implies $\mu(g(\mathcal{G})) = 0$ and also $\operatorname{E}(X\vert\mathcal{G}) = \mathrm{d}\mu\vert_{\mathcal{G}} / \mathrm{d}P\vert_{\mathcal{G}}$. The uniqueness holds if the versions only differ on the sets with zero probability.
The conditional expectation also inherits properties of the unconditional expectation. In fact, its own properties are fundamental to the study of martingales. These include (i) stability: if $X$ is $\mathcal{G}$-measurable, then $\operatorname{E}(X\,\vert\,\mathcal{G}) = X$ (#5); (ii) law of total expecation: $\operatorname{E}[\operatorname{E}(X\vert\mathcal{G})] = \operatorname{E}X$; (iii) tower properties: if $\mathcal{G}_1 \subset \mathcal{G}_2 \subset \mathcal{F}$, then $\operatorname{E}[\operatorname{E}(X\vert\mathcal{G}_2)\vert\mathcal{G}_1] = \operatorname{E}(X\vert\mathcal{G}_1)$, where $\operatorname{E}(X\vert\mathcal{G}_1) = \operatorname{E}[\operatorname{E}(X\vert\mathcal{G}_1)\vert\mathcal{G}_2]$ due to the stability. One can deduce $\operatorname{Var}(X\vert\mathcal{H}) = \operatorname{E}[(X-\operatorname{E}(X\vert\mathcal{H}))^2\vert\mathcal{H}] = \operatorname{E}(X\vert\mathcal{H})^2 - [\operatorname{E}(X\vert\mathcal{H})]^2$ and yield the law of total variance $\operatorname{Var}X = \operatorname{E}[\operatorname{Var}(X\vert\mathcal{H})] + \operatorname{Var}[\operatorname{E}(X\vert\mathcal{H})]$ which underlies many popular hypothesis tests in statistics such as ANOVA.
**
(#1) $\operatorname{E}(\boldsymbol{1}_{A}) = P(A)$ for $A \in \Omega$. (#2) Recall the Dini’s Theorem. (#3) Let $f(x) = x^{-2}$ for all $x>1$, then $m(t) = \int_{1}^{\infty} e^{tx}x^{-2}\,dx$ is divergent. (#4) In $\mathbb{R}$, we require $f \in C^{\omega} := \lbrace \text{a class of analytic functions} \rbrace$. Whereas, in $\mathbb{C}$, because $C^1 = C^\infty = C^\omega$, we only require $f \in C^1 := \lbrace \text{a class of holomorphic functions} \rbrace$. (#5) If $X$ is ind. to $\mathcal{G}$, then simply $\operatorname{E}(X \vert \mathcal{G})=\operatorname{E}X$.
I gathered words solely for my own purposes without any intention to break the rigorosity of the subjects.
Well, I prefer eating corn in spiral .