Hikikomori

16 object(s)
 

205. characteristic function

Characteristic Function


Let $\operatorname{FT}$ denotes the Fourier transform and $(f \ast g)(x) = \int_{\mathbb{R}} f(\tau)g(x-\tau)\,\mathrm{d}\tau$ be the convolution operator. The convolution theorem yields $\operatorname{FT} \lbrace f \ast g \rbrace = \operatorname{FT} \lbrace f \rbrace \cdot \operatorname{FT} \lbrace g \rbrace$. One may view a characteristic function $\varphi$ as the probabilistic version of $\operatorname{FT}$ that is applied to a density function $f_{Z}(z) = \int_{x \in X} f_{X}(x) f_{Y}(z-x)\,\mathrm{d}x$ of some random variable $Z = X+Y$.

I


The characteristic function (cf) $\varphi_X: \mathbb{R} \to \mathbb{C}$ of a random variable $X:\Omega\to\mathbb{R}$ is given by $\varphi_X(t) = \operatorname{E}e^{itX}$, where $i$ is the imaginary unit and $t \in \mathbb{R}$ is the argument. If $X$ admits $f_X$, then the definition $\varphi_X(t) = \int_{\mathbb{R}} e^{itx}f_X(x)\,\mathrm{d}x$ is equal to the Fourier transform of $f_X$ with sign reversal in the exponent (#1). If $Y = aX + b$, then $\varphi_Y(t)= e^{itb}\varphi_{X}(at)$. Furthermore, given a sequence $(X_n)_{n\in\mathbb{N}}$ of ind. random variables, if $S_n = \Sigma_{k=1}^{n} a_k X_k$, then $\varphi_{S_n}(t) = \prod_{k=1}^{n} \varphi_{X_k}(a_{k}t)$ due to the linearity $\operatorname{E}e^{it(aX+b)} = e^{itb}\operatorname{E}e^{itaX}$ and the multiplicativity $\operatorname{E}e^{it(X_1 + X_2 + \dots{+X_n})} = \prod_{k=1}^{n}\operatorname{E}e^{itX_k}$. Similar definitions and properties also hold for a random matrix $X_{m,n} \in \mathbb{R}^{m \times n}$.

Recall Jensen’s inequality ${\vert \varphi_X(t) \vert} = {\vert \operatorname{E}e^{itX} \vert} \leq \operatorname{E}{\vert e^{itX} \vert}$. Since ${\vert e^{itX} \vert} = {\vert \cos^2(tx)+i\sin^2(tx) \vert} = [\cos^2(tx)+\sin^2(tx)]^{1/2}=1$ and $P(\Omega) = 1$, we can tell that ${\vert \varphi_X(t) \vert} \leq 1$ and so the integral $\varphi_X$ is always well-defined on a space of finite measure. Also, $\varphi$ dose not vanish around zero as $\varphi_X(0) = 1$, In particular, the Riemann–Lebesgue lemma states that the $\operatorname{FT}$ of an $L^1$-function vanishes at $\pm \infty$, and so $\lim_{t \to \pm \infty} {\vert \varphi_X(t) \vert} = 0$. To prove the uniform continuity of $\varphi_X$, we show that the magnitude $\vert \varphi_X(t+h)-\varphi_X(t) \vert$ is bounded by an arbitrarily $\varepsilon > 0$ for any value $h > 0$. It means that the estimate does not depend on the input argumnet $t\in\mathbb{R}$.

If the moment generating function $m_{X}$ exists for a random variable $X$, then $m_{X}(t) = \varphi_{X}(-it)$. Moreover, because $\varphi_X$ is Hermitian, i.e. $f_X = \operatorname{FT}^{-1}\lbrace \varphi_X \rbrace$ is real-valued, we can state $\varphi^{\ast}_X(t) = \varphi_{X}(-t)$, where $\ast$ denotes a complex conjugate. In particular, $\varphi_{X}$ of a symmetric random variable $X$ is a real-valued even function such that $\varphi_{X}(-t) = \varphi_{-X}(t)$. In terms of uniquness, $F_X=F_Y$ if and only if $\varphi_X = \varphi_Y$. The Lévy’s continuity theorem states that a bijection between $F_X$ and $\varphi_X$ is sequentially continuous, and thus given any sequence $(X_n)_{n\in\mathbb{N}}$ (#2), if $\varphi_{X_n} \to \varphi_X$ pointwisely as $n \to \infty$ (i.e. for all $t\in\mathbb{R}$), then one has $F_{X_n} \to F_X$ as $n\to\infty$.

II


Due to the continuity theorem, $\varphi_X$ massively eases proofs of theorems that is related to the distributional convergence of random variables and sums of random variables (i.e. the LLNs and the CLTs). Yet, we shall explore the features of Fourier transform in which a characteristic function advantages. The reciprocal relationship between the a Fourier transform and the inverse Fourier transform is a traditional example. Suppose $\varphi_X$ is integrable and $F_X$ is continuous. The inversion formula $f_X(x) = {1 \over 2\pi} \int_{\mathbb{R}} e^{-itx} \varphi_X(t)\,\mathrm{d}t$ provides the density function of $X$ beased on $\varphi_X$, or, equivalently, it returns the bijective Radon-Nikodym derivative $f_X$ computed w.r.t. $P$.

Paul Lévy showed that $F_X(b) - F_X(a) + {1\over2}P(X=a)-{1\over2}P(X=b) = \lim_{T\to\infty} {1\over2\pi} \int_{-T}^{T} {e^{-itb} - e^{-ita} \over it} \varphi(t)\, dt$, where $a,b\in\mathbb{R}$ and $a < b$. The pmfs in the equation vanish if $a$ and $b$ are contained in the continuity set $C(F_X) = \lbrace x: F_X(x) \text{ is continuous at } x \rbrace$, and we have $F_X(b) - F_X(a) = \lim_{T\to\infty} {1\over2\pi} \int_{-T}^{T} {e^{-itb} - e^{-ita} \over it} \varphi(t)\, dt$. We may rewrite ${F(x+h)-F(x-h)\over{2h}} = {1\over{2\pi}}\int_{\mathbb{R}} {\sin{ht}\over{ht}}e^{-itx}\varphi_X(t)\, dt$ to gain power in numerical computations. While the equations in various forms showcase the univariate case, their extension to the multivariate case is achievable through a chain of integrals with respect to $P$, and also $\operatorname{FT}^{-1}$ can be obtained via the equations.

// Furthermore, J. Gil-Pelaez states that $F_X(x) = {1\over{2}} + {1\over{2\pi}} \int_{0}^{\infty} {e^{itx}\varphi(-t) - e^{-itx}\varphi(t) \over it}\, \mathrm{d}t = {1\over{2}} - {1\over{2\pi}} \int_{0}^{\infty} {\operatorname{Im}[e^{-itx}\varphi_X(t)] \over t}\, \mathrm{d}t$, where $x \in \mathbb{R}$ is a continuity point of $F_X$ (#3). ____NotReadThePaperYet

III


Characteristic functions find applications in modern machine learning, exemplified by Google’s development of the Performer (2020), which utilises Random Features (2007) to incorporate linearly scalable attention matrices, thus facilitating faster training on extended sequences. The expensive cost for inner product operations is reduced by applying the Bochner’s theorem on $k(\mathbf{x},\mathbf{y}) = \phi(\mathbf{x})^{\top} \phi(\mathbf{y})$, where $k:\mathbb{R}^n \to \mathbb{R}$ is a kernel, $\phi:\mathbb{R}^n\to\mathbb{R}^m$ is a feature map, and $m \geq n$ (#4). That is, the authors attempted to approximate an attention matrix requiring a quadratic complexity of $\mathcal{O}(n^2)$, by using a randomised feature map $z: \mathbb{R}^n \to \mathbb{R}^m$, where $m \leq n$.

Specifically, given a positive-semidefinite matrix $A_{n \times n} = (a_{qr})_{q,r=1}^{n}$ with $a_{qr}=g(x_q - x_r)$ for every $x \in \mathbb{R}^n$, we call $g:\mathbb{R} \to \mathbb{C}$ a positive-definite function, and a positive-definiteness of $g$ arises if there exists $f \geq 0$ such that $g = \operatorname{FT}\lbrace f \rbrace$. Therefore, if one has a continuous positive-definite function $\phi: \mathbb{R} \to \mathbb{C}$ with $\phi(0) = 1$, then there always exists a Borel probability measure $\mu:\mathbb{R^n} \to [0,1]$ such that $\phi = \operatorname{FT}\lbrace \mu \rbrace$. This result, called the Bochner’s theorem asserts that $p(\omega) = {1\over{2\pi}}\int_{\mathbb{R}} e^{-i\omega^{\top}(\mathbf{x}-\mathbf{y})} k(\mathbf{x}-\mathbf{y})\, \mathrm{d}\mathbf{\triangle}$ is a proper distribution function, if the shift-invariant kernel $k(\mathbf{x},\mathbf{y})=k(\mathbf{x}-\mathbf{y})$ can be properly scaled to have $k(0) = 1$.

The inversion formula reverts $k(\mathbf{x} − \mathbf{y}) = \int_{\mathbb{R}^n} e^{i\omega^{\top}(\mathbf{x-\mathbf{y}})}p(\omega)\, \mathrm{d}\omega$ and we rewrite $k(\mathbf{x} − \mathbf{y}) = \operatorname{E}[\zeta_{\omega}(\mathbf{x})\zeta_{\omega}(\mathbf{y})^{\ast}]$ with $\zeta_{\omega}(\mathbf{x}) = e^{i\omega^{\top}\mathbf{x}}$, where $\zeta_{\omega}(\mathbf{x})\zeta_{\omega}(\mathbf{y})^{\ast}$ is an unbiased estimate of $k(\mathbf{x},\mathbf{y})$ when $\omega$ is drawn from $p(\omega)$. $k$ converges if $e^{\mathbf{x}} = \cos(\mathbf{x}) + i\sin(\mathbf{x})$ is replaced by $\cos(\mathbf{x})$, and so, if $p(\omega)$ and $k(\triangle)$ are real, then $k(\mathbf{x},\mathbf{y})=\operatorname{E}[z_{\omega}(\mathbf{x})z_{\omega}(\mathbf{y})]$, where $z_{\omega}(\mathbf{x}) = \sqrt{2}\cos(\omega^{\top}\mathbf{x} + b)$ and $b$ is drawn uniformly from $[0,2\pi]$. We can substitute $k(\mathbf{x}−\mathbf{y}) \approx z(\mathbf{x})^{\top}z(\mathbf{y})$ for $z$ to map samples onto a lower-dimensional space with small variances (#5), leading faster inner product operations crucial for many machine learning algorithms (e.g. OLS, SVM, PCA).

**


(#1) The Fourier transform defined in both ways by using $e^{-itx}$ or $e^{itx}$ are essentially the same as we can call either $i$ or $-i$ the imaginary unit. (#2) $X_n$ is not necessarily on the common probability space. (#3) The 2nd equality comes by $\operatorname{Im}(z) = (z-z^\ast) / {2i}$. (#4) $n$ denotes an input sequence size. (#5) Hoeffding’s inequality yields fast convergence in $m$: $P[{\vert z(\mathbf{x})^{\top}z(\mathbf{y}) − k(\mathbf{x},\mathbf{y}) \vert} \geq \varepsilon] ≤ 2\exp(−m\varepsilon^{2}/4)$, where $z(x)^{\top}z(\mathbf{y}) = \Sigma_{j=1}^{m}z_{\omega_j}(\mathbf{x})z_{\omega_j}(\mathbf{y})$.




I gathered words solely for my own purposes without any intention to break the rigorosity of the subjects.
Well, I prefer eating corn in spiral .