Stein's lemma, named in honor of Charles Stein, is a theorem of probability theory that is of interest primarily because of its applications to statistical inference — in particular, to James–Stein estimation and empirical Bayes methods — and its applications to portfolio choice theory.[1] The theorem gives a formula for the covariance of one random variable with the value of a function of another, when the two random variables are jointly normally distributed.
Note that the name "Stein's lemma" is also commonly used[2] to refer to a different result in the area of statistical hypothesis testing, which connects the error exponents in hypothesis testing with the Kullback–Leibler divergence. This result is also known as the Chernoff–Stein lemma[3] and is not related to the lemma discussed in this article.
Statement
Suppose X is a normally distributed random variable with expectation μ and variance σ2. 
Further suppose g is a differentiable function for which the two expectations  and
 and  both exist. 
(The existence of the expectation of any random variable is equivalent to the finiteness of the expectation of its absolute value.) 
Then
 both exist. 
(The existence of the expectation of any random variable is equivalent to the finiteness of the expectation of its absolute value.) 
Then
 
Multidimensional
In general, suppose X and Y are jointly normally distributed. Then
 
For a general multivariate Gaussian random vector  it follows that
 it follows that
 
Similarly, when  ,
, ![{\displaystyle \operatorname {E} [\partial _{i}g(X)]=\operatorname {E} [g(X)(\Sigma ^{-1}X)_{i}],\quad \operatorname {E} [\partial _{i}\partial _{j}g(X)]=\operatorname {E} [g(X)((\Sigma ^{-1}X)_{i}(\Sigma ^{-1}X)_{j}-\Sigma _{ij}^{-1})]}](./_assets_/003342b4967b1e4293d83d64c5b143b3bffb1cfa.svg) 
Gradient descent
Stein's lemma can be used to stochastically estimate gradient: where
where  are IID samples from the standard normal distribution
 are IID samples from the standard normal distribution  . This form has applications in Stein variational gradient descent[4] and Stein variational policy gradient.[5]
. This form has applications in Stein variational gradient descent[4] and Stein variational policy gradient.[5]
Proof
The univariate probability density function for the univariate normal distribution with expectation 0 and variance 1 is
 
Since  we get from integration by parts:
 we get from integration by parts:
![{\displaystyle \operatorname {E} [g(X)X]={\frac {1}{\sqrt {2\pi }}}\int g(x)x\exp(-x^{2}/2)\,dx={\frac {1}{\sqrt {2\pi }}}\int g'(x)\exp(-x^{2}/2)\,dx=\operatorname {E} [g'(X)]}](./_assets_/d77c1991a5853a03568397148931e2da0f573df7.svg) . .
The case of general variance  follows by substitution.
 follows by substitution.
Generalizations
Isserlis' theorem is equivalently stated as where
where  is a zero-mean multivariate normal random vector.
 is a zero-mean multivariate normal random vector.
Suppose X is in an exponential family, that is, X has the density
 
Suppose this density has support  where
 where  could be
 could be  and as
 and as   ,
,  where
 where  is any differentiable function such that
  is any differentiable function such that  or
 or   if
 if  finite. Then
 finite. Then 
![{\displaystyle E\left[\left({\frac {h'(X)}{h(X)}}+\sum \eta _{i}T_{i}'(X)\right)\cdot g(X)\right]=-E[g'(X)].}](./_assets_/acbc265856e7952b292aef0f5dad373d8119ab21.svg) 
The derivation is same as the special case, namely, integration by parts.
If we only know  has support
 has support   , then it could be the case that
, then it could be the case that  but
 but  . To see this, simply put
. To see this, simply put  and
 and  with infinitely spikes towards infinity but still integrable. One such example could be adapted from
 with infinitely spikes towards infinity but still integrable. One such example could be adapted from  so that
 so that  is smooth.
 is smooth. 
Extensions to elliptically-contoured distributions also exist.[6][7][8]
See also
References
- ^ Ingersoll, J., Theory of Financial Decision Making, Rowman and Littlefield, 1987: 13–14.
- ^ Csiszár, Imre; Körner, János (2011). Information Theory: Coding Theorems for Discrete Memoryless Systems. Cambridge University Press. p. 14. ISBN 9781139499989.
- ^ Thomas M. Cover, Joy A. Thomas (2006). Elements of Information Theory. John Wiley & Sons, New York. ISBN 9781118585771.
- ^ Liu, Qiang; Wang, Dilin (2019-09-09). "Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm". arXiv:1608.04471 [stat.ML].
- ^ Liu, Yang; Ramachandran, Prajit; Liu, Qiang; Peng, Jian (2017-04-07). "Stein Variational Policy Gradient". arXiv:1704.02399 [cs.LG].
- ^ 
Cellier, Dominique; Fourdrinier, Dominique; Robert, Christian (1989). "Robust shrinkage estimators of the location parameter for elliptically symmetric distributions". Journal of Multivariate Analysis. 29 (1): 39–52. doi:10.1016/0047-259X(89)90075-4.
- ^ 
Hamada, Mahmoud; Valdez, Emiliano A. (2008). "CAPM and option pricing with elliptically contoured distributions". The Journal of Risk & Insurance. 75 (2): 387–409. CiteSeerX 10.1.1.573.4715. doi:10.1111/j.1539-6975.2008.00265.x.
- ^ 
Landsman, Zinoviy; Nešlehová, Johanna (2008). "Stein's Lemma for elliptical random vectors". Journal of Multivariate Analysis. 99 (5): 912––927. doi:10.1016/j.jmva.2007.05.006.