Statistical Inference I: Descriptive Statistics

1. Summary

统计量	数学公式	Python
Relative Standing
minimum	\({x_{\min} = \min X = \min \limits_{i=1,\cdots,n} x_i}\)	`numpy.minimum()`
maximum	\({x_{\max} = \max X = \max \limits_{i=1,\cdots,n} x_i}\)	`numpy.maximum()`
percentile	\({x_p,p=\dfrac{(n+1)P}{100}}\) [a]	`numpy.percentile()` `numpy.quantile()` [A]
Central Tendency
mean	\({\mathrm E(X) = \bar X = \dfrac 1 n \sum_{i=1}^{n}X_i}\) \({\mu = \dfrac 1 n \sum_{i=1}^N X_N}\)
median
mode [1]
Variability
range	\({x_{\text {range}}=x_{\text {max}} - x_{\text {min}}}\)	`numpy.ptp()`
interquartile range	\({\mathrm{Q}3-\mathrm{Q}1}\)
variance	\({\mathrm{Var}(X) = \mathrm D (X)=s^2=\dfrac{1}{n-1}\sum_{i=1}^n(X_i-\bar X)^2}\)[^b] \({\sigma^2=\dfrac{1}{N}\sum_{i=1}^N (X_i-\mu)^2}\)
standard deviations [2]	\({\mathrm{SD}(X)=s=\sqrt{s^2}}\) \({\sigma=\sqrt{\sigma^2}}\)
coefficient of variation (CV) [2]	\({\mathrm{CV}=\dfrac{s}{\bar X}}\)
skewness and kurtosis
skewness [3]	\({\left \{ \begin{array}{lll} \mathrm{g}_1=\dfrac{\kappa_3}{\kappa_2^{3/2}} \\ \kappa_3=\dfrac{1}{n} \sum_{i=1}^{n}(x_i-\bar X)^3 \\ \kappa_2=\dfrac{1}{n} \sum_{i=1}^{n}(x_i-\bar X)^2 \end{array} \right.}\) \({\gamma_1=\mathrm E\left[ \dfrac{(x_i-\mu)^3}{\sigma^3}\right]=\dfrac{\mu_3}{\sigma^3}=\dfrac{\kappa_3}{\kappa_2^{3/2}}}\)
kurtosis [4]	\({\left\{ \begin{array}{ll} \mathrm g_2=(\dfrac{\kappa_4}{\kappa_2^2})-3 \\ \kappa_4 = \dfrac{1}{n}\sum_{i=1}^{n}(x_i-\bar X)^4 \end{array} \right.}\) \({\gamma_2=\dfrac{\kappa_4}{\kappa_2^2}-3=\dfrac{\mu_4}{\sigma^4}-3}\)[^c]
Association
covariance [5]	\({\mathrm{COV}_s(X,Y)=\dfrac{1}{n-1}\sum_{i=1}^n(x_i-\bar X)(y_i-\bar Y)}\) \({\mathrm{COV}_p(X,Y)=\dfrac{1}{N}\sum_{i=1}^N(x_i-\mu_X)(y_i-\mu_Y)}\)	`numpy.cov()`
Pearson product-moment correlation parameter [5]	\({r=\dfrac{\mathrm{COV}_s(X,Y)}{s_X s_Y}}\) \({\rho=\dfrac{\mathrm{COV}_p(X,Y)}{\sigma_X \sigma_Y}}\)	`numpy.corrcoef` `scipy.stats.pearsonr`
Spearman rank correlation parameter	\({r_s = 1-\dfrac{6\sum_{i=1}^{n}d_i^2}{n(n^2-1)}}\) \({d_i=R(x_i)-R(y_i)}\) [d]	`scipy.stats.spearmanr`

[1] mode: is strictly applies to discrete variables

[2] standard deviation Vs CV:

standard deviation is an absolute measure of dispersion.
CV is a relative measure of dispersion, it consider the magnitude of the values in the population or sample.

[3] Skewness is a measure of the degree of asymmetry of a frequency distribution, also the third moment around the mean, or third central moment, with variance being the second moment.

\({\mathrm{g}_1>0}\) or \({\gamma_1>0}\), called the wright-skewed (or positively skewed), indicates a probability distribution tail is larger on the right than it is on the left. the mean is to the right of the median, which in turn is to the right of the mode.
\({\mathrm{g}_1<0}\) or \({\gamma_1<0}\) called the left-skewed (or negatively skewed). \({}\)\({\mathrm{g}_1=0}\) or \({\gamma_1 = 0}\) indicates a probability distribution is symmetrical.

[4]

[5] covariance Vs correlation:

[a] calculate for sufficiently large samples

[b] explanations for using \({n - 1}\) in the denominator of the samples' standard deviation:

Since the standard deviation utilizes the sample mean, the result has “lost” one degree of freedom; that is, there are \({n− 1}\) independent observations remaining to calculate the variance;
In calculating the standard deviation of a small sample, there is a tendency for the resultant standard deviation to be underestimated;
the standard deviation with \({n-1}\) in the denominator of the sample is unbiased estimator of the variance of the population. This will be discussed in the section 1.4.

[c]:

[d]:

[A]: 使用参数 q 来指定具体分位数，numpy.percentile()中的分位数用0_{1表示，`numpy.quantile()`中分位数用0}100表示

注意：

小写字母 \({n}\) 表示样本（sample）数量，大写字母 \({N}\) 表示总体（population）数量
一般大写字母 \({X_i,\cdots}\) 表示随机变量（random variable），是一个变量；用小写字母 \({x_i, \cdots}\) 表示具体的观测值，是一个具体的数值。有时候，可能大小写的区分并不明显，应当根据语境注意其具体含义。
样本的统计量（statistic, or point estimator）如，\({\bar X}\)，\({s}\) 等，是随机变量（random variable），因为其会随着抽样样本的不同而变化。总体的统计量如， \({\mu}\)，\({\sigma}\) 等，是常量（constant）。

2. Mean, Variance, Quartiles

2.1. Mean

Sample Mean

\[\bar{x} = \frac{x_1 + x_2 + \cdots + x_n}{n} = \frac{1}{n} \sum_{i=1}^n x_i \]

Population Mean

\[\mu = \frac{1}{N} \sum_{i=1}^N x_i \]

2.2. Variance

Sample Variance

\[s^2 = \frac{1}{n-1} \sum_{i=1}^{n-1} (x_i - \bar{x})^2 \]

Population Variance

\[s^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2 \]

2.3. Quartiles

Quartiles
- first quartile: 25th percentile, 25%, Q1
- second quartile: 50th percentile, 50%, Q2, median, middle quartile
- third quartile: 75th percentile, 75%, Q3
interquartile range (IQR): \({\text{IQR} = \text{Q3} - \text{Q1}}\)
outliers:
- \({\text{Q}1 - 3 \times \text{IQR} < x < \text{Q}1 - 1.5 \times \text{IQR}}\)
- \({\text{Q}3 + 1.5 \times \text{IQR} < x < \text{Q}3 + 3 \times \text{IQR}}\)
extreme outliers
- \({x < \text{Q}1 - 3 \times \text{IQR}}\)
- \({x > \text{Q}3 + 3 \times \text{IQR}}\)
whiskers:
- the left (or lower) edge: minimum value besides outliers; smallest data point in the range of \({\left[ \text{Q}1 - 1.5 \times \text{IQR}, \text{Q}1 \right]}\)
- the right (or upper) edge: maximum value besides outliers; largest data point in the range of \({\left[ \text{Q}3, \text{Q}3 + 1.5 \times \text{IQR} \right]}\)

3. 中心矩和标准矩

3.1. 中心矩 Central Moment

对于一维随机变量 \({X}\)，其 \({k}\) 阶中心矩 \({\mu_k}\) 为相对于 \({X}\) 之期望值的 \({k}\) 阶矩：

\[\mu_k = \mathrm{E}[(X-\mathrm E[X])^k] = \int_{-\infty}^{+\infty}(x-\mu)^kf(x)\mathrm d x \]

第 0 阶中心矩 \({\mu _{0}}\) 恒为 1

\[\begin{align} \mu_0 &= \mathrm{E}[1] = 1 \end{align} \]

第 1 阶中心矩 \({\mu _{1}}\) 恒为 0

\[\begin{align} \mu_1 &= \mathrm{E}[X-\mathrm E[X]] = \mathrm{E}[X-\bar X] = \mathrm{E}[X] - \mathrm E [\bar X] \\ &= \mathrm{E}[X] - \mathrm E \left[\frac1N \sum X_i \right] = \mathrm{E}[X] - \frac1N \sum \mathrm E [X_i] \\ &= \mu - \mu = 0 \end{align} \]

第 2 阶中心矩 \({\mu _{2}}\) 为 \({X}\) 的方差 \({\mathrm{Var} (X)}\)
第 3 阶中心矩 \({\mu _{3}}\) 用于定义 \({X}\) 的偏度
第 4 阶中心矩 \({\mu _{4}}\) 用于定义 \({X}\) 的峰度

3.2. 标准矩 standardized moment

一维随机变量 \({X}\) 的 \({k}\) 阶标准矩为其 \({k}\) 阶中心矩 \({\mu_k}\) 与其标准差的 \({k}\) 次方 \({\sigma^2}\) 之比：

\[\hat \mu_k = \frac{\mu_k}{\sigma^k} = \frac{\mathrm E[(X-\mu)^k]}{(\mathrm E [(X-\mu)^2])^{\frac k 2}} \]

\({\hat \mu_1 = 0}\)
\({\hat \mu_2=1}\)
三阶标准矩 \({\hat \mu_3}\) 用于定义偏度
四阶标准矩 \({\hat \mu_4}\) 用于定义偏度

4. Properties of Estimators

Unbiasedness: An estimator is said to be unbiased if its expected value is equal to the true population parameter it is meant to estimate.
Efficiency

In general, the unbiased estimator with minimum variance is regarded as a more efficient estimator than the alternatives.
Consistency
Asymptotic: An estimator \({\hat \Theta}\) is said to be consistent if the probability of being closer to the true value of the parameter it estimates (\({θ}\)) increases with increasing sample size. A sufficient condition for an estimator to be consistent is that it is asymptotically unbiased and that its variance tends to zero as \({n\rightarrow \infty}\).
Sufficiency: An estimator is said to be sufficient if it contains all the information in the data about the parameter it estimates.

5. Methods of Displaying Data

Diagram Type	Python	R
Histograms
Ogives
Box Plots
Scatter Diagrams
Bar Charts
Line Charts
Pie Charts

QQ-Plot

2. Statistical Inference (1)

2.1. Probability Basics

2.1.1. Probability

Discrete Random Variable
Continuous Random Variables

2.1.2. Probability Density Function

The probability density function (PDF) \({f(x)}\) of a continuous random variable \({X}\) is used to determine probabilities as follows:

\[P(a<X<b) = \int_a^b f(x) dx \]

The properties of the PDF are:

\({f(X) \ge 0}\)
\({\int_{-\infty}^{+\infty} f(x) \text{d}x = 1}\)

PDF and Histogram
- PDF
  - quantify relative likelihood of the values of \({x}\)
  - describe a population
- Histogram
  - quantify relative frequency of the values of \({x}\)
  - describe a sample

2.1.3. Central Limit Theorem

Normal Distribution
- Understanding the normal distribution (need flash support)
- Definition: A normal random variable with \({\mu = 0}\) and \({\sigma^2=1}\) is called a standard normal random variable, denoted by \({Z}\).
- - \({\Phi(z)=\mathbb{P}[Z \leq z]}\)
Central Limit Theorem

Let \({X_1,X_2,...,X_n}\) be a random sample of size \({n}\) taken from a population with mean \({\mu}\) and variance \({\sigma^2}\). Let \({\bar{X}}\) be sample mean. Then, the limiting form of the distribution of

\[Z = \frac{\bar{X}-\mu}{\sigma/\sqrt{n}} \quad n \rightarrow \infty \]

is the standard normal distribution.

The sampling distribution of \({\bar{X}}\) is approximately normal with mean \({\mu}\) and variance \({\sigma^2 /n}\).

Notes: 这里表示采样样本的均值服从正态分布，参见 animation: Sampling Distribution.

2.2. Point Estimation

2.2.1. Introduction

to obtain the parameters such as population mean and population variance, usually donate as \({\theta}\)
most plausible value for \({\theta}\)

2.2.2. Random Sample and Statistic

Let \({X_1,...,X_n}\) be independent, identically distributed random variables taken from a population.

The set \({X_1,\cdots,X_n}\) is called a random sample
A function of the rando variables \({\hat{\Theta} = h(X_1,\cdots,X_n)}\) is called a statistic
The probability distribution of a statistic is called its sampling distribution
The statistic \({\hat{\Theta} = h(X_1,\cdots,X_n)}\) is called a point estimator of a population parameter \({\theta}\) when it is used a estimate \({\theta}\)
A point estimate of \({\theta}\) is a single numerical value \({\hat{\theta}}\) of \({\hat{\Theta}}\)

2.2.3. Typical Point Estimates

Unknown Parameters	Statistic/Point estimator	Point Estimate
\({\theta}\)	\({\hat \Theta}\)	\({\hat \theta}\)
\({\mu}\)	\({\bar{X}=\dfrac{\sum X_i}{n}}\)	\({\hat x}\)
\({\sigma^2}\)	\({S^2 = \dfrac{\sum(X_1-X)^2}{n-1}}\)	\({s^2}\)
\({p}\)	\({\hat{P} = \dfrac{X}{n}}\)	\({\hat p}\)
\({\mu_1-\mu_2}\)	\({\bar X_1-\bar X_2=\dfrac{\sum X_{1i}}{n_1}-\dfrac{\sum X_{2i}}{n_2}}\)	\({}\)\({\bar x_1-\bar x_2}\)
\({p_1-p_2}\)	\({\hat P_1 - \hat P_2 = \dfrac{X_1}{n_2} - \dfrac{X_1}{n_2}}\)	\({}\)\({\hat p_1 - \hat p_2}\)
unobservable quantity (population)	Function of observable random variables (samples) capital letter \({X}\) represents a variable	Single numerical value after \({a}\) sample selected the small letter \({x}\) represents a sampled value (or an observed value or an actual value)

2.2.3. Unbiased Estimator

2.3. Hypothesis Testing

posted @ 2023-01-06 14:19 veager 阅读(48) 评论(0) 收藏举报

刷新页面返回顶部