Statistical Inference I: Descriptive Statistics
Statistical Inference I: Descriptive Statistics
1. Summary
| 统计量 | 数学公式 | Python | R | Excel |
|---|---|---|---|---|
| Relative Standing | ||||
| minimum | \({x_{\min} = \min X = \min \limits_{i=1,\cdots,n} x_i}\) | numpy.minimum() |
||
| maximum | \({x_{\max} = \max X = \max \limits_{i=1,\cdots,n} x_i}\) | numpy.maximum() |
||
| percentile | \({x_p,p=\dfrac{(n+1)P}{100}}\) [a] | numpy.percentile()numpy.quantile() [A] |
||
| Central Tendency | ||||
| mean | \({\mathrm E(X) = \bar X = \dfrac 1 n \sum_{i=1}^{n}X_i}\) \({\mu = \dfrac 1 n \sum_{i=1}^N X_N}\) |
|||
| median | ||||
| mode [1] | ||||
| Variability | ||||
| range | \({x_{\text {range}}=x_{\text {max}} - x_{\text {min}}}\) | numpy.ptp() |
||
| interquartile range | \({\mathrm{Q}3-\mathrm{Q}1}\) | |||
| variance | \({\mathrm{Var}(X) = \mathrm D (X)=s^2=\dfrac{1}{n-1}\sum_{i=1}^n(X_i-\bar X)^2}\)[^b] \({\sigma^2=\dfrac{1}{N}\sum_{i=1}^N (X_i-\mu)^2}\) |
|||
| standard deviations [2] | \({\mathrm{SD}(X)=s=\sqrt{s^2}}\) \({\sigma=\sqrt{\sigma^2}}\) |
|||
| coefficient of variation (CV) [2] | \({\mathrm{CV}=\dfrac{s}{\bar X}}\) | |||
| skewness and kurtosis | ||||
| skewness [3] | \({\left \{ \begin{array}{lll} \mathrm{g}_1=\dfrac{\kappa_3}{\kappa_2^{3/2}} \\ \kappa_3=\dfrac{1}{n} \sum_{i=1}^{n}(x_i-\bar X)^3 \\ \kappa_2=\dfrac{1}{n} \sum_{i=1}^{n}(x_i-\bar X)^2 \end{array} \right.}\) \({\gamma_1=\mathrm E\left[ \dfrac{(x_i-\mu)^3}{\sigma^3}\right]=\dfrac{\mu_3}{\sigma^3}=\dfrac{\kappa_3}{\kappa_2^{3/2}}}\) |
|||
| kurtosis [4] | \({\left\{ \begin{array}{ll} \mathrm g_2=(\dfrac{\kappa_4}{\kappa_2^2})-3 \\ \kappa_4 = \dfrac{1}{n}\sum_{i=1}^{n}(x_i-\bar X)^4 \end{array} \right.}\) \({\gamma_2=\dfrac{\kappa_4}{\kappa_2^2}-3=\dfrac{\mu_4}{\sigma^4}-3}\)[^c] |
|||
| Association | ||||
| covariance [5] | \({\mathrm{COV}_s(X,Y)=\dfrac{1}{n-1}\sum_{i=1}^n(x_i-\bar X)(y_i-\bar Y)}\) \({\mathrm{COV}_p(X,Y)=\dfrac{1}{N}\sum_{i=1}^N(x_i-\mu_X)(y_i-\mu_Y)}\) |
numpy.cov() |
||
| Pearson product-moment correlation parameter [5] | \({r=\dfrac{\mathrm{COV}_s(X,Y)}{s_X s_Y}}\) \({\rho=\dfrac{\mathrm{COV}_p(X,Y)}{\sigma_X \sigma_Y}}\) |
numpy.corrcoefscipy.stats.pearsonr |
||
| Spearman rank correlation parameter | \({r_s = 1-\dfrac{6\sum_{i=1}^{n}d_i^2}{n(n^2-1)}}\) \({d_i=R(x_i)-R(y_i)}\) [d] |
scipy.stats.spearmanr |
[1] mode: is strictly applies to discrete variables
[2] standard deviation Vs CV:
- standard deviation is an absolute measure of dispersion.
- CV is a relative measure of dispersion, it consider the magnitude of the values in the population or sample.
[3] Skewness is a measure of the degree of asymmetry of a frequency distribution, also the third moment around the mean, or third central moment, with variance being the second moment.
-
\({\mathrm{g}_1>0}\) or \({\gamma_1>0}\), called the wright-skewed (or positively skewed), indicates a probability distribution tail is larger on the right than it is on the left. the mean is to the right of the median, which in turn is to the right of the mode.
-
\({\mathrm{g}_1<0}\) or \({\gamma_1<0}\) called the left-skewed (or negatively skewed). \({}\)\({\mathrm{g}_1=0}\) or \({\gamma_1 = 0}\) indicates a probability distribution is symmetrical.
[4]
[5] covariance Vs correlation:
[a] calculate for sufficiently large samples
[b] explanations for using \({n - 1}\) in the denominator of the samples' standard deviation:
- Since the standard deviation utilizes the sample mean, the result has “lost” one degree of freedom; that is, there are \({n− 1}\) independent observations remaining to calculate the variance;
- In calculating the standard deviation of a small sample, there is a tendency for the resultant standard deviation to be underestimated;
- the standard deviation with \({n-1}\) in the denominator of the sample is unbiased estimator of the variance of the population. This will be discussed in the section 1.4.
[c]:
[d]:
[A]: 使用参数 q 来指定具体分位数,numpy.percentile()中的分位数用01表示,`numpy.quantile()`中分位数用0100表示
注意:
-
小写字母 \({n}\) 表示样本(sample)数量,大写字母 \({N}\) 表示总体(population)数量
-
一般大写字母 \({X_i,\cdots}\) 表示随机变量(random variable),是一个变量;用小写字母 \({x_i, \cdots}\) 表示具体的观测值,是一个具体的数值。有时候,可能大小写的区分并不明显,应当根据语境注意其具体含义。
-
样本的统计量(statistic, or point estimator)如,\({\bar X}\),\({s}\) 等,是随机变量(random variable),因为其会随着抽样样本的不同而变化。总体的统计量如, \({\mu}\),\({\sigma}\) 等,是常量(constant)。
2. Mean, Variance, Quartiles
2.1. Mean
- Sample Mean
- Population Mean
2.2. Variance
- Sample Variance
- Population Variance
2.3. Quartiles
- Quartiles
- first quartile: 25th percentile, 25%, Q1
- second quartile: 50th percentile, 50%, Q2, median, middle quartile
- third quartile: 75th percentile, 75%, Q3
- interquartile range (IQR): \({\text{IQR} = \text{Q3} - \text{Q1}}\)
- outliers:
- \({\text{Q}1 - 3 \times \text{IQR} < x < \text{Q}1 - 1.5 \times \text{IQR}}\)
- \({\text{Q}3 + 1.5 \times \text{IQR} < x < \text{Q}3 + 3 \times \text{IQR}}\)
- extreme outliers
- \({x < \text{Q}1 - 3 \times \text{IQR}}\)
- \({x > \text{Q}3 + 3 \times \text{IQR}}\)
- whiskers:
- the left (or lower) edge: minimum value besides outliers; smallest data point in the range of \({\left[ \text{Q}1 - 1.5 \times \text{IQR}, \text{Q}1 \right]}\)
- the right (or upper) edge: maximum value besides outliers; largest data point in the range of \({\left[ \text{Q}3, \text{Q}3 + 1.5 \times \text{IQR} \right]}\)
3. 中心矩和标准矩
3.1. 中心矩 Central Moment
对于一维随机变量 \({X}\),其 \({k}\) 阶中心矩 \({\mu_k}\) 为相对于 \({X}\) 之期望值的 \({k}\) 阶矩:
- 第 0 阶中心矩 \({\mu _{0}}\) 恒为 1
- 第 1 阶中心矩 \({\mu _{1}}\) 恒为 0
- 第 2 阶中心矩 \({\mu _{2}}\) 为 \({X}\) 的方差 \({\mathrm{Var} (X)}\)
- 第 3 阶中心矩 \({\mu _{3}}\) 用于定义 \({X}\) 的偏度
- 第 4 阶中心矩 \({\mu _{4}}\) 用于定义 \({X}\) 的峰度
3.2. 标准矩 standardized moment
一维随机变量 \({X}\) 的 \({k}\) 阶标准矩为其 \({k}\) 阶中心矩 \({\mu_k}\) 与其标准差的 \({k}\) 次方 \({\sigma^2}\) 之比:
- \({\hat \mu_1 = 0}\)
- \({\hat \mu_2=1}\)
- 三阶标准矩 \({\hat \mu_3}\) 用于定义偏度
- 四阶标准矩 \({\hat \mu_4}\) 用于定义偏度
4. Properties of Estimators
-
Unbiasedness: An estimator is said to be unbiased if its expected value is equal to the true population parameter it is meant to estimate.
-
Efficiency
In general, the unbiased estimator with minimum variance is regarded as a more efficient estimator than the alternatives.
-
Consistency
-
Asymptotic: An estimator \({\hat \Theta}\) is said to be consistent if the probability of being closer to the true value of the parameter it estimates (\({θ}\)) increases with increasing sample size. A sufficient condition for an estimator to be consistent is that it is asymptotically unbiased and that its variance tends to zero as \({n\rightarrow \infty}\).
-
Sufficiency: An estimator is said to be sufficient if it contains all the information in the data about the parameter it estimates.
5. Methods of Displaying Data
| Diagram Type | Python | R |
|---|---|---|
| Histograms | ||
| Ogives | ||
| Box Plots | ||
| Scatter Diagrams | ||
| Bar Charts | ||
| Line Charts | ||
| Pie Charts | ||
QQ-Plot
2. Statistical Inference (1)
2.1. Probability Basics
2.1.1. Probability
-
Discrete Random Variable
-
Continuous Random Variables
2.1.2. Probability Density Function
The probability density function (PDF) \({f(x)}\) of a continuous random variable \({X}\) is used to determine probabilities as follows:
The properties of the PDF are:
- \({f(X) \ge 0}\)
- \({\int_{-\infty}^{+\infty} f(x) \text{d}x = 1}\)
- PDF and Histogram
- PDF
- quantify relative likelihood of the values of \({x}\)
- describe a population
- Histogram
- quantify relative frequency of the values of \({x}\)
- describe a sample
- PDF
2.1.3. Central Limit Theorem
-
Normal Distribution
-
Understanding the normal distribution (need flash support)
-
Definition: A normal random variable with \({\mu = 0}\) and \({\sigma^2=1}\) is called a standard normal random variable, denoted by \({Z}\).
-
- \({\Phi(z)=\mathbb{P}[Z \leq z]}\)
-
-
Central Limit Theorem
Let \({X_1,X_2,...,X_n}\) be a random sample of size \({n}\) taken from a population with mean \({\mu}\) and variance \({\sigma^2}\). Let \({\bar{X}}\) be sample mean. Then, the limiting form of the distribution of
is the standard normal distribution.
The sampling distribution of \({\bar{X}}\) is approximately normal with mean \({\mu}\) and variance \({\sigma^2 /n}\).
Notes: 这里表示采样样本的均值服从正态分布,参见 animation: Sampling Distribution.
2.2. Point Estimation
2.2.1. Introduction
-
to obtain the parameters such as population mean and population variance, usually donate as \({\theta}\)
-
most plausible value for \({\theta}\)
2.2.2. Random Sample and Statistic
Let \({X_1,...,X_n}\) be independent, identically distributed random variables taken from a population.
- The set \({X_1,\cdots,X_n}\) is called a random sample
- A function of the rando variables \({\hat{\Theta} = h(X_1,\cdots,X_n)}\) is called a statistic
- The probability distribution of a statistic is called its sampling distribution
- The statistic \({\hat{\Theta} = h(X_1,\cdots,X_n)}\) is called a point estimator of a population parameter \({\theta}\) when it is used a estimate \({\theta}\)
- A point estimate of \({\theta}\) is a single numerical value \({\hat{\theta}}\) of \({\hat{\Theta}}\)
2.2.3. Typical Point Estimates
| Unknown Parameters | Statistic/Point estimator | Point Estimate |
|---|---|---|
| \({\theta}\) | \({\hat \Theta}\) | \({\hat \theta}\) |
| \({\mu}\) | \({\bar{X}=\dfrac{\sum X_i}{n}}\) | \({\hat x}\) |
| \({\sigma^2}\) | \({S^2 = \dfrac{\sum(X_1-X)^2}{n-1}}\) | \({s^2}\) |
| \({p}\) | \({\hat{P} = \dfrac{X}{n}}\) | \({\hat p}\) |
| \({\mu_1-\mu_2}\) | \({\bar X_1-\bar X_2=\dfrac{\sum X_{1i}}{n_1}-\dfrac{\sum X_{2i}}{n_2}}\) | \({}\)\({\bar x_1-\bar x_2}\) |
| \({p_1-p_2}\) | \({\hat P_1 - \hat P_2 = \dfrac{X_1}{n_2} - \dfrac{X_1}{n_2}}\) | \({}\)\({\hat p_1 - \hat p_2}\) |
| unobservable quantity (population) | Function of observable random variables (samples) capital letter \({X}\) represents a variable |
Single numerical value after \({a}\) sample selected the small letter \({x}\) represents a sampled value (or an observed value or an actual value) |

浙公网安备 33010602011771号