# 概率 - Probability

## 大数定律 - Law of large number (概率的基石)

If we observe more and more repetitions of any chance process, the proportion of times that a specific outcome occurs approaches a single value.

## 概率 - Probability

$$P(A) = \frac {\text {number of outcomes corresponding to event }A}{\text{total number of outcomes in sample space}}$$

## Probability Rule

• Legitimate values:
$$0 \leq P(A) \leq 1$$
• Probability for sample space:
$$\sum_i P_i = 1$$
• Complement Rule
$$P(A^C) = 1 - P(A)$$

## 条件概率 - Conditional Probability

Definition: $P(B|A)$ represent the probability that event B happens given that event A happened.

Computation:
$$P(A|B) = \frac {P(A \cup B)} {P(B)}$$
$$P(B|A) = \frac {P(A \cup B)} {P(A)}$$

## 独立事件 - Independence

$$P(A\cap B) = P(A)P(B)$$
$$P(A|B) = P(A)$$
$$P(B|A) = P(B)$$

## Mutually Exclusive

If one has happened, the other must not happen at all.

## Mutually Exclusive & Independence

$$P(A\cap B) = 0$$

## An important conclusion

$$P(A \cup B) = P(A) + P(B) - P(A\cap B)$$

P.S. When $A$ and $B$ are sets,

$$\operatorname{card}(A \cup B) = \operatorname{card}(A) + \operatorname{card}(B) - \operatorname{card}(A \cap B)$$

## Two-way Tables

Study two categorical variables.

## 组合 - Combination

Definition: $n$ 个里面选 $m$ 个

$$C_n^m = \binom {n}{m} = \displaystyle\frac {n!}{m!(n-m)!}$$

## 重复独立概率 - Repeated Independent Probability

$$\binom{5}{3} \times 0.7^3 \times 0.3 ^ 2$$

$$\binom{n}{m} \times p^m \times (1 - p)^{n - m}$$

# 随机变量 - Random Variable

Definition: A numerical varaible that describes the outcomes of a chance process is called a random variable.

# 概率分布 - Probability Distribution

Definition: The probability model for a random varaible is its probability distribution.

# 离散型随机变量 - Discete Random Variable

Definition: A discrete random variable $X$ takes a fixed set of possible values. The probability distribution of discrete random variable $X$ lists the values $x_i$ and their probabilities $p_i$.

## 离散型随机变量的期望 - Mean of Discrete Random Variable

Definition: The mean of any discrete random variable is an average of the possible outcomes, with each weighted by its probability.

$$\mu_x = E[X] = \sum_i x_ip_i$$

P.S. 又称Expected Value，即”期望”。

## 离散型随机变量的方差 - Variance of Discrete Random Varaible

$$\operatorname{Var}[X] = \sum_i (E[x] - x_i) ^ 2 p_i$$

## 离散型随机变量的标准差 - Standard Deviation of Discrete Random Variable

$$\sigma_X = \sqrt {\operatorname{Var[X]}} = \sqrt {\sum _ i (E[x] - x_i) ^ 2 p_i}$$

## 一个重要的公式

$$\operatorname{Var}[X] = E[X^2] - E^2[X]$$
\begin{aligned} \textit{Proof. }~~~~~~~~~~~~~~~~~~~&\operatorname{Var}[X] \ =&\sum_i(E[X] - x_i) ^ 2 p_i \ =&E[(E[X] - X)^2] \ =&E[E^2[X] - 2XE[X] + X^2] \ =&E^2[X] - 2E^2[X] + E[X^2] \ =& E[X^2] - E^2[X] \end{aligned}

P.S. 关于$E[X^2]$，根据定义：

$$E[X^2] = \sum_i x_i^2p_i$$

## 随机变量的线性变换 - Linear Transformation of Random Variables

\begin{aligned} X &\rightarrow aX + b \ E[X] &\rightarrow aE[X] + b \ \operatorname{Var}[X] &\rightarrow a^2\operatorname{Var}[X] \ \sigma_X &\rightarrow |a|\sigma_X \end{aligned}

## 随机变量的结合 - Combining Ramdom Variables

$$E[X+Y] = E[X] + E[Y]$$
\begin{aligned} \textit{Proof. }~~~~~~~~~~~~~~~~~~~&E[X+Y] \ (\text{since }X\text{ and }Y\text{ are independent})=&\sum_i\sum_j(x_i + y_j) p_{xi} p_{yj} \ =&\sum_i\sum_j x_ip_{xi} p_{yj} + \sum_i\sum_j y_j p_{xi} p_{yj} \ =&\sum_ix_ip_{xi}\sum_j p_{yj} + \sum_j y_j p_{yj} \sum_ip_{xi} \ =&\sum_ix_ip_{xi} + \sum_j y_j p_{yj} = E[X] + E[Y] \end{aligned}
$$\operatorname{Var}[X+Y] = \operatorname{Var}[X] + \operatorname{Var}[Y]$$
\begin{aligned} \textit{Proof. }~~~~~~~~~~~~~~~~~~~&\operatorname{Var}[X+Y] \ =&E[(X+Y)^2] - E^2[X+Y] \ =&E[X^2 + 2XY + Y^2] - (E[X] + E[Y])^2 \ =&E[X^2] + 2E[XY] + E[Y^2] - E^2[X] - 2E[X]E[Y] - E^2[Y] \ =&(E[X^2] - E^2[X]) + (E[Y^2] - E^2[Y]) + 2E[XY] - 2E[X]E[Y] \ =&\operatorname{Var}[X] + \operatorname{Var}[Y] + 2E[XY] - 2E[X]E[Y] \ (\text{since }X\text{ and }Y\text{ are independent})=&\operatorname{Var}[X] + \operatorname{Var}[Y] + 2\sum_i\sum_j(x_iy_i)p_{xi}p_{yj} - 2\sum_ix_ip_{xi} \sum_jy_jp_{yj} \ =&\operatorname{Var}[X] + \operatorname{Var}[Y] \end{aligned}
$$\sigma_{X+Y} = \sqrt {\sigma_X^2 + \sigma_Y^2}$$
\begin{aligned} \textit{Proof. }~~~~~~~~~~~~~~~~~~~&\sigma_{X+Y} \ =&\sqrt{\operatorname{Var}[X+Y]} \ =&\sqrt{\operatorname{Var}[X] + \operatorname{Var}[Y]} \ =&\sqrt{\sigma_X^2 + \sigma_Y^2} \end{aligned}
$$E[X-Y] = E[X] - E[Y]$$
\begin{aligned} \textit{Proof. }~~~~~~~~~~~~~~~~~~~&E[X-Y] \ (\text{since }X\text{ and }Y\text{ are independent})=&\sum_i\sum_j(x_i - y_j) p_{xi} p_{yj} \ =&\sum_i\sum_j x_ip_{xi} p_{yj} - \sum_i\sum_j y_j p_{xi} p_{yj} \ =&\sum_ix_ip_{xi}\sum_j p_{yj} - \sum_j y_j p_{yj} \sum_ip_{xi} \ =&\sum_ix_ip_{xi} - \sum_j y_j p_{yj} = E[X] - E[Y] \end{aligned}
$$\operatorname{Var}[X-Y] = \operatorname{Var}[X] + \operatorname{Var}[Y]$$
\begin{aligned} \textit{Proof. }~~~~~~~~~~~~~~~~~~~&\operatorname{Var}[X-Y] \ =&\operatorname{Var}[X] + \operatorname{Var}[-Y] \ =&\operatorname{Var}[X] + \operatorname{Var}[Y] (-1)^2 \ =&\operatorname{Var}[X] + \operatorname{Var}[Y] \end{aligned}
$$\sigma_{X-Y} = \sqrt {\sigma_X^2 + \sigma_Y^2}$$
\begin{aligned} \textit{Proof. }~~~~~~~~~~~~~~~~~~~&\sigma_{X+Y} \ =&\sqrt{\operatorname{Var}[X-Y]} \ =&\sqrt{\operatorname{Var}[X] + \operatorname{Var}[Y]} \ =&\sqrt{\sigma_X^2 + \sigma_Y^2} \end{aligned}

## Binomial Settings

Definition: A binomial setting arises when we perform several independent trials of the same chance process and record the number of times that a particular outcome occurs. The four conditions for a binomial settings are:

• Binary: The possible outcomes of each trial can be classified as “success” or “failure”
• Independent: Trials must be independent, that is, knowing the result of one trail must not have any effect on the result of any other trial.
• Number: The number of trials $n$ of the chance process must be fixed in advance.
• Success: On each trial, the possibility $p$ must be the same.

• 结果只有两种，成功或失败
• 每次的事件是独立的
• 固定的次数
• 每次成功概率一样

## Binomial Random Variable

Definition: The count $X$ of success in a binomial setting is a binomial random variable

## 二项分布 - Binomial Distribution

Definition: The probability distribution of $X$ (a binomial random variable) is a binomial distribution with parameters $n$ and $p$.

## Binomial Probability

$X\sim B(n,p)$

$$P(X=k) = \binom{n}{k}p^k(1-p)^{n-k }$$

## 二项分布的计算器计算

menu -> 统计 -> 分布 -> 二项PDF -> 计算 $P(X=k)$

menu -> 统计 -> 分布 -> 二项CDF -> 计算 $P(a \leq X \leq b)$

## 二项分布的Mean和Standard Deviation

$X \sim B(n,p)$，

$$\mu_x = E[X] = np$$

$$E[X] = E[X_1 + X_2 + \cdots X_n] = E[X_1] + E[X_2] + \cdots + E[X_n] = np$$

$$\operatorname{Var}[X] = np(1-p)$$

\begin{aligned} &\operatorname{Var}[X] \ =& \operatorname{Var}[X_1 + X_2 + \cdots + X_n] \ =& \operatorname{Var}[X_1] + \operatorname{Var}[X_2] + \cdots + \operatorname{Var}[X_n] \ =& n\operatorname{Var}[X_i] \ =& n((0 - E[X]) ^ 2 (1-p) + (1 - E[X])^2p) \ =& n(p^2 (1 - p) + (1 - p) ^ 2 p) \ =& np(1-p) \end{aligned}

$$\sigma_X = \sqrt {\operatorname{Var}[X]} = \sqrt{np(1 - p)}$$

## Geometric Settings

Definition: A geometric setting arises when we perform independent trials of the same chance process and record the number of trials until a particular outcome occurs. The four conditions for a geometric setting are:

• Binary: The possible outcomes of each trial can be classified as “success” or “failure”
• Independent: Trials must be independent, that is, knowing the result of one trail must not have any effect on the result of any other trial.
• Trials: The goal is to count the number of trials until the first success occurs.
• Success: On each trial, the possibility $p$ must be the same.

• 结果只有两种，成功或失败
• 每次的事件是独立的
• 目标：统计第一次成功时总共用的次数
• 每次成功概率一样

## Geomtric Random Variable

Definition: The number of trials $Y$ that it takes to get a success in a geometric setting is a geometric random variable.

## 几何分布 - Geometric Distribution

Definition: The probability distribution of $Y$ (which is a geometric random variable) is a geometric distribution with parameter $p$, the probability of success on any trial.

## Geometric Probability

$Y \sim G(p)$

$$P(Y = k) = p(1 - p)^{k - 1}$$

## 几何分布的Mean和Standard Deviation

$$E[Y] = \mu_Y = \frac {1}{p}$$

$$\sigma_Y = \frac {\sqrt {1 - p}} p$$

## 几何分布的计算器计算

menu -> 统计 -> 分布 -> 几何PDF -> 计算 $P(Y=k)$

menu -> 统计 -> 分布 -> 几何CDF -> 计算 $P(a \leq Y \leq b)$

## 做题的例子

Let random variable $Y$ denotes …, it satisfies … with ….

# 连续性随机变量 - Continuous Random Variable

## 概率密度曲线 - Density Curve

• is always on or above the horizontal axis
$$f(x) \geq 0$$
• has area exactly 1 underneath it:
$$\int_{-\infty}^{\infty} f(t)\mathrm dt = 1$$

$$P(a < X < b) = P(a \leq X < b) = P(a < X \leq b) = P(a \leq X \leq b)$$

（拓展）概率分布函数：

$$F(x) = \int_{-\infty} ^ {x} f(t) \mathrm dt$$

$$f(x) = F’(x)$$

$$P(a < X \leq b) = F(b) - F(a) = \int_a^b f(x)\mathrm dx$$

## 正态分布 - Normal Distribution

• 自然界中比较正常、自然随机的变量一般都服从正态分布。
• 随机变量$X$服从均值为$\mu$，标准差为$\sigma$的正态分布，记作
$$X \sim \mathcal N(\mu, \sigma)$$

## 正态分布的图像

• Unimodal - 单峰
• Symmetric - 对称
• Bell-shaped - 钟形曲线

## 经验法则 - The Empirical Rule

• $68%$ 的数据在 $\mu \pm \sigma$ 范围内
• $95.4%$ 的数据在 $\mu \pm 2\sigma$ 范围内
• $99.7%$ 的数据在 $\mu \pm 3\sigma$ 范围内

## 标准正态分布

• Definition: 标准正态分布就是以$0$为均值$1$为标准差的正态分布。
• 标准正态分布可以说是$z-score$形成的正态分布
• 既然原来的数据是正态分布的，$z-score = \displaystyle\frac {x - \mu}{\sigma}$只是对原来数据进行了线性转换，所以还是服从正态分布的。
• $z-score$的意义是看数据的大小偏离了均值多少个标准差，所以很明显，$z-score$的均值应该是$0$，标准差是$1$（可以通过定义证明，但同样可以通过感性来理解）

## 标准正态分布表

$$P(X < x_i) = P(z < z_i)$$
$$P(X > x_i) = 1 - P(X < x_i) = 1 - P(z < z_i)$$
$$P(x_1 < X < x_2) = P(X < x_2) - P(X < x_1)$$
$$\text{Inverse: }z_i = \frac {x_i - \mu}{\sigma} \Rightarrow x_i = z_i\sigma + \mu$$

## 正态分布例题 - Normal Distribution Calculation Example

Women’s heights are $\mathcal N(64, 2.5)$. What percentage of women are shorter than $62$.

$\textit{ Sol. }$ Let a random variable $X$ denotes the womens’s height, $X$ satisfies Normal Distribution. $X \sim \mathcal N(64, 2.5)$. The porportion of women’s heights under $62$ is shown above. (应该在上面画一张图)
$$P(X < 62) = P(z < \frac {62 - 64} {2.5}) = 21.19 %$$
Approximately there are $21.19%$ of women are shorter than $62$.

## 正态分布的计算器计算

menu -> 统计 -> 分布 -> 正态分布 CDF

• Lower Bound：下界
• Upper Bound：上界
• $\mu$：分布的均值
• $\sigma$：分布的标标准差
• 答案：$P(a < X < b)$

menu -> 统计 -> 分布 -> Inverse Normal

• Area: $P(X < k)$
• $\mu$：分布的均值
• $\sigma$：分布的标准差
• 答案：$k$

# 抽样分布 - Sampling Distributions

## Population VS. Sample

\begin{aligned} &\text{Population} & \text {Sample} \ \text{mean}~~~~~~&\mu &\bar x \ \text{variance}~~~~~~&\sigma ^ 2&s^2 \ \text{standard deviation}~~~~~~&\sigma & s \ \text{proportion}~~~~~~&p & \hat p \end{aligned}

Definition: the value of s statistic varies in repeated random sampling is called sampling variability.

## Parameter VS. Statistic

Definition:

• Prarmeter: (unknown) number that describes a characteristic of a population.
• Statistic: (obtained) number that describes a characteristic of a sample drawn from the population.

Parameter描绘总体Statistic描述样本。

## Sampling Distribution

Definition: The Sampling Distribution of a statistic is the distribution of values taken by the statistic in all possible samples of the same size from the same population.

## Sample Distribution, Population Distribution, Sampling Distribution

• Sample Distribution: 指的是在单一一次抽样当中，抽样出的数据形成的分布
• Population Distribution: 指的是总体的分布
• Sampling Distribution: 每一次抽样当中得到的statistic放在一起形成的分布（是统计量形成的分布）

## Biased and unbiased estimators

Definition: a statistic used to estimate a parameter is an unbiased estimator if the mean of sampling statistic equals the true value of population parameter.

$$E[\bar x] = \frac {\bar x_1 + \bar x_2 + \cdots + \bar x_n} n = \mu_x$$
$$E[\hat p] = \frac {\hat p_1 + \hat p_2 + \cdots + \hat p_n} n = p$$

## Variability of a Statistic

Definition: The variability of a statistic is described by the spread of its sampling distribution. This spread is determined primarily by the size of the random sample. Larger samples give smaller spread. Variability of sampling statistics should be smaller.

• unbiased
• small variability

## The Centural Limit Theorem

Draw an SRS of size $n$ from any population with mean $\mu$ and finite standard deviation $\sigma$. The CLT says that when $n$ is large ($n \geq 30$), the sampling distribution of the sample mean $\bar x$ is approximately Normal.

CLT: 在每次抽样的个数变大之后，均值的分布逐渐由不规律转换为正态分布。

## Sampling Distribution of a Sample Mean

1. Population follow Normal Distribution: $X \sim \mathcal N(\mu, \sigma) \Rightarrow \bar x \sim\mathcal N(\mu, \displaystyle\frac {\sigma}{\sqrt n})$
\begin{aligned}  \textit{Proof. } ~~~~~~\mu_{\bar x} &= E\Big[\frac{x_1 + x_2 + \cdots + x_n}{n}\Big] \\ &= \frac {1}{n} E[x_1 + x_2 + \cdots + x_n] \\ &= \frac 1 n (E[x_1] + E[x_2] + \cdots + E[x_n]) \\ &= \frac 1 n n \mu = \mu \\ \operatorname{Var}[\bar x] &= \operatorname{Var}\Big [ \frac {x_1 + x_2 + \cdots + x_n}{ n} \Big] \\ &= \frac 1 {n ^2} [\operatorname{Var}[x_1] + \operatorname{Var}[x_2] + \cdots + \operatorname{Var}[x_n]] \\ &= \frac {1 }{n^2} [\sigma^2 + \sigma^2 + \cdots + \sigma ^ 2] = \frac {\sigma^2} n \\ \sigma_{\bar x} &= \sqrt {\operatorname{Var}[\bar x]} = \frac \sigma {\sqrt n} \end{aligned}
2. Population distribution is skewed / unknown, with $n \geq 30$, $\bar x \sim\mathcal N(\mu, \displaystyle\frac {\sigma}{\sqrt n})$ (According to CLT).

## The Sampling Distribution of a Difference Between Two Means

$$\bar {x_1} - \bar {x_2} \sim \mathcal N(\mu_1 - \mu_2, \sqrt {\frac {\sigma_1^2}{n_1} + \frac {\sigma_2^2}{n_2}})$$

## Sampling Distribution of $\hat p$

\begin{aligned} \text {large count condition: }np \geq 10,&n(1-p)\geq 10 & \text{保证形成正态分布} \ n \leq& 10% N &\text {保证抽样形成二项分布} \end{aligned}

$$X \sim \mathcal B(n, p) \Rightarrow \hat p \sim \mathcal N(\mu_{\hat p} = p, \sigma_{\hat p} = \sqrt {\frac{p(1 - p)}{n}})$$

$$E[\hat p] = E\Big [ \frac X n \Big ] = \frac 1 n E[X] = \frac 1 n np = p$$

$$\operatorname{Var}[\hat p] = \operatorname{Var}\Big [ \frac X n \Big] = \frac {1} {n^2} \operatorname{Var}[X] = \frac {1}{n^2} np(1-p) = \frac{p(1-p)}{n}$$

$$\sigma_{\hat p} = \sqrt {\operatorname{Var}[\hat p]} = \sqrt {\frac{p(1-p)}{n}}$$

## The Sampling Distribution of a Difference Between Two Proportions

If

$$n_1 p_1 \geq 10, ~n_1(1 - p_1) \leq 10$$
$$n_2 p_2 \geq 10, ~n_2(1 - p_2) \leq 10$$
$$n_1 \leq 10% N_1, ~n_2 \leq 10%N_2$$
Then
$$\hat p_1 - \hat p_2 \sim \mathcal N(p_1 - p_2, \sqrt {\frac {p_1(1 - p_1)}{n_1} + \frac {p_2(1 - p_2)}{n_2}})$$

## Point Estimator

• Unbiased: 无偏差的，即统计出的值（statistic）和真实的值（parameter）一致
• Low Variability: 更好地情况需要方差更小（可以通过增加sample size）

• Unbiased和Low Variability都要提到

# Confidence Interval / Interval Estimation

$$\text {estimate} \pm \text {margin of error}$$

$\text {margin of error}$的大小是一个$\text {trade off}$.

Definition: To interpret a $C%$ confidence interval for an unknown parameter, say, “We are $C%$ confident that in the long run the interval from $\text {estimate} - \text {margin of error}$ to $\text {estimate} + \text {margin of error}$ would succeed in $\color{red}{capturing}$ the actual value of the [population parameter in context].”

If the estimator is unbiased, then $E[\hat x] = x$

• 不是probability
• 主语是interval, 谓语是capture
• 估计的是population parameter，而不是sample statistic.

## 临界值 - Critical Value

$$\text{confident interval} = \text {sample statistic} \pm z^* \times \text{standard deviation of statistic}$$

$$\text{confident interval of mean} = \text {sample statistic} \pm z^* \times \sigma_{\bar x} = \text {sample statistic} \pm z^* \times \frac {\sigma}{\sqrt n}$$

• $\text {standard error} = \text{standard deviation of statistic}$
• $\text {margin of error} = z^* \times \text{standard deviation of statistic}$

$$\text{confident interval} = \text {sample statistic} \pm \text {critical value} \times \text{standard error}$$

• 降低置信区间 $C% \downarrow$
• 增加每次的抽样数 $n \uparrow$

$$-z^* < \frac {x - \mu}{\sigma_x} < z^*$$

$$x - \sigma_x z^* < \mu < x + \sigma_x z^*$$

## $t$分布 - $t$ Distribution

$$t = \frac {x - \mu}{s_x}$$

$$s = \frac {\sum _{i = 1} ^ n (x_i - \bar x)^2}{n - 1}$$

## $t$分布的特点

• It is symmetric with a single peak at $0$.
• It has much more data on the tails.
• It has a similar shape like normal distribution, but a greater spread and a lower peak.
When $n \rightarrow \infty$, $t \rightarrow z$.

## 用样本的标准差估计的置信区间

$$\text {critical value }t^* = t_{\frac{\alpha}{2}^{n - 1}}$$

$$-t^* < \frac {x - \mu}{s_x} < t^*$$

$$x - s_x t^* < \mu < x + s_x t^*$$

$$\text{confident interval} = \text {sample statistic} \pm t^* \times \text{standard deviation of statistic}$$

## $t$值的计算器查找

menu -> 统计 -> 分布 -> 反向t分布

Area就是从左到右的面积，在计算时取$\displaystyle\frac{\alpha}{2}$。也就是说如果要$90\%$的置信区间，取$0.025$
df ($\text{degree of freedom}$): 自由度

## 置信区间的计算器计算

menu -> 统计 -> 置信区间 -> z区间 -> 统计

menu -> 统计 -> 置信区间 -> t区间 -> 统计

menu -> 统计 -> 置信区间 -> 双样本z区间 -> 统计

menu -> 统计 -> 置信区间 -> 双样本t区间 -> 统计

menu -> 统计 -> 置信区间 -> 单比例z区间 -> 统计

menu -> 统计 -> 置信区间 -> 双比例z区间 -> 统计

## 置信区间的条件

• Random: The data should come from a well-designed random sample or randomized experiment.
• Normal: The sampling distribution is exactly Normal if the population distribution is Normal. In the cases that the population distribution is not normal, the Centural Limit Theorem (CLT) tells us that the sampling distribution of sample mean will be approximately normal if $n$ is large ($n \geq 30$). (For proportions, $np \geq 10, n(1 - p) \geq 10$)
• Independent: Individual observations Individual observations are independent if $n \leq \frac 1 {10} N$. This is called $10%$ condition.

## 单样本Mean的置信区间

Population distribution is normal or sample space is large ($n \geq 30$), with $n \leq 10% N$,

$$\bar x \sim \mathcal N(\mu, \frac{\sigma}{\sqrt n})$$

When $\sigma$ is known

$$\text{Confidence Interval: } \bar x \pm z^*\frac{\sigma}{\sqrt n}$$

When $\sigma$ is unknown

$$\text{Confidence Interval: } \bar x \pm t^*_{n - 1}\frac{\sigma}{\sqrt n}$$

## 双样本Mean的置信区间

Both population distribution is normal or both sample size is large ($n_1 \geq 30, n_2 \geq 30$), with $n_1 \leq 10% N_1, n_2 \leq 10% N_2$,

$$\bar {x_1} - \bar {x_2} \sim \mathcal N(\mu_1 - \mu_2, \sqrt {\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}})$$

When $\sigma_1$ and $\sigma_2$ is known

$$\text {Confidence Interval: }(\bar{x_1} - \bar{x_2}) \pm z^*\sqrt {\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}$$

When $\sigma_1$ and $\sigma_2$ is unknown

$$\text {Confidence Interval: } (\bar{x_1} - \bar{x_2}) \pm t^*_{df}\sqrt {\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}$$

in which

$$df = \frac{(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2})}{\frac{1}{n_1 - 1}(\frac{s_1^2}{n_1})^2 + \frac{1}{n_2 - 1}(\frac{s_2^2}{n_2})}$$

## 例题

FRQ One of the two fire stations in a certain town responds to calls in the northern half of the town, and the other fire station responds to calls in the southern half of the town. One of the town council members believes that the two fire stations have different mean response times. Response time is measured by the differnce between the timwe an emergency call comes into the fire station and the time the first fire truck arrives at the scene of the fire. Data were collected to investigate whether the council member’s beliefis correct. A random sample of $50$ calls selected from the northern fire station had a mean response time of $4.3$ minutes with a standard deviation of $3.7$ minutes. A random sample to $50$ calls selected from the southern fire station had a mean response time of $5.3$ minutes with a standard deviation of $3.2$ minutes. Construct and interpret a $95$ percent confidence interval for the difference in mean response times between the two fire stations.

Step 1: State The two-sample $t$ interval for $\mu_N - \mu_S$, the difference in population mean response times, is $(\bar x_N - \bar x_S) \pm t^* \sqrt {\displaystyle\frac{s_N^2}{n_N} + \displaystyle\frac{s_S^2}{n_S}}$, where $\mu_N$ denotes the mean response for calls from the northern fire station and $\mu_s$ denotes the mean response for calls from the southern fire station.

Step 2: Plan Conditions

• Random: A random sample of 50 calls was selected from the northern fire station, and random sample of 50 calls selected from the southern station
• Independent: The calls of the northern are independent from the calls of the southern.
• Normal (稍微详细写一点): The use of the two-sample $t$ interval is reasonable beacause both sample sizes are large ($n_N = 50 > 30$ and $n_S = 50 > 30$), and by Centural Limit Theorem, the sampling distributions for the two sample means are approximately normal. Therefore the sampling distribution of the difference of the sample means $\bar x_N - \bar x_S$ is approximately normal.

Step 3: Do

$\text{The Degree of freedom}= 96$
\begin{aligned} &(4.3 - 5.3) \pm 1.985 \sqrt {\frac{3.7^2}{50} + \frac{3.2^2}{50}} \ &-1.0 \pm 1.985 \times 0.6918 \ &(-2.37,0.37) \end{aligned}

Step 4: Interpretation Based on these samples, one can be $95$ percent confident that the difference in the population mean response times (northern - southern) is between $-2.37$ minutes and $0.37$ minutes.

## The One-Sample for Matched-Pairs Sample

$$\text{Confidence Interval: } \bar {x_d} \pm t^*_{n - 1} \frac{s_d}{\sqrt n}$$

## 单样本Prop的置信区间

Using the normal approximation as long as $n\hat p \geq 10, n(1 - \hat p) \geq 10$, and independent, $n \leq 10% N$,

$$\hat p \sim \mathcal N(p, \sqrt \frac{p(1 - p)}{n})$$

$$\text{Confidence Interval: }\hat p \pm z^* \sqrt \frac {\hat p(1 - \hat p)}{n}$$

## 双样本Prop的置信区间

Using the normal approximation as long as $n_1\hat p_1 \geq 10, n_1(1 - \hat p_1) \geq 10, n_2\hat p_2 \geq 10, n_2(1 - \hat p_2) \geq 10$, and independent, $n_1 \leq 10% N_1, n_2 \leq 10% N_2$,

$$\hat p_1 - \hat p_2 \sim \mathcal N(p_1 - p_2, \sqrt {\frac {p_1(1 - p_1)}{n_1} + \frac {p_2(1 - p_2)}{n_2}})$$

$$\text {Confidence Interval: } (\hat p_1 - \hat p_2) \pm z^* \sqrt {\frac {\hat p_1(1 - \hat p_1)}{n_1} + \frac{\hat p_2(1 - \hat p_2)}{n_2}}$$

# Hypothesis Test

## Introduction

Hypothesis test statistical idea: flasifiability 可证伪性

## 显著性实验 - Significance Test

Definition: A significance test is a formal procedure for comparing observed data with a claim whose truth we want to assess. This claim is a statement about a parameter, like the population proportion $p$ or the population mean $\mu$. We express the results of a significance test in terms of a probability that measures how well the data and the claim agree.

## Test Statistic

Definition: A test statistic measures how far a sample statistic diverges from what we would expect if the null hypothesis $H_0$ were true, in standardized units. That is

$$\text{test statistic} = \frac{\text{statistic} - \text{parameter}}{\text{standard deviation of statistic}}$$

## 显著程度 - Significance Level $\alpha$

The significance level ($\alpha$) defined for a study is the probability of the study rejecting the null hypothesis, given that it were true.

## Stating Hypothesis (left-tailed test)

When

$$H_0 : \text {parameter = claimed value}$$

$$H_a : \text {parameter} < \text {claimed value}$$

If

$$\text {test statistic} < -z_{\alpha}$$

then $H_0$ can be rejected.

## Stating Hypothesis (right-tailed test)

When

$$H_0 : \text {parameter = claimed value}$$

$$H_a : \text {parameter} > \text {claimed value}$$

If

$$\text {test statistic} > z_\alpha$$

then $H_0$ can be rejected.

## Stating Hypothesis (two-tailed test)

When

$$H_0 : \text {parameter = claimed value}$$

$$H_a : \text {parameter} \neq \text {claimed value}$$

If

$$|\text {test statistic}| > z_{\frac {\alpha}{2}}$$

then $H_0$ can be rejected.

## $P-\text {Values}$

$t$值和$z$值不方便，不直接，不直观，所以引入了$P-\text {Values}$.

Definition: P-value is the probability that we observe a test statistic value at least as extreme as the one computed from the sample, if $H_0$ were true.

### 计算方法：计算$t$分布或$z$分布的CDF

• left-tailed: $P(z < \text {test statistic})$或者$P(t < \text {test statistic})$
• right-tailed: $P(z > \text {test statistic})$或者$P(t > \text {test statistic})$
• two-tailed: $P(|z| > |\text {test statistic}|)$或者$P(|t| > |\text {test statistic}|)$

### $P-\text {Values}$得到的结论

$$P-\text{Values} < \alpha \rightarrow \text {reject }H_0 \rightarrow \text {conclude }H_a \text {(in context)}$$
$$P-\text{Values} \geq \alpha \rightarrow \text {fail to reject }H_0 \rightarrow \text {cannot conclude }H_a \text {(in context)}$$

## Type I and Type II Errors

### 拒真错误 - Type I Error

Definition: 当$H_0$为真，reject $H_0$.

### 受伪错误 - Type II Error

Definition: 当$H_0$为假，fail to reject $H_0$.

### Power - $1 - \beta$

Definition: 当$H_0$为假，成功reject的概率就是Power，为$1 - \beta$.

• 增加$\alpha$
• 增加Sample Size

## HT问题的Conditions

• Random: The data should come from a well-designed random sample or randomized experiment.
• Independent: Individual obervations are independent if $n \leq \displaystyle\frac{1}{10}N$
• Normal:
• Normal if the original distribution is Normal.
• Normal according to Central Limit Theorem when $n \geq 30$.
• Normal when approximating by Binomial Distribution and $np \geq 10$, $n(1 - p) \geq 10$.
• 在题目中没有提及且是小样本的情况下：
1. 画Dot Plot
2. Since the graph reveal no obvious skewness or outliers, we assume that the distribution is approximately normal.

## The One-Sample $z$ Test

When conditions are met, we can test a claim about a poulation mean $\mu$ using a one-sample $z$ test.

one-sample $z$ statistic:

$$z = \frac {\bar x - \mu_0}{\frac {\sigma}{\sqrt n}}$$

## The One-Sample $t$ Test

When conditions are met, we can test a claim about a population mean $\mu$ using a one-sample $t$ test.

one-sample $t$ statistic:

$$t = \frac {\bar x - \mu_0}{\displaystyle\frac {s_x}{\sqrt n}}$$

$df = n - 1$

## Two-Sample $z$ Test for The Difference Between Two Means

Suppose the Random, Normal and Independent conditions are met. To test the hypothesis $H_0: \mu_1 - \mu_2 = \text {hypothesis value}$, compute the $z$ statistic

$$z = \frac {(\bar x_1 - \bar x_2) - (\mu_1 - \mu_2)}{\displaystyle\sqrt {\frac{\sigma_1 ^2}{n_1}+\frac{\sigma_2 ^2}{n_2}}}$$

## Two-Sample $t$ Test for The Difference Between Two Means

Suppose the Random, Normal and Independent conditions are met. To test the hypothesis $H_0: \mu_1 - \mu_2 = \text {hypothesis value}$, compute the $t$ statistic

$$t = \frac {(\bar x_1 - \bar x_2) - (\mu_1 - \mu_2)}{\displaystyle\sqrt {\frac{s_1 ^2}{n_1}+\frac{s_2 ^2}{n_2}}}$$

## The One-Sample $z$ Test for a Proportion

Choose an SRS of size $n$ from a large population that contains an unknown proportion $p$ of successe. To test the hypothesis $H_0 : p = p_0$, compute the $z$ statistic

$$z = \frac {\hat p - p_0}{\displaystyle\sqrt\frac{p_0(1 - p_0)}{n}}$$

## Two-Sample $z$ Test for the Difference Between Two Proportions

Suppose the Random, Normal, and Independent conditions are met. To test the hypothesis $H_0 : p_1 - p_2 = 0$, first find the pooled proportion $\hat p_C$ of successes in both samples combined

$$\hat p_C = \frac {\hat p_1 n_1 + \hat p_2 n_2}{n _ 1 + n _ 2}$$

Then compute the $z$ statistic

$$z = \frac {(\hat p_1 - \hat p_2) - 0}{\sqrt {\displaystyle\frac{\hat p_C(1 - \hat p_C)}{n_1} + \displaystyle\frac{\hat p_C(1 - \hat p_C)}{n_2}}}$$

## 例题

FRQ Investigators at the U.S. Department of Agriculture wished to compare methods of determining the level
of $E$ coli bacteria contamination in beef. Two different methods (A and B) of determining the level of contamination were used on each of ten randomly selected specimens of a certain type of beef. The data obtained, in millimicrobes/liter of ground beef, for each of the methods are shown in the table below.

Is there a significant difference in the mean amount of $E$. coli bacteria detected by the two methods for this type of beef? Provide a statistical justification to support your answer.

Step 1: State

$$H_0 : \mu_d = 0$$

$$H_a : \mu_d \neq 0$$

where $\mu_d$ is the mean difference (method A - method B) in the level of E. coli bacteria contamination in beef detected by the two methods.

Thus, we are gonna use the Paired $t$-test, where

$$t = \frac {\bar x_d - 0}{\displaystyle\frac{s_d}{\sqrt{n_d}}}$$

Step 2: Plan

Conditions:

• Since the observations are obtained on 10 randomly selected specimens, it is reasonable to assume that the 10 data pairs are independent of one another.
• The population distribution of differences is normal. (画图，说为什么Normal)

The computed differences are:
$$-0.3, 0.5, 0.3, 0.6, 0.8, 0.7, 1.2, 0.2, -0.1, -1.0$$

Step 3: Do

$$\bar x_d = 0.29$$

$$s_d = 0.629727$$

$$t = \frac{0.29}{0.629727 / \sqrt {10}} = 1.46$$

$$d.f. = 9$$

$$P-\text {value} = 0.179$$

Step 4: Intepretation

Since the $P$ -value is greater than $0.05,$ we cannot reject $H_{0} .$ We do NOT have statistically significant evidence to conclude that there is a difference in the mean amount of $E$. coli bacteria detected by the two methods for this type of beef. In other words, there does not appear to be a significant difference in these two methods for measuring the level of $E$. coli contamination in beef.

## HT的计算器计算

menu -> 统计 -> 统计检验 -> z检验

menu -> 统计 -> 统计检验 -> t检验

menu -> 统计 -> 统计检验 -> 双样本z检验

menu -> 统计 -> 统计检验 -> 双样本t检验

menu -> 统计 -> 统计检验 -> 单比例z检验

menu -> 统计 -> 统计检验 -> 单比例t检验

# Inference for Categorical Data: Chi-Square

## 卡方统计量 - The Chi-Square Statistic

Definition: The Chi-Square Statistic is a measure of how far the observed counts are from the expected counts. The formula for the statistic is

$$\chi^2 = \sum \frac{(\text {observed} - \text {expected})^2}{\text {expected}}$$

## The Chi-Square Distributions and P-Values

The chi-square distributions are a family of distributions that take only positive values and are skewed to the right.

A particular chi-square distribution is specified by giving its degrees of freedom.

The chi-square goodness-of-fit test uses the chi-square distribution with degrees of freedom = $k - 1$, in which $k = \# categories$

## The Chi-Square Goodness-of-Fit Test

### Null Hypothesis & Alternative Hypothesis

$$H_0 : \text {The specified distribution of the categorical variable is correct.}$$

$$H_a : \text {The specified distribution of the categorical variable is not correct.}$$

### Conditions

• Random: The data come from a random sample or a randomized experiment
• Independent: Individual observations are independent if $n \leq \displaystyle\frac {1}{10}N$. This is called $10%$ condition.
• Large Sample Size: All expected counts are at least $5$.

### The Chi-Square GOF Test

Suppose the conditions are met. Start by finding the expected count for each category assuming that $H_0$ is true. Then calculate the chi-square statistic. The P-value is the area to the right of chi-square distribution with $k - 1$ degrees of freedom.

Observed List: 一个数组，在TI-nspire中可以表示为$\{n1, n2, n3, ...\}$
Expected List: 一个数组，同上
Degree of Freedom: 自由度，$k - 1$

## The Chi-Square Test for Association / Independence

### Null Hypothesis & Alternative Hypothesis

$$H_0 : \text{There is no association between two categorical variables in the population of interest.}$$

$$H_a : \text{There is an association between two categorical variables in the population of interest.}$$

### Conditions

• Random: The data come from a random sample or a randomized experiment
• Large Sample Size: All expected counts are at least $5$.

### Calculations

$$\chi^2 = \sum \frac{(\text {observed} - \text {expected})^2}{\text {expected}}$$

in which the expected value is given by

$$P(row_i \cap col_j) = P(row_i) \times P(col_j) \text { since assumed independent}$$

$$\Rightarrow \text {total} \times \text {total} \times P(row_i \cap col_j) = [P(row_i) \times \text {total}] \times [P(col_j) \times \text {total} ]$$

$$\Rightarrow \text {expected}_{ij} = \frac {\text {row-total}_i \times \text {col-total}_j}{\text {total}}$$

The degree of freedom is

$$\text {df} = (#row - 1)(#col - 1)$$

Matrix: 创建矩阵（menu -> Matrix & Vector -> Create -> Matrix）

# Inference for Quantitative Data: Slope

Least Regression Line的斜率$b$也是一个统计量，所以也可以做CI和HT问题。这里直接给出众多公式：

## Least Regression Line

Line Predicted from the sample:

$$\hat y = a + bx$$

Line for the population:

$$\hat y = \alpha + \beta x$$

## Sampling Distribution of a Slope

• The Mean of the sampling distribution of $b$ is $\mu_b = \beta$
• The Standard Deviation of the sampling distribution of $b$ is
$$ \sigma_b = \frac {\sigma}{\sigma_x \sqrt n}$$
• The Standard Error of $b$ is
$$ SE_b = \frac {s}{s_x \sqrt {n - 1}}$$
$s$是$b$的标准差，$s_x$是原始数据中$x$的标准差。

## Confidence Interval

$$b \pm t^* SE_b$$

## Hypothesis Test

Test Statistic

$$t = \frac {b - \beta _ 0}{SE_b}$$

# 计算器使用汇总

• 定义变量：define xxx = xxx
• 矩阵：menu -> 矩阵与向量 -> 创建 -> 矩阵
• 求统计量：menu -> 统计 -> 数组计算 -> 需要的统计量 -> 填入数组名称
这里数组的计算都是根据列来的，如果有$n$个数据，创建$n$行$1$列的矩阵进行运算。
• 组合：menu -> 概率 -> 组合，e.g. nCr(5, 3) = 10.
• 二项分布PDF：menu -> 统计 -> 分布 -> 二项PDF -> 计算 $P(X=k)$
• 二项分布CDF：menu -> 统计 -> 分布 -> 二项CDF -> 计算 $P(a \leq X \leq b)$
• 几何分布PDF：menu -> 统计 -> 分布 -> 几何PDF -> 计算 $P(Y=k)$
• 几何分布CDF：menu -> 统计 -> 分布 -> 几何CDF -> 计算 $P(a \leq Y \leq b)$
• 正态分布的概率：menu -> 统计 -> 分布 -> 正态分布 CDF
• Lower Bound：下界
• Upper Bound：上界
• $\mu$：分布的均值
• $\sigma$：分布的标标准差
• 答案：$P(a < X < b)$
• 反向正太分布：menu -> 统计 -> 分布 -> Inverse Normal
• Area: $P(X < k)$
• $\mu$：分布的均值
• $\sigma$：分布的标准差
• 答案：$k$
• 反向$t$分布：menu -> 统计 -> 分布 -> 反向t分布
Area就是从左到右的面积，在计算时取$\displaystyle\frac{\alpha}{2}$。也就是说如果要$90\%$的置信区间，取$0.025$
df ($\text{degree of freedom}$): 自由度
• menu -> 统计 -> 置信区间 -> z区间 -> 统计
• menu -> 统计 -> 置信区间 -> t区间 -> 统计
• menu -> 统计 -> 置信区间 -> 双样本z区间 -> 统计
• menu -> 统计 -> 置信区间 -> 双样本t区间 -> 统计
• menu -> 统计 -> 置信区间 -> 单比例z区间 -> 统计
• menu -> 统计 -> 置信区间 -> 双比例z区间 -> 统计
• menu -> 统计 -> 统计检验 -> z检验
• menu -> 统计 -> 统计检验 -> t检验
• menu -> 统计 -> 统计检验 -> 双样本z检验
• menu -> 统计 -> 统计检验 -> 双样本t检验
• menu -> 统计 -> 统计检验 -> 单比例z检验
• menu -> 统计 -> 统计检验 -> 单比例t检验
在比例部分，$x$是成功次数：$x = \hat pn$，其中$\hat p$是实验抽样的概率，$n$是Sample Size。
• menu -> Statistic -> Stat Tests -> $\chi^2$ GOF
Observed List: 一个数组，在TI-nspire中可以表示为$\{n1, n2, n3, ...\}$
Expected List: 一个数组，同上
Degree of Freedom: 自由度，$k - 1$
• menu -> Statistics -> Stat Tests -> $\chi^2$ 2-way Test
Matrix: 创建矩阵（menu -> Matrix & Vector -> Create -> Matrix）

# AP 2020 特辑

## Unit 1 - 单变量

### 描述图表题

• Shape
• Symmetric Distribution
• 对称的分布
• Median = Mean
• Left-Skewed Distribution
• 左偏分布
• 数据集中在右边
• Mean < Median
• Right-Skewed Distribution
• 右偏分布
• 数据集中在左边
• Mean > Median
• Uniform Distribution
• 非常平均的
• 就像一条水平线一样的
• Bimodal Distribution
• 双峰分布
• Outliers
• 有 / 无 / 位置？
• Center
• Median / Mean
• IQR / Range

### 画图题

• 一定要判断Outlier
• 上界：Q3 + 1.5IQR
• 下界：Q1 - 1.5IQR
• 新方法：$\text {Mean} \pm 2 \text { Std Dev}$

### 选择度量问题

• 为什么Median不Mean
• 因为Median是Resistent to outliers
• 为什么IQR不Standard Deviation
• 因为IQR是Resistent to outliers

### 简单计算问题

• IQR = Q3 - Q1

## Unit 2 - 双变量

### 描述回归线的Slope和y-Intercept

• Slope: the amount by which y is predicted to change when x increases by 1 unit.
• y-Intercept: predicted value of y when x = 0.

### 描述Scatter Plot

• Direction
• Positive / Negative
• Positive: xxxx Increase as xxxx Increase
• Negative: xxxx Decrease as xxxx Increase
• Form
• Linear / Nonlinear / Curve / …
• Linear: When xxxx is increased by 1 unit, xxxx will also increase by a constant unit by average.
• Strength
• Strong / Moderate / Weak / …
• Strong: Points are close to a line.
• Weak: Points are far away to a line.
• Outlier
• 有 / 无 / 位置?

### 描述$R^2$ (Coefficient of Determination)

About $R^2$ of the varaibility of xxxxxx can be accounted for by the linear relationship between xxxxxx and xxxxxx.

### 描述$r$ (Correlation)

• Definition $r$ is a measure of the direction and strength of the linear relationship between two quantitative variables.
• $-1≤r≤1$
• $r > 0$: Positive Association
• $r < 0$: Negative Association
• 越Strong，绝对值越大
• 但是绝对值越大，不一定越Strong，因为不一定成线性关系

### Extrapolation - 越界

Explain不能推广（就是越界）的答题格式：
No, This is extrapolation beyond the …. data. xxxxxxx data were not investigated.

### Outlier / High Leverage Point的判断和影响

Outlier：Residual比较大的点 (Large Risidual)
High Leverage Point: 水平方向上离得很远的点 (A high-leverage point in regression has a substantially larger or smaller x-value than the other observations have.)

### 双变量的计算问题

$$\text {residual} = y - \hat y$$

## Unit 3 - Collecting Data

### 描述抽样方法

1. SRS - Simple Random Sample: Label the subjects (students, patients, etc.) from 1 to N (N = population size). Use random number generator in the calculator randInt(1, N) to generate n (n = sample size) different numbers and select these n corresponding subjects (students, patients, etc.) for the sample.
2. Stratified Random Sample: In each strata (gender, plots, etc.), label the individuals (students, trees, etc.) in this strata from 1 to n (n is the number of individuals in this strata). Use randInt(1, n) to generate m different numbers and select these m different numbers for the sample. Repeat this procedure for every strata (genders, plots, etc.).
• 注：每个Stata内部的差异越小越好，不同Stata之间的差异越大越好，最终在每个Stata的内部抽样
3. Cluster Random Sample: Label all the clusters (plots, gender, etc.) from 1 to n (n = total number of clusters). Use random number generator in the calculator randInt(1, n) to generate m different numbers. Select all the individuals in these m corresponding clusters for the sample.
• 注：每个Cluster内部差异越大越好，不同Cluster之间的差异越小越好，最终抽取几个Cluster

### 比较抽样方法

• Cluster的优点：省事. Easier to obtain.
• Stratified的有点：Low Variability. result in a better representative of the population.

### 说明某种误差带来的问题

• Selection Bias
• Under Coverage: 覆盖不全
• Non-selection Bias:
• Non-response Bias: 样本没有回复，比如说人联系不上或者拒绝回答
• Response Bias: 骗人
• Wording of Questions: 用词不当造成的误差

### Experiment和Observational Study的区别

• Experiment: 人为施加措施，attempt to impose treatment
• Observational Study: 只是看

### Language of Experiment

• Treatment: 干的事情
• Experimental Units: The smallest collection of individuals to which treatments are applied. 比如说，在不同瓶子里喷洒农药看虫子的存活情况中，Experimental Units是这些瓶子。
• Response Variable: 反馈的变量

### 如何Random Assignment

1. 标号
2. 计算器随机生成
3. 根据生成的数来assign

### Control Group

A Group that is not assigned of any treatment.

### Control Group 的好处

A control group gives the researchers a comparison group to be used to evaluate the effectiveness of the treatments (medication, therapy, etc.), in comparison with normal effect (context) of control group on the response variable (context).

### Randomized Blocked Design 的好处

lower 组内的 Variability

### 是否可以Single-Blind / Double-Blind

#### 单盲（single blind）

• 只有研究者了解分组情况，研究对象不知道自己是试验组还是对照组。
• 这种盲法的优点是研究者可以更好地观察了解研究对象，在必须时可以及时恰当地处理研究对象可能发生的意外问题，使研究对象的安全得到保障。
• 缺点是避免不了研究者方面带来的主观偏倚，易造成试验组和对照组的处理不均衡。

## Unit 4 - 6

### 算的时候的格式

posted @ 2021-01-14 14:32  gyro永不抽风  阅读(255)  评论(0编辑  收藏  举报