Statistical Inference II

1. Hypothesis Testing

1.1 Mechanics of Hypothesis Testing

Test of a Hypothesis
- The truth or falsity of a particular hypothesis can never be known for certainty unless we examine the entire population
- A hypothesis testing procedure should be developed with the probability of reaching wrong conclusion.

1.2 Null Hypothesis and Alternative hypothesis

Null Hypothesis \({H_0}\): is an assertion about one or more population parameters that are assumed to be true until there is sufficient statistical evidence to conclude otherwise.
Alternative Hypothesis \({H_1}\): is the assertion of all situations not covered by the null hypothesis.

Together the null and the alternative constitute a set of hypotheses that covers all possible values of the parameter or parameters in question.

1.3. Type I error and type II error

Results of a Test of Hypothesis

\[\begin{array}{r|cc} \text{Reality} & {H_0} \text{ is True} & H_0 \text{ is False} \\ \hline \text{Test Result} & & \\ \text{Reject} & \text{Type I error: } {\Pr (\text{Type I error}) = \alpha} & \text{Correct decision} \\ \text{Do not reject} & \text{Correct decision} & \text{Type II error: } {\Pr (\text{Type II error}) = \beta} \end{array} \]

Type I error: Rejecting the null hypothesis \(H_0\) when it is true

Type II error: Failing to reject the null hypothesis \(H_0\) when it is false

significance level \(\alpha\): the probability of making a type I error:

\[\alpha = \Pr(\text{reject } H_0|H_0 \text{ is true}) \]
Strong and weak conclusion
- Because we can usually control the significance level \(\alpha\) (the probability of wrongly rejecting \(H_0\)), the rejection of the null hypothesis is a strong conclusion
- Failing to reject the null hypothesis does not necessarily means a high probability for \(H_0\) being true. Failing to reject \(H_0\) is a weak conclusion.

1.4. Three approaches of a hypothesis testing

fix-significance level \(\alpha\) approach:

Given a fixed-significance level \(\alpha\), all we have to do is to determine where to place the critical regions.
\(p\)-value approach:

The \(p\)-value provides a measure of the credibility of the null hypothesis.

Given the test statistic, the \(p\)-value is the smallest significance level \(\alpha\) that would lead to rejection of the null hypothesis:

\[p < \alpha \quad \Rightarrow \quad \text{reject } H_0 \]
confidence interval (CI) approach:

to estimate an unknown parameter \(\theta\), we would prefer an interval estimate \(L \leq \theta \leq U\), where \(L\) and \(U\) are determined by the sample.

\[\Pr[L \leq \theta \leq U] = 1 - \alpha \quad \text{or} \quad \Pr[L \leq \theta] = 1 - \alpha \quad \text{or} \quad \Pr[\theta \leq U] = 1 - \alpha \]

Fix-significance level approach, P-value approach, and CI should lead to the same hypothesis testing conclusion.

2 Inferences Regarding a Single Population

Note, the inference statistic is always with respect to the population.

2.1 Inference one population mean \(\mu\)

2.1.1 Inference on population mean \(\mu\) with known variance \({σ^2}\): one-sample \(z\)-test

(1) Theorem: Central Limit Theorem. Assumption:

Let \({X_1}, \cdots, X_n\) be a random sample of size \(n\) taken from a population with mean \({\mu}\) and known variance \({\sigma^2}\) with replacement. (No population distribution assumption)
The population variance \({\sigma^2}\) is known
Let \({\bar X}\) be the sample mean.

Then, the limiting form of the distribution of statistic \(Z\)

\[Z = \frac{\bar X - \mu}{\sigma/\sqrt{n}} \sim \mathcal N(0,1) \quad n \rightarrow \infty \]

follows the standard normal distribution. \(Z\) represents a standard normal distributed random variable.

(2) Hypothesis test

The statistic test to inference on population mean \(\mu\) with know variance \({σ^2}\) is called one-sample \(z\)-test.

\[H_0: \mu = \mu_0 \qquad H_1: \mu \neq \mu_0 \]

(3) Estimation of confidence interval for \(\mu\)

The confidence interval of \({\mu}\):

\[\mu = \bar X \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}} \]

where \({\alpha}\) represents the level of significance and \({1-\alpha}\) is called the confidence level. \({Z_{\alpha/2}}\) is the value of \({Z}\) such that the area in each of the tail under the standard normal curve is \({\alpha/2}\).

Confidence level is inversely proportional to the risk that the confidence interval fails to include the actual value of \({\mu}\)

2.1.2 Inference on population mean \(\mu\) with unknown variance: one-sample \(t\)-test

(1)Theorem: Assumption:

Let \({X_1, X_2, ..., X_n}\) be a series of the independent and identically distributed (i.i.d.) random variables following a a normal distribution with mean \({\mu}\) and unknown variance.
The variance of the normal distribution is unknown
Let \({\bar X}\) be the sample mean
Usually assume \(n < 40\). When \(n \geq 40\), \(T\) can be regarded as standard normal distribution.

Then, the statistic \({T}\)

\[T = \frac{\bar{X} - \mu}{S/\sqrt{n}} \sim t(n-1) \]

follows a t-distribution with \({n-1}\) degrees of freedom.

(2) Hypothesis test

The statistic test to inference on population mean \(\mu\) with unknow variance is called one-sample \(t\)-test.

\[H_0: \mu = \mu_0 \qquad H_1: \mu \neq \mu_0 \]

(3) Estimation of confidence interval for \(\mu\)

The confidence interval estimator of the population mean \({\mu}\) is:

\[\mu = \bar X \pm t_{\alpha/2} \frac{S}{\sqrt{n}} \]

where \({S = \sqrt{ \dfrac{1}{(n-1)}\sum_{i=1}^{n}(X_i-\bar X)^2}}\) is the standard deviation of the sample.

2.2 Inference one population variance \(\sigma^2\)

2.2.1 Inference on variance \(\sigma^2\) of a normal population

(1) Theorem: Assumption:

Let \({X_1, X_2, ..., X_n}\) be a series of the independent and identically distributed (i.i.d.) random variables following a normal distribution with unknown mean \(\mu\) and variance \(\sigma^2\).

The statistic \(X^2\):

\[X^2 = \frac{(n-1) S^2}{\sigma^2} \sim \chi^2(n-1) \]

follows a \({\chi^2}\)distribution with \({n-1}\) degrees of freedom.

(2) Hypothesis test

\[H_0: \sigma^2 = \sigma^2_0 \qquad H_1: \sigma^2 \neq \sigma^2_0 \]

(3) Estimation of confidence interval for \(\sigma^2\)

The confidence interval estimator of the population variance \({\sigma^2}\) is:

\[\frac{(n-1)S^2}{\chi^2_{\alpha/2}} \leq \sigma^2 \leq \frac{(n-1)s^2}{\chi^2_{1-\alpha/2}} \]

The area in the right-hand tail of the distribution is \({\chi^2_{\alpha/2}}\), while the area in the left-hand tail of the distribution is \({\chi^2_{1-\alpha/2}}\)

2.3 Inference one population proportion \(p\)

2.3.1 Inference on population proportion \(p\)

(1) Assumption:

Let \({X_1, X_2, ..., X_n}\) can be a series of the independent and identically distributed (i.i.d.) random variables following a Bernoulli Distribution with the successful rate of \({p}\). Thus,
- \(X_i \sim \text{Bernoulli}(p)\), \(\mu = \mathrm{E} (X_i) = p\), \(\sigma^2 = \mathrm{Var} (X_i) = p(1-p)\)
- \(Y=\sum_{i=1}^nX_i \sim B(n,p)\), \(\mu = \mathrm{E} (Y) = np\), \(\sigma^2 = \mathrm{Var} (Y) = np(1-p)\)

By the central limit theorem, when \(n\) is large, the statistic \(Z\):

\[Z = \frac{\bar X - p}{\sqrt{p(1-p)}/\sqrt{n}} = \frac{X - np}{\sqrt{np(1-p)}} \sim \mathcal N (0,1) \quad n \rightarrow \infty \]

Where \(\bar X = \hat p\).

(2) Hypothesis test

\[H_0:p=p_0 \qquad H_1:p \neq p_0 \]

(3) Estimation of confidence interval for \(p\)

The confidence interval for the population proportion \({p}\) is calculated as following. The equation is under the assumption that \({n}\) is sufficiently large: \({np \geq5}\) and \({n(1-p) \geq5}\).

\[p = \hat p \pm z_{\alpha / 2} \sqrt{\frac{\hat p (1-\hat p)}{n}} \]

3. Inferences Regrading Comparing Two Populations

3.1 Inference on two populations' means \(\mu_1-\mu_2\)

3.1.1 Inference on two populations' means with known variances: two-sample \(z\)-test

(1) Assumption:

Let \({X_{11}, X_{12}, ..., X_{1n}}\) be a random sample of size \(n\) taken from a population with mean \({\mu_1}\) and known variance \({\sigma^2_1}\) with replacement. (No population distribution assumption)
Let \({X_{21}, X_{22}, ..., X_{2n}}\) be a random sample of size \(n\) taken from another population with mean \({\mu_2}\) and known variance \({\sigma^2_2}\) with replacement.
The two populations are independent.
The variances \(\sigma^2_1\) and \(\sigma^2_2\) of the two populations are known.

The statistic \(Z\)

\[Z = \frac{(\bar X_1 - \bar X_2)-(\mu_1 - \mu_2)}{\large{\sqrt{\dfrac{\sigma_1^2}{n_1}+\dfrac{\sigma_2^2}{n_2}}}} \sim \mathcal N (0,1) \]

follows the standard normal distribution.

(2) Hypothesis test

The statistic test to inference two populations' means \(\mu_1-\mu_2\) with known variances \(\sigma_1^2\) and \(\sigma_2^2\) is called two-sample \(z\)-test.

\[H_0: \mu_1-\mu_2 = 0 \qquad H_1: \mu_1-\mu_2 \neq 0 \]

When \(\sigma^2_1\) and \(\sigma^2_2\) are unknown, if \(n_1>40\) and \(n_2>40\), the \(Z\) can also be regarded as the standard normal distribution.

(3) Estimation of confidence interval for \(\mu_1 - \mu_2\)

A confidence interval for the difference between two population means \({\mu_1 - \mu_2}\):

\[\mu_1 - \mu_2 = (\bar X_1 - \bar X_2) \pm z_{\alpha/2} \sqrt{\dfrac{\sigma_1^2}{n_1}+\dfrac{\sigma_2^2}{n_2}} \]

3.3.2 Inference on two populations' means with unknown but equal variances: pooled \(t\)-test

(1) Assumption:

Let \({X_{11}, X_{12}, ..., X_{1n}}\) be a series of the independent and identically distributed (i.i.d.) random variables following a normal distribution with mean \({\mu_1}\) and unknown variance \(\sigma^2_1\).
Let \({X_{21}, X_{22}, ..., X_{2n}}\) be a series of the independent and identically distributed (i.i.d.) random variables following another normal distribution with mean \({\mu_2}\) and unknown variance \(\sigma^2_2\).
The two normal populations are independent.
The variances \(\sigma^2_1\) and \(\sigma^2_2\) of the two populations are unknown.
The variances are equal \(\sigma^2_1=\sigma^2_2\)
Usually assume \(n_1 < 40\) or \(n_2<40\). When \(n_1 \geq 40\) and \(n_1 \geq 40\), \(T\) can be regarded as standard normal distribution.

The statistic \(T\)

\[T = \frac{(\bar X_1 - \bar X_2)-(\mu_1 - \mu_2)}{S_p\sqrt{\dfrac{1}{n_1}+\dfrac{1}{n_2}}} \sim \mathcal t (n_1+n_2-2) \]

follows a t-distribution with \({n_1+n_2-1}\) degrees of freedom. Where \(S_p\) is called the pooled estimator of \(\sigma^2\), which is the weighted average of two sample variances. It is calculated by:

\[S_p^2 = \dfrac{(n_1-1)S_1^2+(n_2-1)S_2^2}{n_1+n_2-2} \]

(2) Hypothesis test

\[H_0: \mu_1-\mu_2 = 0 \qquad H_1: \mu_1-\mu_2 \neq 0 \]

(3) Estimation of confidence interval for \(\mu_1 - \mu_2\)

A confidence interval for the difference between two population means \({\mu_1 - \mu_2}\):

\[\mu_1 - \mu_2 = (\bar X_1 - \bar X_2) \pm t_{\alpha/2,n_1+n_2-2} S_p\sqrt{\dfrac{1}{n_1}+\dfrac{1}{n_2}} \]

3.3.3 Inference on two populations' means with unknown and unequal variances: Welch's \(t\)-test

(1) Assumption:

Let \({X_{11}, X_{12}, ..., X_{1n}}\) be a series of the independent and identically distributed (i.i.d.) random variables following a normal distribution with mean \({\mu_1}\) and unknown variance \(\sigma^2_1\)
Let \({X_{21}, X_{22}, ..., X_{2n}}\) be a series of the independent and identically distributed (i.i.d.) random variables following another normal distribution with mean \({\mu_2}\) and unknown variance \(\sigma^2_2\).
The two normal populations are independent.
The variances \(\sigma^2_1\) and \(\sigma^2_2\) of the two populations are unknown.
The variances are equal \(\sigma^2_1 \neq \sigma^2_2\)

The statistic \(T\) is:

\[T = \frac{(\bar X_1 - \bar X_2)-(\mu_1 - \mu_2)}{\sqrt{\dfrac{S_1^2}{n_1}+\dfrac{S_2^2}{n_2}}} \sim \mathcal t (\nu) \]

\(T\) approximately follows a \(t\)-distribution with degrees of freedom \(\nu\), which given by:

\[\nu = \left\lfloor \dfrac{\left(\dfrac{S_1^2}{n_1} + \dfrac{S_2^2}{n_2}\right)^2}{\dfrac{(S_1^2/n_1)^2}{n_1-1}+\dfrac{(S_2^2/n_2)^2}{n_2-1}} \right\rfloor \]

Where \(\lfloor \cdot \rfloor\) is floor function. \(S_1\) and \(S_2\) are the standard deviations of samples \(X_{1n}\) and \(X_{2n}\) respectively.

(2) Hypothesis test

\[H_0: \mu_1-\mu_2 = 0 \qquad H_1: \mu_1-\mu_2 \neq 0 \]

(3) Estimation of confidence interval for \(\mu_1 - \mu_2\)

The confidence interval for the difference between two population means \({\mu_1 - \mu_2}\):

\[\mu_1 - \mu_2 = (\bar X_1 - \bar X_2) \pm t_{\alpha/2,\nu} \sqrt{\dfrac{S_1^2}{n_1}+\dfrac{S_2^2}{n_2}} \]

3.3.4 Inference on two populations' means with paired samples: paired \(t\)-test

(1) Assumption:

Samples on two populations are collected in pairs. Each pair of samples \((X_{1i},X_{2i})\) is taken under homogeneous conditions, but these conditions may changer from one pair to another. (No population distribution assumption)
The two random samples may not be independent. (No population independence assumption)
The variances \(\sigma^2_1\) and \(\sigma^2_2\) of the two populations are unknown.
Whether the variances are equal or not are also unknown.

The statistic \(T\):

\[T = \dfrac{\bar D - \mu_D}{S_D/\sqrt{n}} \sim t(n-1) \]

follows follows a \(t\)-distribution with degrees of freedom \(n-1\). Where:

\(D_i=X_{1i} - X_{2i}, i=1,\cdots,n\)
\(\mu_D=\mathrm{E}[X_1-X_2] = \mu_1-\mu_2\)
\(\bar{D} = \dfrac{1}{n}\sum_{i=1}^{n}D_i\)
\(S_D=\sqrt{\dfrac{1}{n}\sum_{i=1}^{n}(D_i-\bar{D})^2}\)

(2) Hypothesis test

\[H_0: \mu_1-\mu_2 = 0 \qquad H_1: \mu_1-\mu_2 \neq 0 \]

(3) Estimation of confidence interval for \(\mu_1 - \mu_2\)

The confidence interval for the difference between two population means \({\mu_1 - \mu_2}\):

\[\mu_1-\mu_2 = \bar D \pm t_{\alpha/2,n-1} \frac{S_D}{\sqrt{n}} \]

3.3.5 Summary: Inference on the Mean of Two Population

Situations	Variance Known or Variance Unknown (\(n \geq 40\))	Variance Unknown (\(n<40\))	Paired \(t\)-test
Population assumption	two independent distributions	two independent normal distributions	not require
statistic	\(z\)-statistic	\(t\)-statistic	\(t\)-statistic
procedure	two-sample \(z\)-test	equal variances: pooled \(t\)-test unequal variances: welch \(t\)-test	paired \(t\)-test
sampling distribution of statistic	standard normal distribution	\(t\)-distribution	\(t\)-distribution

3.2 Inference on the ratio of variance of two populations:

3.2.1 Inference on the ratio of variance of two normal populations: \(F\)-test

(1) Assumption:

Let \({X_{11}, X_{12}, ..., X_{1n}}\) be a series of the independent and identically distributed (i.i.d.) random variables following a normal distribution with unknown mean \({\mu_1}\) and variance \(\sigma^2_1\)
Let \({X_{21}, X_{22}, ..., X_{2n}}\) be a series of the independent and identically distributed (i.i.d.) random variables following another normal distribution with unknown mean \({\mu_2}\) and variance \(\sigma^2_2\).
The two normal populations are independent.
The means \(\mu_1\) and \(\mu_2\) of the two population are unknown.

The statistic \(F\):

\[F = \dfrac{S_1^2/\sigma_1^2}{S_2^2/\sigma_2^2} = \dfrac{S_1^2}{S_2^2} \sim F(n_1-1,n_2-1) \]

follows an \(F\)-distribution wiht \(n_1-1\) degrees of freedom in the numerator and \(n_2-1\) degrees of freedom in denominator.

(2) Hypothesis test

\[H_0: \frac{\sigma_1^2}{\sigma_2^2} = 1 \qquad H_1: \frac{\sigma_1^2}{\sigma_2^2} \neq 1 \]

(3) Estimation of confidence interval for \(\sigma_1^2 / \sigma_2^2\)

The confidence interval for the ratio of two population variances \({\sigma_1^2/\sigma_2^2}\):

\[\dfrac{S_1^2}{S_2^2}f_{1-\alpha/2,n_1-1,n_2-1} \leq \dfrac{\sigma_1^2}{\sigma_2^2} \leq \dfrac{S_1^2}{S_2^2}f_{\alpha/2,n_1-1,n_2-1} \]

For \(F\)-distribution, we have

\[f_{1-\alpha,n_1-1,n_2-1}=\dfrac{1}{f_{1-\alpha,n_1-1,n_2-1}} \]

2.5.6 The Chi-Square Goodness-of-Fit Test

If the data are grouped into \({k}\) cells, let the observed count in cell \({i}\) be \({O_i}\) and the expected count (expected under \({H_0}\)) be \({E_i}\). The summation is over all cells \({i = 1, 2,\dots, k}\). The test statistic is

\[X^2=\sum_{i=1}^{k}\frac{(O_i-E_i)^2}{E_i} \sim \chi^2(k-1) \]

The expected counts are \({5}\) or more for all cells. When the sample sizes are small or cells are defined such that small expected frequencies, the \({\chi^2}\) test is inappropriate.

Contingency tables are used to determine whether two classification criteria are independent of each other.

General Layout of a Contingency Table

First Classification Category \({\rightarrow}\)
Second Classification Category \({\downarrow}\)	\({1}\)	\({\cdots}\)	\({j}\)	\({\cdots}\)	\({c}\)	Total
\({1}\)	\({O_{11}}\)	\({\cdots}\)	\({O_{1j}}\)	\({\cdots}\)	\({O_{1c}}\)	\({R_1}\)
\({\vdots}\)	\({\vdots}\)	\({\ddots}\)	\({\vdots}\)		\({\vdots}\)	\({\vdots}\)
\({i}\)	\({O_{i1}}\)	\({\cdots}\)	\({O_{ij}}\)	\({\cdots}\)	\({O_{ic}}\)	\({R_i}\)
\({\vdots}\)	\({\vdots}\)		\({\vdots}\)	\({\ddots}\)	\({\vdots}\)	\({\vdots}\)
\({r}\)	\({O_{r1}}\)	\({\cdots}\)	\({O_{rj}}\)	\({\cdots}\)	\({O_{rc}}\)	\({R_r}\)
Total	\({C_1}\)	\({\cdots}\)	\({C_j}\)	\({\cdots}\)	\({C_c}\)	\({n}\)

The test statistic \({\chi^2}\) for the differences between observed and expected frequencies are summed over all rows and columns of a two-way contingency table is rewritten as follows:

\[X^2 = \sum_{i=1}^{r} \sum_{j=1}^{c} \frac{(O_{ij}-E_{ij})^2}{E_{ij}} \sim \chi^2 \left((r-1)(c-1) \right) \]

The test statistic is approximately \({\chi^2}\) distributed with degrees of freedom \({df = (r − 1)(c − 1)}\). \({R_i}\) and \({C_j}\) are the row and column totals, \({E_{ij}}\) is the expected count in \({\text{cell}(i, j)}\) expressed as:

\[E_{ij} = \frac{R_i C_j}{n} \]

Notes that in these instances that contingency tables based on small sample sizes or when expected cell frequencies become small, the \({\chi^2}\) test statistic is unreliable.

posted @ 2022-02-23 10:46 veager 阅读(65) 评论(0) 收藏举报

刷新页面返回顶部