Statistical Inference II

Statistical Inference II

1. Hypothesis Testing

1.1 Mechanics of Hypothesis Testing

  • Test of a Hypothesis
    • The truth or falsity of a particular hypothesis can never be known for certainty unless we examine the entire population
    • A hypothesis testing procedure should be developed with the probability of reaching wrong conclusion.

1.2 Null Hypothesis and Alternative hypothesis

  • Null Hypothesis \({H_0}\): is an assertion about one or more population parameters that are assumed to be true until there is sufficient statistical evidence to conclude otherwise.
  • Alternative Hypothesis \({H_1}\): is the assertion of all situations not covered by the null hypothesis.

Together the null and the alternative constitute a set of hypotheses that covers all possible values of the parameter or parameters in question.

1.3. Type I error and type II error

  • Results of a Test of Hypothesis

\[\begin{array}{r|cc} \text{Reality} & {H_0} \text{ is True} & H_0 \text{ is False} \\ \hline \text{Test Result} & & \\ \text{Reject} & \text{Type I error: } {\Pr (\text{Type I error}) = \alpha} & \text{Correct decision} \\ \text{Do not reject} & \text{Correct decision} & \text{Type II error: } {\Pr (\text{Type II error}) = \beta} \end{array} \]

Type I error: Rejecting the null hypothesis \(H_0\) when it is true

Type II error: Failing to reject the null hypothesis \(H_0\) when it is false

  • significance level \(\alpha\): the probability of making a type I error:

    \[\alpha = \Pr(\text{reject } H_0|H_0 \text{ is true}) \]

  • Strong and weak conclusion

    • Because we can usually control the significance level \(\alpha\) (the probability of wrongly rejecting \(H_0\)), the rejection of the null hypothesis is a strong conclusion

    • Failing to reject the null hypothesis does not necessarily means a high probability for \(H_0\) being true. Failing to reject \(H_0\) is a weak conclusion.

1.4. Three approaches of a hypothesis testing

  • fix-significance level \(\alpha\) approach:

    Given a fixed-significance level \(\alpha\), all we have to do is to determine where to place the critical regions.

  • \(p\)-value approach:

    The \(p\)-value provides a measure of the credibility of the null hypothesis.

    Given the test statistic, the \(p\)-value is the smallest significance level \(\alpha\) that would lead to rejection of the null hypothesis:

    \[p < \alpha \quad \Rightarrow \quad \text{reject } H_0 \]

  • confidence interval (CI) approach:

    to estimate an unknown parameter \(\theta\), we would prefer an interval estimate \(L \leq \theta \leq U\), where \(L\) and \(U\) are determined by the sample.

    \[\Pr[L \leq \theta \leq U] = 1 - \alpha \quad \text{or} \quad \Pr[L \leq \theta] = 1 - \alpha \quad \text{or} \quad \Pr[\theta \leq U] = 1 - \alpha \]

Fix-significance level approach, P-value approach, and CI should lead to the same hypothesis testing conclusion.

2 Inferences Regarding a Single Population

  • Note, the inference statistic is always with respect to the population.

2.1 Inference one population mean \(\mu\)

2.1.1 Inference on population mean \(\mu\) with known variance \({σ^2}\): one-sample \(z\)-test

(1) Theorem: Central Limit Theorem. Assumption:
  • Let \({X_1}, \cdots, X_n\) be a random sample of size \(n\) taken from a population with mean \({\mu}\) and known variance \({\sigma^2}\) with replacement. (No population distribution assumption)
  • The population variance \({\sigma^2}\) is known
  • Let \({\bar X}\) be the sample mean.

Then, the limiting form of the distribution of statistic \(Z\)

\[Z = \frac{\bar X - \mu}{\sigma/\sqrt{n}} \sim \mathcal N(0,1) \quad n \rightarrow \infty \]

follows the standard normal distribution. \(Z\) represents a standard normal distributed random variable.

(2) Hypothesis test

The statistic test to inference on population mean \(\mu\) with know variance \({σ^2}\) is called one-sample \(z\)-test.

\[H_0: \mu = \mu_0 \qquad H_1: \mu \neq \mu_0 \]

(3) Estimation of confidence interval for \(\mu\)

The confidence interval of \({\mu}\):

\[\mu = \bar X \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}} \]

where \({\alpha}\) represents the level of significance and \({1-\alpha}\) is called the confidence level. \({Z_{\alpha/2}}\) is the value of \({Z}\) such that the area in each of the tail under the standard normal curve is \({\alpha/2}\).

Confidence level is inversely proportional to the risk that the confidence interval fails to include the actual value of \({\mu}\)

2.1.2 Inference on population mean \(\mu\) with unknown variance: one-sample \(t\)-test

(1)Theorem: Assumption:
  • Let \({X_1, X_2, ..., X_n}\) be a series of the independent and identically distributed (i.i.d.) random variables following a a normal distribution with mean \({\mu}\) and unknown variance.

  • The variance of the normal distribution is unknown

  • Let \({\bar X}\) be the sample mean

  • Usually assume \(n < 40\). When \(n \geq 40\), \(T\) can be regarded as standard normal distribution.

Then, the statistic \({T}\)

\[T = \frac{\bar{X} - \mu}{S/\sqrt{n}} \sim t(n-1) \]

follows a t-distribution with \({n-1}\) degrees of freedom.

(2) Hypothesis test

The statistic test to inference on population mean \(\mu\) with unknow variance is called one-sample \(t\)-test.

\[H_0: \mu = \mu_0 \qquad H_1: \mu \neq \mu_0 \]

(3) Estimation of confidence interval for \(\mu\)

The confidence interval estimator of the population mean \({\mu}\) is:

\[\mu = \bar X \pm t_{\alpha/2} \frac{S}{\sqrt{n}} \]

where \({S = \sqrt{ \dfrac{1}{(n-1)}\sum_{i=1}^{n}(X_i-\bar X)^2}}\) is the standard deviation of the sample.

2.2 Inference one population variance \(\sigma^2\)

2.2.1 Inference on variance \(\sigma^2\) of a normal population

(1) Theorem: Assumption:
  • Let \({X_1, X_2, ..., X_n}\) be a series of the independent and identically distributed (i.i.d.) random variables following a normal distribution with unknown mean \(\mu\) and variance \(\sigma^2\).

The statistic \(X^2\):

\[X^2 = \frac{(n-1) S^2}{\sigma^2} \sim \chi^2(n-1) \]

follows a \({\chi^2}\)distribution with \({n-1}\) degrees of freedom.

(2) Hypothesis test

\[H_0: \sigma^2 = \sigma^2_0 \qquad H_1: \sigma^2 \neq \sigma^2_0 \]

(3) Estimation of confidence interval for \(\sigma^2\)

The confidence interval estimator of the population variance \({\sigma^2}\) is:

\[\frac{(n-1)S^2}{\chi^2_{\alpha/2}} \leq \sigma^2 \leq \frac{(n-1)s^2}{\chi^2_{1-\alpha/2}} \]

The area in the right-hand tail of the distribution is \({\chi^2_{\alpha/2}}\), while the area in the left-hand tail of the distribution is \({\chi^2_{1-\alpha/2}}\)

2.3 Inference one population proportion \(p\)

2.3.1 Inference on population proportion \(p\)

(1) Assumption:
  • Let \({X_1, X_2, ..., X_n}\) can be a series of the independent and identically distributed (i.i.d.) random variables following a Bernoulli Distribution with the successful rate of \({p}\). Thus,

    • \(X_i \sim \text{Bernoulli}(p)\), \(\mu = \mathrm{E} (X_i) = p\), \(\sigma^2 = \mathrm{Var} (X_i) = p(1-p)\)

    • \(Y=\sum_{i=1}^nX_i \sim B(n,p)\), \(\mu = \mathrm{E} (Y) = np\), \(\sigma^2 = \mathrm{Var} (Y) = np(1-p)\)

By the central limit theorem, when \(n\) is large, the statistic \(Z\):

\[Z = \frac{\bar X - p}{\sqrt{p(1-p)}/\sqrt{n}} = \frac{X - np}{\sqrt{np(1-p)}} \sim \mathcal N (0,1) \quad n \rightarrow \infty \]

Where \(\bar X = \hat p\).

(2) Hypothesis test

\[H_0:p=p_0 \qquad H_1:p \neq p_0 \]

(3) Estimation of confidence interval for \(p\)

The confidence interval for the population proportion \({p}\) is calculated as following. The equation is under the assumption that \({n}\) is sufficiently large: \({np \geq5}\) and \({n(1-p) \geq5}\).

\[p = \hat p \pm z_{\alpha / 2} \sqrt{\frac{\hat p (1-\hat p)}{n}} \]

3. Inferences Regrading Comparing Two Populations

3.1 Inference on two populations' means \(\mu_1-\mu_2\)

3.1.1 Inference on two populations' means with known variances: two-sample \(z\)-test

(1) Assumption:
  • Let \({X_{11}, X_{12}, ..., X_{1n}}\) be a random sample of size \(n\) taken from a population with mean \({\mu_1}\) and known variance \({\sigma^2_1}\) with replacement. (No population distribution assumption)

  • Let \({X_{21}, X_{22}, ..., X_{2n}}\) be a random sample of size \(n\) taken from another population with mean \({\mu_2}\) and known variance \({\sigma^2_2}\) with replacement.

  • The two populations are independent.

  • The variances \(\sigma^2_1\) and \(\sigma^2_2\) of the two populations are known.

The statistic \(Z\)

\[Z = \frac{(\bar X_1 - \bar X_2)-(\mu_1 - \mu_2)}{\large{\sqrt{\dfrac{\sigma_1^2}{n_1}+\dfrac{\sigma_2^2}{n_2}}}} \sim \mathcal N (0,1) \]

follows the standard normal distribution.

(2) Hypothesis test

The statistic test to inference two populations' means \(\mu_1-\mu_2\) with known variances \(\sigma_1^2\) and \(\sigma_2^2\) is called two-sample \(z\)-test.

\[H_0: \mu_1-\mu_2 = 0 \qquad H_1: \mu_1-\mu_2 \neq 0 \]

When \(\sigma^2_1\) and \(\sigma^2_2\) are unknown, if \(n_1>40\) and \(n_2>40\), the \(Z\) can also be regarded as the standard normal distribution.

(3) Estimation of confidence interval for \(\mu_1 - \mu_2\)

A confidence interval for the difference between two population means \({\mu_1 - \mu_2}\):

\[\mu_1 - \mu_2 = (\bar X_1 - \bar X_2) \pm z_{\alpha/2} \sqrt{\dfrac{\sigma_1^2}{n_1}+\dfrac{\sigma_2^2}{n_2}} \]

3.3.2 Inference on two populations' means with unknown but equal variances: pooled \(t\)-test

(1) Assumption:
  • Let \({X_{11}, X_{12}, ..., X_{1n}}\) be a series of the independent and identically distributed (i.i.d.) random variables following a normal distribution with mean \({\mu_1}\) and unknown variance \(\sigma^2_1\).

  • Let \({X_{21}, X_{22}, ..., X_{2n}}\) be a series of the independent and identically distributed (i.i.d.) random variables following another normal distribution with mean \({\mu_2}\) and unknown variance \(\sigma^2_2\).

  • The two normal populations are independent.

  • The variances \(\sigma^2_1\) and \(\sigma^2_2\) of the two populations are unknown.

  • The variances are equal \(\sigma^2_1=\sigma^2_2\)

  • Usually assume \(n_1 < 40\) or \(n_2<40\). When \(n_1 \geq 40\) and \(n_1 \geq 40\), \(T\) can be regarded as standard normal distribution.

The statistic \(T\)

\[T = \frac{(\bar X_1 - \bar X_2)-(\mu_1 - \mu_2)}{S_p\sqrt{\dfrac{1}{n_1}+\dfrac{1}{n_2}}} \sim \mathcal t (n_1+n_2-2) \]

follows a t-distribution with \({n_1+n_2-1}\) degrees of freedom. Where \(S_p\) is called the pooled estimator of \(\sigma^2\), which is the weighted average of two sample variances. It is calculated by:

\[S_p^2 = \dfrac{(n_1-1)S_1^2+(n_2-1)S_2^2}{n_1+n_2-2} \]

(2) Hypothesis test

\[H_0: \mu_1-\mu_2 = 0 \qquad H_1: \mu_1-\mu_2 \neq 0 \]

(3) Estimation of confidence interval for \(\mu_1 - \mu_2\)

A confidence interval for the difference between two population means \({\mu_1 - \mu_2}\):

\[\mu_1 - \mu_2 = (\bar X_1 - \bar X_2) \pm t_{\alpha/2,n_1+n_2-2} S_p\sqrt{\dfrac{1}{n_1}+\dfrac{1}{n_2}} \]

3.3.3 Inference on two populations' means with unknown and unequal variances: Welch's \(t\)-test

(1) Assumption:
  • Let \({X_{11}, X_{12}, ..., X_{1n}}\) be a series of the independent and identically distributed (i.i.d.) random variables following a normal distribution with mean \({\mu_1}\) and unknown variance \(\sigma^2_1\)
  • Let \({X_{21}, X_{22}, ..., X_{2n}}\) be a series of the independent and identically distributed (i.i.d.) random variables following another normal distribution with mean \({\mu_2}\) and unknown variance \(\sigma^2_2\).
  • The two normal populations are independent.
  • The variances \(\sigma^2_1\) and \(\sigma^2_2\) of the two populations are unknown.
  • The variances are equal \(\sigma^2_1 \neq \sigma^2_2\)

The statistic \(T\) is:

\[T = \frac{(\bar X_1 - \bar X_2)-(\mu_1 - \mu_2)}{\sqrt{\dfrac{S_1^2}{n_1}+\dfrac{S_2^2}{n_2}}} \sim \mathcal t (\nu) \]

\(T\) approximately follows a \(t\)-distribution with degrees of freedom \(\nu\), which given by:

\[\nu = \left\lfloor \dfrac{\left(\dfrac{S_1^2}{n_1} + \dfrac{S_2^2}{n_2}\right)^2}{\dfrac{(S_1^2/n_1)^2}{n_1-1}+\dfrac{(S_2^2/n_2)^2}{n_2-1}} \right\rfloor \]

Where \(\lfloor \cdot \rfloor\) is floor function. \(S_1\) and \(S_2\) are the standard deviations of samples \(X_{1n}\) and \(X_{2n}\) respectively.

(2) Hypothesis test

\[H_0: \mu_1-\mu_2 = 0 \qquad H_1: \mu_1-\mu_2 \neq 0 \]

(3) Estimation of confidence interval for \(\mu_1 - \mu_2\)

The confidence interval for the difference between two population means \({\mu_1 - \mu_2}\):

\[\mu_1 - \mu_2 = (\bar X_1 - \bar X_2) \pm t_{\alpha/2,\nu} \sqrt{\dfrac{S_1^2}{n_1}+\dfrac{S_2^2}{n_2}} \]

3.3.4 Inference on two populations' means with paired samples: paired \(t\)-test

(1) Assumption:
  • Samples on two populations are collected in pairs. Each pair of samples \((X_{1i},X_{2i})\) is taken under homogeneous conditions, but these conditions may changer from one pair to another. (No population distribution assumption)
  • The two random samples may not be independent. (No population independence assumption)
  • The variances \(\sigma^2_1\) and \(\sigma^2_2\) of the two populations are unknown.
  • Whether the variances are equal or not are also unknown.

The statistic \(T\):

\[T = \dfrac{\bar D - \mu_D}{S_D/\sqrt{n}} \sim t(n-1) \]

follows follows a \(t\)-distribution with degrees of freedom \(n-1\). Where:

  • \(D_i=X_{1i} - X_{2i}, i=1,\cdots,n\)
  • \(\mu_D=\mathrm{E}[X_1-X_2] = \mu_1-\mu_2\)
  • \(\bar{D} = \dfrac{1}{n}\sum_{i=1}^{n}D_i\)
  • \(S_D=\sqrt{\dfrac{1}{n}\sum_{i=1}^{n}(D_i-\bar{D})^2}\)
(2) Hypothesis test

\[H_0: \mu_1-\mu_2 = 0 \qquad H_1: \mu_1-\mu_2 \neq 0 \]

(3) Estimation of confidence interval for \(\mu_1 - \mu_2\)

The confidence interval for the difference between two population means \({\mu_1 - \mu_2}\):

\[\mu_1-\mu_2 = \bar D \pm t_{\alpha/2,n-1} \frac{S_D}{\sqrt{n}} \]

3.3.5 Summary: Inference on the Mean of Two Population

Situations Variance Known
or Variance Unknown (\(n \geq 40\))
Variance Unknown (\(n<40\)) Paired \(t\)-test
Population assumption two independent distributions two independent normal distributions not require
statistic \(z\)-statistic \(t\)-statistic \(t\)-statistic
procedure two-sample \(z\)-test equal variances: pooled \(t\)-test
unequal variances: welch \(t\)-test
paired \(t\)-test
sampling distribution of statistic standard normal distribution \(t\)-distribution \(t\)-distribution

3.2 Inference on the ratio of variance of two populations:

3.2.1 Inference on the ratio of variance of two normal populations: \(F\)-test

(1) Assumption:
  • Let \({X_{11}, X_{12}, ..., X_{1n}}\) be a series of the independent and identically distributed (i.i.d.) random variables following a normal distribution with unknown mean \({\mu_1}\) and variance \(\sigma^2_1\)
  • Let \({X_{21}, X_{22}, ..., X_{2n}}\) be a series of the independent and identically distributed (i.i.d.) random variables following another normal distribution with unknown mean \({\mu_2}\) and variance \(\sigma^2_2\).
  • The two normal populations are independent.
  • The means \(\mu_1\) and \(\mu_2\) of the two population are unknown.

The statistic \(F\):

\[F = \dfrac{S_1^2/\sigma_1^2}{S_2^2/\sigma_2^2} = \dfrac{S_1^2}{S_2^2} \sim F(n_1-1,n_2-1) \]

follows an \(F\)-distribution wiht \(n_1-1\) degrees of freedom in the numerator and \(n_2-1\) degrees of freedom in denominator.

(2) Hypothesis test

\[H_0: \frac{\sigma_1^2}{\sigma_2^2} = 1 \qquad H_1: \frac{\sigma_1^2}{\sigma_2^2} \neq 1 \]

(3) Estimation of confidence interval for \(\sigma_1^2 / \sigma_2^2\)

The confidence interval for the ratio of two population variances \({\sigma_1^2/\sigma_2^2}\):

\[\dfrac{S_1^2}{S_2^2}f_{1-\alpha/2,n_1-1,n_2-1} \leq \dfrac{\sigma_1^2}{\sigma_2^2} \leq \dfrac{S_1^2}{S_2^2}f_{\alpha/2,n_1-1,n_2-1} \]

For \(F\)-distribution, we have

\[f_{1-\alpha,n_1-1,n_2-1}=\dfrac{1}{f_{1-\alpha,n_1-1,n_2-1}} \]

2.5.6 The Chi-Square Goodness-of-Fit Test

If the data are grouped into \({k}\) cells, let the observed count in cell \({i}\) be \({O_i}\) and the expected count (expected under \({H_0}\)) be \({E_i}\). The summation is over all cells \({i = 1, 2,\dots, k}\). The test statistic is

\[X^2=\sum_{i=1}^{k}\frac{(O_i-E_i)^2}{E_i} \sim \chi^2(k-1) \]

The expected counts are \({5}\) or more for all cells. When the sample sizes are small or cells are defined such that small expected frequencies, the \({\chi^2}\) test is inappropriate.

Contingency tables are used to determine whether two classification criteria are independent of each other.

  • General Layout of a Contingency Table
First Classification
Category \({\rightarrow}\)
Second Classification
Category
\({\downarrow}\)
\({1}\) \({\cdots}\) \({j}\) \({\cdots}\) \({c}\) Total
\({1}\) \({O_{11}}\) \({\cdots}\) \({O_{1j}}\) \({\cdots}\) \({O_{1c}}\) \({R_1}\)
\({\vdots}\) \({\vdots}\) \({\ddots}\) \({\vdots}\) \({\vdots}\) \({\vdots}\)
\({i}\) \({O_{i1}}\) \({\cdots}\) \({O_{ij}}\) \({\cdots}\) \({O_{ic}}\) \({R_i}\)
\({\vdots}\) \({\vdots}\) \({\vdots}\) \({\ddots}\) \({\vdots}\) \({\vdots}\)
\({r}\) \({O_{r1}}\) \({\cdots}\) \({O_{rj}}\) \({\cdots}\) \({O_{rc}}\) \({R_r}\)
Total \({C_1}\) \({\cdots}\) \({C_j}\) \({\cdots}\) \({C_c}\) \({n}\)

The test statistic \({\chi^2}\) for the differences between observed and expected frequencies are summed over all rows and columns of a two-way contingency table is rewritten as follows:

\[X^2 = \sum_{i=1}^{r} \sum_{j=1}^{c} \frac{(O_{ij}-E_{ij})^2}{E_{ij}} \sim \chi^2 \left((r-1)(c-1) \right) \]

The test statistic is approximately \({\chi^2}\) distributed with degrees of freedom \({df = (r − 1)(c − 1)}\). \({R_i}\) and \({C_j}\) are the row and column totals, \({E_{ij}}\) is the expected count in \({\text{cell}(i, j)}\) expressed as:

\[E_{ij} = \frac{R_i C_j}{n} \]

Notes that in these instances that contingency tables based on small sample sizes or when expected cell frequencies become small, the \({\chi^2}\) test statistic is unreliable.

posted @ 2022-02-23 10:46  veager  阅读(65)  评论(0)    收藏  举报