Sampling Methods 样本采样方法

Sampling Methods 样本采样方法

1. Overview

Three types of sampling methods:

1.1 Simple Random Sampling Method

  • Consists in first associating an identifier (number) to each unit in the population and then selecting these numbers at random to obtain the sample.

  • Advantages: Simple random sampling works best if you have a lot of time and resources to conduct your study, or if you are studying a limited population that can easily be sampled.

  • Disadvantages: Far too large samples may be required to ensure sufficient data about minority options of particular interest.

1.2 Stratified Random Sampling Method

  • A priori information is first used to subdivide the population into homogeneous strata (with respect to the stratifying variable, (e.g., race, gender identity, location, etc.) and then simple random sampling is conducted inside each stratum using the same sampling rate.

  • Advantages: Allows for the correct proportions of each stratum in the sample to be obtained.

  • Disadvantages: Issue raises when there are relatively small subgroups in population due to lack of representation in simple random sample.

1.3 Choice-based Sampling Method

  • Stratifying population based on the result of the choice process under consideration. This method is fairly common in transport studies.

  • Advantage: Data may be produced at a much lower cost than with the other sampling methods.

  • Disadvantages: Sample formed may not be random, and the risk of bias in expected values is greater.

2. Conceptualisation of the Sampling Problem

Assume that each sampled observation \(i\) may be described on the basis of the following two variables:

  • \(Y \in \{1, 2, \cdots, N\}\) observed choice of the sample individual (e.g. took a bus)

  • \(\boldsymbol{X}\) is a vector of characteristics (attributes) of the individual (e.g., income) and of the alternatives in his choice set (e.g., travel time)

Assume that the underlying travel mode choice process in the population may be represented by a model with parameters \(\boldsymbol{θ}\) (e.g., the logit-based mode choice model).

  • The joint probability distribution of \(Y\) and \(\boldsymbol{X}\) is given by: \(\Pr(Y, \boldsymbol{X}| \boldsymbol{θ})\)

  • The probability of choosing alternative \(Y\) among a set of options with attributes \(\boldsymbol{X}\) is \(\Pr(Y|\boldsymbol{X}, \boldsymbol{θ})\).

2.1 Simple random sampling method

For the simple random sampling method, the joint distribution of \(Y\) and \(\boldsymbol{X}\) in sample and population should be identical, i.e.

\[f(Y, \boldsymbol{X}| \theta) = \Pr(Y, \boldsymbol{X} | \boldsymbol{\theta}) \]

It is just a special case of the stratified sampling method where \(f(\boldsymbol{X}) = \Pr(\boldsymbol{X})\) because:

\[f(Y, \boldsymbol{X}| \theta) = f(\boldsymbol{X}) f(Y| \boldsymbol{X}, \boldsymbol{\theta})= \Pr(\boldsymbol{X}) \Pr(Y| \boldsymbol{X}, \boldsymbol{\theta}) = \Pr(Y, \boldsymbol{X} | \boldsymbol{\theta}) \]

2.2 Stratified or exogenous sampling method

For the stratified sampling method, the sample is not completely random w.r.t. certain independent variables of the travel choice model (e.g. high income and low income).

  • The sampling process is defined by function \(f(\boldsymbol{X})\), giving the probability of finding an observation with characteristics \(\boldsymbol{X}\). In the population this probability is \(\Pr(\boldsymbol{X})\).

  • The distribution of \(Y\) and \(\boldsymbol{X}\) in the sample is thus given by:

    \[f(Y, \boldsymbol{X}| \boldsymbol{\theta}) = f(\boldsymbol{X}) \Pr(Y | \boldsymbol{X}, \boldsymbol{\theta}) \]

  • Thus, we can obtain an property of stratified sampling method:

    \[f(Y|\boldsymbol{X}, \boldsymbol{θ}) = \Pr(Y|\boldsymbol{X}, \boldsymbol{θ}) \]

2.3 Choice-based sampling method

For the choice-based sampling method:

  • The sampling procedure is defined by a function \(f(Y)\), giving the probability of finding an observation that chooses option \(Y\) (i.e. it is stratified according to the choice).

  • The distribution of \(Y\) and \(\boldsymbol{X}\) in the sample is given by:

    \[f(Y, \boldsymbol{X}| \boldsymbol{\theta}) = f(Y) \Pr(\boldsymbol{X} | Y, \boldsymbol{\theta}) \]

  • Thus, we can obtain an property of choice-based sampling method:

    \[f(\boldsymbol{X}|Y, \boldsymbol{θ}) = \Pr(\boldsymbol{X}|Y, \boldsymbol{θ}) \]

  • Applying Bayes Theorem on conditional probability:

    \[\Pr(\boldsymbol{X}|Y, \boldsymbol{\theta}) = \frac{\Pr(\boldsymbol{X},Y|\boldsymbol{\theta})}{\Pr(Y|\boldsymbol{\theta})} = \frac{\Pr(Y|\boldsymbol{X}, \boldsymbol{\theta}) \Pr(\boldsymbol{X})}{\Pr(Y|\boldsymbol{\theta})} \]

    The expression in the denominator may be obtained assuming discrete \(\boldsymbol{X}\) from:

    \[\Pr(Y|\boldsymbol{\theta}) = \sum_{\boldsymbol{X}} \Pr(Y, \boldsymbol{X} | \boldsymbol{\theta}) = \sum_{\boldsymbol{X}} \Pr(Y|\boldsymbol{X}, \boldsymbol{\theta}) \Pr(\boldsymbol{X}) \]

  • The final expression for the joint probability of \(Y\) and \(\boldsymbol{X}\) for a choice-based sample is clearly more complex:

    \[f(Y, \boldsymbol{X}| \boldsymbol{\theta}) = \frac{f(Y) \Pr(Y|\boldsymbol{X}, \boldsymbol{\theta}) \Pr(\boldsymbol{X})}{\displaystyle \sum_{\boldsymbol{X}} \Pr(Y|\boldsymbol{X}, \boldsymbol{\theta}) \Pr(\boldsymbol{X})} \]

2.4 Corollary

2.4.1 For stratified sampling

For stratified sampling sample, we hold

\[f(Y| \boldsymbol{X}) = \frac{f(Y, \boldsymbol{X})}{f(\boldsymbol{X})} = \frac{\Pr(Y,\boldsymbol{X})}{\Pr(\boldsymbol{X})} = \Pr(Y| \boldsymbol{X}) \]

2.4.2 For choice-based sampling

For choice-based sampling sample , we hold

\[f(\boldsymbol{X} | Y) = \frac{f(Y, \boldsymbol{X})}{f(Y)} = \frac{\Pr(Y,\boldsymbol{X})}{\Pr(Y)} = \Pr(\boldsymbol{X} | Y) \]

We also have

\[\begin{align*} \Pr(Y|\boldsymbol{X}) &= \frac{\Pr(Y, \boldsymbol{X})}{\Pr(\boldsymbol{X})} = \frac{\Pr(\boldsymbol{X}, Y)}{\displaystyle \sum_{Y} \Pr(\boldsymbol{X}, Y)} \\ \\ &= \frac{\Pr(\boldsymbol{X} | Y) \Pr(Y)}{\sum_{Y} \Pr(\boldsymbol{X} | Y ) \Pr(Y)} = \frac{ f(\boldsymbol{X} | Y) \Pr(Y)}{\sum_{Y} f(\boldsymbol{X} | Y) \Pr(Y)} \\ \\ &= \frac{\Pr(Y) \dfrac{f(\boldsymbol{X}, Y)}{f(Y)}}{\displaystyle \sum_{Y} \left[ \Pr(Y) \dfrac{f(\boldsymbol{X}, Y)}{f(Y)} \right]} = \frac{w_Y f(\boldsymbol{X}, Y)}{\displaystyle \sum_{Y} w_Y f(\boldsymbol{X}, Y)} \end{align*} \]

Where \(w_Y = \Pr(Y) / f(Y)\)

3. Example

The population distribution is provided as follows. The distribution of simple random sampling samples is identical to it.

\[\begin{array}{c|cc|c} & \text{Low income}& \text{High income} & \text{Total} \\ \hline \text{Bus user} & 0.45 & 0.15 & 0.60 \\ \text{Car user} & 0.20 & 0.20 & 0.40 \\ \hline \text{Total} & 0.65 & 0.35 & 1.00 \\ \end{array} \]

For the stratified sampling, consider a sample with 75% low income (LI) and 25% high income (HI), that is to say: \(f(\boldsymbol{X}=\text{LI}) = 0.75\) and \(f(\boldsymbol{X} = \text{HI}) = 0.25\). Then, the joint probability distribution of the sample \(f(Y, \boldsymbol{X})\), taking \(f(Y=\text{B}, \boldsymbol{X}=\text{LI})\) as an example, can be calculated as:

\[\begin{align*} f(Y=\text{B}, \boldsymbol{X}=\text{LI}) &= f(\boldsymbol{X} = \text{LI}) \Pr(Y=\text{B}| \boldsymbol{X}=\text{LI}) \\ &= f(\boldsymbol{X} = \text{LI}) \frac{\Pr(Y=\text{B},\boldsymbol{X}=\text{LI})}{\Pr(\boldsymbol{X}=\text{LI})} \\ &= 0.75 \times \frac{0.45}{0.65} = 0.519 \end{align*} \]

Eventually, we have the joint probability distribution of the sample \(f(Y, \boldsymbol{X})\) as following:

\[\begin{array}{c|cc|c} & \text{Low income}& \text{High income} & \text{Total} \\ \hline \text{Bus user} & 0.519 & 0.107 & 0.626 \\ \text{Car user} & 0.231 & 0.143 & 0.374 \\ \hline \text{Total} & 0.750 & 0.250 & 1.000 \\ \end{array} \]

For the choice-based sampling, consider a sample with 75% bus users (B) and 25% car users (C), i.e., \(f(\boldsymbol{Y}=\text{B}) = 0.75\) and \(f(\boldsymbol{Y} = \text{C}) = 0.25\). Then, the joint probability distribution of the sample \(f(Y, \boldsymbol{X})\), taking \(f(Y=\text{B}, \boldsymbol{X}=\text{LI})\) as an example, can be calculated as:

\[\begin{align*} f(Y=\text{B}, \boldsymbol{X}=\text{LI}) &= f(\boldsymbol{Y} = \text{B}) \Pr(\boldsymbol{X}=\text{LI}|Y=\text{B}) \\ &= f(\boldsymbol{Y} = \text{B}) \frac{\Pr(Y=\text{B},\boldsymbol{X}=\text{LI})}{\Pr(Y=\text{B})} \\ &= 0.75 \times \frac{0.45}{0.60} = 0.563 \end{align*} \]

Eventually, we have the joint probability distribution of the sample \(f(Y, \boldsymbol{X})\) for choice-based sampling method as following:

\[\begin{array}{c|cc|c} & \text{Low income}& \text{High income} & \text{Total} \\ \hline \text{Bus user} & 0.563 & 0.187 & 0.750 \\ \text{Car user} & 0.125 & 0.125 & 0.250 \\ \hline \text{Total} & 0.688 & 0.312 & 1.000 \\ \end{array} \]

3 Calibrating choice-based sampling model

For the stratified sampling method, for a specific strata \(\boldsymbol{X}\), the probabilities of mode choice are identical to the simple random sampling method, i.e.:

\[f(\boldsymbol{Y}| \boldsymbol{X}) = \frac{f(\boldsymbol{Y}, \boldsymbol{X})}{f(\boldsymbol{X})} = \Pr(\boldsymbol{Y}| \boldsymbol{X}) \]

For example, for the traveler of the high income (\(\boldsymbol{X}=\text{LI}\)) strata, we can obtain:

\[\begin{align*} & f(\boldsymbol{Y}=\text{B} | \boldsymbol{X}=\text{LI}) = \frac{0.15}{0.35} = 0.429,\ && f(\boldsymbol{Y}=\text{C} | \boldsymbol{X}=\text{LI}) = \frac{0.20}{0.35} = 0.571 && \text{for stratified sampling} \\ \\ & \Pr(\boldsymbol{Y}=\text{B} | \boldsymbol{X}=\text{LI}) = \frac{0.107}{0.250} = 0.428,\ && \Pr(\boldsymbol{Y}=\text{C} | \boldsymbol{X}=\text{LI}) = \frac{0.143}{0.250} = 0.572 && \text{for simple sampling} \end{align*} \]

The last digital decimal places are different because of round-off error.

While for choice-based sampling method, those are completely different:

\[\begin{align*} & f(Y=\text{B} | \boldsymbol{X}=\text{LI}) = \frac{0.187}{0.312} = 0.599,\ && f(Y=\text{C} | \boldsymbol{X}=\text{LI}) = \frac{0.125}{0.312} = 0.401 && \text{for choice-based sampling} \end{align*} \]

There is a method to use data from choice-based samples in model estimation avoiding bias at the expense only of requiring knowledge of the actual market shares. This involves weighting the observations by factors calculated as:

\[w = \frac{\Pr(Y)}{f(Y)} \]

Thus, we can calculated the weight of each mode choice:

\[w_{\text{B}} = \frac{0.60}{0.750} = 0.8, \qquad w_{\text{C}} = \frac{0.40}{0.250} = 1.6 \]

Then the unbiased mode-choice probability of the high-income traveler are:

\[\begin{align*} & f(Y=\text{B} | \boldsymbol{X}=\text{LI}) = \frac{0.187 w_{\text{B}}}{0.187 w_{\text{B}} + 0.125 w_{\text{C}}} = \frac{0.1496}{0.3496} = 0.428 \\ \\ & f(Y=\text{C} | \boldsymbol{X}=\text{LI}) = \frac{0.125 w_{\text{B}}}{0.187 w_{\text{B}} + 0.125 w_{\text{C}}} = \frac{0.2000}{0.3496} = 0.572 \end{align*} \]

Reference

[1] Juan de Dios Ort´uzar and Luis G. Willumsen, "3.1 Basic Sampling Theory", in Modelling Transport, Fourth Edition, John Wiley & Sons., 2011, p.p. 55-64

posted @ 2022-02-13 19:58  veager  阅读(332)  评论(0)    收藏  举报