描述统计
描述统计 (descriptive statistics),对数据的总结性分析(summary)。
unlike inferential statistics, is not developed on the basis of probability theory, and are frequently non-parametric statistics.
集中趋势 central tendency
平均数 mean
平均数反映的是统计对象的一般水平 (a typical value)。比如五年级某班的平均身高为155cm,虽然不能反映每个同学的身高也不能反映其分布,但是至少这个平均身高不太可能是165cm或145cm...
平均数容易受outliers影响,而中位数不受影响。
用mean的情况:
- 没有outliers,或者outliers可以除掉(always check outliers when use mean)
- outlier不影响,因为问题看得就是平均数..
中位数 median
缺点:需要排序
优点:不受outlier影响,more robust (robust statistics)
众数 mode
离中趋势 variability/dispersion
- 方差variance、标准差standard deviation
- The 1st central moment --- mean (expectation)。
- The 2nd central moment --- variance。或者更直观的---标准差。反映平均离散程度。
- 光有mean和variance仍不能确定分布(mean和variance相同,分布可能也不同),一个特殊的例子就是left skew和right skew,左右对称,可能刚好mean和variance一样,但skewness不一样(如下图)。这就引出the 3rd central moment --- skewness偏度。
- the 4th central moment: kurtosis峰度 (heaviness of the tail of the distribution).

无论left skew还是right skew,mean总是倾向于tail,还是因为mean受极端值影响大。
协方差/协变量
(μ=E(X), ν=E(Y))
相关系数就是消去单位的无量纲的一个量。协方差cov(X,Y)/sd(X)*sd(Y)。
注意到sd(X) or sd(Y)为0时,相关系数undefiend。这时其实应该是no correlation,因为(至少)其中一个变量根本就是一些相同不变的值。
Assumptions of Pearson correlation coefficient
- 两变量是连续变量。If ordinal variable, Spearman correlation could be conducted instead.
- Related pairs refers to the pairs of variables. Each participant or observation should have a pair of values. (每对数据之间相互独立【1】)
- Absence of outliers refers to not having outliers in either variable. Having an outlier can skew the results of the correlation by pulling the line of best fit formed by the correlation too far in one direction or another. Typically, an outlier is defined as a value that is ±3.29 standard deviations from the mean.
- Linearity refers to the shape of the values formed by the scatterplot.

浙公网安备 33010602011771号