多元统计分析讨论四

多元统计分析讨论四

一、问题

UC Berkeley在研究生录取时是否存在性别差异?
以下表格展示了1973年UC Berkeley六个主要研究生项目中男性和女性的申请人数及录取人数,请分析当时是否存在性别差异?

img

二、代码分析

2.1 总体情况(不分专业)

data <- data.frame(
  Major = c("A", "B", "C", "D", "E", "F"),
  MaleApplicants = c(825, 560, 325, 417, 191, 373),
  MaleAdmits = c(512, 353, 120, 138, 53, 22),
  FemaleApplicants = c(108, 25, 593, 375, 393, 341),
  FemaleAdmits = c(89, 17, 202, 202, 94, 24)
)
# --------------------------总体(部分专业)---------------------
## 构建总表
total_male_applicants <- sum(data$MaleApplicants)
total_male_admits <- sum(data$MaleAdmits)
total_female_applicants <- sum(data$FemaleApplicants)
total_female_admits <- sum(data$FemaleAdmits)

overall_table <- matrix(c(
  total_male_admits, total_male_applicants - total_male_admits,
  total_female_admits, total_female_applicants - total_female_admits
), nrow = 2, byrow = TRUE)

colnames(overall_table) <- c("Admitted", "Rejected")
rownames(overall_table) <- c("Male", "Female")

# 卡方检验
chisq.test(overall_table)
# fisher 检验
fisher.test(overall_table)
2.1.1 结果分析
检验方法 p 值 显著性 odds ratio 95% 置信区间 结论
卡方检验 5.17e-12 ✅ 显著 - - 存在总体性别差异
Fisher 精确检验 3.70e-12 ✅ 显著 1.54 [1.36, 1.75] 男性的录取优势明显高于女性
2.1.2 图示

img

2.2 分专业情况

# -----------------分专业分析---------------------------
for (i in 1:nrow(data)) {
  cat("\n--- 专业", data$Major[i], "---\n")
  
  admits <- c(data$MaleAdmits[i], data$MaleApplicants[i] - data$MaleAdmits[i],
              data$FemaleAdmits[i], data$FemaleApplicants[i] - data$FemaleAdmits[i])
  
  major_table <- matrix(admits, nrow = 2, byrow = TRUE)
  colnames(major_table) <- c("Admitted", "Rejected")
  rownames(major_table) <- c("Male", "Female")
  
  print(major_table)
  
  cat("卡方检验:\n")
  print(chisq.test(major_table, simulate.p.value = TRUE))  # 避免样本太小
  
  cat("Fisher精确检验:\n")
  print(fisher.test(major_table))
}

2.2.1 结果分析
专业 卡方 p值 Fisher p值 显著性判断 (p < 0.05) 倾向性别(Fisher OR) 解读
A 0.0005 1.67e-05 ✅ 显著 男性录取比例更高 (OR = 0.35) 存在性别差异,男性录取机会明显低于女性
B 0.6682 0.6771 ❌ 不显著 无明显性别倾向 无差异
C 0.3758 0.3866 ❌ 不显著 略倾向女性 (OR = 1.13) 差异不显著
D 0.0005 5.05e-09 ✅ 显著 女性录取比例更高 (OR = 0.42) 存在性别差异,男性录取机会明显低于女性
E 0.3458 0.3604 ❌ 不显著 略倾向女性 (OR = 1.22) 差异不显著
F 0.5482 0.5458 ❌ 不显著 无明显性别倾向 无差异
2.2.2 图示

img

img

三、结论

从总体看,是存在显著的性别差异。但在分专业分析中,只有A和D两个专业的卡方检验和Fisher精确检验都显示出了显著的性别差异,而其余专业则没有显著的性别差异,或者说是在统计意义上不显著。那基于此,我们在考虑研究生录取的性别差异时,需要具体分析各个专业的具体情况。否则,可能造成辛普森悖论。

四、附录

4.1 检验方式适用情况

  • 卡方检验:k-ways
  • Fisher精确检验:two-ways

4.2 OR解释规则

OR 解释规则(通用)[男性处于分子]:
OR = 1:男女录取机会相等

OR < 1:男性录取机会 低于 女性

OR > 1:男性录取机会 高于 女性

4.3 代码

# --------------------------------------------------------------------------

data <- data.frame(
  Major = c("A", "B", "C", "D", "E", "F"),
  MaleApplicants = c(825, 560, 325, 417, 191, 373),
  MaleAdmits = c(512, 353, 120, 138, 53, 22),
  FemaleApplicants = c(108, 25, 593, 375, 393, 341),
  FemaleAdmits = c(89, 17, 202, 202, 94, 24)
)
# --------------------------总体(部分专业)---------------------
## 构建总表
total_male_applicants <- sum(data$MaleApplicants)
total_male_admits <- sum(data$MaleAdmits)
total_female_applicants <- sum(data$FemaleApplicants)
total_female_admits <- sum(data$FemaleAdmits)

overall_table <- matrix(c(
  total_male_admits, total_male_applicants - total_male_admits,
  total_female_admits, total_female_applicants - total_female_admits
), nrow = 2, byrow = TRUE)

colnames(overall_table) <- c("Admitted", "Rejected")
rownames(overall_table) <- c("Male", "Female")

# 卡方检验
chisq.test(overall_table)

# fisher 检验
fisher.test(overall_table)



# -----------------分专业分析---------------------------
for (i in 1:nrow(data)) {
  cat("\n--- 专业", data$Major[i], "---\n")
  
  admits <- c(data$MaleAdmits[i], data$MaleApplicants[i] - data$MaleAdmits[i],
              data$FemaleAdmits[i], data$FemaleApplicants[i] - data$FemaleAdmits[i])
  
  major_table <- matrix(admits, nrow = 2, byrow = TRUE)
  colnames(major_table) <- c("Admitted", "Rejected")
  rownames(major_table) <- c("Male", "Female")
  
  print(major_table)
  
  cat("卡方检验:\n")
  print(chisq.test(major_table, simulate.p.value = TRUE))  # 避免样本太小
  
  cat("Fisher精确检验:\n")
  print(fisher.test(major_table))
}


# 各专业按性别的录取率条形图
library(ggplot2)
library(dplyr)

# 构建长格式数据
data_long <- data.frame(
  Major = rep(data$Major, each = 2),
  Gender = rep(c("Male", "Female"), times = nrow(data)),
  Applicants = c(data$MaleApplicants, data$FemaleApplicants),
  Admits = c(data$MaleAdmits, data$FemaleAdmits)
)

data_long <- data_long %>%
  mutate(AdmitRate = Admits / Applicants)

# 绘制条形图
ggplot(data_long, aes(x = Major, y = AdmitRate, fill = Gender)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "各专业中按性别的录取率",
    x = "专业",
    y = "录取率"
  ) +
  theme_minimal() +
  scale_fill_brewer(palette = "Set1")

# 每个专业的录取人数与申请人数气泡图(比例视角)
data_long <- data_long %>%
  mutate(Rejection = Applicants - Admits)

ggplot(data_long, aes(x = Admits, y = Rejection, color = Gender, size = Applicants)) +
  geom_point(alpha = 0.7) +
  facet_wrap(~ Major) +
  labs(
    title = "各专业按性别的申请/录取人数分布",
    x = "录取人数",
    y = "未录取人数"
  ) +
  theme_light() +
  scale_color_brewer(palette = "Set1")



# 热图:各专业的 Fisher 检验 p 值热图 + odds ratio
library(ggplot2)
library(tibble)

# 汇总统计结果
summary_df <- tribble(
  ~Major, ~FisherP, ~OR,
  "A", 1.67e-05, 0.35,
  "B", 0.6771, 0.80,
  "C", 0.3866, 1.13,
  "D", 5.05e-09, 0.42,
  "E", 0.3604, 1.22,
  "F", 0.5458, 0.83
)

# 绘图
ggplot(summary_df, aes(x = Major, y = -log10(FisherP), fill = OR)) +
  geom_bar(stat = "identity") +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "red") +
  scale_fill_gradient2(low = "blue", mid = "white", high = "darkred", midpoint = 1) +
  labs(
    title = "各专业性别差异显著性(Fisher检验)",
    x = "专业",
    y = "-log10(p值)(越高越显著)",
    fill = "OR(女性/男性)"
  ) +
  theme_minimal()



# 总体录取率按性别条形图
overall_df <- data.frame(
  Gender = c("Male", "Female"),
  Admits = c(sum(data$MaleAdmits), sum(data$FemaleAdmits)),
  Applicants = c(sum(data$MaleApplicants), sum(data$FemaleApplicants))
)
overall_df$Rejects <- overall_df$Applicants - overall_df$Admits
overall_df$Rate <- overall_df$Admits / overall_df$Applicants

library(ggplot2)
ggplot(overall_df, aes(x = Gender, y = Rate, fill = Gender)) +
  geom_bar(stat = "identity", width = 0.5) +
  labs(
    title = "UC Berkeley 1973年整体录取率对比(按性别)",
    y = "录取率", x = "性别"
  ) +
  theme_minimal() +
  scale_fill_brewer(palette = "Set1") +
  geom_text(aes(label = scales::percent(Rate)), vjust = -0.5)

# 分专业录取率对比 + 总体线标识
# 使用 data_long 数据
ggplot(data_long, aes(x = Major, y = AdmitRate, fill = Gender)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_hline(yintercept = overall_df$Rate[1], linetype = "dashed", color = "#1f78b4", size = 0.5) +
  geom_hline(yintercept = overall_df$Rate[2], linetype = "dashed", color = "#e31a1c", size = 0.5) +
  labs(
    title = "各专业性别录取率对比(含总体录取率)",
    y = "录取率",
    x = "专业"
  ) +
  theme_minimal() +
  scale_fill_brewer(palette = "Set1")


# 整体显著但分专业不显著”的辛普森悖论现象
library(dplyr)
library(tidyr)

# 每专业录取率差异(男 - 女)
delta_rate <- data_long %>%
  select(Major, Gender, AdmitRate) %>%
  pivot_wider(names_from = Gender, values_from = AdmitRate) %>%
  mutate(Diff = Male - Female)

ggplot(delta_rate, aes(x = Major, y = Diff)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "各专业男性与女性录取率差(男 - 女)", y = "录取率差", x = "专业") +
  theme_minimal()

posted @ 2025-05-11 21:29  xia0ya  阅读(25)  评论(0)    收藏  举报