ADS R知识总结

Knit skills

eval=false has no output shown

hist(rnorm(10000), col = 'tomato') # eval

echo=false has no code shown

hist(rnorm(10000), col = 'tomato') # echo

include=false has no output and code shown

hist(rnorm(10000), col = 'tomato') # include

warning=false has no warning shown

library(ggplot2)

error=false has no error shown

#num <- "ADS" + 100

library:: maybe useful when knit fail

#knitr::kable(summary(data))

convert to xelatex:

output: 
  pdf_document:
    latex_engine: xelatex

Data process

import data

data <- read.csv("path", header = TRUE, sep = ",")
# data <- read.table("path) # when process txt
head(data)

basic examination

summary(data)
unique(data$col)
str(data)

examine NA

is.na(data)
# summary(is.na(data))
# sum(is.na(data))
# anyNA(data)
# sum(!complete.cases(data)) # library DMwR

data <- na.omit(data)
# data.dropna(subset=['case_id', 'date_onset', 'age'])

examine duplication

duplicated(data)

data <- distinct(data) # dplyr
# data <- unique(data)

to factor

teeth <- teeth %>% 
	mutate(dose = factor(dose, levels = c(0.5, 1, 2), ordered = T), 
           supp = as.factor(supp)) %>% 
	relocate(supp, dose) str(teeth)

wide to long

library(tidyr)
data <- data.frame( 
	ID = 1:10, 
	Before = c(5, 3, 6, 7, 4, 6, 8, 7, 5, 6), 
	After = c(7, 4, 6, 9, 5, 8, 9, 8, 7, 7))
data_long <- data %>% 
	pivot_longer(cols = c(Before, After), names_to = "Time", values_to = "Value")

long to wide

data_wide <- data_long %>% 
	pivot_wider(names_from = Time, values_from = Value)

Visualization

三分组变量箱线图：

ggplot(data, aes(x = factor_variable, 
				 y = numeric_variable, 
				 fill = group_variable)) + 
	geom_boxplot() + 
	labs(title = "Boxplot of Numeric Variable by Factor and Group Variables", x = "Factor Variable", y = "Numeric Variable") + 
	scale_fill_brewer(palette = "Set3") + 
	theme_minimal() + 
	theme(plot.title = element_text(hjust = 0.5))

Statistics

t-test:

t.test(variable1, variable2, var.equal = TRUE)

u-test:

H0: 两组分布相同
HA: 两组分布不同

wilcox.test(variable1, variable2)

ANOVA:

Choose: 先说有几组，有几个factor，确定用什么ANOVA（或Kruskal-Wallis test）

anova_result <- aov(variable1 ~ group1 * group2, data = data)

Justify: Assumptions for a 2-way ANOVA test are:
1. Independence of observations.
  1. We can assume it at once.
2. Normality of residuals:
  1. Can be checked visually (plot(anova_model, 2)) or by running a suitable hypothesis test
3. Equality of variance:
  1. Can be checked visually (plot(anova_model, 1)) or by running a suitable hypothesis test
4. Equal group size (have to use different types of SS calculation for the ANOVA table if this requirement is violated):
  1. The group size can be noticed in the data diagnosis step
5. Finally, we can use parameter ANOVA.
statistical hypotheses:
1. H0: means of different supp groups are the same
2. H1: means of different supp groups are NOT the same
Carry test:

summary(anova_result)
TukeyHSD(anova_result)

Suggestions:
1. 缺乏未治疗组（对照）
2. 其他生物学意义

Chi-square:

Justify:

Chi-square goodness-of-fit test
Chi-square test for homogeneity
Chi-square test for independence
Assumptions:
• The variables must be categorical.
– Fits.
• Observations must be independent.
– Can assume from the task. Fits.
• Cells in the contingency table are mutually exclusive.
– Fits.
• The expected value of cells should be 5 or greater in at least 80% of cells.
– See Table. Fits.

chisq.test(x = table, p = predict) # goodness-of-fit

Bootstrapping

if data are categorical but seriously lacking in independence

first_satisfied <- 864 
first_unsatisfied <- 714 
second_satisfied <- 980 
second_unsatisfied <- 473 
first_bootstraps <- vector() 
second_bootstraps <- vector() 
first_results <- c(rep(1, first_satisfied), rep(0, first_unsatisfied))
second_results <- c(rep(1, second_satisfied), rep(0, second_unsatisfied)) for (a in 1:100) { 
	first_sample <-
		mean(sample(first_results, length(first_results),replace=T))
	second_sample <- 
		mean(sample(second_results,length(second_results),replace=T))

	first_bootstraps<-c(first_bootstraps,first_sample) 
	second_bootstraps<-c(second_bootstraps,second_sample)
}
first_upper<-quantile(first_bootstraps,probs= c(0.975)) 
second_lower<-quantile(second_bootstraps,probs=c(0.025)) 
boxplot( 
	first_bootstraps, 
	second_bootstraps, 
	notch= T, 
	names= c('early', 'late'), 
	ylab='Prop.ofsatisfiedbuttonpresses' )

然后看是否一组的上interval和一组的下interval相交

first_upper < second_lower

Bayes

P(A|B) = P(B|A)*P(A)/P(B)

# compare two hypotheses:
P(H1|D)/P(H2|D) = P(D|H1)/P(D|H2) * P(H1)/P(H2)

# Bayes Factor:
P(D|H1)/P(D|H2)

K-means

visualization

ggplot(data, aes(x = ln_hap1, y = hap2, color = type)) +
# if cluster result then type -> cluster
  geom_point() +
  theme_minimal() +
  labs(title = "xxx") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

clustering

features <- data[, c("hap1", "hap2")]
set.seed(123)
k <- 5
kmeans_result <- kmeans(features, centers = k, nstart = 1)

Finally:

Sys.time()

posted @ 2024-05-30 13:15 CarrotCo 阅读(35) 评论(0) 收藏举报

刷新页面返回顶部