ADS R知识总结

Knit skills

eval=false has no output shown

hist(rnorm(10000), col = 'tomato') # eval

echo=false has no code shown

hist(rnorm(10000), col = 'tomato') # echo

include=false has no output and code shown

hist(rnorm(10000), col = 'tomato') # include

warning=false has no warning shown

library(ggplot2)

error=false has no error shown

#num <- "ADS" + 100

library:: maybe useful when knit fail

#knitr::kable(summary(data))

convert to xelatex:

output: 
  pdf_document:
    latex_engine: xelatex

Data process

import data

data <- read.csv("path", header = TRUE, sep = ",")
# data <- read.table("path) # when process txt
head(data)

basic examination

summary(data)
unique(data$col)
str(data)

examine NA

is.na(data)
# summary(is.na(data))
# sum(is.na(data))
# anyNA(data)
# sum(!complete.cases(data)) # library DMwR

data <- na.omit(data)
# data.dropna(subset=['case_id', 'date_onset', 'age'])

examine duplication

duplicated(data)

data <- distinct(data) # dplyr
# data <- unique(data)

to factor

teeth <- teeth %>% 
	mutate(dose = factor(dose, levels = c(0.5, 1, 2), ordered = T), 
           supp = as.factor(supp)) %>% 
	relocate(supp, dose) str(teeth)

wide to long

library(tidyr)
data <- data.frame( 
	ID = 1:10, 
	Before = c(5, 3, 6, 7, 4, 6, 8, 7, 5, 6), 
	After = c(7, 4, 6, 9, 5, 8, 9, 8, 7, 7))
data_long <- data %>% 
	pivot_longer(cols = c(Before, After), names_to = "Time", values_to = "Value")

long to wide

data_wide <- data_long %>% 
	pivot_wider(names_from = Time, values_from = Value)

Visualization

三分组变量箱线图:

ggplot(data, aes(x = factor_variable, 
				 y = numeric_variable, 
				 fill = group_variable)) + 
	geom_boxplot() + 
	labs(title = "Boxplot of Numeric Variable by Factor and Group Variables", x = "Factor Variable", y = "Numeric Variable") + 
	scale_fill_brewer(palette = "Set3") + 
	theme_minimal() + 
	theme(plot.title = element_text(hjust = 0.5))

Statistics

t-test:

t.test(variable1, variable2, var.equal = TRUE)

u-test:

  • H0: 两组分布相同
  • HA: 两组分布不同
wilcox.test(variable1, variable2)

ANOVA:

  1. Choose: 先说有几组,有几个factor,确定用什么ANOVA(或Kruskal-Wallis test)
anova_result <- aov(variable1 ~ group1 * group2, data = data) 
  1. Justify: Assumptions for a 2-way ANOVA test are:
    1. Independence of observations.
      1. We can assume it at once.
    2. Normality of residuals:
      1. Can be checked visually (plot(anova_model, 2)) or by running a suitable hypothesis test
    3. Equality of variance:
      1. Can be checked visually (plot(anova_model, 1)) or by running a suitable hypothesis test
    4. Equal group size (have to use different types of SS calculation for the ANOVA table if this requirement is violated):
      1. The group size can be noticed in the data diagnosis step
    5. Finally, we can use parameter ANOVA.
  2. statistical hypotheses:
    1. H0: means of different supp groups are the same
    2. H1: means of different supp groups are NOT the same
  3. Carry test:
summary(anova_result)
TukeyHSD(anova_result) 
  1. Suggestions:
    1. 缺乏未治疗组(对照)
    2. 其他生物学意义

Chi-square:

Justify:

  • Chi-square goodness-of-fit test
  • Chi-square test for homogeneity
  • Chi-square test for independence
    Assumptions:
    • The variables must be categorical.
    – Fits.
    • Observations must be independent.
    – Can assume from the task. Fits.
    • Cells in the contingency table are mutually exclusive.
    – Fits.
    • The expected value of cells should be 5 or greater in at least 80% of cells.
    – See Table. Fits.
chisq.test(x = table, p = predict) # goodness-of-fit

Bootstrapping

  • if data are categorical but seriously lacking in independence
first_satisfied <- 864 
first_unsatisfied <- 714 
second_satisfied <- 980 
second_unsatisfied <- 473 
first_bootstraps <- vector() 
second_bootstraps <- vector() 
first_results <- c(rep(1, first_satisfied), rep(0, first_unsatisfied))
second_results <- c(rep(1, second_satisfied), rep(0, second_unsatisfied)) for (a in 1:100) { 
	first_sample <-
		mean(sample(first_results, length(first_results),replace=T))
	second_sample <- 
		mean(sample(second_results,length(second_results),replace=T))

	first_bootstraps<-c(first_bootstraps,first_sample) 
	second_bootstraps<-c(second_bootstraps,second_sample)
}
first_upper<-quantile(first_bootstraps,probs= c(0.975)) 
second_lower<-quantile(second_bootstraps,probs=c(0.025)) 
boxplot( 
	first_bootstraps, 
	second_bootstraps, 
	notch= T, 
	names= c('early', 'late'), 
	ylab='Prop.ofsatisfiedbuttonpresses' )

然后看是否一组的上interval和一组的下interval相交

first_upper < second_lower

Bayes

P(A|B) = P(B|A)*P(A)/P(B)

# compare two hypotheses:
P(H1|D)/P(H2|D) = P(D|H1)/P(D|H2) * P(H1)/P(H2)

# Bayes Factor:
P(D|H1)/P(D|H2)

K-means

visualization

ggplot(data, aes(x = ln_hap1, y = hap2, color = type)) +
# if cluster result then type -> cluster
  geom_point() +
  theme_minimal() +
  labs(title = "xxx") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

clustering

features <- data[, c("hap1", "hap2")]
set.seed(123)
k <- 5
kmeans_result <- kmeans(features, centers = k, nstart = 1)

Finally:

Sys.time()
posted @ 2024-05-30 13:15  CarrotCo  阅读(35)  评论(0)    收藏  举报