Kmeans聚类流程

1. turtles

Introduction

In this report, we will analyze a problem related to turtle populations on a small island with two beaches: West Beach and East Beach. The goal is to determine the probability of being on East Beach given that a Loggerhead Turtle is found. We will use Bayes' theorem and R programming to solve this problem.

Problem Statement

An ecologist studying turtles on a small island with two beaches knows the following information about the turtle population:

  • On West Beach, 90% of turtles are Green Sea Turtles, and the remaining 10% are Loggerhead Sea Turtles.
  • On East Beach, 60% of turtles are Green Sea Turtles, while 40% are Loggerhead Turtles.

On a foggy day, the ecologist gets lost on the island. After hours of walking, they reach a beach but cannot determine which one it is due to the dense fog. The ecologist finds a turtle and examines it, discovering that it is a Loggerhead Turtle.

The question is: What is the probability that the ecologist is on East Beach? Additionally, we need to state the assumptions made to arrive at this probability.

Assumptions

To solve this problem, we make the following assumptions:

  1. The ecologist is either on West Beach or East Beach; there are no other possibilities.
  2. In the foggy weather, the probability of reaching West Beach or East Beach is equal, i.e., 50% each.

Solution

We will use Bayes' theorem to calculate the probability of being on East Beach given that a Loggerhead Turtle is found.

# Define known conditions
p_west <- 0.5  # Probability of reaching West Beach
p_east <- 0.5  # Probability of reaching East Beach
p_loggerhead_given_west <- 0.1  # Probability of finding a Loggerhead Turtle on West Beach
p_loggerhead_given_east <- 0.4  # Probability of finding a Loggerhead Turtle on East Beach

# Apply Bayes' theorem to calculate the probability of being on East Beach given a Loggerhead Turtle is found
p_east_given_loggerhead <- (p_loggerhead_given_east * p_east) / 
  (p_loggerhead_given_west * p_west + p_loggerhead_given_east * p_east)

# Print the result
cat("The probability of being on East Beach given that a Loggerhead Turtle is found is:", p_east_given_loggerhead, "\n")

Conclusion

Based on the given information and assumptions, the probability of being on East Beach given that a Loggerhead Turtle is found is r p_east_given_loggerhead, or 80%.

This result relies on the assumptions that the ecologist is either on West Beach or East Beach and that the probability of reaching each beach in the foggy weather is equal. If these assumptions do not hold, the calculated probability may differ. For example, if the probability of reaching West Beach in the foggy weather is higher, the ecologist might still be more likely to be on West Beach even after finding a Loggerhead Turtle.

2. Classifying neuron types from electrophysiological recordings

library(tidyverse)
vmndata <- read.csv("/Users/chen_yiru/Desktop/Desk/Projects/incourse/大二下/ADS_files/vmndata.csv")
head(vmndata)
colSums(is.na(vmndata))

No NA value here.

vmndata_duplicate <- duplicated(vmndata)

sum(vmndata_duplicate)

No duplicates.

ggplot(data = vmndata, aes(x = hap1, y = hap2, color= type)) + geom_point() + ggtitle("Original classification")

寻找最佳聚类数

虽然这里因为已经知道一共有五类,但是如果是不知道的情况下还是要这一步

dots <- vmndata[,c(2,3)]
library(factoextra)
set.seed(123)
fviz_nbclust(dots, kmeans, method = "wss") +
  geom_vline(xintercept = 4, linetype = 2)

model <- kmeans(dots, 5)
model
vmndata$cluster <- as.factor(model$cluster)
ggplot(data = vmndata, aes(x = hap1, y = hap2, color= cluster)) + geom_point() + ggtitle("Cluter classification")

Test the clustering using different subsets of the fit parameters

sub_hap1 <- vmndata[,2]
hap1_result <- kmeans(sub_hap1,5)
vmndata$hap1_cluster <- as.factor(hap1_result$cluster)
ggplot(data = vmndata, aes(x = hap1, y = hap2, color= hap1_cluster)) + geom_point() + ggtitle("Hap1 classification")
sub_hap2 <- vmndata[,3]
hap2_result <- kmeans(sub_hap2,5)
vmndata$hap2_cluster <- as.factor(hap2_result$cluster)
ggplot(data = vmndata, aes(x = hap1, y = hap2, color= hap2_cluster)) + geom_point() + ggtitle("Hap2 classification")

Comparison between the origional classification and cluster classification,评估结果

Adjusted Rand Index (ARI) 是一种用于评估两个数据分配(例如,真实标签和由聚类算法得到的标签)一致性的统计量。它的取值范围是 [-1, 1],其中:

1 表示两个分配完全一致。
0 表示随机一致性,即两个分配的一致性与随机标签的一致性相同。
-1 表示完全不一致,即两个分配的一致性比随机标签的一致性还要差。

虽然 ARI 没有绝对的“好”或“坏”的阈值,但通常认为:

接近 1:非常好
0.7 到 0.9:好
0.4 到 0.69:一般
0.2 到 0.39:较差
接近 0:随机
负值:比随机还差
# Example of calculating ARI
library(CommKern)
ari_score <- adj_RI(vmndata$type, vmndata$cluster)
print(ari_score)

Report

Certainly! Below is a brief report written in English based on the R code you provided:


Report on the Analysis of Neuronal Electrical Activity Classification

Introduction:
This report presents an analysis of electrical activity recordings from 25 neurons, classified into 5 types by a colleague based on their activity patterns. We aim to independently classify these recordings using a model with two fit parameters, hap1 (half-life or time constant) and hap2 (magnitude), and compare the results with the original classifications.

Data Import and Preliminary Checks:
We began by importing the original data from the file vmndata.csv using the tidyverse package in R. Preliminary checks for missing values and duplicates were performed, confirming that the data set contains no NA values and no duplicated entries.

Original Classification Visualization:
To visualize the original classifications, a scatter plot was created with hap1 on the x-axis, hap2 on the y-axis, and colors representing the type of neuron classification. This plot provides a clear visual representation of how the neurons were originally classified based on their electrical activity patterns.

Clustering Analysis:
Using the kmeans function in R, we performed clustering on the model fit data to create our own classification of the neuron recordings. The choice of 5 clusters was informed by the original analysis. The resulting clusters were added to the data set for further visualization and comparison.

Clustering Visualization:
A scatter plot similar to the original classification plot was created, but this time with colors representing the clusters identified by the kmeans algorithm. This visual comparison allows us to assess the differences and similarities between the original classifications and the clusters derived from our model.

Subset Clustering Analysis:
To test the robustness of our clustering, we performed separate kmeans analyses using only hap1 and hap2 as the sole variables. This allowed us to see how sensitive the clustering results are to each parameter individually.

Comparison and Evaluation:
To quantitatively compare the original classifications with our cluster classifications, we employed the Adjusted Rand Index (ARI), a statistical measure that assesses the agreement between two data partitions. An ARI score close to 1 indicates perfect agreement, while a score around 0 suggests random agreement.

Conclusion:
The ARI score obtained from our analysis provides a quantitative measure of how well our clustering aligns with the original classifications. A high ARI score would suggest that our model-based classification is consistent with the colleague's analysis, while a lower score would indicate discrepancies.

Recommendations:
Based on the ARI score and visual comparisons, we can make recommendations for further refinement of the model or suggest areas where additional data collection or analysis might be beneficial.


This report provides a concise summary of the steps taken in the analysis, the methods used, and the implications of the findings. It is written in clear, non-technical language suitable for a general audience.

posted @ 2024-05-25 18:12  chen生信  阅读(52)  评论(0)    收藏  举报