MS ML
https://www.saedsayad.com/categorical_categorical.htm
- 在learning很常用到kernel,那你要怎么判断一个kernel是否valid?
Valid Kernel needs to satify Mercer Condition. The linear kernel works fine if your dataset if linearly separable; however, if your dataset isn't linearly separable, we should use nonlinear kernels such as the Radial Basis Function kernel.
We would setup a hyperparameter search (grid search, for example) and compare different kernels to each other. Based on the loss function (or a performance metric such as accuracy, F1, MCC, ROC auc, etc.) we could determine which kernel is "appropriate" for the given task.
- 比较一下first-ordermethod跟second-ordermethod
Diffence between 1st and 2nd order algorithm:
Any algorithm that requires at least one first-derivative/gradient is a first order algorithm. In the case of a finite sum optimization problem, you may use only the gradient of a single sample, but this is still first order because you need at least one gradient.
A second order algorithm is any algorithm that uses any second derivative, in the scalar case.
- 我今天有很多的广告,每个广告都会有些关键词。那我现在有针对每个关键词我所得到的revenue。那问你该怎么predict之后的revenue?(他中间有特别强调这个是个veryhigh-dimensional的问题,不能用太naïve的方法)
- Anomaly Detection
Given a table of system logs with data like Latency, Filebytes, User, Account, Timestamp etc, design an alert system to report anomoly
How do we define anomaly?
Simple Statistical Methods
The simplest approach to identifying irregularities in data is to flag the data points that deviate from common statistical properties of a distribution, including mean, median, mode, and quantiles(1.5 IQR). Let's say the definition of an anomalous data point is one that deviates by a certain standard deviation from the mean. We would need a rolling window to compute the average across the data points. Technically, this is called a rolling average or a moving average, and it's intended to smooth short-term fluctuations and highlight long-term ones.
Challenges
- The data contains noise which might be similar to abnormal behavior, because the boundary between normal and abnormal behavior is often not precise.
- The definition of abnormal or normal may frequently change, as malicious adversaries constantly adapt themselves. Therefore, the threshold based on moving average may not always apply.
- The pattern is based on seasonality. This involves more sophisticated methods, such as decomposing the data into multiple trends in order to identify the change in seasonality.
K-means creates 'k' similar clusters of data points. Data instances that have largest distances to their cluster centers or away from all clusters could potentially be marked as anomalies.
- Window your data: turns data into a bunch of smaller n segments and normalize the windows to make all of our segments will begin and end with a value of 0.
- cluster our waveform segments in n-dimensional space
- Reconstruction: First, we make an array of 0’s that is as long as our anomalous dataset.We will eventually replace the 0’s in our reconstruction array with the predicted centroids. Next, we need to split our anomalous dataset into overlapping segments. We will make predictions based on these segments. Finally, determine whether we have greater than a 2% error and plot.
- alert: Now if we wanted to alert on our anomaly, all we have to do is set a threshold for our Reconstruction Error. Anytime it exceeds threshold, we have detected an anomaly.
k-nearest neighbors algorithm. Normal data points occur around a dense neighborhood and abnormalities are far away. In KNN, outliers are those data points which predictive algorithms consistently classify into incorrect categories. While anomalies could be caused by missing predictors, they could also arise due to insufficient data for training the predictive model. Hence, as it is important to ensure an enough sample size.
- 怎么选择Training set的百分比,选70%好还是80%好?训练的模型在已知数据里表现不错,但新数据进来了表现就不好了,有哪些可能的原因?
There are two competing concerns: with less training data, your parameter estimates have greater variance and your performance statistic will have greater variance. Broadly speaking you should be concerned with dividing data such that neither variance is too high, which is more to do with the absolute number of instances in each category rather than the percentage.
If you have a total of 100 instances, you're probably stuck with cross validation as no single split is going to give you satisfactory variance in your estimates. If you have 100,000 instances, it doesn't really matter whether you choose an 80:20 split or a 90:10 split (indeed you may choose to use less training data if your method is particularly computationally intensive).
Training data set is not representative of the whole population; If the test data set or validation set is not randomly shiffled, there would be overfitting problem.
- signifiance level和power有什么关系?
When we increase the significance level, it's more likely to reject the null hypothesis, that means the type 2 error would decrease, so that the prower will increase.
Several factors affect the power of a statistical test. Some of the factors are under the control of the experimenter, whereas others are not.
- Sample size: the larger the sample size, the higher the power. Since sample size is typically under an experimenter's control, increasing sample size is one way to increase power. However, it is sometimes difficult and/or expensive to use a large sample size.
- Standard deviation: power is higher when the standard deviation is small than when it is large. Experimenters can sometimes control the standard deviation by sampling from a homogeneous population of subjects, by reducing random measurement error, and/or by making sure the experimental procedures are applied very consistently.
- Effect size: effect size is the difference in means between the two groups divided by the standard deviation of the control group. The larger the effect size, the more likely it is that an experiment would find a significant effect, so larger effect size requires less sample to get the same power. For the smallest effect (30% vs. 40%) we would need a sample of 356 per group to yield power of 80%. For the intermediate effect (30% vs. 50%) we would need a sample of 93 per group to yield this level of power. For the highest effect size (30% vs. 60%) we would need a sample of 42 per group to yield power of 80%.
- Significant level: The lower the significance level, the lower the power. Naturally, the stronger the evidence needed to reject the null hypothesis, the lower the chance that the null hypothesis will be rejected.
-
One- versus Two-Tailed Tests: Power is higher with a one-tailed test than with a two-tailed test as long as the hypothesized direction is correct.
- 比如build一个logistic model从头到尾的大致步骤
- collect data
- Data exploratory analysis
Explore data to check assumptions
Dependent variable should be binary.
Independence of errors
No perfect multicollinearity: VIF, scaterplot, coefficient matrix
Linearity between independent variable and log odds: visualization
Explore dara to check data quality: missing value, outliers
- data preprocess and select features
heck the distributions of the variables you intend to use, as well as bivariate relationships among all variables that might go into the model.
Only the meaningful variables should be included. relation between the features with the target - grouped boxplot, stacked bar chart
data transformation: date to day of week, days since last action; create dummy variable for categorical variable;
-
Run an initial model and evaluate the model
-
Refine predictors and check model fit
you can use some sort of stepwise approach to determine the best predictors.
Drop nonsignificant control variables
Test, and possibly drop, interactions and quadratic or explore other types of non-linearity