Dos and Don'ts of Machine Learning in Computer Security笔记

出处：Usenix security 2022 Summer Accepted Paper
作者：Daniel Arp, Erwin Quiring, Pendlebury et. al.（欧洲那块的）
话题：机器学习用于安全应用过程中的微妙陷阱、如何”安全“地进行机器学习安全应用
论文链接：https://www.usenix.org/system/files/sec22summer_arp.pdf

Abstract & Introduction

Problem:
- ML in Sec is prone to subtle(微妙的) pitfalls(陷阱)
- Influence: undermine performance & unsuitable for security tasks and practical deployment
Contribution:
- Pitfall Identification: Identify common pitfalls(design & implement & evaluate)
- Prevalence Analysis: pitfall influence with experiments
- Impact Analysis: actionable recommendations to avoid or mitigate pitfalls

Pitfalls in Machine Learning

Pitfalls in Data Collection and Labeling

Sampling Bias

data doesn't represent the true data distribution
bias origin: rely on synthetic data or to combine data from different sources
recommendation: avoid mix of data from incompatible sources, limitations of dataset should be openly discussed.

Label Inaccuracy

ground-truth labels are inaccurate, unstable or erroneous, affecting the overall performance
inaccuracy origin: reliable labels are typically not available in relevant security problems resulting in chicken-and-egg problem, label shift
recommendation: labels should be verified whenever possible. If noisy labels: robust models or loss functions, modeling label noise in learning process, cleansing noisy labels in training data

Pitfalls in System Design and Learning

Data Snooping

data snooping results in over-optimistic results
test snooping: test set is used to train
temporal snooping: common pitfall, occurs if time dependencies within the data are ignored
selective snooping: cleansing of data based on information not available in practice
recommendation: split test data early and stored separately until evaluation, consider temporal dependencies, with experiments on more recent data

Spurious Correlations

Artifacts unrelated result in shortcut patterns,which causes high risk of overestimating the capabilities of the method and misjudging its practical limitations.
recommendation: explanation techniques for ML, clearly defining spurious correlations objective in advance and validating.

Biased Parameter Selection

final parameters depend on the test set rather than train set.
best-performing model suffer from a biased parameter selection
recommendation: use validation set, strict data isolation.

Pitfalls in Performance Evaluation

Inappropriate Baseline

there exists no universal learning algorithm that outperforms all other approaches in general.
overly complex learning method increases the chances of overfitting
recommendation: use simple models throughout evaluation, use AutoML find baseline, check whether non-learning approaches are also suitable.

Inappropriate Performance Measures

not all of performance measures are suitable in the context of security, refers to the inappropriate description of performance
recommendation: consider the practical deployment of model and identify measures

Base Rate Fallacy

large class imbalance is ignored, refers to misleading interpretation of results.
recommendation: advocate precision and recall, Matthews Correlation Coefficient, ROC, AUC, discuss FP

Pitfalls in Deployment and Operation

Lab-Only Evaluation

A learning-based system is solely evaluated in a laboratory setting, without discussing itr practical limitations.
Recommendation: move away from a laboratory setting and approximate a real-world settings accurately as possible.

Inappropriate Threat Model

security of ML is not considered, exposing the system to a variety of attacks, such as poisoning and evasion attacks.
Recommendation: threat models should be defined precisely and systems evaluated with respect to them, focus on white-box attack where possible.

Prevalence Analysis

3 Central Observation:

there is a lack of awareness for the identified pitfalls(most authors agree)
the pitfalls are widespread in security literature and there is a need for mitigating them
a consistent understanding of the identified pitfalls is still lacking

Impact Analysis

mobile malware detection (P1, P4, and P7)
vulnerability discovery (P2, P4, and P6)
source code authorship attribution (P1 and P4)
network intrusion detection (P6 and P9)

Mobile Malware Detection

Data collection: AndroZoo

Dataset analysis: data distribution in origin, number of antivirus detections

Experimental Setup:

Data: \(D_1\)(1000 Benign from Google Play 1000 malicious from Chinese Markets), \(D_2\)(10000 Benign from Google Play 1000 malicious from Google Play) [Pitfall 1]
Feature: 2 feature sets taken from DREBIN classifier & OPSEQS classifier, URL play.google.com [Pitfall 4]
Model: Linear SVM
Result: different biased parameter selections within different performance Measures [Pitfall 7]

Vulnerability Discovery

Data Collection: CWE-119

Dataset Analysis: vulnerabilities related to buffers, 39757 source code snippets of 10444 (vulnerability label, 26%) [pitfall 2]

Experiment Setup:

Feature: classify a random subset to spot possible artifacts, extract the buffers sizes of char(because they seems only in one class throughout the samples) [pitfall 4]
Model:

VulDeePecker(classify code snippets)
SVM with bag-of-words features on n-grams(baseline for VulDeePecker)
Layerwise Relevance Propagation(LRP, explain the prediction and score relevance)

Result: extract tokens which occur most often in top-10 selection

INT* most (Spurious Correlations), SVM with 3-grams has the best performance.

VulDeePecker is not exploiting relations in the sequence but merely combines special tokens [Pitfall 6]

By truncating the code snippets will lose important information, causing label inaccuracy. [Pitfall 2]

Source Code Author Attribution

Programming habits are characterized by a variety of stylistic patterns.

In combination with sampling bias [Pitfall 1], this expressiveness may give rise to spurious correlations [Pitfall 4] in current attribution methods, leading to an overestimation of accuracy.

(就是比对了有没用的代码数据集与无没用的代码数据集的差别，其实就是无用数据冗余的问题)

Network Intrusion Detection

data collection: IoT (Mirsky).

Dataset analysis: all benign activity seems to halt as the attack commences, after 74 minutes, despite the number of devices on the network.

Experimental Setup:

method: boxplot method (as the gray line)

result:

an experiment without an appropriate baseline (P6) is insufficient to justify the complexity and overhead of the ensemble [Pitfall 6]
simple method(boxplot) can reveal issues with data generated for lab-only evaluations [Pitfall 9]

Limitations and Threats to Validity

Pitfalls: may seem obvious at first, not cover all ten pitfalls in detail.

Prevalence analysis: the selection process is not entirely free from bias.

Impact analysis: biased selection

posted @ 2022-03-30 20:54 FrancisQiu 阅读(26) 评论(0) 收藏举报

刷新页面返回顶部

道隐于小成，言隐于荣华

FrancisQiu's cnblog

backup blogs from former blog site.