思科恶意加密TLS流检测论文记录——由于样本不均衡，其实做得并不好，神马99.9的准确率都是浮云啊，之所以思科使用DNS和http一个重要假设是DGA和HTTP C&C（正常http会有图片等）。一开始思科使用的逻辑回归，后面17年文章是随机森林。 - bonelee

0x00

本系列笔记是用来记录论文阅读过程中产生的问题与思考的随笔性质文本，结构可能比较松散，无法完全体现园论文的精髓之处，仅供自己日后温习参考之用。

题目：Identifying Encrypted Malware Traffic with Contextual Flow Data
作者： Blake Anderson (Cisco), David McGrew (Cisco)
出处：AISec ‘16 Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security
关键词：Malware; Machine Learning; Transport Layer Security; Network Monitoring

0x01 提出问题

根据恶意软件收发的加密流量数据来检测恶意软件的类型是很有必要的。
传统的特征提取方式大多聚焦在数据包大小和一些与时间有关的参数，本文扩充了特征提取范围，运用到完整TLS握手数据包、同TLS握手数据包同一来源的DNS数据流和5分钟窗口内的HTTP数据流（后两者被称为contextual flow）。根据以上数据，我们能够
将提取到的特征输入到监督机器学习算法中，能够得到非常高的识别准确率。

0x02 解决方法

特征提取步骤：针对contextual flow，从DNS流中，我们主要分析从DNS服务器中返回的带有一个地址的响应，以及和这个地址相关联的TTL值；从HTTP流中，我们主要分析HTTP头中的各种属性。针对TLS stream，我们主要分析它们的握手包中提供的信息；针对其他数据包（如普通TCP，UDP，ICMP包，Observable metadata）。我们将提取它们的“边信道信息”。
分类识别步骤：对特征进行正则化处理，并投入监督学习算法中。
使用真实网络环境下抓取的数据包进行测试。

0x03 特征来源

TLS流

TLS流在交互之初是不加密的，因其需要同远程服务器进行握手。我们可以观测到的未加密TLS元数据包括clientHello和clientKeyExchange。从这些包的信息中，我们可以推断出客户端使用的TLS库等信息。从这些信息中，我们可以发现，良性流量的行为轨迹与恶意流量是十分不同的。

客户端方面，我们首先观察两个TLS特征：Offered Ciphersuites和Advertised TLS Extensions。对于前者，恶意流量更喜欢在clientHello中提供0x0004(TLS_RSA_WITH_RC4_128_MD5)套件，而良性流量则更多提供0x002f(TLS_RSA_WITH_AES_128_CBC_SHA)套件；对于后者，大多数TLS流量提供0x000d(signature_algorithms)，但是良性流量会使用以下很少在恶意流量中见到的参数：

0x0005 (status request)
0x3374 (next protocol negotiation)
0xff01 (renegotiation info)

随后，我们观察良性与恶性流量客户端公钥的区别。良性流量往往选择256-bit的椭圆曲线密码公钥，而恶意流量往往选择2048-bit的RSA密码公钥。

服务端方面，我们能够从serverHello流中得到服务端选择的Offered Ciphersuites和Advertised TLS Extensions信息。良性流量的选择比较多元化，而恶性流量往往会选择较为过时的技术。在certificate流中，我们能够得到服务端的证书链。无论是恶意流量还是良性流量，其证书的数量都是差不多的，但若我们观察长度为1的证书链，就能够发现，其中的70%都来自恶意流量自签名，0.1%来自良性流量自签名。

除此之外，SubjectAltName这个X.509拓展以及证书的有效时间也可区分一定量的良性和恶意流量。

DNS流

许多恶意软件使用域名生成算法来随机生成域名，这是一个明显区别于普通流量的行为。因此这便是我们识别恶意流量的突破口。

在比较域名的长度时，良性流量的域名基本符合高斯分布，其最高点在6或7处；而恶意流量的域名分布在6处存在一个极为尖锐的高峰。在对域名使用字符种类的探测上，我们发现良性流量域名使用数字字符较恶意流量更多。

在比较DNS返回响应中携带的IP地址的个数时，我们发现，良性流量更多地返回2或8个，而恶意流量更多地返回4或11个。同时，在比较响应中的TTL数值时，我们发现良性流量中最常出现的数值为60、300、20和30；在恶意流量的TTL数值中，300是一个常见数值，但是20和30却并不常见。且恶意流量中经常出现数值100，但这个数值几乎从未出现在良性流量中过。

除了以上指标，我们还能通过参考Alexa排名来获取良性流量和恶意流量在域名上的区别。我们将域名分为6类：top-100, top-1000, top-10000, top-100000以及未上榜。随后我们发现，86%的恶意流量域名都未上榜。

HTTP流

HTTP响应报头中，恶意流量最常用的属性为Server，Set-Cookie和Location，但良性流量最常用的属性为Connection，Expires和Last-Modified；在HTTP请求报头中，良性流量最常用的属性为User-Agent，Accept-Encoding和Accept-Language。

在属性值的观察中，良性流量最常用的Content-Type为image／\*，而恶意流量最常用的是text／\*。其他恶意流量常用的MIME值为：text／html；charset=UTF-8以及text／html；charset=utf-8。

恶意流量往往宣称自己使用的服务器为低版本的Nginx，而良性流量往往宣称自己使用的是低版本的Apache或Nginx。

恶意流量的User-Agent字段中较为常见的值为Opera/9.50(WindowsNT6.0;U;en)，次常见的为一些版本的Mozilla／5.0或Mozilla／4.0；而良性流量则一般为Windows或OS X版本的Mozilla／5.0。

0x04 特征提取细节

边信道信息

（此处未看懂，与马尔科夫链有关）
创立一个256-bit的数组，为每一种长度的payload计数

TLS数据

基于客户端的特征：将176种密码套件的类型、TLS拓展以及公钥长度列成一个list，并使用一个二元数组（只有0和1）针对对该流量数据的具体情况进行标记；
基于服务端的特征：同上。

DNS数据

类似于上文的方法，我们罗列了针对域名的特征如下：32个可能的TTL值和一个“other”选项、数字字符的数量、非字母数字字符的数量、DNS响应中返回的IP地址数量，以及6个衡量域名在Alexa排名的位阶。

HTTP数据

类似于上文的方法，选择6个在HTTP报头中经常的出现的字段，以及一个“other”选项。

0x05 测试结果

SPLT + BD + TLS + HTTP + DNS：99.933%
SPLT + BD + TLS + HTTP：99.983%
SPLT + BD + TLS + DNS：99.968%
TLS + HTTP + DNS：99.988%
SPLT + BD + TLS：99.933%
HTTP + DNS：99.985%
TLS + HTTP：99.955%
TLS + DNS：99.883%
HTTP：99.945%
DNS：99.496%
TLS：96.335%

补充：

Machine Learning for Encrypted Malware Traic Classification: Accounting for Noisy Labels and Non-Stationarity 同样的作者在kdd 2017上的文章

里面提到了tls的交互过程：

Figure 1 provides a graphical representation of a simple TLS session. The client initially sends a ClientHello message that provides the server with, among other fields, a list of cipher suites and a set of TLS extensions that the client supports. The cipher suite list is ordered by preference of the client, and each cipher suite denes a set of cryptographic algorithms needed for TLS to operate. The set of extensions provides additional information to the server that facilitates extended functionality, e.g., the Server Name Indication extension indicates the hostname of the server that the client is trying to connect to, which is important for virtual hosting. As explained in Section 4, all of the TLS data features used in this paper are taken from the unencrypted ClientHello message. After the ClientHello, the server sends a ServerHello message that contains the selected cipher suite, selected from the client’s offer list, which defines the set of cryptographic algorithms that will be used to secure the exchanged application data. The ServerHello message also contains a list of extensions that the server supports, where this list is a subset of what the client supports. At this time, the server also sends a Certificate message containing the server’s certicate chain, which can be used to authenticate the server.
The client then sends a ClientKeyExchange message that establishes the premaster secret of the TLS session. Then the client and server exchange ChangeCipherSpec messages indicating that future messages will be encrypted with the negotiated cryptographic parameters. Finally, the client and server begin to exchange application data. In Figure 1, red text represents unencrypted messages, and blue text represents encrypted messages. The current TLS 1.2 handshake protocol provides a lot of interesting, unencrypted information. To enhance privacy, TLS 1.3 will be encrypting more of the handshake, e.g., the Certificate message will be encrypted, but the data features used in this paper will still be available. Many important details were omitted for the sake of brevity, but the associated RFC’s provide the full specification [18, 34]. Because TLS encrypts many of the application-specific features, therefore making traditional deep packet inspection infeasible,
many researchers have utilized side-channel information to make useful inferences on the TLS trac [38]. These data features are typically constructed from the individual packet lengths and packet inter-arrival times of the encrypted session. Commonly used features include the mean of the packet lengths, n-gram or Markov chain based features derived from the sequence of packet lengths, or similarly constructed features for the timing information.

google翻译下：

图1提供了简单TLS会话的图形表示。客户端最初发送ClientHello消息，该消息为服务器提供密码套件列表和客户端支持的一组TLS扩展。密码套件列表按客户端的优先顺序排序，每个密码套件定义了TLS运行所需的一组加密算法。该组扩展向服务器提供便于扩展功能的附加信息，例如，服务器名称指示扩展指示客户端尝试连接的服务器的主机名，这对于虚拟主机是重要的。如第4节所述，本文中使用的所有TLS数据功能都来自未加密的ClientHello消息。在ClientHello之后，服务器发送ServerHello消息，该消息包含从客户端的商品列表中选择的选定密码套件，该列表定义将用于保护交换的应用程序数据的加密算法集。 ServerHello消息还包含服务器支持的扩展列表，其中此列表是客户端支持的子集。此时，服务器还会发送包含服务器证书链的证书消息，该消息可用于对服务器进行身份验证。
然后，客户端发送ClientKeyExchange消息，该消息建立TLS会话的预主密钥。然后，客户端和服务器交换ChangeCipherSpec消息，指示将使用协商的加密参数对将来的消息进行加密。最后，客户端和服务器开始交换应用程序数据。在图1中，红色文本表示未加密的消息，蓝色文本表示加密的消息。当前的TLS 1.2握手协议提供了许多有趣的，未加密的信息。为了增强隐私，TLS 1.3将加密更多的握手，例如，证书消息将被加密，但本文中使用的数据功能仍然可用。为简洁起见，省略了许多重要细节，但相关的RFC提供了完整的规范[18,34]。因为TLS加密了许多特定于应用程序的功能，因此传统的深度包检测不可行，许多研究人员利用旁道信息对TLS流量做出了有用的推论[38]。这些数据特征通常由加密会话的各个分组长度和分组到达间隔时间构成。常用的特征包括分组长度的平均值，从分组长度序列导出的n-gram或基于马尔可夫链的特征，或者用于定时信息的类似构造的特征。

我总觉得报文大小不应该是关键特征，但是论文说是：

最后看下算法准确率，

样本数量：Total 4,287,892 285,895 恶意样本：白样本为7:100

Enterprise Malware Algorithm Standard Enhanced Standard Enhanced

LinReg 99.92% 99.28% 0.00% 58.65%

l2-LogReg 93.35% 98.36% 16.86% 76.13%

l1-LogReg 92.75% 98.97% 19.71% 75.08%

DecTree 97.55% 97.02% 40.98% 83.33%

RandForest 99.53% 99.99% 33.54% 76.79%

SVM 11.94% 99.78% 77.98% 72.62%

MLP 95.90% 99.54% 20.61% 72.53%

由于样本不均衡，其实分类效果并不好，就看恶意软件的检出率和准确率就知道。最高的才83%。

Identifying Encrypted Malware Traffic with Contextual Flow Data 文章里一些要点文章里有很多特征提取的图，可以认真看下。一开始思科使用的逻辑回归，在这个文章里就是。

We can see that malware usually offers a set of three obsolete ciphersuites in the clientHello message including 0x0004 (TLS_RSA_WITH_RC4_128_MD5). In the benign traffic we collected, the 0x002f (TLS_RSA_WITH_AES_128_CBC_SHA)
ciphersuite was the most offered. Malware also seems to have comparatively little diversity in the client-supported TLS extensions. 0x000d (signature_algorithms) was the only TLS extension supported in the majority of TLS flows. ∼50% of the DMZ traffic also advertised the following extensions, which were rarely seen in the malware dataset:
• 0x0005 (status request)
• 0x3374 (next protocol negotiation)
• 0xff01 (renegotiation info)
Although not shown, the client’s public key length was another client-based data feature that had significant differences. Most of the DMZ traffic used 256-bit elliptic curve cryptography for the public keys, but most of the malicious traffic used 2048-bit RSA public keys. The serverHello and certificate messages can be used to gain information about the server. The serverHello message contains the selected ciphersuite and supported extensions. As one would expect given the type and diversity of the offered ciphersuites and the advertised extensions, the malicious traffic most often selected obsolete ciphersuites. The DMZ traffic contained a wider variety of supported TLS extensions by the servers.

翻译就是：

我们可以看到恶意软件通常在clientHello消息中提供一组三个过时的密码套件，包括0x0004（TLS_RSA_WITH_RC4_128_MD5）。在我们收集的良性流量中，0x002f（TLS_RSA_WITH_AES_128_CBC_SHA）
密码套件是最多的。恶意软件似乎在客户端支持的TLS扩展中具有相对较小的多样性。 0x000d（signature_algorithms）是大多数TLS流中唯一支持的TLS扩展。 ~50％的DMZ流量还宣传了以下扩展，这在恶意软件数据集中很少见：
•0x0005（状态请求）
•0x3374（下一个协议协商）
•0xff01（重新协商信息）
虽然未显示，但客户端的公钥长度是另一个基于客户端的数据功能，具有显着差异。大多数DMZ流量使用256位椭圆曲线加密作为公钥，但大多数恶意流量使用2048位RSA公钥。 serverHello和证书消息可用于获取有关服务器的信息。 serverHello消息包含选定的密码套件和支持的扩展。正如人们所期望的那样，鉴于所提供的密码套件和广告扩展的类型和多样性，恶意流量通常选择过时的密码套件。 DMZ流量包含服务器支持的各种TLS扩展。

The certificate message passes the server’s certificate chain to the client. We observed that the number of certificates in the chain for the malware and DMZ data were roughly the same. But, if we restrict our focus on the length1 chains, ∼70% were self-signed for malware and ∼.1% were self-signed for the DMZ traffic. The number of names in the SubjectAltName (SAN) X.509 extension also differed in the two datasets. For the DMZ traffic, the length of the list was 1 ∼45% of the time. This is in part because a number of Content Distribution Network (CDN) providers, e.g., Akamai, only have one entry. Length-10/12 lists were also common in the DMZ traffic due to some ad services.
Figure 2 also shows the distribution of the validity of the certificates rounded to the nearest day. Similar to the other data features, the period of validity for a server certificate has notable differences in the malicious and DMZ traffic.

证书消息将服务器的证书链传递给客户端。我们观察到恶意软件和DMZ数据链中的证书数量大致相同。但是，如果我们将注意力集中在长度为1的链上，则大约有70％是针对恶意软件进行自签名的，并且~.1％是针对DMZ流量进行自签名的。 SubjectAltName（SAN）X.509扩展中的名称数量在两个数据集中也不同。对于DMZ流量，列表的长度是1~45％的时间。这部分是因为许多内容分发网络（CDN）提供商（例如Akamai）只有一个条目。由于某些广告服务，长度为10/12的列表在DMZ流量中也很常见。
图2还显示了四舍五入到最近一天的证书有效性的分布。与其他数据功能类似，服务器证书的有效期在恶意和DMZ流量方面有显着差异。

特征和相关度：

Weight Feature 3.38 DNS Suffix org 2.99 DNS TTL 3600 2.62 TLS Ciphersuite TLS_RSA_WITH_RC4_128_SHA 2.28 HTTP Field accept-encoding 1.95 TLS Ciphersuite SSL_RSA_FIPS_WITH_3DES_EDE_CBC_SHA 1.78 HTTP Field location 1.38 DNS Alexa: None 1.21 TLS Ciphersuite TLS_RSA_WITH_RC4_128_MD5 1.12 HTTP Server nginx 1.11 HTTP Code 404 -2.16 TLS Extension extended_master_secret -1.65 HTTP Content Type application/octet-stream -1.61 HTTP Accept Language en-US,en;q=0.5 -1.35 TLS Ciphersuite TLS_DHE_RSA_WITH_DES_CBC_SHA -1.10 HTTP Content Type text/plain;charset=UTF-8 -0.97 HTTP Server Microsoft-IIS/8.5 -0.95 DNS Alexa: top-1,000,000 -0.91 HTTP User-Agent Microsoft-CryptoAPI/6.1 -0.88 TLS Ciphersuite TLS_ECDHE_ECDSA_WITH_RC4_128_SHA -0.85 HTTP Content Type application/x-gzip

Table 2: The data features most relevant to the TLS/DNS/HTTP classifier.

7.2 DNS
域名系统（DNS）[28]是一种分层的，分散的手段，用于提供有关域名的附加信息，特别是域名到IP地址映射。最近，恶意软件利用DNS和域生成算法（DGA）[8]来提供运行其命令和控制通道的强大方法。以前有很多关于将DNS数据分类为恶意或良性的结果[7,9,24]。这项工作都不利用DNS数据来推断加密流量。我们的工作也不同，我们说明了DNS的不同数据特征的分布，例如TTL值。
7.3 HTTP
超文本传输协议（HTTP）[17]是用于在万维网上传输数据的应用程序级协议。与DNS类似，威胁行为者也将HTTP用作命令和控制通道[29,33]。已经有一些专门针对HTTP数据中存在的功能的工作。在[33]中，作者使用统计数据（例如URL的平均长度）和URL上的字符串匹配方法来聚类恶意软件。同样，恶意软件和良性HTTP会话的具体差异不会突出显示。 [22]专门分析了User-Agent字段值。我们提供了更多HTTP字段的详细说明，并使用此信息为加密流量创建机器学习分类器。

之所以思科使用DNS和http一个重要假设是DGA和HTTP C&C。

将者，智、信、仁、勇、严也。

Hi，我是李智华，华为-安全AI算法专家，欢迎来到安全攻防对抗的有趣世界。

论文记录：Identifying Encrypted Malware Traffic with Contextual Flow Data