How To Setup A Linux Active Defense System Or Intrusion Detection System On Linux

0. 主动防御基础平台
1. 开源HIDS调研
2. 进程管控主动防御
3. 文件变动监控(FileMonitor)
4. Linux下内核模块导出辅助接口
5. 主动防御的自我保护

0. 主动防御基础平台

从本质上来讲，主动防御和入侵检测系统还是存在区别的，主动防御对架构设计上的要求会更高

1. 主动防御需要更加丰富的系统行为关键路径Hook点，对入侵检测系统来说关注的基本上是某"类"事件的发生，但是对于主防来说，往往需要深入到这个事件内部的更多逻辑细节，甚至整个生命周期，对生命周期作更加细化的切分

2. 主防要实现基于策略的实时串行控制，而不是仅仅针对特定事件作日志捕获
基于这个攻防思想
    1) 主防需要更加高效的RING0~RING3通信机制
    2) RING3、RING0基于策略的判断逻辑和对消息的处理速度要更加高速，如果串行的处理速度过长，会造成用户层业务的延时过长，甚至DOWN

3. 主动防御对LSM Hook Point的选择更加苛刻，从设计思想上来看，主动防御是将用户层的决策模块插入了KERNEL中关键系统调用的代码逻辑流程中，这是通过高效的RING0~RING3通信实现的，这就要求RING0 LSM Hook Point点是同时具备参数的获取(GET)和修改(SET)两种能力

对于系统层的入侵检测、主动防御来说，并不一定能够做到在黑客的整个入侵流程中的每一个环节都做到完美防御，当然如果能做到这是最好的情况，但至少要在入侵流程的前半段的所有流支都都能都一个环节能够做到实时发现和实时阻断，那么从结果上来说，就达到了主防的目的了

1. MAC: 配置特定主体(例如进程)可以操作哪些资源: 典型如SELINUX
2. DAC: 配置特定资源(例如文件)可以被哪些主体操作: 典型的如ACL
//最终形成一个二维笛卡儿积的规则矩阵

Relevant Link:

http://sourceforge.net/projects/lids/
http://liuyuanljy.blog.51cto.com/607434/126873
http://www.symantec.com/connect/articles/focus-linux-intrusion-detection-linux
https://books.google.com.hk/books?id=jQE-iUCjUKAC&pg=PA547&lpg=PA547&dq=linux+lids&source=bl&ots=iD6Xqbc8wx&sig=XZ4E7Ix6s36HWH7E-oizuStyrJE&hl=zh-CN&sa=X&ei=1X4aVf2jBZWPuASxkIGoDw&ved=0CGcQ6AEwCQ#v=onepage&q=linux%20lids&f=false

0x1: KERNEL - USER SPACE通信方式

现有的通信方案大致可以如下分类

1. 基于FileSystem的通信
    1) 内核启动参数
    2) Debugfs
    3) ioctl
    4) procfs
    4) Character Devices
2. 基于VFS Kernel Memory的通信
    1) 模块参数、sysfs简单数据共享传输
    2) sysctl

3. 基于Socket的通信
    1) Netlink
    2) UDP Socket

4. 基于Shared Memory的通信
    1) mmap Portable Operating System Interface for UNIX® (POSIX) 共享的内存机制(shmem)共享内存 

5. 基于系统调用System Call的通信
    1) Upcall

6. 基于信号SIGNAL的通信
    1) Sending Signals from the Kernel to the User Space

在这些通信技术中，能同时满足大数据量、低延时，同步效率高的技术方案只有

1. netlink: 基于socket的数据传递技术: 存在RING3-RING0的穿透拷贝：适合作小数据量的控制通信
2. shared memory: 基于内存的数据传递技术: 无穿透拷贝，通信效率最高，适合作大数据量通信
3. 内核态环形队列: 配合shared memory使用
关于环形队列的相关知识，请参阅另一篇文章
http://www.cnblogs.com/LittleHann/p/4392061.html

在实际项目中，可以考虑采用netlink和shared memory组合的方式进行通信
使用有名对象进行KERNEL - USER SPACE通信，例如

1. Netlink: 有名可见Socket fd
2. 基于设备结点创建物理内存共享到用户态: 有名可见/dev结点

有名对象可能受到来自黑客的攻击，例如

1. 强制锁住通信使用的的dev，KERNEL有可能会HANG住
2. 利用/dev，从KERNEL - USER SPACE通道中偷取数据
3. 利用/dev，向KERNEL - USER SPACE通道注入大量垃圾数据，淹没正常的业务数据

0x2: Hook点的选择和向下兼容性

主动防御的基础框架从本质上来说是Hook框架，为了保证最大程度的系统版本兼容性，选择LSM Hook是一个很好的选择。但是需要注意的一个问题是，一定要注意向下兼容性或者说是用户侵入兼容性，因为Linux是一款高度配置的操作系统，安全研究员可以使用的Hook框架，对ROOT用户来说同样也是完全开放的，同时也意味着对获取了ROOT权限的黑客来说也是开放的，成熟的LSM Hook框架需要考虑以下的因素

1. 用户开启了SELINUX，LSM Hook框架要实现和原始LSM挂载函数的向下兼容(多个LSM Hook点之间的兼容)
在加载安全模块时，必须先对模块进行注册，用户(SELINUX)可以使用register_security()函数向LSM注册一个安全模块，在模块被加载成功后，就可以访问控制操作了，如果此时我们还有另一个安全模块要使用register_security()函数进行加载，则会出现错误，直到使用unregister_security()函数向框架注销后，下一个模块才可以载入

2. LSM模块栈
LSM提供了mod_reg_security()、mod_unreg_security()函数，可以连续注册多个安全模块，如果有其他后来的模块需要载入，可以通过mod_reg_security()向第一个模块注册，形成支持不同策略的模块栈
/*
需要注意的是，mod_reg_security在不同KERNEL版本中的兼容性是不同的，在25以前存在这个函数并且导出了内核符号表，但在25以后就不存在了，如果想要使用这个功能，需要自己实现
https://download.openswan.org/openswan/openswan-ocf/klips-fsm/security/security.c
//mod_reg_security - allows security modules to be "stacked"
//This function allows security modules to be stacked if the currently loaded
int mod_reg_security(const char *name, struct security_operations *ops)
{
    if (verify(ops)) 
    {
        printk(KERN_INFO "%s could not verify security operations.\n", __FUNCTION__);
        return -EINVAL;
    }

    if (ops == security_ops) 
    {
        printk(KERN_INFO "%s security operations already registered.\n", __FUNCTION__);
        return -EINVAL;
    } 
    return security_ops->register_security(name, ops);
}

//mod_unreg_security - allows a security module registered with mod_reg_security() to be unloaded
//This function allows security modules that have been successfully registered with a call to mod_reg_security() to be unloaded from the system.
int mod_unreg_security(const char *name, struct security_operations *ops)
{
    if (ops == security_ops) 
    {
        printk(KERN_INFO "%s invalid attempt to unregister primary security ops.\n", __FUNCTION__);
        return -EINVAL;
    } 
    return security_ops->unregister_security(name, ops);
}
 
EXPORT_SYMBOL_GPL(mod_reg_security);
EXPORT_SYMBOL_GPL(mod_unreg_security);
*/

3. 和用户自身的LSM Hook模块的兼容
除了利用mod_reg_security还有另一种解决思路，即检测security_ops数组上对应的指针是否被实现了非default函数，如果是则说明在安装我们的LSM Hook模块之前已经有其他LSM模块。在这种情况下，则优先执行原始的LSM函数，根据执行结果进行下一步判断
    1) 原始LSM放行: 继续我们的LSM Hook判断逻辑
    2) 原始LSM阻断: 继续阻断

4. LSM Hook模块自身的自保护
对于拿到ROOT权限的黑客来说，它和ROOT管理员具有同样的权限，可以对KERNEL进行任意修改，包括LSM的security_ops数组，我们的LSM Hook框架要实现对自身挂载的Module的保护，防止被黑客劫持

整个LSM会以一种链式判断的方式进行，即: 主防LSM Hook -> if (SELINUX Hook == NULL) then SELINUX Hook -> if (default LSM Hook == NULL) then default LSM Hook

0x3: LSM Hook模块的注册、注销

处于安全上的考虑，LSM在Kernel中未导出security_register函数，且没有实现security_unregister，这导致内核主防模块不能使用系统原生提供的注册、注销机制，需要手工实现这个功能

1. 使用kprobe动态获取LSM函数指针数组security_ops的内核内存地址
2. 在注册的时候对需要Hook的函数的数组项进行原子替换
3. 在注销的时候将被Hook的函数的数组项还原为原始的值

在进行LSM Hook模块卸载的时候，要进行"引用计数判断"是否为0判断，只有引用计数为0，即当前没有系统调用处于Hook函数中的时候，才可以继续卸载，释放Hook函数内核内存

只要引用计数未归零，则rmmod都无法完成，也即对应的Hook函数的内核内存不会被释放，旧的系统调用可以安全返回，以此达到安全卸载的目的

0x4: 在LSM Hook点对相关参数进行修改

对于主动防御来说，最核心的技术就是Hook技术，而在Hook Point能够实现的功能则决定了主防产品最终攻防效果，基本上来说，它需要包含以下动作

1. DENY: 阻断当前系统动作
2. ALLOW: 放行当前系统动作
3. IGNORE: 忽略当前系统动作，不作记录
4. LOG: 记录当前系统动作
5. SET: 修改当前系统动作对应的相应参数

在Hook点进行信息获取是很容易的，但是如果要实现在LSM Hook点实现参数的修改SET，会存在很多的技术难点

1. LSM Hook点需要在整个执行流中尽量靠前，保证对相关参数的修改可以影响到当前系统调用的整个最终流程
2. 需要完整覆盖相关参数的所有指针、引用、堆、栈上的临时数据(可能在之后被赋值使用)，即需要保证修改前后数据的一致性
3. 可能在一个LSM Hook点上无法完全达到所有SET修改的要求，需要配合多个Hook点完成，即需要把整个进程创建过程细分为多个小的子过程，把整个过程串行起来
4. 各个LSM Hook涉及到的数据结构不一定是线程安全的，在进行数据修改的时候需要保证原子性

0x5: 内核态白名单

对于LSM Hook点上报的数据来说，在正常的系统运行过程中，会产生大量"冗余数据"，这部分行为数据几乎不可能会命中主防实时阻断的策略。为此需要在内核态实现一个过滤白名单

1. 参考Smack的做法，利用虚拟文件系统作为RING3-RING0的通信手段，将白名单信息直接加载保存到虚拟文件系统中
2. 内核模块可以直接通过读取虚拟文件系统，在内核态实现冗余数据的过滤

example

/*
例如:
1. 如果新建进程的路径为: /bin/ping，则在Kernel中直接作为冗余数据忽略
2. 如果当前加载模块的路径为: /lib64/libc.so.6，则在Kernel中直接作为冗余数据忽略
*/
/ActiveDefense
    /ActiveDefense/ProcessControl/
        /ActiveDefense/ProcessControl/newProcess/
            /ActiveDefense/ProcessControl/newProcess/1: /bin/ping
        /ActiveDefense/ProcessControl/newThread/
        /ActiveDefense/ProcessControl/loadMod/
            /ActiveDefense/ProcessControl/loadMod/1: /lib64/libc.so.6
            /ActiveDefense/ProcessControl/loadMod/2: /etc/ld.so.preload
            /ActiveDefense/ProcessControl/loadMod/3: /usr/lib/locale/locale-archive
        /ActiveDefense/ProcessControl/exitProcess/
    /ActiveDefense/FilesystemControl/

而关于Ring3-Ring0的轻量级配置信息通信，可以使用/fs、/dev这种虚拟文件系统进行轻量通信

1. 开源HIDS调研

0x1: OSSEC

Key Benefits

1. Multi platform
OSSEC lets customers implement a comprehensive host based intrusion detection system with fine grained application/server specific policies across multiple platforms such as Linux, Solaris, AIX, HP-UX, BSD, Windows, Mac and Vmware ESX.

2. Compiled Agent for Windows
3. Lots of functionality than just FIM
4. Rigid but simple installation process

5. Compliance Requirements
OSSEC helps customers meet specific compliance requirements such as PCI, HIPAA etc. It lets customers detect and alert on unauthorized file system modifications and malicious behavior embedded in the log files of COTS products as well as custom applications. For PCI, it covers the sections of file integrity monitoring (PCI 11.5, 10.5), log inspection and monitoring (section 10) and policy enforcement/checking.

6. Real-time and Configurable Alerts
OSSEC lets customers configure incidents they want to be alerted on which lets them focus on raising the priority of critical incidents over the regular noise on any system. Integration with smtp, sms and syslog allows customers to be on top of alerts by sending these on to e-mail and handheld devices such as cell phones and pagers. Active response options to block an attack immediately is also available.

7. Integration with current infrastructure
OSSEC will integrate with current investments from customers such as SIM/SEM (Security Incident Management/Security Events Management) products for centralized reporting and correlation of events.

8. Centralized management
OSSEC provides a simplified centralized management server to manage policies across multiple operating systems. Additionally, it also lets customers define server specific overrides for finer grained policies.

9. Agent and agentless monitoring
OSSEC offers the flexibility of agent based and agentless monitoring of systems and networking components such as routers and firewalls. It lets customers who have restrictions on software being installed on systems (such as FDA approved systems or appliances) meet security and compliance needs.

OSSEC主要实现的功能点如下

1. log analysis
Your operating system wants to speak to you, but do you know how to listen? Every operating system, application, and device on your network generate logs (events) to let you know what is happening. OSSEC collects, analyzes and correlates these logs to let you know if something wrong is going on (attack, misuse, errors, etc). Do you want to know when an application is installed on your client box? Or when someone changes a rule in your firewall? By monitoring your logs, OSSEC will let you know of that.

2. file integrity checking(File Integrity Monitoring)
There is one thing in common to any attack to your networks and computers: they change your systems in some way. The goal of file integrity checking (or FIM – file integrity monitoring) is to detect these changes and alert you when they happen. It can be an attack, or a misuse by an employee or even a typo by an admin, any file, directory or registry change will be alerted to you

3. policy monitoring
4. rootkit detection
Criminals (also known as hackers) want to hide their actions, but using rootkit detection you can be notified when they (or trojans, viruses, etc) change your system in this way.

5. real-time alerting and active response 

6. Active response
Take immediate and automatic responses when something happens. Why wait for hours when you can alert your admin and block an attack right way?

OSSEC is a full platform to monitor and control your systems. It mixes together all the aspects of HIDS

1. host-based intrusion detection
2. log monitoring and SIM/SIEM

together in a simple, powerful and open source solution. It is also backed and fully supported by Trend Micro.
OSSEC is composed of multiple pieces. It has a central manager monitoring everything and receiving information from agents, syslog, databases and from agentless devices.

1. Manager
The manager is the central piece of the OSSEC deployment. It stores the file integrity checking databases, the logs, events and system auditing entries. All the rules, decoders and major configuration options are stored centrally in the manager, making easy to administer even a large number of agents.

2. Agents
The agent is a small program installed on the systems you desire to monitor. It will collect information on real time and forward to the manager for analysis and correlation. It has a very small memory and CPU footprint by default, not affecting with the system’s usage.
Agent security: It runs with a low privilege user (created during the installation) and inside a chroot jail isolated from the system. Most of the agent configuration is pushed from the manager, with just some of them are stored locally on each agent. In case these local options are changed, the manager will receive the information and will generate an alert.

3. Agentless
For systems that you can’t install an agent, OSSEC allows you to perform file integrity monitoring on them without the agent installed. It can be very useful to monitor firewalls, routers and even Unix systems where you are not allowed to install the agent.

4. Virtualization/Vmware
OSSEC allows you to install the agent on the guest operating systems or inside the host (Vmware ESX). With the agent installed inside the VMware ESX you can get alerts about when a VM guest is being installed, removed, started, etc. It also monitors logins, logouts and errors inside the ESX server. In addition to that, OSSEC performs the CIS checks for Vmware, alerting if there is any insecure configuration option enabled or any other issue.

5. Firewalls, switches and routers
OSSEC can receive and analyze syslog events from a large variety of firewalls, switches and routers. It supports all Cisco routers, Cisco PIX, Cisco FWSM, Cisco ASA, Juniper Routers, Netscreen firewall, Checkpoint and many others.

Relevant Link:

http://www.ossec.net/?page_id=19
http://www.ossec.net/files/ossec-vm-2.8.1.README
http://www.ossec.net/?page_id=165
http://www.ossec.net/?page_id=169

0x2: Security Onion

Relevant Link:

http://blog.securityonion.net/
https://www.alienvault.com/blogs/security-essentials/open-source-intrusion-detection-tools-a-quick-overview

0x3: Linux Intrusion Detection System(LIDS)

In computer security, the Linux Intrusion Detection System (LIDS) is a patch to the Linux kernel and associated administrative tools that enhances the kernel's security by implementing Mandatory Access Control (MAC).
It uses and extends the system capabilities bounding set to control the whole system and adds some network and filesystem security features to the kernel to enhance the security.

1. LIDS is a patch to the Linux kernel
2. it implements access control and a reference monitor
3. LIDS is configured with its two admin tools
    1) lidsconf
    lidsconf is the tool that allows you to add and remove access control to certain files, which can be binaries or any other files. LIDS refers to these files as objects, and the capabilities we allow or disallow are referred to as subjects. LIDS overrides things like filesystem permissions. You can literally use LIDS to make it impossible to access virtually any object, whether it's a file, raw device, memory or I/O, even if you're trying to access the object as the root user.

    2) lidsadm 
    lidsadm is the utility that allows you to disable LIDS in a terminal, so that you can set various settings, which LIDS, when enabled, won't allow you to do, and you can view the current status of your LIDS installation with this tool 
//In short, LIDS is a complete security model implementation for the Linux kernel

Setting Access Controls

lidsconf -A [-s subject] -o object [-d] [-t from-to] [-i level] -j ACTION

1. subject
subject is a program upon which a capability is added

2. object
object can be a binary, directory, socket name or a capability.

3. -d
The -d switch tells LIDS that the domain is an exec domain

4. -t
The -t lets you set a specific time dependency for the capability

5. -i
-i defines the inheritance level

6. -j
The -j switch is an action that can be one of the following:
    1) DENY: denies access to the object.
    2) READONLY: sets the object to read-only.
    3) APPEND: mostly used for logs, this allows a certain program to append only that file, not remove it.
    4) WRITE: allows other binaries to write on the file.
    5) GRANT: used in conjunction with a capability, used to grant the subject a capability.
    6) IGNORE and DISABLE: two options that allow you to disable the setting of any permission on a certain object and disable some extension features, respectively.

The capabilities LIDS supports are the following, as can be seen by typing:
lidsadm -h | grep CAP

CAP_CHOWN: chown/chgrp.
CAP_DAC_OVERRIDE: DAC access.
CAP_DAC_READ_SEARCH: DAC read.
CAP_FOWNER: owner ID, not equal user.
ID CAP_FSETID: effective user ID, not equal owner.
ID CAP_KILL: real/effective ID, not equal process.
ID CAP_SETGID: set*gid(2).
CAP_SETUID: set*uid(2).
CAP_SETPCAP: transfer capability.
CAP_LINUX_IMMUTABLE: immutable and append file attributes.
CAP_NET_BIND_SERVICE: binding to ports below 1024.
CAP_NET_BROADCAST: broadcasting/listening to multicast.
CAP_NET_ADMIN: interface/firewall/routing changes.
CAP_NET_RAW: raw sockets.
CAP_IPC_LOCK: locking of shared memory segments.
CAP_IPC_OWNER: IPC-ownership checks.
CAP_SYS_MODULE: insertion and removal of kernel modules.
CAP_SYS_RAWIO: ioperm(2)/iopl(2) access.
CAP_SYS_CHROOT: chroot(2).
CAP_SYS_PTRACE: ptrace(2).
CAP_SYS_PACCT: configuration of process accounting.
CAP_SYS_ADMIN: tons of admin stuff.
CAP_SYS_BOOT: reboot(2).
CAP_SYS_NICE: nice(2).
CAP_SYS_RESOURCE: sets resource limits.
CAP_SYS_TIME: sets system time.
CAP_SYS_TTY_CONFIG: tty configuration.
CAP_MKNOD: mknod operation.
CAP_LEASE: taking leases on files.
CAP_HIDDEN: hidden process.
CAP_KILL_PROTECTED: kill protected programs.
CAP_PROTECTED: protect the process from signals.

下面展示一段LIDS的使用示例

1. Setting Up a System
//After you restart LIDS, you can begin adding access controls to various system binaries and libraries. The following sets the /sbin, /bin, /usr/bin and /lib to read-only:
lidsconf -A -o /sbin -j READONLY
lidsconf -A -o /bin -j READONLY
lidsconf -A -o /usr/bin -j READONLY
lidsconf -A -o /lib -j READONLY

//Next, we define some additional access controls for /opt, /etc and /usr/local/etc, which should be read-only, and we deny all access to /etc/shadow and the boot manager file:
lidsconf -A -o /etc -j READONLY
lidsconf -A -o /usr/local/etc -j READONLY
lidsconf -A -o /etc/shadow -j DENY
lidsconf -A -o /etc/lilo.conf -j DENY

//Because we have denied all access to /etc/shadow, the system will not be able to authenticate logins, thus we need to allow login and vlock to have read-only access to the file. Additionally, su also should have read-only access to the /etc/shadow file:
lidsconf -A -s /bin/login -o /etc/shadow -j READONLY
lidsconf -A -s /usr/bin/vlock -o /etc/shadow -j READONLY
lidsconf -A -s /bin/su -o /etc/shadow -j READONLY

//We need to set some other access controls for su, in order for it to work with UIDs and GIDs, and access the /etc/shadow file:
lidsconf -A -s /bin/su -o CAP_SETUID -j GRANT
lidsconf -A -s /bin/su -o CAP_SETGID -j GRANT
lidsconf -A -s /bin/su -o /etc/shadow -j READONLY

//Now, we need to allow init, login and associated applications to have write access to log files:
lidsconf -A -o /var/log -j APPEND
lidsconf -A -s /bin/login -o /var/log/wtmp -j WRITE
lidsconf -A -s /bin/login -o /var/log/lastlog -j WRITE
lidsconf -A -s /sbin/init -o /var/log/wtmp -j WRITE
lidsconf -A -s /sbin/init -o /var/log/lastlog -j WRITE
lidsconf -A -s /sbin/halt -o /var/log/wtmp -j WRITE
lidsconf -A -s /sbin/halt -o /var/log/lastlog -j WRITE
lidsconf -A -s /etc/rc.d/rc.sysinit -o /var/log/wtmp -i 1 -j WRITE
lidsconf -A -s /etc/rc.d/rc.sysinit -o /var/log/lastlog -i 1 -j WRITE

//Now, we set up access control for root's home folder. We allow only the bash history file to be appended:
lidsconf -A -o /root -j READONLY
lidsconf -A -s /bin/bash -o /root/.bash_history -j APPEND

//Finally, we allow the init program to kill processes on shutdown:
idsconf -A -s /sbin/init -o CAP_INIT_KILL -j GRANT
lidsconf -A -s /sbin/init -o CAP_KILL -j GRANT

//Now, we allow fstab and init scripts to mount filesystems, kill processes and unmount filesystems:
lidsconf -A -s/etc/fstab -o CAP_SYS_ADMIN -j 1 -j GRANT
lidsconf -A -s /etc/rc.d/init.d/halt -o CAP_INIT_KILL -i 1 -j GRANT
lidsconf -A -s /etc/rc.d/init.d/halt -o CAP_KILL -i 1 -j GRANT
lidsconf -A -s /etc/rc.d/init.d/halt -o CAP_NET_ADMIN -i 1 -j GRANT
lidsconf -A -s /etc/rc.d/init.d/halt -o CAP_SYS_ADMIN -i 1 -j GRANT


2. Setting Access Controls for the Apache Web Server
/*
The sample commands below assume that the Apache installation resides in /usr/local/apache, with a log directory of /var/log/httpd, and also assumes your Apache configuration directory is /etc/httpd. MySQL is assumed to be installed in /usr/local/mysql.
*/
//Apache needs to have setuid and setgid capabilities. We also need to allow Apache to access log files and deny other applications from accessing the httpd binary:
idsconf -A -s /usr/local/apache/bin/httpd -o CAP_SETUID -j GRANT
lidsconf -A -s /usr/local/apache/bin/httpd -o CAP_SETGID -j GRANT
lidsconf -A -o /etc/httpd -j DENY
lidsconf -A -s /usr/local/apache/bin/httpd -o /etc/httpd -j READONLY
lidsconf -A -o /usr/local/apache -j DENY
lidsconf -A -s /usr/local/apache/bin/httpd -o /usr/local/apache -j READONLY
lidsconf -A -o /var/log/httpd -j DENY
lidsconf -A -s /usr/local/apache/bin/httpd -o /var/log/httpd -j APPEND
lidsconf -A -s /usr/local/apache/bin/httpd -o /usr/local/apache/logs -j WRITE


3. MySQL
//For MySQL, we need to deny other applications' access to the mysql binary. We also need to restrict access to the mysql/var directory so that it's append=only, and allow read-only access for the mysqld dæmon to the mysql directory:
lidsconf -A -o /usr/local/mysql/var -j APPEND
lidsconf -A -o /usr/local/mysql -j DENY
lidsconf -A -s /usr/local/mysql/libexec/mysqld -o /usr/local/mysql -j READONLY
lidsconf -A -s /usr/local/mysql/libexec/mysqld -o /usr/local/mysql/var -j WRITE


4. Login
//Login is the program that allows a user to log in to a GNU/Linux system:
lidsconf -A -s /bin/login -o /etc/shadow -j READONLY
lidsconf -A -s /bin/login -o CAP_SETUID -j GRANT
lidsconf -A -s /bin/login -o CAP_SETGID -j GRANT
lidsconf -A -s /bin/login -o CAP_CHOWN -j GRANT
lidsconf -A -s /bin/login -o CAP_FSETID -j GRANT

对于LIDS来说，它的主要着眼点是

1. File System
    1) file access
    2) Protect important files and directories irrespective of the file system
    3) CAP_LINUX_IMMUTABLE: immutable and append file attributes(禁止修改指定文件的属性)
    4) CAP_CHOWN: chown/chgrp
    5) CAP_FOWNER: owner ID, not equal user 
    6) ID CAP_FSETID: effective user ID, not equal owner 
    7) CAP_SYS_TTY_CONFIG: tty configuration
    8) CAP_MKNOD: mknod operation: 设备结点管控

2. system/network administration
    1) System administration 
    2) CAP_NET_ADMIN: interface/firewall/routing changes
    3) CAP_SYS_BOOT: reboot(2)
    4) CAP_SYS_RESOURCE: sets resource limits
    5) CAP_SYS_TIME: sets system time

3. any capability use
    1) Super user may abuse his rights
    2) CAP_DAC_OVERRIDE: DAC access 
    3) CAP_DAC_READ_SEARCH: DAC read 
    4) ID CAP_SETGID: set*gid(2)
    5) CAP_SETUID: set*uid(2)
    6) CAP_SYS_CHROOT: chroot(2)

4. raw device
    1) Prevent raw i/o operations by any unauthorized program: Linux上一切皆文件(everything is file)，对关键设备的保护，本质上还是通过控制进程对文件的操作控制实现(MAC、DAC模型)

5. mem and I/O access  
    1) CAP_IPC_LOCK: locking of shared memory segments
    2) CAP_IPC_OWNER: IPC-ownership checks
    3) CAP_SYS_RAWIO: ioperm(2)/iopl(2) access

6. Processes  
    1) Protect important processes: 保护进程磁盘文件、相应的配置文件这个角度入手，达到保护指定进程的目的
    2) ID CAP_KILL: real/effective ID, not equal process
    3) CAP_SYS_MODULE: insertion and removal of kernel modules: 插入insmod属于高危行为
    4) CAP_SYS_PTRACE: ptrace(2): 常用于进程注入
    5) CAP_SYS_NICE: nice(2)
    6) CAP_HIDDEN: hidden process
    7) CAP_KILL_PROTECTED: kill protected programs
    8) CAP_PROTECTED: protect the process from signals: 进程自保护

7. Network Manager
    1) CAP_NET_BIND_SERVICE: binding to ports below 1024: 作为正常的服务器应用来说，只会监听1024以下的通用端口，1024以上的端口监听"很有可能"是黑客部署的任意程序
    2) CAP_NET_RAW: raw sockets: raw socket常常被黑客用来进行抓包、网络数据包嗅探，属于高危行为
 
8. Account Protect
    1) CAP_SYS_ADMIN: tons of admin stuff

//LIDS是以文件读写控制为切入点，保护文件，控制进程的文件读写是它的主要目标

Relevant Link:

http://www.linuxjournal.com/article/8690?page=0,2
http://freecode.com/projects/lids
http://www.securityfocus.com/tools/1966 
http://liuyuanljy.blog.51cto.com/607434/126873
http://infocenter.guardiandigital.com/manuals/SecureCommunity/node228.html
http://www.tjscott.net/ids/lids.faq.htm

0x4: OpenWall

Relevant Link:

http://www.openwall.com/Owl/
http://www.symantec.com/connect/articles/focus-linux-intrusion-detection-linux

0x5: SELINUX(Security Enhance LINUX)

SELinux即Security Enhanced Linux(安全强化 Linux)，是 MAC(Mandatory Access Control，强制访问控制系统)的一个实现，目的在于明确的指明某个进程可以访问哪些资源(文件、网络端口等)
强制访问控制系统的用途在于增强系统抵御0-Day 攻击的能力。所以它不是网络防火墙或ACL的替代品，在用途上也不重复，举例来说
系统上的Apache被发现存在一个漏洞，使得某远程用户可以访问系统上的敏感文件(比如/etc/passwd来获得系统已存在用户)，而修复该安全漏洞的Apache更新补丁尚未释出。此时SELinux可以起到弥补该漏洞的缓和方案。因为/etc/passwd不具有Apache的访问标签，所以Apache对于/etc/passwd的访问会被SELinux阻止

SELinux有以下几点特点

1. 控制策略是可查询而非程序不可见的
2. 可以热更改策略而无需重启或者停止服务(hot reload)
3. 可以从进程初始化、继承、程序执行三个方面通过策略进行控制(进程声明周期)
4. 控制范围覆盖
    1) 文件系统
    2) 目录
    3) 文件
    4) 文件启动描述符
    5) 端口
    6) 消息接口和网络接口

SELinux策略

正如我们所说，SELinux遵从最小权限这个理念，在缺省情况下一切均被拒绝，而系统的每部份有一个政策来允许它访问所需的功能。这个描述用来形容严格型政策最为贴切
这样做的目标之一，是要令每个已安装及在开机时执行的进程，都缺省在一个受规限的本地下运行。针对型政策在设计时尽量保护最多的主要进程而不会对用户的经验产生不利影响，所以多数用户甚至乎不应察觉SELinux正在运行中

SELinux访问控制

SELinux拥有三种访问控制方法：
1. 强制类型(TE): TE是针对型政策所采用的主要访问控制机制
2. 基于角色的访问控制(RBAC): 它以SELinux用户(未必等同Linux用户)为基础，但缺省的针对型政策并未采用它
3. 多层保障(MLS): 普遍不获采用，而且经常隐藏在缺省的针对型政策内

SELINUX运行配置

改变 SELinux 运行状态
setenforce [ Enforcing | Permissive | 1 | 0 ]

vim /etc/sysconfig/selinux 
//包含了SELinux运行策略的信息，通过改变变量SELINUXTYPE的值实现，该值有两种可能
targeted: 代表仅针对预制的几种网络服务和访问请求使用SELinux保护
strict: 代表所有网络服务和访问请求都要经过SELinux

Apache SELinux配置实例

1. 让Apache可以访问位于非默认目录下的网站文件
首先，用semanage fcontext -l | grep '/var/www'获知默认/var/www目录的 SELinux上下文
/var/www(/.*)? all files system_u:object_r:httpd_sys_content_t:s0
//从中可以看到 Apache 只能访问包含 httpd_sys_content_t 标签的文件
假设希望Apache使用/srv/www作为网站文件目录，那么就需要给这个目录下的文件增加httpd_sys_content_t标签，分两步实现
    1) 首先为/srv/www这个目录下的文件添加默认标签类型：semanage fcontext -a -t httpd_sys_content_t '/srv/www(/.*)?' 
    2) 然后用新的标签类型标注已有文件：restorecon -Rv /srv/www之后，Apache就可以使用该目录下的文件构建网站了
其中restorecon在SELinux 管理中很常见，起到恢复文件默认标签的作用。比如当从用户主目录下将某个文件复制到Apache网站目录下时，Apache默认是无法访问，因为用户主目录的下的文件标签是user_home_t。此时就需要restorecon将其恢复为可被Apache访问的 httpd_sys_content_t类型：
restorecon -v /srv/www/foo.com/html/file.html
restorecon reset /srv/www/foo.com/html/file.html context unconfined_u:object_r:user_home_t:s0->system_u:object_r:httpd_sys_content_t:s0

2. 让Apache侦听非标准端口
默认情况下Apache只侦听80和443两个端口，若是直接指定其侦听888端口的话，会在 service httpd restart的时候报错：
Starting httpd: (13)Permission denied: make_sock: could not bind to address [::]:888
(13)Permission denied: make_sock: could not bind to address 0.0.0.0:888
no listening sockets available, shutting down
Unable to open logs
SELinux 根据三种不同情况分别给出了对应的解决方法。在这里，第一种情况是我们想要的，于是按照其建议输入：
semanage port -a -t http_port_t -p tcp 888
之后再次启动 Apache 服务就不会有问题了

3. 允许Apache访问创建私人网站
若是希望用户可以通过在~/public_html/放置文件的方式创建自己的个人网站的话，那么需要在Apache策略中允许该操作执行。使用：
setsebool httpd_enable_homedirs 1
setsebool是用来切换由布尔值控制的SELinux 策略的，当前布尔值策略的状态可以通过getsebool来获知。
默认情况下setsebool的设置只保留到下一次重启之前，若是想永久生效的话，需要添加 -P 参数，比如：
setsebool -P httpd_enable_homedirs 1

基本上在传统DAC模式(典型的如ACL)只在有限个权限级别，root、user、group、other，而当中不能简易地实施最小权限的理念。很多由root引导1的进程在后期会撇除它们的权限并以受限制的用户身份来运行，有些则会在chroot的情况下执行，但这些安全措施都是酌情的
SELinux 更能遵从最小权限的理念。在缺省的enforcing情况下，一切均被拒绝，接着有一系列例外的政策来允许系统的每个元素(服务、程序、用户)运作时所需的访问权。当一项服务、程序或用户尝试访问或修改一个它不须用的文件或资源时，它的请求会遭拒绝，而这个行动会被记录下来

Relevant Link:

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Security-Enhanced_Linux/sect-Security-Enhanced_Linux-Working_with_SELinux-Enabling_and_Disabling_SELinux.html
http://zh.wikipedia.org/wiki/%E5%AE%89%E5%85%A8%E5%A2%9E%E5%BC%BA%E5%BC%8FLinux
https://linuxtoy.org/archives/selinux-introduction.html
http://wiki.centos.org/zh/HowTos/SELinux
http://fedoraproject.org/wiki/SELinux

0x6: 强制访问控制内核模块Smack(Simplified Mandatory Access Control Kernel)

Smack(Simplified Mandatory Access Control Kernel)是Casey Schaufler于2007年在LSM基础上实现的Linux强制访问控制安全模块，它以内核安全补丁的形式存在于Linux操作系统中，其设计思想是利用LSM安全域将Linux内核中所有主体与客体都打上安全标签，并规定安全策略，只有符合安全策略的访问方式才被容许。与SELinux和DTE相比，Smack安全策略要简单得多，但却能实现它们相似的强制访问控制功能，并且Smack对内核性能损耗比较低

1. Smack基本概念

1. 主体
Smack主体是指Linux内核进程

2. 客体
Smack客体是指Linux内核客体对象，如文件、消息队列、套接字、共享内存、信号量等，客体也可以是Linux进程或者IPC 

3. 安全标签
Smack安全标签是C语言的字符串，但最多包含24个字符(包括'/0')Smack修改了进程task_struct安全域，在进程被初始创建时，其安全标签是"_"。同样，Smack修改了虚拟文件系统的inode和super_block安全域，使得文件系统被创建时所有文件的安全标签是"_"

4. 访问方式
Smack最初版本的访问方式只有四种，即读(r或R)、写(w或W)、执行(x或X)、盲写(a或A)。其中，在进程之间通信中，一个进程发送消息或者数据包给另一个进程时，这样的操作属于写操作 

5. 安全策略
Smack的安全策略分为
    1) Smack内置的安全策略，Smack内置的安全策略是指原本就已经被Smack访问控制代码所规定的，包括如下几条
        1.1) 安全标签是"*"的进程发起的任何形式的访问都被拒绝 
        1.2) 安全标签是"^"的进程发起的读或执行的请求都被容许 
        1.3) 任何进程对安全标签是"_"的客体发起的读或执行的请求都被容许 
        1.4) 任何进程对安全标签是"*"的客体发起的任何形式的请求都被容许 
        1.5) 如果主体和客体的安全标签相同，那么该主体对该客体发起的任何形式的访问都被容许
    2) 用户可定制的安全策略 
    除此之外，用户可以根据自己的安全需求，在主、客体安全标签都已经存在的前提下，通过工具写入安全策略

2. Smack的策略规则虚拟文件系统

强制访问控制机制是需要安全策略数据库的支持，为此，Smack采取了虚拟文件系统作为其安全策略存储系统，这样Smack决策时间会很短，但这也就意味着，一旦计算机系统重新启动，需求人为得再次设定安全策略。Smack文件系统位于"/smack"目录下，包含了

1. load: load文件存放了Smack的安全策略
2. cipso: cipso存放cipso值，包括安全级别和安全分类，cipso和netlabel用于Smack网络通信控制
3. doi
4. direct
5. ambient
6. netlabel: netlabel存放了主机的IP地址和其相关的Smack标签，cipso和netlabel用于Smack网络通信控制
7. onlycap
8. logging

它们被组织成Linux内核链表形式，其中最常用到的是load、cipso和netlabel
Smack重写文件系统操作函数，并使用注册文件系统、挂载文件系统和初始化系统调用等内核API来实现虚拟文件系统

Relevant Link:

http://www.xuebuyuan.com/865515.html

0x7: AppArmor(Application Armor)

apparmor_status

apparmor module is loaded.
6 profiles are loaded.
6 profiles are in enforce mode.
   /sbin/dhclient
   /usr/lib/NetworkManager/nm-dhcp-client.action
   /usr/lib/connman/scripts/dhclient-script
   /usr/sbin/named
   /usr/sbin/ntpd
   /usr/sbin/tcpdump
0 profiles are in complain mode.
2 processes have profiles defined.
2 processes are in enforce mode.
   /usr/sbin/named (5522) 
   /usr/sbin/ntpd (956) 
0 processes are in complain mode.
0 processes are unconfined but have a profile defined

AppArmor is designed to provide easy-to-use application security for both servers and workstations. Novell AppArmor is an access control system that lets you specify per program which files the program may read, write, and execute. AppArmor secures applications by enforcing good application behavior without relying on attack signatures, so it can prevent attacks even if they are exploiting previously unknown vulnerabilities.

AppArmor is an effective and easy-to-use Linux application security system. AppArmor proactively protects the operating system and applications from external or internal threats, even zero-day attacks, by enforcing good behavior and preventing even unknown application flaws from being exploited. AppArmor security policies completely define what system resources individual applications can access, and with what privileges. A number of default policies are included with AppArmor, and using a combination of advanced static analysis and learning-based tools, AppArmor policies for even very complex applications can be deployed successfully in a matter of hours.

AppArmor is a Mandatory Access Control (MAC) system which is a kernel (LSM) enhancement to confine programs to a limited set of resources. AppArmor's security model is to bind access control attributes to programs rather than to users
从本质上来说，AppArmor的这种防御思想是在传统ACL(即DAC访问控制模型)的基础之上的进化版，传统的ACL的粒度是用户，这种防御模式在黑客一旦获取到了root账户的时候，防御体系就是全然无效，而解决这个问题最直接的方式就是对权限划分进行更细粒度的划分，从而SELINUX、AppArmor这种基于主体客体的访问控制策略的防御模块就应运而生了

cat /etc/apparmor.d/sbin.dhclient 

# vim:syntax=apparmor
# Last Modified: Fri Jul 17 11:46:19 2009
# Author: Jamie Strandboge <jamie@canonical.com>
#include <tunables/global>

/sbin/dhclient {
  #include <abstractions/base>
  #include <abstractions/nameservice>

  capability net_bind_service,
  capability net_raw,
  capability sys_module,
  capability dac_override,

  network packet,
  network raw,

  @{PROC}/[0-9]*/net/ r,
  @{PROC}/[0-9]*/net/** r,

  /sbin/dhclient mr,

  /etc/dhclient.conf r,
  /etc/dhcp/ r,
  /etc/dhcp/** r,

  /var/lib/dhcp{,3}/dhclient* lrw,
  /{,var/}run/dhclient*.pid lrw,
  /{,var/}run/dhclient*.lease* lrw,

  # NetworkManager
  /{,var/}run/nm*conf r,
  /{,var/}run/sendsigs.omit.d/network-manager.dhclient*.pid lrw,

  # connman
  /{,var/}run/connman/dhclient*.pid lrw,
  /{,var/}run/connman/dhclient*.leases lrw,

  # synce-hal
  /usr/share/synce-hal/dhclient.conf r,

  # if there is a custom script, let it run unconfined
  /etc/dhcp/dhclient-script Uxr,

  # The dhclient-script shell script sources other shell scripts rather than
  # executing them, so we can't just use a separate profile for dhclient-script
  # with 'Uxr' on the hook scripts. However, for the long-running dhclient3
  # daemon to run arbitrary code via /sbin/dhclient-script, it would need to be
  # able to subvert dhclient-script or write to the hooks.d directories. As
  # such, if the dhclient3 daemon is subverted, this effectively limits it to
  # only being able to run the hooks scripts.
  /sbin/dhclient-script                           Uxr,

  # Run the ELF executables under their own unrestricted profiles
  /usr/lib/NetworkManager/nm-dhcp-client.action   Pxr,
  /usr/lib/connman/scripts/dhclient-script        Pxr,

  # Site-specific additions and overrides. See local/README for details.
  #include <local/sbin.dhclient>
}

/usr/lib/NetworkManager/nm-dhcp-client.action {
  #include <abstractions/base>
  #include <abstractions/dbus>
  /usr/lib/NetworkManager/nm-dhcp-client.action mr,
}

/usr/lib/connman/scripts/dhclient-script {
  #include <abstractions/base>
  #include <abstractions/dbus>
  /usr/lib/connman/scripts/dhclient-script      mr,
}

Relevant Link:

http://blog.csdn.net/ustc_dylan/article/details/7944955
http://wiki.apparmor.net/index.php/Main_Page
https://wiki.ubuntu.com/AppArmor
http://en.wikipedia.org/wiki/AppArmor

0x8: osquery

Relevant Link:

http://www.cnblogs.com/LittleHann/p/4549739.html

2. 进程管控主动防御

0x1: Hook架构

我们知道，在Linux/Windows中，进程从创建、调度、结束(包括异常退出)都会经历一个完整的生命周期，主动防御需要针对这个周期的每一步进行Hook

1. 新进程创建、主线程开始执行
在Linux下，新进程创建、主线程开始执行等价于调用execve()系统调用，翻阅exec.c内核源代码可知，在和execve相关的所有LSM Hook点中，只有"do_execve->search_binary_handler->security_bprm_check"可以同时满足以下设计需求
    1) 捕获到所有和进程相关的参数
    2) 可以通过修改bprm指针修改当前进程启动的相关参数，达到劫持重定向的效果
    3) 根据获取到的参数，进行策略规则进行多维判断后，可以返回DENY/ALLOW/LOG/SET结果给LSM Hook点，以决定本次系统调用是否放行/修改

2. 进程创建线程、进程复制自身
在Linux中，所有的新进程的创建本质上就是父进程复制出子进程(所有的进程都是init进程fork出来的)，而且对于Linux来说，进程和线程并没有像windows那样严格的分界线，线程和进程的差别在于fork新进程的时候，clone_flag的差别，即父子进程的资源共享程度决定了当前fork的是子进程还是子线程
在"do_fork->copy_process>security_task_create"可以同时满足以下设计需求
    1) 捕获到所有和进程相关的参数
    2) 根据clone_flags判断是否是线程创建
    3) 在当前Hook点通过规则判定是否允许"执行流"的创建

3. 进程依赖模块加载监控: 第三方库、SO模块
security_dentry_open

4. 进程退出
security_ops->task_kill
    1) 捕获到所有和进程相关的参数
    2) 捕获到进程退出的信号标志

0x2: 新进程创建

1. 新进程创建监控
    1) 黑客常常利用WEB容器、MYSQL/SQLSERVER数据库系统进程的指令管道发起下载请求，并启动恶意程序
    2) 可以和findMalware、MAA进行联动，将新建可疑进程上报到server端进行恶意判断
    3) 禁止具有指定路径特征的父进程执行特定指令名称/参数的系统指令，例如"CVE-2015-0240 Unexpected Code Execution In SMBD"漏洞的典型特征是samba遭到溢出攻击后会启动一个bash新进程

2. 系统指令执行监控
    1) 黑客利用WEB容器、MYSQL/SQLSERVER、FTP等系统软件的弱口令漏洞，通过SQL管理平台提供的指令管道执行系统敏感指令，例如添加系统帐号，获取登录密码文件、向磁盘写入WEBSHELL文件

0x3: 新线程创建 && 新"执行流"创建

对于Linux来说，线程创建、或者新"执行流"的创建本质上就是调用fork()系统调用进行对进程自身的复制，而问题在于，进程的"复制"本身并不能表现出太多的恶意，它仅仅代表了当前进程开始进行了新进程创建的"准备工作"或者仅仅只是想复制自身而已
真正的实时阻断策略、恶意程序检测还是需要在调用了execve这个系统调用之后才能进行

这个Hook点存在的一个问题是

1. linux的fork是决定是要复制一份重的新进程，还是复制一份轻的线程
2. 这个点并不会去做填充新进程相关信息的事情，这也导致我们拿不到新进程信息，必须到exe里面才有

\linux-2.6.32.63\kernel\fork.c

static struct task_struct *copy_process(unsigned long clone_flags,
                    unsigned long stack_start,
                    struct pt_regs *regs,
                    unsigned long stack_size,
                    int __user *child_tidptr,
                    struct pid *pid,
                    int trace)
{
    ..
    retval = security_task_create(clone_flags);
    if (retval)
        goto fork_out;
    ..
    p->tgid = p->pid;
    if (clone_flags & CLONE_THREAD)
        p->tgid = current->tgid;

可以看到，内核在LSM Hook之后才进行了pid、tid的赋值，这当然也是LSM的设计思想，毕竟如果LSM规则判定为阻断，则没有必要继续代码流，去做无谓的内粗申请等工作了，但这也同时对我们获取参数造成了困难。为了解决这个问题，我们可以尝试使用"copy kernel code"模拟执行的思路，将copy_process之后用于生成子线程tid的代码进行选择性拷贝，提取不会对当前内核状态造成影响的代码进行模拟执行

if (pid != &init_struct_pid) 
{
    retval = -ENOMEM;
    pid = alloc_pid(p->nsproxy->pid_ns);
    if (!pid)
        goto bad_fork_cleanup_io;

    if (clone_flags & CLONE_NEWPID) 
    {
        retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
        if (retval < 0)
            goto bad_fork_free_pid;
    }
}

p->pid = pid_nr(pid);

/source/kernel/pid.c
struct pid *alloc_pid(struct pid_namespace *ns)
{
    struct pid *pid;
    enum pid_type type;
    int i, nr;
    struct pid_namespace *tmp;
    struct upid *upid;

    pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
    if (!pid)
        goto out;

    tmp = ns;
    for (i = ns->level; i >= 0; i--) {
        nr = alloc_pidmap(tmp);
        if (nr < 0)
            goto out_free;

        pid->numbers[i].nr = nr;
        pid->numbers[i].ns = tmp;
        tmp = tmp->parent;
    }

    get_pid_ns(ns);
    pid->level = ns->level;
    atomic_set(&pid->count, 1);
    for (type = 0; type < PIDTYPE_MAX; ++type)
        INIT_HLIST_HEAD(&pid->tasks[type]);

    spin_lock_irq(&pidmap_lock);
    for (i = ns->level; i >= 0; i--) {
        upid = &pid->numbers[i];
        hlist_add_head_rcu(&upid->pid_chain,
                &pid_hash[pid_hashfn(upid->nr, upid->ns)]);
    }
    spin_unlock_irq(&pidmap_lock);

out:
    return pid;

out_free:
    while (++i <= ns->level)
        free_pidmap(pid->numbers + i);

    kmem_cache_free(ns->pid_cachep, pid);
    pid = NULL;
    goto out;
}

/source/kernel/pid.c
根据命名空间的层次查看规则，获取子线程tid

另一种可行的思路是： Linux netlink process connector

1. LSM operation replace hook
//ops->task_create = alise_task_create;

2. static int alise_task_create(unsigned long clone_flags)
/*
if(!msg_ops[MSG_PROSESS_CREATE].handle_message((void *)&clone_flags))
        ret = -EPERM;
..
static struct message_ops msg_ops[MSG_TOTAL] = 
{
    ..
    {
        .handle_message = process_create_msg,
    },
    ..
};
*/

3. static int process_create_msg(void *data)
4. Linux Kernel在进程fork的时候会发起netlink通知
static void find_and_instead(int sk)
/*
write_lock(&sock->sk->sk_callback_lock);
sock->sk->sk_data_ready = alise_newpid_report;
write_unlock(&sock->sk->sk_callback_lock);
*/

5. static void alise_newpid_report(struct sock *sk, int len)
/*
skb = skb_dequeue(&sk->sk_receive_queue);
    atomic_sub(skb->truesize, &sk->sk_rmem_alloc);
    sk_mem_uncharge(sk,skb->truesize);

    if(!atomic_read(&op_install) || !alise_id || !enabled)
        goto out;

    if(skb->len >= sizeof(struct nlmsghdr))
    {
        nlh = (struct nlmsghdr *)skb->data;
        if((nlh->nlmsg_len >= sizeof(struct nlmsghdr))
                    && (skb->len >= nlh->nlmsg_len)){
            cnmsg = (struct cn_msg *)NLMSG_DATA(nlh);
            ev = (struct proc_event *)cnmsg->data;        
            if(ev->what == PROC_EVENT_FORK)
                msg_ops[MSG_PROSESS_PIDREP].handle_message((void *)ev);
        }
    }
*/

这种Hook方案涉及到tcp协议栈处理各种事件的分析，socket netlink会将一些状态的变化通知给对应的进程
/source/net/core/sock.c

void sock_init_data(struct socket *sock, struct sock *sk)
{
    ..
    sk->sk_state_change     =       sock_def_wakeup;
    sk->sk_data_ready       =       sock_def_readable;
    sk->sk_write_space      =       sock_def_write_space;
    sk->sk_error_report     =       sock_def_error_report;
    sk->sk_destruct         =       sock_def_destruct;
    ..
}

/source/include/net/sock.h

struct sock 
{
    ..
    //sk_sleep是一个等待队列，也就是所有阻塞在这个sock上的进程，我们通知用户进程就是通过这个等待队列来做的
    wait_queue_head_t    *sk_sleep;
    ..
    //sk_state_change是一个sock状态改变的回调函数，也就是当sock的状态变迁了(比如从established到clos_wait状态),那么就会调用这个函数 
    void        (*sk_state_change)(struct sock *sk);
    //sk_data_ready 这个函数是当当前sock有可读数据的时候，就会被调用
    void        (*sk_data_ready)(struct sock *sk, int bytes);
    //sk_write_space 这个函数是当当前的sock有可写的空间的时候，就会被调用
    void        (*sk_write_space)(struct sock *sk);
    //sk_error_report 这个函数是当当前的sock出错(比如收到一个rst)后就会被调用 
    void        (*sk_error_report)(struct sock *sk);
    int    (*sk_backlog_rcv)(struct sock *sk,struct sk_buff *skb);  
};

sk_data_ready，这个是用来发起可读事件的

static void sock_def_readable(struct sock *sk, int len)
{
    read_lock(&sk->sk_callback_lock);
    //首先判断是否有进程休眠在sock上。如果有则同步唤醒所有的阻塞的进程，这里注意传递的参数是POLLIN，这样我们就能通过epoll这类来捕捉事件了 
    if (sk_has_sleeper(sk))
        wake_up_interruptible_sync_poll(sk->sk_sleep, POLLIN | POLLRDNORM | POLLRDBAND);
    ///这里主要是处理异步的唤醒事件 
    sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
    read_unlock(&sk->sk_callback_lock);
}

Relevant Link

http://www.codelifeliwan.com/?p=722
http://blog.csdn.net/dog250/article/details/5303430
http://blog.csdn.net/zhangskd/article/details/45787989
http://www.fsl.cs.sunysb.edu/kernel-api/re539.html
http://blog.csdn.net/dog250/article/details/5303430
http://simohayha.iteye.com/blog/559506

基于netlink proc events进行linux process monitoring

The proc connector allows you to receive notification of process events such fork and exec calls, as well as changes to a process’s uid, gid or sid (session id). These are provided through a socket-based interface by reading instances of struct proc_event defined in the kernel header

1. proc event通知是Linux Kernel提供的一个异步事件通知机制，如果需要得到这个事件通知，需要用户态发起订阅(subscribe)
2. Linux使用"proc connector"(built on top of the generic connector and that itself is on top of netlink)来封装这个过程，本质上从代码上看就是一个socket的bind、send、recv过程，区别在于bind的时候采用了NETLINK_CONNECTOR特殊标识符
3. 在发生进程创建fork、退出exit、状态变化(uid、..)的时候，Linux Kernel会遍历所有netlink socket，向其中填充事件数据
4. 用户态发起订阅的进程(即通过bind、recv阻塞等待的进程)会自动接收到异步事件消息

下面展示了一个在用户态"订阅"proc event的代码示例

/* exec-notify, so you can watch your acrobat reader or vim executing "bash -c"
 * commands ;-)
 * Requires some 2.6.x Linux kernel with proc connector enabled.
 *
 * $  gcc -Wall -ansi -pedantic -std=c99 exec-notify.c
 *
 * (C) 2007-2010 Sebastian Krahmer <krahmer@suse.de> original netlink handling
 * stolen from an proc-connector example, copyright folows:
 */
/*
 *
 * Copyright (C) Matt Helsley, IBM Corp. 2005
 * Derived from fcctl.c by Guillaume Thouvenin
 * Original copyright notice follows:
 *
 * Copyright (C) 2005 BULL SA.
 * Written by Guillaume Thouvenin <guillaume.thouvenin@bull.net>
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
 */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <signal.h>

#include <sys/socket.h>
#include <sys/types.h>

#include <linux/connector.h>
#include <linux/netlink.h>
#include <linux/cn_proc.h>

#define SEND_MESSAGE_LEN (NLMSG_LENGTH(sizeof(struct cn_msg) + \
                       sizeof(enum proc_cn_mcast_op)))
#define RECV_MESSAGE_LEN (NLMSG_LENGTH(sizeof(struct cn_msg) + \
                       sizeof(struct proc_event)))

#define SEND_MESSAGE_SIZE    (NLMSG_SPACE(SEND_MESSAGE_LEN))
#define RECV_MESSAGE_SIZE    (NLMSG_SPACE(RECV_MESSAGE_LEN))

#define max(x,y) ((y)<(x)?(x):(y))
#define min(x,y) ((y)>(x)?(x):(y))

#define BUFF_SIZE (max(max(SEND_MESSAGE_SIZE, RECV_MESSAGE_SIZE), 1024))
#define MIN_RECV_SIZE (min(SEND_MESSAGE_SIZE, RECV_MESSAGE_SIZE))

#define PROC_CN_MCAST_LISTEN (1)
#define PROC_CN_MCAST_IGNORE (2)


void handle_msg (struct cn_msg *cn_hdr)
{
    char cmdline[1024], fname1[1024], ids[1024], fname2[1024], buf[1024];
    int r = 0, fd, i;
    FILE *f = NULL;
    struct proc_event *ev = (struct proc_event *)cn_hdr->data;

    snprintf(fname1, sizeof(fname1), "/proc/%d/status", ev->event_data.exec.process_pid);
    snprintf(fname2, sizeof(fname2), "/proc/%d/cmdline", ev->event_data.exec.process_pid);

    f = fopen(fname1, "r");
    fd = open(fname2, O_RDONLY);

    memset(&cmdline, 0, sizeof(cmdline));
    memset(&ids, 0, sizeof(ids));

    while (f && fgets(buf, sizeof(buf), f) != NULL) {
        if (strstr(buf, "Uid")) {
            strtok(buf, "\n");
            snprintf(ids, sizeof(ids), "%s", buf);
        }
    }
    if (f)
        fclose(f);

    if (fd > 0) {
        r = read(fd, cmdline, sizeof(cmdline));
        close(fd);

        for (i = 0; r > 0 && i < r; ++i) {
            if (cmdline[i] == 0)
                cmdline[i] = ' ';
        }
    }

    switch(ev->what){
    case PROC_EVENT_FORK:
        printf("FORK:parent(pid,tgid)=%d,%d\tchild(pid,tgid)=%d,%d\t[%s]\n",
               ev->event_data.fork.parent_pid,
               ev->event_data.fork.parent_tgid,
               ev->event_data.fork.child_pid,
               ev->event_data.fork.child_tgid, cmdline);
        break;
    case PROC_EVENT_EXEC:
        printf("EXEC:pid=%d,tgid=%d\t[%s]\t[%s]\n",
               ev->event_data.exec.process_pid,
               ev->event_data.exec.process_tgid, ids, cmdline);
        break;
    case PROC_EVENT_EXIT:
        printf("EXIT:pid=%d,%d\texit code=%d\n",
               ev->event_data.exit.process_pid,
               ev->event_data.exit.process_tgid,
               ev->event_data.exit.exit_code);
        break;
    case PROC_EVENT_UID:
        printf("UID:pid=%d,%d ruid=%d,euid=%d\n",
            ev->event_data.id.process_pid, ev->event_data.id.process_tgid,
            ev->event_data.id.r.ruid, ev->event_data.id.e.euid);
        break;
    default:
        break;
    }
}


int main(int argc, char **argv)
{
    int sk_nl;
    int err;
    struct sockaddr_nl my_nla, kern_nla, from_nla;
    socklen_t from_nla_len;
    char buff[BUFF_SIZE];
    int rc = -1;
    struct nlmsghdr *nl_hdr;
    struct cn_msg *cn_hdr;
    enum proc_cn_mcast_op *mcop_msg;
    size_t recv_len = 0;
    if (getuid() != 0) {
        printf("Only root can start/stop the fork connector\n");
        return 0;
    }
    if (argc != 1)
        return 0;

    setvbuf(stdout, NULL, _IONBF, 0);

    /*
     * Create an endpoint for communication. Use the kernel user
     * interface device (PF_NETLINK) which is a datagram oriented
     * service (SOCK_DGRAM). The protocol used is the connector
     * protocol (NETLINK_CONNECTOR)
     */
    sk_nl = socket(PF_NETLINK, SOCK_DGRAM, NETLINK_CONNECTOR);
    if (sk_nl == -1) {
        printf("socket sk_nl error");
        return rc;
    }
    my_nla.nl_family = AF_NETLINK;
    my_nla.nl_groups = CN_IDX_PROC;
    my_nla.nl_pid = getpid();

    kern_nla.nl_family = AF_NETLINK;
    kern_nla.nl_groups = CN_IDX_PROC;
    kern_nla.nl_pid = 1;

    err = bind(sk_nl, (struct sockaddr *)&my_nla, sizeof(my_nla));
    if (err == -1) {
        printf("binding sk_nl error");
        goto close_and_exit;
    }
    nl_hdr = (struct nlmsghdr *)buff;
    cn_hdr = (struct cn_msg *)NLMSG_DATA(nl_hdr);
    mcop_msg = (enum proc_cn_mcast_op*)&cn_hdr->data[0];

    printf("sending proc connector: PROC_CN_MCAST_LISTEN... ");
    memset(buff, 0, sizeof(buff));
    *mcop_msg = PROC_CN_MCAST_LISTEN;

    /* fill the netlink header */
    nl_hdr->nlmsg_len = SEND_MESSAGE_LEN;
    nl_hdr->nlmsg_type = NLMSG_DONE;
    nl_hdr->nlmsg_flags = 0;
    nl_hdr->nlmsg_seq = 0;
    nl_hdr->nlmsg_pid = getpid();
    /* fill the connector header */
    cn_hdr->id.idx = CN_IDX_PROC;
    cn_hdr->id.val = CN_VAL_PROC;
    cn_hdr->seq = 0;
    cn_hdr->ack = 0;
    cn_hdr->len = sizeof(enum proc_cn_mcast_op);
    if (send(sk_nl, nl_hdr, nl_hdr->nlmsg_len, 0) != nl_hdr->nlmsg_len) {
        printf("failed to send proc connector mcast ctl op!\n");
        goto close_and_exit;
    }

    printf("sent\n");
    if (*mcop_msg == PROC_CN_MCAST_IGNORE) {
        rc = 0;
        goto close_and_exit;
    }
    printf("Reading process events from proc connector.\n"
        "Hit Ctrl-C to exit\n");
    for(memset(buff, 0, sizeof(buff)), from_nla_len = sizeof(from_nla);
      ; memset(buff, 0, sizeof(buff)), from_nla_len = sizeof(from_nla)) {
        struct nlmsghdr *nlh = (struct nlmsghdr*)buff;
        memcpy(&from_nla, &kern_nla, sizeof(from_nla));
        recv_len = recvfrom(sk_nl, buff, BUFF_SIZE, 0,
                (struct sockaddr*)&from_nla, &from_nla_len);
        if (from_nla.nl_pid != 0)
            continue;
        if (recv_len < 1)
            continue;
        while (NLMSG_OK(nlh, recv_len)) {
            cn_hdr = NLMSG_DATA(nlh);
            if (nlh->nlmsg_type == NLMSG_NOOP)
                continue;
            if ((nlh->nlmsg_type == NLMSG_ERROR) ||
                (nlh->nlmsg_type == NLMSG_OVERRUN))
                break;
            handle_msg(cn_hdr);
            if (nlh->nlmsg_type == NLMSG_DONE)
                break;
            nlh = NLMSG_NEXT(nlh, recv_len);
        }
    }
close_and_exit:
    close(sk_nl);
    exit(rc);

    return 0;
}

主动防御的目的是在内核态实现对线程创建的Hook，也即在fork系统调用发生会得到完整的父子进程信息，方案原理梳理如下

1. Linux Kernel提供了进程启动退出、状态变动的异步事件通知，在发生指定事件的时候，会遍历所有netlink socket，通过填充netlink sk buffer
2. netlink socket的sock_def_readable回调函数会检测buffer上是否有进程处于阻塞等待状态(即是否有进程在订阅这个消息)来触发订阅进程
3. Linux Kernel提供了"kernel socket filter interface"用于实现对filter the socket in the kernel，这使我们有机会能够对"PROC_EVENT_FORK"事件进行Hook捕获
4. filter the socket so that we only receive fork notifications, discarding the other types of proc connector event type and most importantly discarding the messages that indicate new threads being created (those where the pid and tgid fields differ)
5. 通过劫持指定的socket的data_ready事件: sock->sk->sk_data_ready = alise_newpid_report;实现对异步事件proc event的劫持，需要注意的，这个消息劫持只会Hook指定socket的data_ready消息，对系统原生的其他netlink socket的通信不会造成影响
//在进行socket filter的时候，要针对事件类型做过滤，只过滤出线程创建关心的"PROC_EVENT_FORK"事件
6. 在proc event PROC_EVENT_FORK事件回调中，我们可以完整的获得fork时父进程的信息(包括PID)，从而达到主防Hook进程创建线程的目的，因为Linux上进程创建线程/新建进程是通过fork实现的

Relevant Link

http://bewareofgeek.livejournal.com/2945.html
http://users.suse.com/~krahmer/exec-notify.c
http://lxr.free-electrons.com/source/Documentation/connector/connector.txt 
http://netsplit.com/the-proc-connector-and-socket-filters
http://lxr.free-electrons.com/source/include/uapi/linux/cn_proc.h
https://godoc.org/github.com/cloudfoundry/gosigar/psnotify#pkg-constants

0x4: 进程依赖模块、SO、第三方库加载(不包括进程的文件操作)

在进程管控和恶意检测的联动过程中，除了把新建进程ELF本身上报到云端进行恶意检测之外，恶意程序还常常把恶意代码放到so、静态库模块中，通过外部加载引入的方式实现恶意行为，因此，对进程引入的外部模块行为的监控和可疑上报也是非常重要的

从入侵的角度来说，因为外部模块加载导致的入侵情况有以下几个

1. 黑客通过mysql弱口令登录了mysql控制台，通过加载udf(.so模块)执行危险系统指令
2. 黑客通过溢出注入漏洞向目标进程注入了.so模块(本质上还是让目标进程加载一个新的.so)
3. Gcc编译源代码程序
4. 程序在运行过程中调用dlopen动态加载一个.so

值得注意的是，对模块加载的监控需要判断当前加载的模块类型，我们可以参考modinfo的实现方式

http://code.metager.de/source/xref/linux/utils/kmod/tools/modinfo.c

这里需要关注的一个问题是，对进程加载模块的LSM Hook点，dentry_open会对所有的进程、.so、.a、.ko、目录、设备文件打开都捕获到事件，包括

1. 从磁盘加载一个ELF文件启动一个新进程
2. 从磁盘加载一个SHELL脚本文件并调用
3. Gcc编译源程序引入动态库、静态库模块
4. 文本编辑器打开一个文本文件
5. 程序运行过程中的时候加载动态链接库
6. 程序运行过程中调用dlopen动态打开共享库
7. 打开目录
8. 打开socket设备
9. 打开块设备
//实际上，Linux上打开"一切"文件都会调用到dentry_open这个系统调用

如果全部上报事件量太大，需要对冗余数据进行过滤，从参数上来看，可能包含判断当前open文件的文件类型的方法有

1. struct file *file
通过copy from kernel code，模仿kernel原生判断文件类型从而加载对应的二进制代码解释器来加载二进制文件
    1) int do_execve(char * filename, char __user *__user *argv, char __user *__user *envp, struct pt_regs * regs) ->
    /*
    可执行文件中读入开头的128个字节到linux_binprm结构brmp中的缓冲区，用于之后内核根据这头128字节判断应该调用哪个解析引擎来处理当前文件prepare_binprm用于提供一些父进程相关的值
    */
    2) retval = prepare_binprm(bprm) ->
    3) retval = search_binary_handler(bprm, regs) -> 
    4) request_module("binfmt-%04x", *(unsigned short *)(&bprm->buf[2]));
    /*
    对可执行ELF文件的加载前安全判断再次确认了当前加载的文件是可执行ELF文件
    */
    5) static int load_elf_library(struct file *file)
    {
        ..
        struct elfhdr elf_ex; 
        retval = kernel_read(file, 0, (char *)&elf_ex, sizeof(elf_ex));
        if (retval != sizeof(elf_ex))
        goto out; 
        if (memcmp(elf_ex.e_ident, ELFMAG, SELFMAG) != 0)
        goto out;

        /* First of all, some simple consistency checks */
        if (elf_ex.e_type != ET_EXEC || elf_ex.e_phnum > 2 || !elf_check_arch(&elf_ex) || !file->f_op || !file->f_op->mmap)
        goto out;
        ..
    }

2. S_ISREG
将非常规文件过滤掉

0x5: 进程退出

进程退出是进程生命周期的最后一个环节，在这个Hook点可以想到的功能点是

1. 黑客通过WEB漏洞或者数据库/SSH弱口令之后，利用WEB容器或者数据库DB的进程指令管道，执行KILL命名，杀死主防等杀毒软件
2. 根据被杀死进程的相关信息、SIGNAL信号判断是否允许当前进程退出，从而起到"进程保护"的目的

对于进程退出的Hook，需要注意几个问题

1. LSM Hook点选择int security_task_kill(struct task_struct *p, struct siginfo *info, int sig, u32 secid)
2. LSM Hook点需要确定在Linux自身的权限检查之后，因为Linux本身对非root用户，非进程属主用户是禁止KILL目标进程的，主防的阻断策略不应该侵入到系统自身原本的安全机制中

\linux-2.6.32.63\kernel\signal.c

/*
 * Bad permissions for sending the signal
 * - the caller must hold at least the RCU read lock
 */
static int check_kill_permission(int sig, struct siginfo *info, struct task_struct *t)
{
    const struct cred *cred, *tcred;
    struct pid *sid;
    int error;

    if (!valid_signal(sig))
        return -EINVAL;

    if (info != SEND_SIG_NOINFO && (is_si_special(info) || SI_FROMKERNEL(info)))
        return 0;

    error = audit_signal_info(sig, t); /* Let audit system see the signal */
    if (error)
        return error;

    cred = current_cred();
    tcred = __task_cred(t);
    if (!same_thread_group(current, t) &&
        (cred->euid ^ tcred->suid) &&
        (cred->euid ^ tcred->uid) &&
        (cred->uid  ^ tcred->suid) &&
        (cred->uid  ^ tcred->uid) &&
    !capable(CAP_KILL)) 
    {
        switch (sig) 
        {
            case SIGCONT:
                sid = task_session(t);
                /*
                 * We don't return the error if sid == NULL. The
                 * task was unhashed, the caller must notice this.
                 */
                if (!sid || sid == task_session(current))
                    break;
            default:
                return -EPERM;
        }
    }
    //完成Linux原生的权限检测之后，才调用lSM Hook安全检测
    return security_task_kill(t, info, sig, 0);
}

对于进程退出的Hook监控需要注意几个问题

1. SIGKILL不能阻断，这是系统必须保证的最后一个强制进程退出的手段
2. 进程退出的方式有多种
    1) 接收到SIGNAL信号
    2) return正常退出
    3) exit退出
而LSM的task_kill只能捕获到SIGNAL信号的发送这个行为，对进程的正常退出无法捕获到
3. 而整个进程管控的生命周期监控需要进程退出退出标识周期结束的标志

为了解决这个问题，需要找到一个比task_kill更底层的管控进程退出的Hook点，notifier_call_chain可以解决这个问题，notifier_call_chain正好位于do_exit函数中，并注册了一个通知链表头

/source/kernel/exit.c
NORET_TYPE void do_exit(long code)
{
    struct task_struct *tsk = current;
    int group_dead;

    profile_task_exit(tsk);
    ...
}

/source/kernel/profile.c
void profile_task_exit(struct task_struct *task)
{
    blocking_notifier_call_chain(&task_exit_notifier, 0, task);
}

/source/kernel/notifier.c
int blocking_notifier_call_chain(struct blocking_notifier_head *nh, unsigned long val, void *v)
{
    return __blocking_notifier_call_chain(nh, val, v, -1, NULL);
}
EXPORT_SYMBOL_GPL(blocking_notifier_call_chain);

int __blocking_notifier_call_chain(struct blocking_notifier_head *nh, unsigned long val, void *v, int nr_to_call, int *nr_calls)
{
    int ret = NOTIFY_DONE;

    /*
     * We check the head outside the lock, but if this access is
     * racy then it does not matter what the result of the test
     * is, we re-check the list after having taken the lock anyway:
     */
    if (rcu_dereference(nh->head)) {
        down_read(&nh->rwsem);
        ret = notifier_call_chain(&nh->head, val, v, nr_to_call,
                    nr_calls);
        up_read(&nh->rwsem);
    }
    return ret;
}
EXPORT_SYMBOL_GPL(__blocking_notifier_call_chain);

static int __kprobes notifier_call_chain(struct notifier_block **nl, unsigned long val, void *v, int nr_to_call,    int *nr_calls)
{
    int ret = NOTIFY_DONE;
    struct notifier_block *nb, *next_nb;

    nb = rcu_dereference(*nl);

    while (nb && nr_to_call) {
        next_nb = rcu_dereference(nb->next);

#ifdef CONFIG_DEBUG_NOTIFIERS
        if (unlikely(!func_ptr_is_kernel_text(nb->notifier_call))) {
            WARN(1, "Invalid notifier called!");
            nb = next_nb;
            continue;
        }
#endif
        ret = nb->notifier_call(nb, val, v);

        if (nr_calls)
            (*nr_calls)++;

        if ((ret & NOTIFY_STOP_MASK) == NOTIFY_STOP_MASK)
            break;
        nb = next_nb;
        nr_to_call--;
    }
    return ret;
}

1. Linux内核通知链简介

大多数内核子系统都是相互独立的，因此某个子系统可能对其它子系统产生的事件感兴趣。为了满足这个需求，也即是让某个子系统在发生某个事件时通知其它的子系统，Linux内核提供了通知链的机制。通知链表只能够在内核的子系统之间使用，而不能够在内核与用户空间之间进行事件的通知。
通知链表是一个函数链表，链表上的每一个节点都注册了一个函数。当某个事情发生时，链表上所有节点对应的函数就会被执行。所以对于通知链表来说有一个通知方与一个接收方

1. 通知方
发生某个事件的时候，触发通知链，即遍历执行通知链上注册的函数

2. 被通知方
在通知这个事件时所运行的函数由被通知方决定，实际上也即是被通知方注册了某个函数，在发生某个事件时这些函数就得到执行。本质上和系统调用signal的思想类似

2. 通知链表数据结构

通知链表的节点类型为notifier_block
/source/include/linux/notifier.h

struct notifier_block 
{
    //表示了这个节点所对应的要运行的那个函数
    notifier_fn_t notifier_call;

    //next指向下一个节点，即当前事件发生时还要继续执行的那些节点
    struct notifier_block __rcu *next;
    int priority;
};

3. 注册通知链

在通知链注册时，需要有一个链表头，它指向这个通知链表的第一个元素。这样，之后的事件对该链表通知时就会根据这个链表头而找到这个链表中所有的元素
/source/kernel/notifier.c

static int notifier_chain_register(struct notifier_block **nl, struct notifier_block *n)
//将新的节点n加入到nl所指向的链表中去

卸载的函数是

static int notifier_chain_unregister(struct notifier_block **nl, struct notifier_block *n)
//将节点n从nl所指向的链表中删除

4. 通知链表

当有事件发生时，就使用notifier_call_chain向某个通知链表发送消息
/source/kernel/notifier.c

static int notifier_call_chain(struct notifier_block **nl, unsigned long val, void *v, int nr_to_call, int *nr_calls)
//按顺序运行nl指向的链表上的所有节点上注册的函数

5. 示例代码

1. 初始化通知链的头结点，并将要执行的函数注册到自己的通知链中
    1) buildchain.c: 定义了头节点和一些自定义的注册函数(针对该头节点的)
    2) regchain.c: 使用自定义的注册函数注册了一些通知链节点 

2. 由另外的子系统来通知这个链，让之前面注册的函数运行
    1) notify.c: 发送通知信息的代码

buildchain.c

#include <asm/uaccess.h>
#include <linux/types.h>
#include <linux/kernel.h>
#include <linux/sched.h>
#include <linux/notifier.h>
#include <linux/init.h>
#include <linux/types.h>
#include <linux/module.h>

MODULE_LICENSE("GPL");

/*
 * 定义自己的通知链头结点以及注册和卸载通知链的外包函数
 */
 
/*
RAW_NOTIFIER_HEAD是定义一个通知链的头部结点，
通过这个头部结点可以找到这个链中的其它所有的notifier_block
*/
static RAW_NOTIFIER_HEAD(test_chain);

/*
自定义的注册函数，将notifier_block节点加到刚刚定义的test_chain这个链表中来
raw_notifier_chain_register会调用notifier_chain_register
*/
int register_test_notifier(struct notifier_block *nb)
{
        return raw_notifier_chain_register(&test_chain, nb);
}
EXPORT_SYMBOL(register_test_notifier);

int unregister_test_notifier(struct notifier_block *nb)
{
        return raw_notifier_chain_unregister(&test_chain, nb);
}
EXPORT_SYMBOL(unregister_test_notifier);


//自定义的通知链表的函数，即通知test_chain指向的链表中的所有节点执行相应的函数
int test_notifier_call_chain(unsigned long val, void *v)
{
        return raw_notifier_call_chain(&test_chain, val, v);
}
EXPORT_SYMBOL(test_notifier_call_chain);


//init and exit 
static int __init init_notifier(void)
{
        printk("init_notifier\n");
        return 0;
}

static void __exit exit_notifier(void)
{
        printk("exit_notifier\n");
}
module_init(init_notifier);
module_exit(exit_notifier);

regchain.c

#include <asm/uaccess.h>
#include <linux/types.h>
#include <linux/kernel.h>
#include <linux/sched.h>
#include <linux/notifier.h>
#include <linux/init.h>
#include <linux/types.h>
#include <linux/module.h>

MODULE_LICENSE("GPL");

//注册通知链
extern int register_test_notifier(struct notifier_block*);
extern int unregister_test_notifier(struct notifier_block*);


static int test_event1(struct notifier_block *this, unsigned long event, void *ptr)
{
        printk("In Event 1: Event Number is %d\n", event);
        return 0; 
}

static int test_event2(struct notifier_block *this, unsigned long event, void *ptr)
{
        printk("In Event 2: Event Number is %d\n", event);
        return 0; 
}

static int test_event3(struct notifier_block *this, unsigned long event, void *ptr)
{
        printk("In Event 3: Event Number is %d\n", event);
        return 0; 
}


//事件1，该节点执行的函数为test_event1
static struct notifier_block test_notifier1 =
{
        .notifier_call = test_event1,
};


//事件2，该节点执行的函数为test_event1
static struct notifier_block test_notifier2 =
{
        .notifier_call = test_event2,
};


//事件3，该节点执行的函数为test_event1
static struct notifier_block test_notifier3 =
{
        .notifier_call = test_event3,
};


//对这些事件进行注册
static int __init reg_notifier(void)
{
        int err;
        printk("Begin to register:\n");
        
        err = register_test_notifier(&test_notifier1);
        if (err)
        {
                printk("register test_notifier1 error\n");
                return -1; 
        }
        printk("register test_notifier1 completed\n");

        err = register_test_notifier(&test_notifier2);
        if (err)
        {
                printk("register test_notifier2 error\n");
                return -1; 
        }
        printk("register test_notifier2 completed\n");

        err = register_test_notifier(&test_notifier3);
        if (err)
        {
                printk("register test_notifier3 error\n");
                return -1; 
        }
        printk("register test_notifier3 completed\n");
        return err;
}


//卸载注册了的通知链
static void __exit unreg_notifier(void)
{
        printk("Begin to unregister\n");
        unregister_test_notifier(&test_notifier1);
        unregister_test_notifier(&test_notifier2);
        unregister_test_notifier(&test_notifier3);
        printk("Unregister finished\n");
}
module_init(reg_notifier);
module_exit(unreg_notifier);

notify.c

#include <asm/uaccess.h>
#include <linux/types.h>
#include <linux/kernel.h>
#include <linux/sched.h>
#include <linux/notifier.h>
#include <linux/init.h>
#include <linux/types.h>
#include <linux/module.h>

MODULE_LICENSE("GPL");

extern int test_notifier_call_chain(unsigned long val, void *v);


//向通知链发送消息以触发注册了的函数
static int __init call_notifier(void)
{
        int err;
        printk("Begin to notify:\n");

    /*
     * 调用自定义的函数，向test_chain链发送消息
    */
 
        printk("==============================\n");
        err = test_notifier_call_chain(1, NULL);
        printk("==============================\n");
        if (err)
                printk("notifier_call_chain error\n");
        return err;
}


static void __exit uncall_notifier(void)
{
        printk("End notify\n");
}
module_init(call_notifier);
module_exit(uncall_notifier);

Makefile

obj-m:=buildchain.o regchain.o notify.o

KERNELDIR:=/lib/modules/$(shell uname -r)/build

default:
        make -C $(KERNELDIR) M=$(shell pwd) modules

运行

make

insmod buildchain.ko
insmod regchain.ko
insmod notify.ko

回到我们利用do_exit中的通知链实现进程退出的事件捕获这个目的上来

1. Kernel在do_exit中实现了基本的通知链，即初始化了表头
2. 内核开发者可以继续向这个通知链注册新的通知函数，即增加链表元素(todo)
3. 在do_exit中，即发生了进程退出事件，内核会使用notifier_call_chain向对应于进程退出的通知链表发送消息
4. 实现进程退出的实时监控

梳理一下Hook方案

1. 调用profile_event_register()系统调用，增加一个notify监听事件
/*
static struct notifier_block task_exit_notify = 
{
    .notifier_call  = alise_task_exit,
};
*/
2. static int alise_task_exit(struct notifier_block *self, unsigned long val, void *data)
..

Relevant Link:

http://bbs.chinaunix.net/thread-2011776-1-1.html

0x6: Hook方式

为了最大程度的提高Hook的性能，减少因为Hook模块自身的处理逻辑造成系统运行的延迟，一个好的Hook方式是

1. 在KERNEL执行到Hook函数的时候，在每个Hook点记录下当前Hook进程的相关信息，例如PID
2. 将进程挂起，即放行进程，使之进入Linux调度器的等待队列
3. 当用户态的业务逻辑层返回结果、或者Timer延时超时的时候，根据PID将对应进程从等待队列中提取出来，使之进入就绪队列，重新参与调度

这种方式可以使Hook模块对系统执行流程造成的时延达到尽可能的小，基本实现一种异步消息响应的Hook模型

0x7: 策略规则

规则	规则描述
规则编号(unique global) 操作主体[PPID/PID/PGID/ProgramPath/Argv/UID...] 操作客体[PID/ProgramPath/Argv] 操作类型[NewProgram/NewThread/LoadMod/ExitProgram] 策略规则[DENY/ALLOW/IGNORE] 策略参数[ProgramPath:/usr/local/safe ]	采用MAC控制模型定义操作主体和操作客体的制约关系
rule1 [ProgramPath:/usr/local/mysql Argv:-u(regex) UID:-0] [ProgramPath:/bin/bash] [NewProgram] DENY	禁止非root用户启动的mysql启动bash新进程
rule2 [ProgramPath:/usr/local/mysql ProgramFileAtime: <10] [ProgramPath:/tmp/virus] [NewProgram] DENY	禁止mysql进程启动创建时间小于10s的进程
rule3 [ProgramPath:/usr/local/mysql Argv:net user add(regex)] [ProgramPath:/www/html/dedecms/] [NewProgram] DENY	禁止mysql执行"net user add"添加系统帐号这种参数特征的系统指令
rule4 [ProgramPath:/usr/local/apache ] [FilePathName:/tmp/virus.so ModType:ELF] [LoadMod] DENY	禁止apache服务器在非启动期间加载非标准路径下的so模块，这常常被黑客用来进行恶意程序加载
rule5 [ProgramPath:/usr/local/samva ] [ProgramPath:/bin/bash Argv:cat /etc/passwd(regex) ] [NewProgram] SET	将samba进程启动的bash新进程重定向到一个安全进程，以此达到沙箱的目的

3. 文件变动监控(FileMonitor)

主防的文件变动监控(FileMonitor)主要要解决的问题是

1. 作为杀毒、进程管控的辅助模块，以异步方式监控操作系统上的文件操作变动
2. 提高恶意文件、WEBSHELL的实时发现能力

0x1: Linux文件变动监控技术方案调研

http://www.cnblogs.com/LittleHann/p/4500160.html

之所以不使用Linux内核原生支持的inotify机制而采取基于LSM模拟实现一套和inotify相同功能的文件变动监控引擎，有以下原因

1. inotify的整体架构决定无法保证无缝、实时地提供文件监控
    1) inotify采取在inode数据结构中添加了inotify相关的字段
    2) inotify的监控模型是采取针对单独的目录、文件添加watch点实现的，如果要实现全盘监控，就必须在用户态由程序员显式地对目录进行递归遍历，对每个文件/目录分别添加watch监控点
    3) 当磁盘发生新建文件、新建目录的时候，需要在用户态显示地捕获这一消息，并进行watch的添加，这增加了用户态监控处理逻辑的复杂度，也无法做到实时监控
2. inotify的资源使用受系统配置上限的控制
cat /proc/sys/fs/inotify/max_user_instances
cat /proc/sys/fs/inotify/max_user_watches
/*
[root@iZ23smv648mZ fanotify]# cat /proc/sys/fs/inotify/max_user_instances
128
[root@iZ23smv648mZ fanotify]# cat /proc/sys/fs/inotify/max_user_watches
8192
*/

3. inotify采用了select、poll、epoll的I/O模型，当监控目录较多的时候，I/O效率表现会很差

从本质上讲，使用LSM和文件系统相关的挂钩点实现一个文件变动监控殷勤和inotify内核部分完全一样的东西，唯一的差别

1. inotify的设计思想是关注指定要关注的东西，它其实也是所有事件都报的，在入口点做了一个遍历判断，看是否是我们关注的inode，如果不在，则直接忽略
2. 而LSM Hook FileNotify引擎的策略是全部都报，然后根据白名单过滤某些目录
3. 所以inotify有一个天生的缺点，对新建文件/目录的实时hook不敏感

从改进内核源代码的角度来看，可以从以下几个方面改进

1. 将inotify的inode遍历判断做成开关，给用户2个模式，单文件、全盘，分别实现两种不同的设计思想
2. 在用户设置inotify的时候放入一个递归选项，就像grep 的-r，只要设置，就递归监控子目录，在实现的时候，递归比较一下父亲节点

0x2: LSM Hook点选择

1. 文件创建(FILE_CREATE)
    1) security_inode_create: 创建新文件
    2) security_inode_symlink: 创建软链接(关注链接源文件)
    3) security_inode_link: 创建硬链接(关注链接源文件)
2. 文件元属性修改(FILE_ATTRIBUTE_CHANGE)
    1) security_inode_rename: 文件移动(mv)
    2) security_inode_rename: 目录重命名、文件名重命名
3. 文件写操作(FILE_WRITE)
    1) security_file_permission: 写文件前的权限检查
4. 关闭文件(IN_CLOSE_WRITE)
    1) 文件引用计数 == 0 && 文件的打开flag包含WRITE(File opened for writing was closed)
    2) security_file_free: 进程关闭打开的文件

0x3: 创建新文件监控: security_inode_create

Linux上新建文件的入口点有

1. \linux-2.6.32.63\fs\namei.c
2. \linux-2.6.32.63\fs\cachefiles\namei.c
3. \linux-2.6.32.63\fs\ecryptfs\inode.c
4. \linux-2.6.32.63\fs\nfsd\vfs.c
5. \linux-2.6.32.63\ipc\mqueue.c

因为VFS实现了对底层文件系统的抽象，所有的新建文件操作都会走vfs_create的逻辑
/source/fs/namei.c

int vfs_create(struct inode *dir, struct dentry *dentry, int mode, struct nameidata *nd)
{
    int error = may_create(dir, dentry);

    if (error)
        return error;

    if (!dir->i_op->create)
        return -EACCES;    /* shouldn't it be ENOSYS? */
    mode &= S_IALLUGO;
    mode |= S_IFREG;
    //LSM Hook点
    error = security_inode_create(dir, dentry, mode);
    if (error)
        return error;
    vfs_dq_init(dir);
    error = dir->i_op->create(dir, dentry, mode, nd);
    if (!error)
        //inotify通知点
        fsnotify_create(dir, dentry);
    return error;
}

0x4: 创建软链接监控: security_inode_symlink

Linux上新建软链接的入口点有

1. \linux-2.6.32.63\fs\namei.c
2. \linux-2.6.32.63\fs\ecryptfs\inode.c
3. \linux-2.6.32.63\fs\nfsd\vfs.c

因为VFS实现了对底层文件系统的抽象，所有的新建文件操作都会走vfs_symlink的逻辑
/source/fs/namei.c

int vfs_symlink(struct inode *dir, struct dentry *dentry, const char *oldname)
{
    int error = may_create(dir, dentry);

    if (error)
        return error;

    if (!dir->i_op->symlink)
        return -EPERM;

    //LSM Hook点
    error = security_inode_symlink(dir, dentry, oldname);
    if (error)
        return error;

    vfs_dq_init(dir);
    error = dir->i_op->symlink(dir, dentry, oldname);
    if (!error)
        //inotify通知点
        fsnotify_create(dir, dentry);
    return error;
}

0x5: 创建硬链接监控: security_inode_link

Linux上新建硬链接的入口点有

1. \linux-2.6.32.63\fs\namei.c
2. \linux-2.6.32.63\fs\ecryptfs\inode.c
3. \linux-2.6.32.63\fs\nfsd\vfs.c

因为VFS实现了对底层文件系统的抽象，所有的新建文件操作都会走vfs_link的逻辑
/source/fs/namei.c

int vfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *new_dentry)
{
    struct inode *inode = old_dentry->d_inode;
    int error;

    if (!inode)
        return -ENOENT;

    error = may_create(dir, new_dentry);
    if (error)
        return error;

    if (dir->i_sb != inode->i_sb)
        return -EXDEV;

    /*
     * A link to an append-only or immutable file cannot be created.
     */
    if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
        return -EPERM;
    if (!dir->i_op->link)
        return -EPERM;
    if (S_ISDIR(inode->i_mode))
        return -EPERM;

    //LSM Hook点
    error = security_inode_link(old_dentry, dir, new_dentry);
    if (error)
        return error;

    mutex_lock(&inode->i_mutex);
    vfs_dq_init(dir);
    error = dir->i_op->link(old_dentry, dir, new_dentry);
    mutex_unlock(&inode->i_mutex);
    if (!error)
        //inotify通知点
        fsnotify_link(dir, inode, new_dentry);
    return error;
}

0x6: 文件移动(mv)、目录重命名、文件名重命名监控: security_inode_rename

mv文件/目录移动在内核VFS层是调用vfs_rename实现
/source/fs/namei.c

int vfs_rename(struct inode *old_dir, struct dentry *old_dentry, struct inode *new_dir, struct dentry *new_dentry)
{
    int error;
    int is_dir = S_ISDIR(old_dentry->d_inode->i_mode);
    const char *old_name;

    if (old_dentry->d_inode == new_dentry->d_inode)
         return 0;
 
    error = may_delete(old_dir, old_dentry, is_dir);
    if (error)
        return error;

    if (!new_dentry->d_inode)
        error = may_create(new_dir, new_dentry);
    else
        error = may_delete(new_dir, new_dentry, is_dir);
    if (error)
        return error;

    if (!old_dir->i_op->rename)
        return -EPERM;

    vfs_dq_init(old_dir);
    vfs_dq_init(new_dir);

    old_name = fsnotify_oldname_init(old_dentry->d_name.name);

    //调用VFS重命名API
    if (is_dir)
        error = vfs_rename_dir(old_dir,old_dentry,new_dir,new_dentry);
    else
        error = vfs_rename_other(old_dir,old_dentry,new_dir,new_dentry);

    if (!error) 
    {
        const char *new_name = old_dentry->d_name.name;
        /*
        inotify IN_MOVED事件通知点
        fsnotify_move - file old_name at old_dir was moved to new_name at new_dir
        */
        fsnotify_move(old_dir, new_dir, old_name, new_name, is_dir, new_dentry->d_inode, old_dentry);
    }
    fsnotify_oldname_free(old_name);

    return error;
}

error = vfs_rename_dir(old_dir,old_dentry,new_dir,new_dentry);
error = vfs_rename_other(old_dir,old_dentry,new_dir,new_dentry);

static int vfs_rename_dir(struct inode *old_dir, struct dentry *old_dentry, struct inode *new_dir, struct dentry *new_dentry)
{
    int error = 0;
    struct inode *target;

    /*
     * If we are going to change the parent - check write permissions,
     * we'll need to flip '..'.
     */
    if (new_dir != old_dir) {
        error = inode_permission(old_dentry->d_inode, MAY_WRITE);
        if (error)
            return error;
    }

    //LSM Hook点
    error = security_inode_rename(old_dir, old_dentry, new_dir, new_dentry);
    if (error)
        return error;
    ..
}

static int vfs_rename_other(struct inode *old_dir, struct dentry *old_dentry, struct inode *new_dir, struct dentry *new_dentry)
{
    struct inode *target;
    int error;

    //LSM Hook点
    error = security_inode_rename(old_dir, old_dentry, new_dir, new_dentry);
    if (error)
        return error;
    ..
}

0x7: 写文件监控: security_file_permission: 写文件前的权限检查

文件写操作通过VFS封装实现
/source/fs/read_write.c

ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos)
{
    ssize_t ret;

    if (!(file->f_mode & FMODE_WRITE))
        return -EBADF;
    if (!file->f_op || (!file->f_op->write && !file->f_op->aio_write))
        return -EINVAL;
    if (unlikely(!access_ok(VERIFY_READ, buf, count)))
        return -EFAULT;

    //写之前进行权限验证
    ret = rw_verify_area(WRITE, file, pos, count);
    if (ret >= 0) 
    {
        count = ret;
        if (file->f_op->write)
            ret = file->f_op->write(file, buf, count, pos);
        else
            ret = do_sync_write(file, buf, count, pos);
        if (ret > 0) 
        {
            //inotify通知点
            fsnotify_modify(file->f_path.dentry);
            add_wchar(current, ret);
        }
        inc_syscw(current);
    }

    return ret;
}
EXPORT_SYMBOL(vfs_write);

ret = rw_verify_area(WRITE, file, pos, count);

int rw_verify_area(int read_write, struct file *file, loff_t *ppos, size_t count)
{
    struct inode *inode;
    loff_t pos;
    int retval = -EINVAL;

    inode = file->f_path.dentry->d_inode;
    if (unlikely((ssize_t) count < 0))
        return retval;
    pos = *ppos;
    if (unlikely((pos < 0) || (loff_t) (pos + count) < 0))
        return retval;

    if (unlikely(inode->i_flock && mandatory_lock(inode))) {
        retval = locks_mandatory_area(
            read_write == READ ? FLOCK_VERIFY_READ : FLOCK_VERIFY_WRITE,
            inode, file, pos, count);
        if (retval < 0)
            return retval;
    }
    //LSM Hook点
    retval = security_file_permission(file,
                read_write == READ ? MAY_READ : MAY_WRITE);
    if (retval)
        return retval;
    return count > MAX_RW_COUNT ? MAX_RW_COUNT : count;
}

VFS的另一个写文件的API: vfs_writev

ssize_t vfs_writev(struct file *file, const struct iovec __user *vec, unsigned long vlen, loff_t *pos)
{
    if (!(file->f_mode & FMODE_WRITE))
        return -EBADF;
    if (!file->f_op || (!file->f_op->aio_write && !file->f_op->write))
        return -EINVAL;

    return do_readv_writev(WRITE, file, vec, vlen, pos);
}

return do_readv_writev(WRITE, file, vec, vlen, pos);

static ssize_t do_readv_writev(int type, struct file *file, const struct iovec __user * uvector, unsigned long nr_segs, loff_t *pos)
{
    size_t tot_len;
    struct iovec iovstack[UIO_FASTIOV];
    struct iovec *iov = iovstack;
    ssize_t ret;
    io_fn_t fn;
    iov_fn_t fnv;

    if (!file->f_op) {
        ret = -EINVAL;
        goto out;
    }

    ret = rw_copy_check_uvector(type, uvector, nr_segs,
            ARRAY_SIZE(iovstack), iovstack, &iov);
    if (ret <= 0)
        goto out;

    tot_len = ret;
    //rw_verify_area中调用了 security_file_permission LSM Hook点
    ret = rw_verify_area(type, file, pos, tot_len);
    if (ret < 0)
        goto out;

    fnv = NULL;
    if (type == READ) {
        fn = file->f_op->read;
        fnv = file->f_op->aio_read;
    } else {
        fn = (io_fn_t)file->f_op->write;
        fnv = file->f_op->aio_write;
    }

    if (fnv)
        ret = do_sync_readv_writev(file, iov, nr_segs, tot_len,
                        pos, fnv);
    else
        ret = do_loop_readv_writev(file, iov, nr_segs, pos, fn);

out:
    if (iov != iovstack)
        kfree(iov);
    if ((ret + (type == READ)) > 0) {
        if (type == READ)
            fsnotify_access(file->f_path.dentry);
        else
            fsnotify_modify(file->f_path.dentry);
    }
    return ret;
}

LSM Hook点 security_file_permission 可以完全覆盖所有的文件写操作，并且security_file_permission会根据type类型传入对应的WRITE、READ标志

retval = security_file_permission(file, read_write == READ ? MAY_READ : MAY_WRITE);

0x8: 进程关闭打开的文件监控: security_file_free

\linux-2.6.32.63\fs\open.c

SYSCALL_DEFINE1(close, unsigned int, fd)
{
    struct file * filp;
    struct files_struct *files = current->files;
    struct fdtable *fdt;
    int retval;

    spin_lock(&files->file_lock);
    /*
    获取指向struct fdtable结构体的指针
    \linux-2.6.32.63\include\linux\fdtable.h
    #define files_fdtable(files) (rcu_dereference((files)->fdt))
    */
    fdt = files_fdtable(files);
    if (fd >= fdt->max_fds)
    {
        goto out_unlock;
    } 
    //获取需要关闭的文件描述符编号
    filp = fdt->fd[fd];
    if (!filp)
    {
        goto out_unlock;
    } 
    /*
    将fd_array[]中的的指定元素值置null 
    */
    rcu_assign_pointer(fdt->fd[fd], NULL);
    FD_CLR(fd, fdt->close_on_exec); 
    /*
    调用__put_unused_fd函数，将当前fd回收，则下一次打开新的文件又可以用这个fd了
    static void __put_unused_fd(struct files_struct *files, unsigned int fd)
    {
        struct fdtable *fdt = files_fdtable(files);
        __FD_CLR(fd, fdt->open_fds);
        if (fd < files->next_fd)
        {
            files->next_fd = fd;
        } 
    }
    */
    __put_unused_fd(files, fd);
    spin_unlock(&files->file_lock);
    retval = filp_close(filp, files);

    /* can't restart close syscall because file table entry was cleared */
    if (unlikely(retval == -ERESTARTSYS || retval == -ERESTARTNOINTR || retval == -ERESTARTNOHAND || retval == -ERESTART_RESTARTBLOCK))
    {
        retval = -EINTR;
    } 

    return retval;

out_unlock:
    spin_unlock(&files->file_lock);
    return -EBADF;
}
EXPORT_SYMBOL(sys_close);

retval = filp_close(filp, files);

int filp_close(struct file *filp, fl_owner_t id)
{
    int retval = 0;

    if (!file_count(filp)) 
    {
        printk(KERN_ERR "VFS: Close: file count is 0\n");
        return 0;
    }

    if (filp->f_op && filp->f_op->flush)
    {
        retval = filp->f_op->flush(filp, id);
    } 

    dnotify_flush(filp, id);
    locks_remove_posix(filp, id);
    fput(filp);
    return retval;
}
EXPORT_SYMBOL(filp_close);

fput(filp);
\linux-2.6.32.63\fs\file_table.c

void fput(struct file *file)
{
    if (atomic_long_dec_and_test(&file->f_count))
        __fput(file);
}
EXPORT_SYMBOL(fput);

__fput(file);

void __fput(struct file *file)
{
    struct dentry *dentry = file->f_path.dentry;
    struct vfsmount *mnt = file->f_path.mnt;
    struct inode *inode = dentry->d_inode;

    might_sleep();

    //inotify内核通知点
    fsnotify_close(file);
    /*
     * The function eventpoll_release() should be the first called
     * in the file cleanup chain.
     */
    eventpoll_release(file);
    locks_remove_flock(file);

    if (unlikely(file->f_flags & FASYNC)) {
        if (file->f_op && file->f_op->fasync)
            file->f_op->fasync(-1, file, 0);
    }
    if (file->f_op && file->f_op->release)
        file->f_op->release(inode, file);
    //到这里为止，file结构体实例已经被执行了释放free操作，其中的很多字段已经处于"bad address"状态，并不可用

    //LSM Hook点
    security_file_free(file);

    ima_file_free(file);
    if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL))
        cdev_put(inode->i_cdev);
    fops_put(file->f_op);
    put_pid(file->f_owner.pid);
    file_kill(file);
    if (file->f_mode & FMODE_WRITE)
        drop_file_write_access(file);
    file->f_path.dentry = NULL;
    file->f_path.mnt = NULL;
    file_free(file);
    dput(dentry);
    mntput(mnt);
}

这里存在的一个问题是，在security_file_free这个LSM Hook点，file结构体实例已经被执行了多次了释放操作，实例中的成员地址很多都已经不可用了，因此我们无法稳定、有效地从struct file中取到我们需要的参数
一个可行的解决思路是复用Linux原生的inotify机制，在inotify的group调用链头部插入一个Hook group，Hook group的handler中对所有的inode事件都进行处理，以此来解决inotify天生不能监控全目录的缺点，在处理结束之后，永远返回false，以此让group调用链继续往下运行
我们结合inotify的源代码来逐步阐述这个思路
\linux-2.6.32.63\fs\notify\fsnotify.c

void fsnotify(struct inode *to_tell, __u32 mask, void *data, int data_is, const char *file_name, u32 cookie)
{
    struct fsnotify_group *group;
    struct fsnotify_event *event = NULL;
    int idx;
    /* global tests shouldn't care about events on child only the specific event */
    __u32 test_mask = (mask & ~FS_EVENT_ON_CHILD);

    //判断registered fsnotify_group是否为空，即当前inotify group调用链上至少要有一个group
    if (list_empty(&fsnotify_groups))
        return;

    if (!(test_mask & fsnotify_mask))
        return;

    if (!(test_mask & to_tell->i_fsnotify_mask))
        return;
    /*
     * SRCU!!  the groups list is very very much read only and the path is
     * very hot.  The VAST majority of events are not going to need to do
     * anything other than walk the list so it's crazy to pre-allocate.
     */
    idx = srcu_read_lock(&fsnotify_grp_srcu);
    list_for_each_entry_rcu(group, &fsnotify_groups, group_list) 
    {
        if (test_mask & group->mask) 
        {    
            //执行group的的操作句柄数组的should_send_event函数，并判断返回结果
            if (!group->ops->should_send_event(group, to_tell, mask))
                continue;
            if (!event) 
            {
                event = fsnotify_create_event(to_tell, mask, data, data_is, file_name, cookie, GFP_KERNEL);
                /* shit, we OOM'd and now we can't tell, maybe
                 * someday someone else will want to do something
                 * here */
                if (!event)
                    break;
            }
            group->ops->handle_event(group, event);
        }
    }
    srcu_read_unlock(&fsnotify_grp_srcu, idx);
    /*
     * fsnotify_create_event() took a reference so the event can't be cleaned
     * up while we are still trying to add it to lists, drop that one.
     */
    if (event)
        fsnotify_put_event(event);
}
EXPORT_SYMBOL_GPL(fsnotify);

梳理一下我们要实现的目标

1. 我们需要往group_list的头部插入一个新的Hook group，以此实现Hook劫持inotify执行流的目的，这个可以通过Linux内核原生支持的list_add函数来实现
2. 我们的Hook group逻辑不能影响到用户自己配置的inotify行为，即不能影响到原生的inotify处理流程
3. group_list的链表插入需要特别注意，需要注意在SMP场景下对group_list的加锁写，防止出现并发问题

从这个目标出发，产生了以下的技术方案

1. 通过kprobe动态获取fsnotify_obtain_group()、fsnotify_put_group()函数地址
2. 创建新的fsnotify_group实例，设置监控事件: FS_CLOSE_WRITE，并设置事件回调函数
/*
group = kp_fsnotify_obtain_group(grp_num, FS_CLOSE_WRITE, &cwnotify_ops);
const struct fsnotify_ops cwnotify_ops = 
{
    .should_send_event = cwnotify_should_send_event,
};
*/
3. cwnotify_should_send_event
/*
static bool cwnotify_should_send_event(struct fsnotify_group *group, struct inode *inode, __u32 mask)
{
    alise_file_close(inode); 
    return false;
}
*/
4. 调用idr_init()向group_list链表的头部添加一个新的Hook struct fsnotify_group
5. 将Hook struct fsnotify_group->ops->should_send_event配置为我们Hook函数，在Hook函数中完成参数的获取和事件上报逻辑
6. 在处理结束之后，返回false，这样，使其执行continue，让执行流跳过之后的处理，继续后续的group的系统原生的inotify处理逻辑
/*
if (!group->ops->should_send_event(group, to_tell, mask))
                continue;
*/

更进一步，我们参考一下fanotify实现directed、per-mount、global三种监控模式的方法
\linux-3.15.5\fs\notify\fanotify\fanotify_user.c

/* fanotify syscalls */
SYSCALL_DEFINE2(fanotify_init, unsigned int, flags, unsigned int, event_f_flags)
{
    ..
    /* fsnotify_alloc_group takes a ref.  Dropped in fanotify_release */
    group = fsnotify_alloc_group(&fanotify_fsnotify_ops);
    if (IS_ERR(group)) {
        free_uid(user);
        return PTR_ERR(group);
    }
    ..
}

group对应的operation数组在初始化时即完成声明
\linux-3.15.5\fs\notify\fanotify\fanotify.c

const struct fsnotify_ops fanotify_fsnotify_ops = 
{
    .handle_event = fanotify_handle_event,
    .free_group_priv = fanotify_free_group_priv,
    .free_event = fanotify_free_event,
};

通过调用fanotify_mark指定监控模式、监控目录

SYSCALL_DEFINE5(fanotify_mark, int, fanotify_fd, unsigned int, flags,
                  __u64, mask, int, dfd,
                  const char  __user *, pathname)
{
    ..
    case FAN_MARK_ADD:
        if (flags & FAN_MARK_MOUNT)
            ret = fanotify_add_vfsmount_mark(group, mnt, mask, flags);
        else
            ret = fanotify_add_inode_mark(group, inode, mask, flags);
        break;
    ..
}

0x9: 针对频繁重复事件进行基于时间窗口过滤

思考下面这段代码展示的场景

#include <stdio.h>
#include <stdlib.h> 
#include <sys/time.h> 

int main ()
{
    FILE * pFile;
    struct timeval start;
    struct timeval end;
    float diff;

    pFile = fopen ("myfile.txt","w");
    int i = 0;  
    
    for (i = 0; i < 99; ++i)
    {    
        //start time count
        gettimeofday(&start, NULL);

        if (pFile!=NULL)
        {
            fputs ("fopen example",pFile); 
        }

        gettimeofday(&end, NULL);
        //end time count 

        diff = 1000000 * (end.tv_sec-start.tv_sec) + (end.tv_usec-start.tv_usec);
        printf("the total second is %f (s)\n", (float)(diff / 1000000) );
    }

    fclose (pFile);
    return 0; 
}

程序在短时间内发起了大量的写文件操作，这会导致主防在一个很短的时间内捕获到大量的write事件，造成无意义的通道拥堵，为了解决这个问题，需要对同一个文件的重复写事件(FILE_WRITE)，需要根据两次事件之间的时间窗口(50ms)进行去重，避免出现同一个文件在很短时间内的频繁读写对上传通道造成无意义的压力，如果50ms后再没有发送任何数据包，则以最后以此事件最为lasttime进行上报

if (open && write) || ( write exceed > 50ms) || ((read || exec ) and (write log exceed > 50ms ))
    report
else if (( no write log ) and write)
    write log

0x10: 过滤非可执行冗余文件

使用S_ISREG过滤产生变动的文件，文件变动监控只关注以下几类文件

1. 库文件(.so、.a)
2. ELF文件
3. 软链接、硬链接
4 文本文件(webshell、.py、.python..)

在文件变动监控中，会有很多场景会导致冗余文件变动事件的产生，例如

1. 源代码编译过程中产生的
    1) #inlcude预编译产生的源代码ACII文件
    2) 编译器产生的中间.obj文件
2. 网站运行过程中产生的cache文件
    1) .php
    2) .asp
3. 编译器产生的临时文件
    1) xxx~

因此，在内核态需要通过读取当前文件的ELF文件头，通过type判断，将冗余文件过滤掉

Relevant Link:

http://stackoverflow.com/questions/1835947/how-do-i-program-for-linuxs-new-fanotify-file-system-monitoring-feature
https://www.ibm.com/developerworks/cn/linux/l-cn-fanotify/

4. Linux下内核模块导出辅助接口

主动防御作为内核驱动LKM模块，从内核层面接管了系统所有的操作行为，除了提供审计和实时阻断的能力之外，还需要向用户态导出辅助接口，用户向用户态提供操作系统的最高权限的API，例如

1. 用户态应用程序经过可信握手验证之后，可以调用辅助接口实现对磁盘任意文件进行读写，而无视文件系统的ACL控制
    1) 在LSM security_file_permission进行权限开放
    2) 用户态调用辅助接口实现针对指定进程PID开启特权模式
2. 用户态应用程序经过可信握手验证之后，可以任意读写当前系统中任意进程的虚拟内存，而无视操作系统的进程虚拟内存保护机制

用户态对内核模块功能的接口封装，从本质上讲是将内核态的函数调用通过Ring3~Ring0通信的方式衍生到了用户态，通过这层封装概念，使得用户态可以像调用C库API一样直接使用内核模块的功能
从功能实现上来讲，辅助接口要实现的是突破Linux的权限控制、ACL控制

0x1: Ring3 ~ Ring0基于ioctl操作procfs结点进行调用参数、执行结果的通信

实现Linux内核模块功能导出用户态的核心是procfs，创建用于实现接口通信的结点最关键的.write、.read事件句柄函数

1. .read句柄函数
    1) 用户态应用程序在发起辅助接口调用之后，隔一定的时间后试图读取结点buf
    2) 如果内核模块完成处理逻辑，则此时用户态可以成功从结点buf中读取数据
    3) 完成调用后，由用户态应用程序清空结点buf

2. .write句柄函数
    1) 用户态发起辅助接口调用，向辅助接口结点传入参数，即触发.write事件函数，内核模块读取结点buf，获取调用类型、调用参数
    2) 内核模块调用实际的功能处理函数，实现实际功能
    3) 内核模块处理完成之后，将处理结果写到结点的buf中

0x2: Ring3 ~ Ring0基于prctl操作procfs结点进行调用参数、执行结果的通信

prctl - operations on a process

#include <sys/prctl.h>

int prctl(int option, unsigned long arg2, unsigned long arg3, unsigned long arg4, unsigned long arg5);

prctl() is called with a first argument describing what to do (with values defined in <linux/prctl.h>), and further arguments with a significance depending on the first one. The first argument can be:

..
1. PR_SET_PDEATHSIG
(since Linux 2.1.57) Set the parent process death signal of the current process to arg2 (either a signal value in the range 1..maxsig, or 0 to clear). This is the signal that the current process will get when its parent dies. This value is cleared upon a fork().

2. PR_GET_PDEATHSIG
(since Linux 2.3.15) Read the current value of the parent process death signal into the (int *) arg2.

3. PR_SET_DUMPABLE
(Since Linux 2.4) Set the state of the flag determining whether core dumps are produced for this process upon delivery of a signal whose default behaviour is to produce a core dump. (Normally this flag is set for a process by default, but it is cleared when a set-user-ID or set-group-ID program is executed and also by various system calls that manipulate process UIDs and GIDs). In kernels up to and including 2.6.12, arg2 must be either 0 (process is not dumpable) or 1 (process is dumpable). Since kernel 2.6.13, the value 2 is also permitted; this causes any binary which normally would not be dumped to be dumped readable by root only. (See also the description of /proc/sys/fs/suid_dumpable in proc(5).)

4. PR_GET_DUMPABLE
(Since Linux 2.4) Return (as the function result) the current state of the calling process’s dumpable flag.

5. PR_SET_KEEPCAPS
Set the state of the process’s "keep capabilities" flag, which determines whether the process’s effective and permitted capability sets are cleared when a change is made to the process’s user IDs such that the process’s real UID, effective UID, and saved set-user-ID all become non-zero when at least one of them previously had the value 0. (By default, these credential sets are cleared). arg2 must be either 0 (capabilities are cleared) or 1 (capabilities are kept).

6. PR_GET_KEEPCAPS
..

下面的代码示例使用prctl给线程自身命名

#include <stdio.h> 
#include <pthread.h> 
#include <sys/prctl.h> 

void * tmain( void * arg ) 
{ 
        char name[ 32] ; 
        prctl( PR_SET_NAME, ( unsigned long ) "LittleHann" ) ; 
        prctl( PR_GET_NAME, ( unsigned long ) name) ; 
        printf ( "%s/n" , name) ; 
        while ( 1) 
                sleep ( 1) ; 
} 

int main( void ) 
{ 
        pthread_t tid; 
        pthread_create ( & tid, NULL , tmain, NULL ) ; 
        pthread_join ( tid, NULL ) ; 

        return 0; 
}
//gcc tmp.c -o tmp -lpthread

启动另一个终端，查看进程信息

ps aux | grep tmp
root     13259  0.0  0.0  16212   520 pts/0    Sl+  15:14   0:00 ./tmp

cd /proc/13259/task/
ll
//选择线程对应task
cd /proc/13259/task/13260
cat status
Name:    LittleHann
State:    S (sleeping)

0x3: Ring3 ~ Ring0基于Netlink进行调用参数、执行结果的通信

从函数调用的本质是通信的角度来看，使用Netlink建立辅助接口调用机制也是一种很好的方案，内核态和用户态分为维护一个Netlink Socket，将API调用转化为数据异步传输实现方式

Relevant Link:

http://man7.org/linux/man-pages/man2/prctl.2.html
http://blog.csdn.net/dahailantian1/article/details/5950824

5. 主动防御的自我保护

作为任何安全攻防产品来说，它们基本要解决下列几个问题

1. 保护功能如何正确生效
2. 在保护模块加载的时候系统处于"安全状态"的前提假设是否成立，这决定了安全保护模块自身是否会遭到劫持
3. 如何保护"保护模块自身"的安全，即不被卸载

在这些问题中，保护模块的自我保护是其中最关键也是最重要的问题，从本质上讲，自保护意味着需要保证模块(程序)自身在受到非正常攻击的情况仍然能保持"正常工作状态(不被KILL、不失效、能产生有效防御效果、能被正常调度等)"

0x1: KERNEL模块的加载验证、安全加载

Linux可以动态的加载内核模块，在很多场合可能需要确保加载内核的安全性。如果被攻击者加载恶意内核模块，将会使得内核变得极其危险，解决这个问题的思路有以下几种

1. 给内核模块进行签名，内核只加载能正确验证的签名
linux在3.7版本的内核里都已经实现了这些。称为linux-modsign

2. 基于LSM模块在用户态对insmod系统调用进行保护
    1) insmod不会被替换(在init_module中添加验证逻辑)
    2) lsm模块必须被正确加载(在系统启动之处处于绝对纯净状态的时候加载保护模块)
    3) lsm模块不能被卸载，卸载意味着安全机制不再起作用了
        3.1) 把lsm模块直接编译进内核
        3.2) hook unload_module
        3.3) 添加了CONFIG_MODULE_UNLOAD标记，用于禁止卸载内核模块

3. 内核自身提供的KO校验机制
    1) 模块的CRC值校验: 针对模块(内核)导出符号，是一种简单的 ABI(Application Binary Interface)一致性检查
    2) vermagic(Version Magic String)校验: 保存了模块编译时的内核版本以及SMP等配置信息，当模块vermagic与主机信息不相符时亦将终止模块的加载
//Linux内核在进行模块装载时先完成模块的 CRC 值校验，再核对 vermagic 中的字符信息

Relevant Link:

http://www.ibm.com/developerworks/cn/linux/l-cn-kernelmodules/
http://lwn.net/Articles/470435
http://www.4byte.cn/learning/43273.html

0x2: RING0 ~ RING3可信通信

Linux上的RING3 ~ RING0通信基本上是进程间通信方式，这意味着对对于同样获得系统权限的黑客来说，通信信道的身份验证不存在，存在遭到黑客窃取、注入、篡改、重定向的可能性，为了解决这个问题，需要对信道做特殊设计

1. Netlink用户态在和Netlink内核态第一次握手通信之前生成一个随机值，并作为pid传入内核层Netlink
2. 内核层Netlink在握手阶段记录下这个pid，并本地保存
3. 之后如果还其其他用户态进程试图和内核态Netlink进行握手通信，则需要检测pid是否相等
    1) 如果相等，说明是可信方
    2) 如果不相等，说明是未授权进程，禁止握手
/*
if(alise_id && (NETLINK_CB(in_skb).pid != alise_id)){
        printk(KERN_INFO "Alise: some other prosess try connect.\n");
        return;
    }
*/

这种方式依然存在一个安全风险，如果黑客通过枚举随机key的明文空间，在用户态不断while循环尝试接收主防的netlink消息，如果某次枚举key成功，则可以直接接收主防内核态报告的消息，通过一个竞态条件bypass主防的消息hook，从结果上使主防失去消息hook的能力，可以采用以下的解决方案

1. 用户态可信主程序每隔一定时间(例如120s)随机生成一个新的认证key
2. 作为pid传入内核层Netlink
2. 内核层Netlink在握手阶段记录下这个pid，并本地保存
3. 之后如果还其其他用户态进程试图和内核态Netlink进行握手、或者数据通信，则需要检测pid是否相等
    1) 如果相等，说明是可信方
    2) 如果不相等，说明是未授权进程，禁止握手、禁止数据通信

0x3: 内核模块对外提供.so封装接口安全性保证

主防的内核模块不允许任意用户态程序直接进行交互，而是提供了一个封装的.so，对外暴露API接口，控制消息、Hook数据的通信、配置文件读取/reload都通过这个.so进行
.so的加载和使用也需要加入"可信通道"的管控中，只允许可信主程序加载这个so，解决方案如下

1. 在进程管控的"进程加载模块"Hook控制中，对主防使用的.so的加载方进行限制，只允许可信主进程加载

posted @ 2015-04-07 20:00 郑瀚Andrew 阅读(299) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

Han Zheng, Practitioners and Theoretical Researcher, Now working in Alibaba Cloud Corp, China

Welcome to contact me. Wechat：LittleHann，My email, 306211321@qq.com，Job mail：zhenghan.zh@alibaba-inc.com

How To Setup A Linux Active Defense System Or Intrusion Detection System On Linux

公告