The idea in this paper is similar to a paper[http://www-users.cselabs.umn.edu/classes/Spring-2019/csci8980/papers/tuning.pdf] we have discussed on class before. In the paper[Auto DBMS Tuning...], the author use ML model to automatically tune database knobs, while in this paper the author use ML model to optimize the combination of components in a complex software system to achieve better performance.
The whole framework could be divided into 3 parts:
- Dane: A light weight programming language for implementing small components. These components could be assembled together to build a large scale software system. Some of these components may have the same function but do jobs in different ways. The main contribution of Dane is that it is fast so that different component could be dynamically rewired with very low cost.
- A perception, assembly and learning framework (PAL): It contains 2 modules: 1). Assembly Module. It could assemble different components to implement a feature. 2). Perception Module. It could monitor the performance of the software system, as well as the runtime environment.
- The online learning part. It uses reinforcement learning model “multi-armed bandit” to optimize the selection of components. Also, like the [Auto DBMS Tuning...] paper, some selection of components are related so they could share information. Thus, the author uses regression model to reduce the search space.
Application of ML method:
It used a continuous way when updating configuration. First, the software will run with a random combination of components. After several seconds, the PAL will collect perception data, and the online learning part will update estimates and update the model. Then the software will run under the new configuration for several seconds, and update the model again. In this way we could better shape the software system to the environment.
- The idea is interesting. It used an automatic ML model to replace manually written rules in previous work. This idea could also apply to other computer system problems where configurations and performance are linked.
- The usage of regression factor model reduces the search space, which achieved balancing exploration and exploitation.
- Compared to [Auto DBMS Tuning...] paper, this paper considered how to adapt to the changes in deployment environment.
- The performance of the model deeply relies on the components implemented by Dane. Firstly, it means that this model could not be applied to present software systems that are written in more common languages, like C++ and java. Secondly, this requires that the whole software could be disassembled into multiple irrelevant, small, dynamic components. In some situations it may not be guaranteed.
- For a specific tasks(like web server), the author could have built a configuration dataset, so when first running the model, the application could be have a less optimal, but not bad combination of components. This could be better than choosing randomly at first.
This paper focuses on detecting malware android applications based on machine learning method. They defined 4 types of features to be extracted from applications:
- 1. well-received features
- 1.1 Permission: Some malicious applications need some specific types of permissions.
- 1.2 Sensitive API Calls: They extracted sensitive API calls from Snail files, and found the top types of sensitive API calls which could best distinguish malicious and benign applications.
- 2. newly-defined features
- 2.1 Sequence: Malicious apps tend to have drastically different sensitive API calls. They defined 3 metrics to quantify the number of sensitive API calls.
- 2.2 Dynamic Behavior: Monitor the activities triggered by each application from their log file.
In both of these features, they removed the common ones which are shared by malicious and benign apps, and left the most distinguishable features.
Application of ML methods
They compared several ML methods: SVM, decision tree, Multi-Layer Perceptron, Naïve Bayes, KNN, Bagging predictor. They also compared the performance of different feature selection methods (Only well-received features / well-received and newly-defined features). Finally, they found that KNN classifier + well-received and newly-defined features achieved the best performance.
- They contributed to define new features for monitoring dynamic behavior of malicious applications, which achieved better performance than using static analysis only.
- When analyzing features, they used the same APP dataset which is also used in testing ML classifier. So they removed the common features and selected most distinguishable features based on their analyzing results, before testing their ML classifier. I think there will be a kind of “over-fitting” in this process. They should use different APP dataset for analyzing and testing.
- Decompiling and analyzing the application on Android device may use too many system resources, since the performance of mobile CPU and memory are comparatively lower. Actually I think this method is more suitable for detecting malicious applications on PC.