大型网站日志分布式分析

大数据 Web 日志分析

学完了几个基础的大数据框架和组件，自己也初步的搭建了一个电商数据网站的日志分析系统。当然由很多不足的地方，或许根本不能投入生产使用，但是不仅是我一个自己的总结，也是学习灵活的运用。
这里我不会介绍每个组件的安装即怎么使用，后续会将使用博客一次写好。我只复习这次项目的大概实现路径。还有就是下面我介绍的代码看注解步骤就行，因为很多自己封装了一些工具类并没有上传，太多，详细的见github：https://github.com/ongbo/LogProject
**
这次项目的装备**是：

系统：搭建在Windows上的VMware虚拟机上的三台Centos7
jdk-1.8.0_201
hadoop-2.8.5
flume-1.9.0
hbase-2.1.3
nginx-1.12.2
zookeeper-3.4.13
mysql-8.0.15
IDEA 2019

首先我先介绍一下项目结构吧。这个项目是分几个步骤完成的。

一：搭建网站服务器

这里主要是在linux上搭建了一个nginx服务器。nginx安装我就不多说了，两种方法，如果有需要我再写一篇博客。
详细的nginx配置文件如下

# For more information on configuration, see:
#   * Official English Documentation: http://nginx.org/en/docs/
#   * Official Russian Documentation: http://nginx.org/ru/docs/
user nginx;
worker_processes auto;
error_log /var/log/nginx/error.log;
pid /run/nginx.pid;
# Load dynamic modules. See /usr/share/nginx/README.dynamic.
include /usr/share/nginx/modules/*.conf;
events {
    worker_connections 1024;
}
http {
   # log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
   #                   '$status $body_bytes_sent "$http_referer" '
   #                   '"$http_user_agent" "$http_x_forwarded_for"';
    log_format main '$remote_addr^A$remote_user^A$msec^A$http_host^A$http_user_agent^A$request_uri';//访问日志的格式
    access_log  /home/hadoop/app/nginx/access.log  main;//这里配置的是访问日志access.log的路径
    sendfile            on;
    tcp_nopush          on;
    tcp_nodelay         on;
    keepalive_timeout   65;
    types_hash_max_size 2048;
    include             /etc/nginx/mime.types;
    default_type        application/octet-stream;
    # Load modular configuration files from the /etc/nginx/conf.d directory.
    # See http://nginx.org/en/docs/ngx_core_module.html#include
    # for more information.
    include /etc/nginx/conf.d/*.conf;
    server {
        listen       80 default_server;
        listen       [::]:80 default_server;
        server_name  _;
        root         /usr/share/nginx/html;
        # Load configuration files for the default server block.
        include /etc/nginx/default.d/*.conf;

        location / {
        }
        error_page 404 /404.html;       
        location = /40x.html {
        }

        error_page 500 502 503 504 /50x.html;
            location = /50x.html {
        }
    }
}

在nginx配置里面配置了一下nginx的日志路路径和日志格式。这样只要由客户端访问了这个网站，我们就可以获取这个访问记录的信息，实时写入到access.log文件。

二：flume进行数据抽取

这里采用的是flume-1.9.0监控access.log并将里面的内容实时写入hdfs
这里前提是必须搭建好了hadoop平台，并能够完整的运行。
我在flume-1.9.0中的安装目录里面的dirflume下写了一个option的配置文件。

#me the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source//配置数据源
a1.sources.r1.type =exec
a1.sources.r1.command =tail -F /home/hadoop/app/nginx/access.log
# Describe the sink //配置数据输出路径
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /log/%Y%m%d
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollSize = 10240000
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.callTimeout = 60000
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.idleTimeout = 10
#Use a channel which buffers events in memory//配置输出管道
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

这样我们就配置了从source->channel->sink的数据转化
我输入命令

flume-ng -agent --conf-file dirflume/option --name a1 -Dflume.root.logger,console

这样flume就会实时监控access.log文件，只要nginx服务器向access.log中写入日志信息，这个命令就会将信息在console打印出来，因为配置flume配置信息里面配置了a1.sources.r1.command =tail -F /home/hadoop/app/nginx/access.log，然后将日志信息写入hdfs中，在hdfs中就能够看到响应的信息了。

三：ETL数据处理

下面才是真正的数据核心处理，前面不过就是为了能够拿到数据，这里就开始进行模块开发和数据清洗处理了。
这个步骤过程主要是从hdfs中读取信息，然后用mapreduce处理模型将存储在hdfs中的日志信息取出来，进行处理分析，再写入hbase
记得我们在nginx中定义的nginx日志信息格式：

 log_format main '$remote_addr^A$remote_user^A$msec^A$http_host^A$http_user_agent^A$request_uri';
 remote_addr是远程地址
 remote_user是用户信息
 msec是时间
 http_host是主机地址
 http_user_agent是浏览器平台信息
 request_uri是请求信息

下面我们就来分析一下一个具体的日志信息

remote_addr:
192.168.40.1^A
remote_user:
-^A
msec:
1558089924.870^A
http_host:
192.168.40.135^A
http_user_agent:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36^A
request_uri:
/x.JPG?
en=e_pv&
p_url=http%3A%2F%2Flocalhost%3A8080%2FWebLog%2Fdemo.jsp&
tt=%E6%B5%8B%E8%AF%95%E9%A1%B5%E9%9D%A21&
ver=1&
pl=website&
sdk=js&
u_ud=69547D15-FB1F-4B3B-BEDC-399FD4071FD3&
u_mid=zhangsan&
u_sd=0538D363-4F8C-4F8A-89C8-0CF0B7F0A199&
c_time=1558089923049&
l=zh-CN&
b_iev=Mozilla%2F5.0%20(Windows%20NT%2010.0%3B%20Win64%3B%20x64)%20AppleWebKit%2F537.36%20(KHTML%2C%20like%20Gecko)%20Chrome%2F73.0.3683.103%20Safari%2F537.36&
b_rst=1920*1080

我们就是这样将每个字段信息尤其是请求信息里面有很多日志字段信息我们都拆分开来，以键值的形式分析出来，再分析过程中还要处理掉没有用的日志信息，比如不完整的日志记录，没有时间的记录，或是过时的记录。
最后我们将键值对写入到hbase中
这里对hbase的表的设计是相当困难的，要看具体的业务需求了
rowkey的话就是时间戳了 family列族就是log column-value就包括各种键值对了。这样数据就写入了hbase里面。

四：提取数据进入mysql

这个阶段也写了一个mapreduce的从hbase中使用scan读取数据，并提前创建数据库各种表。
runner类：进行设置各种mapreduce过程的所需配置。

public class NewInstallUserRunner implements Tool {
Configuration conf = null;
public static void main(String[] args) {
System.setProperty("hadoop.home.dir", "D:\IDEA\hadoop-common-2.2.0-bin-master");
try {
ToolRunner.run(new Configuration(),new NewInstallUserRunner(),args);
} catch (Exception e) {
e.printStackTrace();
}
}
@Override
public int run(String[] strings) throws Exception {
Configuration conf = this.getConf();
Job job = Job.getInstance(conf,"new member");
job.setJarByClass(NewInstallUserRunner.class);
this.processArgs(conf,strings);
TableMapReduceUtil.initTableMapperJob(getScans(conf),NewInstallUserMapper.class,StatsUserDimension.class,TimeOutputValue.class,job,false);
job.setReducerClass(NewInstallUserReducer.class);
job.setOutputKeyClass(StatsUserDimension.class);
job.setOutputValueClass(MapWritableValue.class);
//向mysql中输出类的类型
job.setOutputFormatClass(TransformerOutputFormat.class);
return job.waitForCompletion(true)?0:1;
}
private void processArgs(Configuration conf, String[] args){
String date = null;
for (int i = 0;i<args.length;i++){
if("-d".equals(args[i])){
if(i+1<args.length)
}
}
if(StringUtils.isNotBlank(date) || TimeUtil.isValidateRunningDate(date))
conf.set(GlobalConstants.RUNNING_DATE_PARAMES,date);
}
/*
从hbase中获取数据
条件：
1：时间范围
2：事件类型()
3：获取部分列
*/
private List getScans(Configuration conf){
String date = conf.get(GlobalConstants.RUNNING_DATE_PARAMES);
Scan scan = new Scan();
long time = TimeUtil.parseString2Long(date);
String startRow = String.valueOf(time);
String stopRow = String.valueOf(time+ GlobalConstants.DAY_OF_MILLISECONDS);
scan.setStartRow(startRow.getBytes());
scan.setStopRow(stopRow.getBytes());
//给scan对象添加过滤器
FilterList filters = new FilterList(FilterList.Operator.MUST_PASS_ALL);
SingleColumnValueFilter filter1 = new SingleColumnValueFilter(EventLogConstants.EVENT_LOGS_FAMILY_NAME.getBytes(),EventLogConstants.LOG_COLUMN_NAME_EVENT_NAME.getBytes(),CompareOp.EQUAL,EventEnum.LAUNCH.alias.getBytes());
filters.addFilter(filter1);
String[] columns = new String[]{EventLogConstants.LOG_COLUMN_NAME_SERVER_TIME,EventLogConstants.LOG_COLUMN_NAME_PLATFORM,EventLogConstants.LOG_COLUMN_NAME_BROWSER_NAME,EventLogConstants.LOG_COLUMN_NAME_BROWSER_VERSION,EventLogConstants.LOG_COLUMN_NAME_UUID};
filters.addFilter(getFilter(columns));
scan.setFilter(filters);
//指定从那个hbase表中获取数据
scan.setAttribute(Scan.SCAN_ATTRIBUTES_TABLE_NAME,EventLogConstants.HBASE_NAME_EVENT_LOGS.getBytes());
return Arrays.asList(scan);
}
private Filter getFilter(String[] columns) {
int length = columns.length;
byte[][] b = new byte[length][];
for (int i=0; i<length;i++){
b[i] = columns[i].getBytes();
}
return new MultipleColumnPrefixFilter(b);
}
@Override
public void setConf(Configuration conf) {
conf.set("hbase.zookeeper.quorum", "192.168.40.135:2181,192.168.40.134:2181,192.168.40.137:2181");
conf.addResource("output-collector.xml");
conf.addResource("query-mapping.xml");
conf.addResource("transformer-env.xml");
this.conf = HBaseConfiguration.create(conf);
}
@Override
public Configuration getConf() {
return this.conf;
}}

map端：将从hbase输入的字段进行提取，并定义各种模块，以便再reduce阶段能够针对不同模块进行统计

public class NewInstallUserMapper extends TableMapper<StatsUserDimension,TimeOutputValue>{//每个分析条件（由各个维度组成的）作为key，uuid作为value
//定义列族
byte[] family = Bytes.toBytes(EventLogConstants.EVENT_LOGS_FAMILY_NAME);
//定义输出map端输出key对象，和value对象
StatsUserDimension statsUserDimension = new StatsUserDimension();
TimeOutputValue timeOutputValue = new TimeOutputValue();
//定义模块维度
//暂时定义一个用户进本信息模块
KpiDimension newInstallUser = new KpiDimension(KpiType.NEW_INSTALL_USER.name);
//浏览器模块
KpiDimension InstallUserBrowser = new KpiDimension(KpiType.BROWSER_NEW_INSTALL_USER.name);
@Override
protected void map(ImmutableBytesWritable key, Result value, Context context) throws IOException, InterruptedException {
//从hbase中获取时间的值.
String date = Bytes.toString(CellUtil.cloneValue(value.getColumnLatestCell(family,Bytes.toBytes(EventLogConstants.LOG_COLUMN_NAME_SERVER_TIME))));
//获取平台维度的值
String platform = Bytes.toString(CellUtil.cloneValue(value.getColumnLatestCell(family,Bytes.toBytes(EventLogConstants.LOG_COLUMN_NAME_PLATFORM))));
//获取用户id
String uuid = Bytes.toString(CellUtil.cloneValue(value.getColumnLatestCell(family,Bytes.toBytes(EventLogConstants.LOG_COLUMN_NAME_UUID))));
//浏览器——————名称和版本
String browserName = Bytes.toString(CellUtil.cloneValue(value.getColumnLatestCell(family,Bytes.toBytes(EventLogConstants.LOG_COLUMN_NAME_BROWSER_NAME))));
String browserVersion = Bytes.toString(CellUtil.cloneValue(value.getColumnLatestCell(family,Bytes.toBytes(EventLogConstants.LOG_COLUMN_NAME_BROWSER_VERSION))));
//构建时间维度
long time = Long.valueOf(date);
DateDimension dateDimension = DateDimension.buildDate(time,DateEnum.DAY);
//构建平台维度
List platformDimensions = PlatformDimension.buildList(platform);
//构建浏览器维度（名字和版本）
List browserDimensions = BrowserDimension.buildList(browserName,browserVersion);
//给输出的value对象timeOutputValue
timeOutputValue.setId(uuid);
timeOutputValue.setTime(time);
/*
* 拼接维度组合
*/
//拼接时间维度
//statsUserDimension.getStatsCommon().setDate(dateDimension);
StatsCommonDimension statsCommonDimension = statsUserDimension.getStatsCommon();
statsCommonDimension.setDate(dateDimension);
//对平台和浏览器维度进行拼接，是一个嵌套循环循环，用循环是因为平台分为all和具体的平台，浏览器版本分为all和具体的浏览器版本。
//设置空的浏览器
BrowserDimension defaultBrowser = new BrowserDimension("","");
for (PlatformDimension pl:platformDimensions){
statsCommonDimension.setKpi(newInstallUser);//设置用户模块
statsCommonDimension.setPlatform(pl);//设置平台
statsUserDimension.setBrowser(defaultBrowser);//用户模块里面没有浏览器信息，所以设置一个空的就行
context.write(statsUserDimension,timeOutputValue);
for (BrowserDimension bd : browserDimensions){
statsCommonDimension.setKpi(InstallUserBrowser);//设置浏览器模块
statsUserDimension.setBrowser(bd);//设置浏览器信息
context.write(statsUserDimension,timeOutputValue);
}
}
}
}

reduce端：针对不同模块进行统计分析，并输出到mysql

*/
public class NewInstallUserReducer extends Reducer<StatsUserDimension, TimeOutputValue, StatsUserDimension, MapWritableValue> {
//创建reduce输出的value对象
MapWritableValue mapWritableValue = new MapWritableValue();
//创建去重集合
Set unique = new HashSet();
@Override
protected void reduce(StatsUserDimension key, Iterable values, Context context) throws IOException, InterruptedException {
//清空Set集合，防止上一个迭代器留下的值产生的影响
this.unique.clear();
//遍历迭代器，将set集合的大小作为最终统计结果
for (TimeOutputValue timeOutputValue : values){
this.unique.add(timeOutputValue.getId());
}
//存放最终的计算结果，key是唯一一个标识，方便取值，value是集合大小，最终的统计结果
MapWritable map = new MapWritable();
map.put(new IntWritable(-1),new IntWritable(this.unique.size()));
//将map结果放到reduce输出的value对象中
mapWritableValue.setValue(map);
//获取模块名称
String kpiname = key.getStatsCommon().getKpi().getKpiName();
//将kpiType设置奥reduce端输出对象中
if(kpiname.equals(KpiType.NEW_INSTALL_USER.name)){
mapWritableValue.setKpi(KpiType.NEW_INSTALL_USER);
}else if(kpiname.equals(KpiType.BROWSER_NEW_INSTALL_USER.name)){
mapWritableValue.setKpi(KpiType.BROWSER_NEW_INSTALL_USER);
}
context.write(key,mapWritableValue);
}
}

在这里插入图片描述
这就是数据库各种我们要分析的数据表了，包含了各种字段信息。

比如浏览器新增用户模块进行分析，就能够知道哪种浏览器版本他们的使用情况了。

这里这个过程就详见GitHub，以后我会再这里详细介绍这一个过程的。

本文由 混迹零柒叁壹堕落街 创作，采用知识共享署名4.0 国际许可协议进行许可
本站文章除注明转载/出处外，均为本站原创或翻译，转载前请务必署名
最后编辑时间为: 二月 13,2020

posted @ 2020-02-01 16:18 ongbo 阅读(62) 评论(0) 收藏举报

刷新页面返回顶部

ongbo

大型网站日志分布式分析

一：搭建网站服务器

二：flume进行数据抽取

三：ETL数据处理

四：提取数据进入mysql

公告