全链路跟踪skywalking简介

该文章主要包括以下内容:

  1. skywalking的简介
  2. skywalking的使用,支持多种调用中间件(httpclent,springmvc,dubbo,mysql等等)
  3. skywalking的traceId与日志组件(log4j,logback,elk等)的集成
  4. skywalking告警模块使用
  5. skywalking的原理
  6. skywalking的限制

 

1.skywalking的简介:

 

            Overview:

SkyWalking: an open source observability platform to collect, analyze, aggregate and visualize data from services and cloud native infrastructures.
SkyWalking provides an easy way to keep you have a clear view of your distributed system, even across Cloud.
It is more like a modern APM, specially designed for cloud native, container based and distributed system.

-------

skywalking是一个开放源码的,用于收集、分析,聚合,可视化来自于不同服务和本地基础服务的数据的可观察的平台,
skywalking提供了一个简单的方法来让你对你的分布式系统甚至是跨云的服务有清晰的了解。
它更像是一个现代的系统性能管理,特别为分布式系统而设计。

          Why use SkyWalking?

 

SkyWalking provides solutions for observing and monitoring distributed system, in many different scenarios. 
First of all, like traditional ways, SkyWalking provides auto instrument agents for service, such as Java, C# and Node.js.
At the same time, it provides manual instrument SDKs for Go(Not yet), C++(Not yet).
Also with more languages required, risks in manipulating codes at runtime, cloud native infrastructures grow more powerful,
SkyWalking could use Service Mesher infra probes to collect data for understanding the whole distributed system.
In general, it provides observability capabilities for service(s), service instance(s), endpoint(s).

----------
skywalking提供了在很多不同的场景下用于观察和监控分布式系统的方式。
首先,像传统的方法,skywalking为java,c#,Node.js等提供了自动探针代理.
同时,它为Go,C++提供了手工探针。
随着本地服务越来越多,需要越来越多的语言,掌控代码的风险也在增加,
Skywalking可以使用网状服务探针收集数据,以了解整个分布式系统。
通常,skywalking提供了观察service,service instance,endpoint的能力。

service: 一个服务
Service Instance: 服务的实例(1个服务会启动多个节点)
Endpoint: 一个服务中的其中一个接口

 

          

   Architecture:

 

 

         2.skywalking的使用:

        第一步:从skywalking的官网http://skywalking.apache.org/downloads/下载包,包的结构如图。

             

     第二步:启动skywalking收集器服务,启动脚本是E:\apache-skywalking-apm-bin\bin\startup.sh,启动之后我们就可以访问http://localhost:8080/就可以看到skywalking的ui界面了。

          

 

     第三步:启动项目:  拷贝skywalking-agent目录到所需位置,探针包含整个目录,请不要改变目录结构,可修改agent.config配置agent.application_code=xxl-job为自己的应用名

              增加JVM启动参数,-javaagent:/path/to/skywalking-agent/skywalking-agent.jar。参数值为skywalking-agent.jar的绝对路径。

   通过以上几步之后,我们就可以直接访问我们的项目的接口,看skywalking界面上能否收集到我们的调用信息了。

下图为skywalking的首页,主要展示全局的性能信息。

    为了验证skywalking具有发现系统拓扑(系统依赖)的功能,启动4个服务,4个服务的接口路径分别为hello/start1,hello/start2,hello/start3,hello/start4,

      在服务的依赖关系为: start1依赖start2,start2依赖start3和start4。

       访问start1接口,skywalking展示的项目拓扑图如下:

      

       全链路性能跟踪展示页面:

         

     skywalking默认支持调用性能监控的类型有DB(1),RPC_FRAMEWORK(2),HTTP(3),MQ(4),CACHE(5),此外还支持自定义插件来监控未支持的组件。

      下面来看下调用dubbo和db的效果:(服务start2中调用db和项目4的dubbo服务)

 

      3.skywalking的traceId与日志组件(log4j,logback,elk等)的集成:

     以logback为例,只要在日志配置xml中增加以下配置,则在打印日志的时候,自动把当前上下文中的traceId加入到日志中去。

        

    <appender name="console" class="ch.qos.logback.core.ConsoleAppender">
        <layout class="org.apache.skywalking.apm.toolkit.log.logback.v1.x.TraceIdPatternLogbackLayout">
             <pattern>
                 %d{yyyy-MM-dd HH:mm:ss} [%thread] %-5level %logger{36} - %tid - %msg%n
             </pattern>
        </layout>
    </appender>

         效果如下图所示,链路中的所有节点的traceId是一样的,这样就可以在skywalking上面发现性能差的traceId后,再去日志组件中查看日志是否有异常日志。

       服务1中打印的日志:

       2019-08-14 16:46:22 [http-nio-9091-exec-1] INFO  c.z.s.controller.HelloController - TID:47.34.15657723821280001 - service1 logger with traceId

       服务2中打印的日志:

       2019-08-14 16:46:24 [http-nio-9092-exec-9] INFO  c.z.s.controller.HelloController - TID:47.34.15657723821280001 - service2 logger with traceId

       服务3中打印的日志:

       2019-08-14 16:46:24 [http-nio-9093-exec-1] INFO  c.z.s.controller.HelloController - TID:47.34.15657723821280001 - service3 logger with traceId

       服务4中打印的日志:

        2019-08-14 16:46:24 [http-nio-9094-exec-1] INFO  c.z.s.controller.HelloController - TID:47.34.15657723821280001 - service4 logger with traceId    

   

    4.skywalking告警模块的使用:

     下图为告警页面的ui界面,可以看到可以从三个维度来监控,分别为服务(service)、服务实例(service instance),端点(endpoint/接口)。

        告警规则可以在安装包下的配置文件-(apache-skywalking-apm-bin/config/alarm-settings.yml)中,自由定义。

        默认配置监控服务和服务实例,不监控端点,因为 # Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.# Because the number of endpoint is much more than service and instance.

      

  下面代码为配置告警规则的代码,skywalking还支持使用者配置告警接口,来及时发送通知,如发送短信/邮件等。如配置文件中的webhooks中。

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Sample alarm rules.
rules:
  # Rule unique name, must be ended with `_rule`.
  service_resp_time_rule:
    metrics-name: service_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 3
    silence-period: 5
    message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
  service_sla_rule:
    # Metrics value need to be long, double or int
    metrics-name: service_sla
    op: "<"
    threshold: 8000
    # The length of time to evaluate the metrics
    period: 10
    # How many times after the metrics match the condition, will trigger alarm
    count: 2
    # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
    silence-period: 3
    message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes
  service_p90_sla_rule:
    # Metrics value need to be long, double or int
    metrics-name: service_p90
    op: ">"
    threshold: 1000
    period: 10
    count: 3
    silence-period: 5
    message: 90% response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes
  service_instance_resp_time_rule:
    metrics-name: service_instance_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 2
    silence-period: 5
    message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes
#  Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.
#  Because the number of endpoint is much more than service and instance.
#
  endpoint_avg_rule:
    metrics-name: endpoint_avg
    op: ">"
    threshold: 1000
    period: 10
    count: 2
    silence-period: 5
    message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes

#webhooks:
#  - http://127.0.0.1/notify/
#  - http://127.0.0.1/go-wechat/

 

 

5.skywalking的原理:

       skywalaking总体架构分为三部分:

  1.    skywalking-collector:链路数据归集器,数据可以落地ElasticSearch,单机也可以落地H2,不推荐,H2仅作为临时演示用
  2.    skywalking-web:web可视化平台,用来展示落地的数据
  3.    skywalking-agent:探针,用来收集和发送数据到归集器

skywalking的核心在于agent部分,下图展示了一次调用跨多个进程里agent的详细的运行过程:

 

agent支持多种客户端和服务端,支持的插件明细:--->https://github.com/apache/skywalking/blob/master/docs/en/setup/service-agent/java-agent/Supported-list.md

以拦截dubbo请求为例,skywalking的dubbo拦截插件实现的代码实现:

源码使用的是拦截dubbo中的MonitorFilter这个类中的invoke方法。具体如DubboInterceptor所示,通过获取dubbo的上下文RpcContext先对消费者调用之前加入sky walking的跨进程协议header信息sw:traceId,然后到生产者取出。

 

package org.apache.skywalking.apm.plugin.dubbo;
public class DubboInstrumentation extends ClassInstanceMethodsEnhancePluginDefine {

    private static final String ENHANCE_CLASS = "com.alibaba.dubbo.monitor.support.MonitorFilter";
    private static final String INTERCEPT_CLASS = "org.apache.skywalking.apm.plugin.dubbo.DubboInterceptor";

    @Override
    protected ClassMatch enhanceClass() {
        return NameMatch.byName(ENHANCE_CLASS);
    }

    @Override
    public ConstructorInterceptPoint[] getConstructorsInterceptPoints() {
        return null;
    }

    @Override
    public InstanceMethodsInterceptPoint[] getInstanceMethodsInterceptPoints() {
        return new InstanceMethodsInterceptPoint[] {
            new InstanceMethodsInterceptPoint() {
                @Override
                public ElementMatcher<MethodDescription> getMethodsMatcher() {
                    return named("invoke");
                }

                @Override
                public String getMethodsInterceptor() {
                    return INTERCEPT_CLASS;
                }

                @Override
                public boolean isOverrideArgs() {
                    return false;
                }
            }
        };
    }
}

以下代码为Dubbo拦截器的实现:

package org.apache.skywalking.apm.plugin.dubbo;

import com.alibaba.dubbo.common.URL;
import com.alibaba.dubbo.rpc.Invocation;
import com.alibaba.dubbo.rpc.Invoker;
import com.alibaba.dubbo.rpc.Result;
import com.alibaba.dubbo.rpc.RpcContext;
import java.lang.reflect.Method;
import org.apache.skywalking.apm.agent.core.context.ContextCarrier;
import org.apache.skywalking.apm.agent.core.context.tag.Tags;
import org.apache.skywalking.apm.agent.core.context.CarrierItem;
import org.apache.skywalking.apm.agent.core.context.ContextManager;
import org.apache.skywalking.apm.agent.core.context.trace.AbstractSpan;
import org.apache.skywalking.apm.agent.core.context.trace.SpanLayer;
import org.apache.skywalking.apm.agent.core.plugin.interceptor.enhance.EnhancedInstance;
import org.apache.skywalking.apm.agent.core.plugin.interceptor.enhance.InstanceMethodsAroundInterceptor;
import org.apache.skywalking.apm.agent.core.plugin.interceptor.enhance.MethodInterceptResult;
import org.apache.skywalking.apm.network.trace.component.ComponentsDefine;

/**
 * {@link DubboInterceptor} define how to enhance class {@link com.alibaba.dubbo.monitor.support.MonitorFilter#invoke(Invoker,
 * Invocation)}. the trace context transport to the provider side by {@link RpcContext#attachments}.but all the version
 * of dubbo framework below 2.8.3 don't support {@link RpcContext#attachments}, we support another way to support it.
 *
 * @author zhangxin
 */
public class DubboInterceptor implements InstanceMethodsAroundInterceptor {
    /**
     * <h2>Consumer:</h2> The serialized trace context data will
     * inject to the {@link RpcContext#attachments} for transport to provider side.
     * <p>
     * <h2>Provider:</h2> The serialized trace context data will extract from
     * {@link RpcContext#attachments}. current trace segment will ref if the serialize context data is not null.
     */
    @Override
    public void beforeMethod(EnhancedInstance objInst, Method method, Object[] allArguments,
        Class<?>[] argumentsTypes, MethodInterceptResult result) throws Throwable {
        Invoker invoker = (Invoker)allArguments[0];
        Invocation invocation = (Invocation)allArguments[1];
        RpcContext rpcContext = RpcContext.getContext();
        boolean isConsumer = rpcContext.isConsumerSide();
        URL requestURL = invoker.getUrl();

        AbstractSpan span;

        final String host = requestURL.getHost();
        final int port = requestURL.getPort();
        if (isConsumer) {
            final ContextCarrier contextCarrier = new ContextCarrier();
            span = ContextManager.createExitSpan(generateOperationName(requestURL, invocation), contextCarrier, host + ":" + port);
            //invocation.getAttachments().put("contextData", contextDataStr);
            //@see https://github.com/alibaba/dubbo/blob/dubbo-2.5.3/dubbo-rpc/dubbo-rpc-api/src/main/java/com/alibaba/dubbo/rpc/RpcInvocation.java#L154-L161
            CarrierItem next = contextCarrier.items();
            while (next.hasNext()) {
                next = next.next();
                rpcContext.getAttachments().put(next.getHeadKey(), next.getHeadValue());
            }
        } else {
            ContextCarrier contextCarrier = new ContextCarrier();
            CarrierItem next = contextCarrier.items();
            while (next.hasNext()) {
                next = next.next();
                next.setHeadValue(rpcContext.getAttachment(next.getHeadKey()));
            }

            span = ContextManager.createEntrySpan(generateOperationName(requestURL, invocation), contextCarrier);
        }

        Tags.URL.set(span, generateRequestURL(requestURL, invocation));
        span.setComponent(ComponentsDefine.DUBBO);
        SpanLayer.asRPCFramework(span);
    }

    @Override
    public Object afterMethod(EnhancedInstance objInst, Method method, Object[] allArguments,
        Class<?>[] argumentsTypes, Object ret) throws Throwable {
        Result result = (Result)ret;
        if (result != null && result.getException() != null) {
            dealException(result.getException());
        }

        ContextManager.stopSpan();
        return ret;
    }

    @Override
    public void handleMethodException(EnhancedInstance objInst, Method method, Object[] allArguments,
        Class<?>[] argumentsTypes, Throwable t) {
        dealException(t);
    }

    /**
     * Log the throwable, which occurs in Dubbo RPC service.
     */
    private void dealException(Throwable throwable) {
        AbstractSpan span = ContextManager.activeSpan();
        span.errorOccurred();
        span.log(throwable);
    }

    /**
     * Format operation name. e.g. org.apache.skywalking.apm.plugin.test.Test.test(String)
     *
     * @return operation name.
     */
    private String generateOperationName(URL requestURL, Invocation invocation) {
        StringBuilder operationName = new StringBuilder();
        operationName.append(requestURL.getPath());
        operationName.append("." + invocation.getMethodName() + "(");
        for (Class<?> classes : invocation.getParameterTypes()) {
            operationName.append(classes.getSimpleName() + ",");
        }

        if (invocation.getParameterTypes().length > 0) {
            operationName.delete(operationName.length() - 1, operationName.length());
        }

        operationName.append(")");

        return operationName.toString();
    }

    /**
     * Format request url.
     * e.g. dubbo://127.0.0.1:20880/org.apache.skywalking.apm.plugin.test.Test.test(String).
     *
     * @return request url.
     */
    private String generateRequestURL(URL url, Invocation invocation) {
        StringBuilder requestURL = new StringBuilder();
        requestURL.append(url.getProtocol() + "://");
        requestURL.append(url.getHost());
        requestURL.append(":" + url.getPort() + "/");
        requestURL.append(generateOperationName(url, invocation));
        return requestURL.toString();
    }
}

 在调用结束后结束,把span的详情信息发送给collector(数据收集器).具体实现在类org.apache.skywalking.apm.agent.core.context.TracingContext的stopSpan(AbstractSpan span)方法,

下面是stopSpan的具体实现方法:

@Override
    public boolean stopSpan(AbstractSpan span) {
        AbstractSpan lastSpan = peek();
        if (lastSpan == span) {
            if (lastSpan instanceof AbstractTracingSpan) {
                AbstractTracingSpan toFinishSpan = (AbstractTracingSpan)lastSpan;
                if (toFinishSpan.finish(segment)) {
                    pop();
                }
            } else {
                pop();
            }
        } else {
            throw new IllegalStateException("Stopping the unexpected span = " + span);
        }

        finish();

        return activeSpanStack.isEmpty();
    }

具体发送数据的逻辑在finish方法中

/**
     * Finish this context, and notify all {@link TracingContextListener}s, managed by {@link
     * TracingContext.ListenerManager}
     */
    private void finish() {
        if (isRunningInAsyncMode) {
            asyncFinishLock.lock();
        }
        try {
            if (activeSpanStack.isEmpty() && running && (!isRunningInAsyncMode || asyncSpanCounter.get() == 0)) {
                TraceSegment finishedSegment = segment.finish(isLimitMechanismWorking());
                /*
                 * Recheck the segment if the segment contains only one span.
                 * Because in the runtime, can't sure this segment is part of distributed trace.
                 *
                 * @see {@link #createSpan(String, long, boolean)}
                 */
                if (!segment.hasRef() && segment.isSingleSpanSegment()) {
                    if (!samplingService.trySampling()) {
                        finishedSegment.setIgnore(true);
                    }
                }

                /*
                 * Check that the segment is created after the agent (re-)registered to backend,
                 * otherwise the segment may be created when the agent is still rebooting and should
                 * be ignored
                 */
                if (segment.createTime() < RemoteDownstreamConfig.Agent.INSTANCE_REGISTERED_TIME) {
                    finishedSegment.setIgnore(true);
                }

                TracingContext.ListenerManager.notifyFinish(finishedSegment); //通知监控追踪容器的监听者,监听者会把数据发送给collector.

                running = false;
            }
        } finally {
            if (isRunningInAsyncMode) {
                asyncFinishLock.unlock();
            }
        }
    }

 5.skywalking的限制

Just effect frameworks or libraries. 
Because of the changing codes by agents, it also means the codes are already known by agent plugin developers.
So, there is always a supported list in this kind of probes. Like SkyWalking Java agent supported list. Across thread can't be supported all the time.
Like we said about in process propagation, most codes run in a single thread per request, especially business codes.
But in some other scenarios, they do things in different threads, such as job assignment, task pool or batch process.
Or some languages provide coroutine or similar thing like Goroutine, then developer could run async process with low payload, even been encouraged. In those cases, auto instrument will face problems.

1.只支持已知的代理,如果使用的中间件还未被支持,需要自己写插件。

2.跨线程的场景不支持自动代理,比如任务分配,任务池,批处理的场景。

 

 

 

posted on 2019-08-13 18:28  swave  阅读(80647)  评论(5编辑  收藏  举报

导航