ElasticSearch聚合查询引起的熔断问题

背景

现网上线一个统计分析业务，需要从ES中存储的话单数据，按地市、账号进行聚合统计，取出每个地市中流量TOP10W的用户清单及其对应的流量等其他信息。查询语句如下：

{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "days": "2025-01-19"
          }
        },
        {
          "term": {
            "bigcitycode": "1"
          }
        }
      ]
    }
  },
  "track_total_hits": true,
  "size": 0,
  "aggs": {
    "top_data": {
      "multi_terms": {
        "terms": [
          {
            "field": "loginname"
          },
          {
            "field": "class"
          },
          {
            "field": "nasipaddress"
          },
          {
            "field": "ipattribute"
          }
        ],
        "size": "100000",
        "order": {
          "upFlow": "desc"
        }
      },
      "aggs": {
        "upFlow": {
          "sum": {
            "field": "acctinputoctets"
          }
        },
        "downFlow": {
          "sum": {
            "field": "acctoutputoctets"
          }
        },
        "sessiontimes": {
          "sum": {
            "field": "sessiontime"
          }
        }
      }
    }
  }
}

问题

运行一段时间后，发现执行该统计任务时报错，触发ES熔断器，执行其他的聚合查询操作也同样触发熔断器报错，影响业务运行。故采用重启ES节点的方式来临时解决问题。

应用侧报错如下：

======2024-07-31 11:06:12.691 [RuoyiScheduler_Worker-1] ERROR com.ruoyi.quartz.util.AbstractQuartzJob - [execute,50] - 任务执行异常  - ：

java.lang.reflect.InvocationTargetException: null
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.ruoyi.quartz.util.JobInvokeUtil.invokeMethod(JobInvokeUtil.java:61)
	at com.ruoyi.quartz.util.JobInvokeUtil.invokeMethod(JobInvokeUtil.java:33)
	at com.ruoyi.quartz.util.QuartzDisallowConcurrentExecution.doExecute(QuartzDisallowConcurrentExecution.java:19)
	at com.ruoyi.quartz.util.AbstractQuartzJob.execute(AbstractQuartzJob.java:44)
	at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
Caused by: org.elasticsearch.client.ResponseException: method [GET], host [http://192.168.100.100:9092], URI [/dailylog/_search?pretty], status line [HTTP/1.1 500 Internal Server Error]
{
  "error" : {
    "root_cause" : [
      {
        "type" : "task_cancelled_exception",
        "reason" : "cancelled"
      },
      {
        "type" : "task_cancelled_exception",
        "reason" : "The parent task was cancelled, shouldn't start any child tasks"
      }
    ],
    "type" : "search_phase_execution_exception",
    "reason" : "",
    "phase" : "fetch",
    "grouped" : true,
    "failed_shards" : [
      {
        "shard" : 0,
        "index" : ".ds-dailylog-2024.07.30-000642",
        "node" : "NNrViOwRQ868_LJEodD5bg",
        "reason" : {
          "type" : "task_cancelled_exception",
          "reason" : "cancelled"
        }
      },
      {
        "shard" : 1,
        "index" : ".ds-dailylog-2024.07.30-000642",
        "node" : "uwaJwxb9RoS1uCKjmubGtQ",
        "reason" : {
          "type" : "transport_exception",
          "reason" : "failure to send",
          "caused_by" : {
            "type" : "task_cancelled_exception",
            "reason" : "The parent task was cancelled, shouldn't start any child tasks"
          }
        }
      }
    ],
    "caused_by" : {
      "type" : "circuit_breaking_exception",
      "reason" : "[request] Data too large, data for [<reduce_aggs>] would be [9021524103/8.4gb], which is larger than the limit of [9019431321/8.3gb]",
      "bytes_wanted" : 9021524103,
      "bytes_limit" : 9019431321,
      "durability" : "TRANSIENT"
    }
  },
  "status" : 500
}

	at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:326)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:296)
	at org.elasticsearch.client.RestClient.performRequest(RestClient.java:270)
	at com.ruoyi.quartz.task.analysisTask.topFlowAnalyze(analysisTask.java:242)
	at com.ruoyi.quartz.task.analysisTask$$FastClassBySpringCGLIB$$ae8c1db7.invoke(<generated>)
	at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)
	at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:771)
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
	at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:749)
	at org.springframework.aop.aspectj.MethodInvocationProceedingJoinPoint.proceed(MethodInvocationProceedingJoinPoint.java:88)
	at com.ruoyi.framework.aspectj.DataSourceAspect.around(DataSourceAspect.java:49)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethodWithGivenArgs(AbstractAspectJAdvice.java:644)
	at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethod(AbstractAspectJAdvice.java:633)
	at org.springframework.aop.aspectj.AspectJAroundAdvice.invoke(AspectJAroundAdvice.java:70)
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:175)
	at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:749)
	at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:95)
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
	at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:749)
	at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:691)
	at com.ruoyi.quartz.task.analysisTask$$EnhancerBySpringCGLIB$$fc8fd480.topFlowAnalyze(<generated>)
	... 10 common frames omitted

ES侧报错日志如下：

[2024-07-31T11:06:08,908][WARN ][o.e.i.b.request          ] [es-node-1] [request] New used memory 9046598667 [8.4gb] for data of [<reused_arrays>] would be larger than configured breaker: 9019431321 [8.3gb], breaking
[2024-07-31T11:06:09,010][WARN ][o.e.i.b.request          ] [es-node-1] [request] New used memory 9021524103 [8.4gb] for data of [<reduce_aggs>] would be larger than configured breaker: 9019431321 [8.3gb], breaking
[2024-07-31T11:06:12,656][WARN ][r.suppressed             ] [es-node-1] path: /dailylog/_search, params: {pretty=, index=dailylog}
org.elasticsearch.action.search.SearchPhaseExecutionException: 
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:661) [elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.action.search.FetchSearchPhase$1.onFailure(FetchSearchPhase.java:89) [elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:28) [elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33) [elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732) [elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.13.0.jar:7.13.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
	at java.lang.Thread.run(Thread.java:831) [?:?]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [request] Data too large, data for [<reduce_aggs>] would be [9021524103/8.4gb], which is larger than the limit of [9019431321/8.3gb]
	at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.circuitBreak(ChildMemoryCircuitBreaker.java:65) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.limit(ChildMemoryCircuitBreaker.java:137) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:92) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.action.search.QueryPhaseResultConsumer$PendingMerges.addEstimateAndMaybeBreak(QueryPhaseResultConsumer.java:272) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.action.search.QueryPhaseResultConsumer$PendingMerges.consume(QueryPhaseResultConsumer.java:311) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.action.search.QueryPhaseResultConsumer.consumeResult(QueryPhaseResultConsumer.java:110) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardResult(AbstractSearchAsyncAction.java:551) [elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.action.search.SearchQueryThenFetchAsyncAction.onShardResult(SearchQueryThenFetchAsyncAction.java:99) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.action.search.AbstractSearchAsyncAction$1.innerOnResponse(AbstractSearchAsyncAction.java:305) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.action.search.SearchActionListener.onResponse(SearchActionListener.java:34) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.action.search.SearchActionListener.onResponse(SearchActionListener.java:18) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.action.search.SearchExecutionStatsCollector.onResponse(SearchExecutionStatsCollector.java:56) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.action.search.SearchExecutionStatsCollector.onResponse(SearchExecutionStatsCollector.java:25) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:43) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleResponse(SearchTransportService.java:391) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.transport.TransportService$5.handleResponse(TransportService.java:732) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1273) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:291) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.transport.InboundHandler.handleResponse(InboundHandler.java:275) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:128) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:84) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:693) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:129) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:104) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:69) ~[elasticsearch-7.13.0.jar:7.13.0]
	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:63) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:271) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1518) ~[?:?]
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1267) ~[?:?]
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1314) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:501) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:440) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[?:?]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:615) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:578) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[?:?]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]

分析处理

从上面的报错来看似乎是跑统计分析的时候需要使用的内存不够，触发了熔断器，导致查询失败。

第一次处理

措施：拆分统计查询处理请求，将原来一次性查询改为分地市分批次查询，减少每次查询请求的数据量。

结果：上线更新后运行一段时间仍然出现熔断器报错问题。

第二次处理

措施：增加服务器内存，增加ES进程内存配置。

结果：扩容后运行大概2周后再次出现熔断器错误，相比扩容前时间延后了几天而已。

根本原因分析

经过前两次的处理均没有有效解决问题，遂进行根本原因分析。使用如下语句查看ES节点的cache使用量均均不大，且没有上涨趋势。

GET /_cat/nodes?v&h=name,queryCacheMemory,fielddataMemory,requestCacheMemory,requestCacheHitC

name        queryCacheMemory fielddataMemory requestCacheMemory
es-node-1            2.3mb              0b             10.9kb
es-node-2          991.3mb         233.1mb            241.8mb
es-node-3          557.3mb         155.6mb              181mb

分析过程中发现熔断器有一个预估值指标，使用如下语句查看

GET /_nodes/stats/breaker?human

"breakers" : {
        "request" : {
          "limit_size_in_bytes" : 15461882265,
          "limit_size" : "14.3gb",
          "estimated_size_in_bytes" : 6426516720,
          "estimated_size" : "5.9gb",
          "overhead" : 1.0,
          "tripped" : 0
        },
        "fielddata" : {
          "limit_size_in_bytes" : 10307921510,
          "limit_size" : "9.5gb",
          "estimated_size_in_bytes" : 244509384,
          "estimated_size" : "233.1mb",
          "overhead" : 1.03,
          "tripped" : 0
        },
        "in_flight_requests" : {
          "limit_size_in_bytes" : 25769803776,
          "limit_size" : "24gb",
          "estimated_size_in_bytes" : 68322,
          "estimated_size" : "66.7kb",
          "overhead" : 2.0,
          "tripped" : 0
        },
        "model_inference" : {
          "limit_size_in_bytes" : 12884901888,
          "limit_size" : "12gb",
          "estimated_size_in_bytes" : 0,
          "estimated_size" : "0b",
          "overhead" : 1.0,
          "tripped" : 0
        },
        "accounting" : {
          "limit_size_in_bytes" : 25769803776,
          "limit_size" : "24gb",
          "estimated_size_in_bytes" : 3447342,
          "estimated_size" : "3.2mb",
          "overhead" : 1.0,
          "tripped" : 0
        },
        "parent" : {
          "limit_size_in_bytes" : 24481313587,
          "limit_size" : "22.7gb",
          "estimated_size_in_bytes" : 14780020624,
          "estimated_size" : "13.7gb",
          "overhead" : 1.0,
          "tripped" : 0
        }
      }

其中 estimated_size 为每类熔断器的预估使用值，该值与限制的最大值会有一定差异，根据资料分析该预估值若与最大值较接近了可能导致熔断器报错，固增加了对该值得监控。

后续通过持续的监控采集，发现 request 熔断器的预估内存每次在业务统计分析执行后均会上涨且不回落，最终引发熔断器报错。

解决办法

目前确定是因为统计分析任务导致ES的Request熔断器预估内存上涨引发的报错，但是不确定是不是ES的某种机制或者BUG导致任务执行完成后预估内存依然不回降。

方案一

措施：采用Composite分批查询方式，从ES取出某地市的完整数据，在本地进行全量排序取TOP值。
优点：减少单次查询的数据量，降低ES内存压力。
缺点：需要在本地处理大量数据，可能增加本地计算压力。

方案二

措施：将multi_terms聚合方式调整为嵌套聚合，如下：

{
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "days": "2024-11-04"
          }
        },
        {
          "term": {
            "bigcitycode": "1"
          }
        }
      ]
    }
  },
  "track_total_hits": false,
  "size": 0,
  "aggs": {
    "top_data": {
      "terms": {
        "field": "loginname",
        "size": 100000,
        "order": {
          "upFlow": "desc"
        }
      },
      "aggs": {
        "upFlow": {
          "sum": {
            "field": "acctinputoctets"
          }
        },
        "downFlow": {
          "sum": {
            "field": "acctoutputoctets"
          }
        },
        "sessiontimes": {
          "sum": {
            "field": "sessiontime"
          }
        },
        "basip":{
          "top_hits": {
            "size": 1,
            "_source": ["nasipaddress"]
          }
        }
      }
    }
  }
}

优点：虽然一次性取出10W条数据，但预估内存值在执行完成后可回归正常水平。
结果：采用此方案后，问题得到有效解决。

后续思考

若生产中遇到需要对大量数据进行聚合分析但不需要全局排序的场景，仍然优先考虑采用Composite分批查询的方式。

posted on 2025-03-14 10:29 lzc_jack 阅读(168) 评论(0) 收藏举报