背景
现网上线一个统计分析业务,需要从ES中存储的话单数据,按地市、账号进行聚合统计,取出每个地市中流量TOP10W的用户清单及其对应的流量等其他信息。查询语句如下:
{
"query": {
"bool": {
"must": [
{
"term": {
"days": "2025-01-19"
}
},
{
"term": {
"bigcitycode": "1"
}
}
]
}
},
"track_total_hits": true,
"size": 0,
"aggs": {
"top_data": {
"multi_terms": {
"terms": [
{
"field": "loginname"
},
{
"field": "class"
},
{
"field": "nasipaddress"
},
{
"field": "ipattribute"
}
],
"size": "100000",
"order": {
"upFlow": "desc"
}
},
"aggs": {
"upFlow": {
"sum": {
"field": "acctinputoctets"
}
},
"downFlow": {
"sum": {
"field": "acctoutputoctets"
}
},
"sessiontimes": {
"sum": {
"field": "sessiontime"
}
}
}
}
}
}
问题
运行一段时间后,发现执行该统计任务时报错,触发ES熔断器,执行其他的聚合查询操作也同样触发熔断器报错,影响业务运行。故采用重启ES节点的方式来临时解决问题。
应用侧报错如下:
======2024-07-31 11:06:12.691 [RuoyiScheduler_Worker-1] ERROR com.ruoyi.quartz.util.AbstractQuartzJob - [execute,50] - 任务执行异常 - :
java.lang.reflect.InvocationTargetException: null
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.ruoyi.quartz.util.JobInvokeUtil.invokeMethod(JobInvokeUtil.java:61)
at com.ruoyi.quartz.util.JobInvokeUtil.invokeMethod(JobInvokeUtil.java:33)
at com.ruoyi.quartz.util.QuartzDisallowConcurrentExecution.doExecute(QuartzDisallowConcurrentExecution.java:19)
at com.ruoyi.quartz.util.AbstractQuartzJob.execute(AbstractQuartzJob.java:44)
at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
Caused by: org.elasticsearch.client.ResponseException: method [GET], host [http://192.168.100.100:9092], URI [/dailylog/_search?pretty], status line [HTTP/1.1 500 Internal Server Error]
{
"error" : {
"root_cause" : [
{
"type" : "task_cancelled_exception",
"reason" : "cancelled"
},
{
"type" : "task_cancelled_exception",
"reason" : "The parent task was cancelled, shouldn't start any child tasks"
}
],
"type" : "search_phase_execution_exception",
"reason" : "",
"phase" : "fetch",
"grouped" : true,
"failed_shards" : [
{
"shard" : 0,
"index" : ".ds-dailylog-2024.07.30-000642",
"node" : "NNrViOwRQ868_LJEodD5bg",
"reason" : {
"type" : "task_cancelled_exception",
"reason" : "cancelled"
}
},
{
"shard" : 1,
"index" : ".ds-dailylog-2024.07.30-000642",
"node" : "uwaJwxb9RoS1uCKjmubGtQ",
"reason" : {
"type" : "transport_exception",
"reason" : "failure to send",
"caused_by" : {
"type" : "task_cancelled_exception",
"reason" : "The parent task was cancelled, shouldn't start any child tasks"
}
}
}
],
"caused_by" : {
"type" : "circuit_breaking_exception",
"reason" : "[request] Data too large, data for [<reduce_aggs>] would be [9021524103/8.4gb], which is larger than the limit of [9019431321/8.3gb]",
"bytes_wanted" : 9021524103,
"bytes_limit" : 9019431321,
"durability" : "TRANSIENT"
}
},
"status" : 500
}
at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:326)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:296)
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:270)
at com.ruoyi.quartz.task.analysisTask.topFlowAnalyze(analysisTask.java:242)
at com.ruoyi.quartz.task.analysisTask$$FastClassBySpringCGLIB$$ae8c1db7.invoke(<generated>)
at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:771)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:749)
at org.springframework.aop.aspectj.MethodInvocationProceedingJoinPoint.proceed(MethodInvocationProceedingJoinPoint.java:88)
at com.ruoyi.framework.aspectj.DataSourceAspect.around(DataSourceAspect.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethodWithGivenArgs(AbstractAspectJAdvice.java:644)
at org.springframework.aop.aspectj.AbstractAspectJAdvice.invokeAdviceMethod(AbstractAspectJAdvice.java:633)
at org.springframework.aop.aspectj.AspectJAroundAdvice.invoke(AspectJAroundAdvice.java:70)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:175)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:749)
at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:95)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:749)
at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:691)
at com.ruoyi.quartz.task.analysisTask$$EnhancerBySpringCGLIB$$fc8fd480.topFlowAnalyze(<generated>)
... 10 common frames omitted
ES侧报错日志如下:
[2024-07-31T11:06:08,908][WARN ][o.e.i.b.request ] [es-node-1] [request] New used memory 9046598667 [8.4gb] for data of [<reused_arrays>] would be larger than configured breaker: 9019431321 [8.3gb], breaking
[2024-07-31T11:06:09,010][WARN ][o.e.i.b.request ] [es-node-1] [request] New used memory 9021524103 [8.4gb] for data of [<reduce_aggs>] would be larger than configured breaker: 9019431321 [8.3gb], breaking
[2024-07-31T11:06:12,656][WARN ][r.suppressed ] [es-node-1] path: /dailylog/_search, params: {pretty=, index=dailylog}
org.elasticsearch.action.search.SearchPhaseExecutionException:
at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:661) [elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.action.search.FetchSearchPhase$1.onFailure(FetchSearchPhase.java:89) [elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:28) [elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33) [elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732) [elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.13.0.jar:7.13.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
at java.lang.Thread.run(Thread.java:831) [?:?]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [request] Data too large, data for [<reduce_aggs>] would be [9021524103/8.4gb], which is larger than the limit of [9019431321/8.3gb]
at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.circuitBreak(ChildMemoryCircuitBreaker.java:65) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.limit(ChildMemoryCircuitBreaker.java:137) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:92) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.action.search.QueryPhaseResultConsumer$PendingMerges.addEstimateAndMaybeBreak(QueryPhaseResultConsumer.java:272) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.action.search.QueryPhaseResultConsumer$PendingMerges.consume(QueryPhaseResultConsumer.java:311) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.action.search.QueryPhaseResultConsumer.consumeResult(QueryPhaseResultConsumer.java:110) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardResult(AbstractSearchAsyncAction.java:551) [elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.action.search.SearchQueryThenFetchAsyncAction.onShardResult(SearchQueryThenFetchAsyncAction.java:99) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.action.search.AbstractSearchAsyncAction$1.innerOnResponse(AbstractSearchAsyncAction.java:305) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.action.search.SearchActionListener.onResponse(SearchActionListener.java:34) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.action.search.SearchActionListener.onResponse(SearchActionListener.java:18) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.action.search.SearchExecutionStatsCollector.onResponse(SearchExecutionStatsCollector.java:56) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.action.search.SearchExecutionStatsCollector.onResponse(SearchExecutionStatsCollector.java:25) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:43) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleResponse(SearchTransportService.java:391) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.transport.TransportService$5.handleResponse(TransportService.java:732) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1273) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:291) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.transport.InboundHandler.handleResponse(InboundHandler.java:275) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:128) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:84) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:693) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:129) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:104) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:69) ~[elasticsearch-7.13.0.jar:7.13.0]
at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:63) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:271) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1518) ~[?:?]
at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1267) ~[?:?]
at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1314) ~[?:?]
at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:501) ~[?:?]
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:440) ~[?:?]
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[?:?]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:615) ~[?:?]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:578) ~[?:?]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) ~[?:?]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[?:?]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
分析处理
从上面的报错来看似乎是跑统计分析的时候需要使用的内存不够,触发了熔断器,导致查询失败。
第一次处理
措施:拆分统计查询处理请求,将原来一次性查询改为分地市分批次查询,减少每次查询请求的数据量。
结果:上线更新后运行一段时间仍然出现熔断器报错问题。
第二次处理
措施:增加服务器内存,增加ES进程内存配置。
结果:扩容后运行大概2周后再次出现熔断器错误,相比扩容前时间延后了几天而已。
根本原因分析
经过前两次的处理均没有有效解决问题,遂进行根本原因分析。使用如下语句查看ES节点的cache使用量均均不大,且没有上涨趋势。
GET /_cat/nodes?v&h=name,queryCacheMemory,fielddataMemory,requestCacheMemory,requestCacheHitC
name queryCacheMemory fielddataMemory requestCacheMemory
es-node-1 2.3mb 0b 10.9kb
es-node-2 991.3mb 233.1mb 241.8mb
es-node-3 557.3mb 155.6mb 181mb
分析过程中发现熔断器有一个预估值指标,使用如下语句查看
GET /_nodes/stats/breaker?human
"breakers" : {
"request" : {
"limit_size_in_bytes" : 15461882265,
"limit_size" : "14.3gb",
"estimated_size_in_bytes" : 6426516720,
"estimated_size" : "5.9gb",
"overhead" : 1.0,
"tripped" : 0
},
"fielddata" : {
"limit_size_in_bytes" : 10307921510,
"limit_size" : "9.5gb",
"estimated_size_in_bytes" : 244509384,
"estimated_size" : "233.1mb",
"overhead" : 1.03,
"tripped" : 0
},
"in_flight_requests" : {
"limit_size_in_bytes" : 25769803776,
"limit_size" : "24gb",
"estimated_size_in_bytes" : 68322,
"estimated_size" : "66.7kb",
"overhead" : 2.0,
"tripped" : 0
},
"model_inference" : {
"limit_size_in_bytes" : 12884901888,
"limit_size" : "12gb",
"estimated_size_in_bytes" : 0,
"estimated_size" : "0b",
"overhead" : 1.0,
"tripped" : 0
},
"accounting" : {
"limit_size_in_bytes" : 25769803776,
"limit_size" : "24gb",
"estimated_size_in_bytes" : 3447342,
"estimated_size" : "3.2mb",
"overhead" : 1.0,
"tripped" : 0
},
"parent" : {
"limit_size_in_bytes" : 24481313587,
"limit_size" : "22.7gb",
"estimated_size_in_bytes" : 14780020624,
"estimated_size" : "13.7gb",
"overhead" : 1.0,
"tripped" : 0
}
}
其中 estimated_size 为每类熔断器的预估使用值,该值与限制的最大值会有一定差异,根据资料分析该预估值若与最大值较接近了可能导致熔断器报错,固增加了对该值得监控。
后续通过持续的监控采集,发现 request 熔断器的预估内存每次在业务统计分析执行后均会上涨且不回落,最终引发熔断器报错。
解决办法
目前确定是因为统计分析任务导致ES的Request熔断器预估内存上涨引发的报错,但是不确定是不是ES的某种机制或者BUG导致任务执行完成后预估内存依然不回降。
方案一
- 措施:采用Composite分批查询方式,从ES取出某地市的完整数据,在本地进行全量排序取TOP值。
- 优点:减少单次查询的数据量,降低ES内存压力。
- 缺点:需要在本地处理大量数据,可能增加本地计算压力。
方案二
- 措施:将
multi_terms聚合方式调整为嵌套聚合,如下:
{
"query": {
"bool": {
"must": [
{
"term": {
"days": "2024-11-04"
}
},
{
"term": {
"bigcitycode": "1"
}
}
]
}
},
"track_total_hits": false,
"size": 0,
"aggs": {
"top_data": {
"terms": {
"field": "loginname",
"size": 100000,
"order": {
"upFlow": "desc"
}
},
"aggs": {
"upFlow": {
"sum": {
"field": "acctinputoctets"
}
},
"downFlow": {
"sum": {
"field": "acctoutputoctets"
}
},
"sessiontimes": {
"sum": {
"field": "sessiontime"
}
},
"basip":{
"top_hits": {
"size": 1,
"_source": ["nasipaddress"]
}
}
}
}
}
}
- 优点:虽然一次性取出10W条数据,但预估内存值在执行完成后可回归正常水平。
- 结果:采用此方案后,问题得到有效解决。
后续思考
若生产中遇到需要对大量数据进行聚合分析但不需要全局排序的场景,仍然优先考虑采用Composite分批查询的方式。
浙公网安备 33010602011771号