elk集群优化思路及计划草稿

ELLK 背景、问题

elastic集群:
document总数: 1,240,218,602
Indices数量:369(脚本定期清理)
elastic集群节点:3master、3data节点

几大索引的情况
biz-info-accs-yyyy.mm.dd

Total: 4.5 GB
Primaries: 2.3 GB
Documents: 7.7m
Total Shards: 10
Unassigned Shards: 0

biz-info-credit-yyyy.mm.dd
Total: 5.6 GB
Primaries: 2.8 GB
Documents: 14.9m
Total Shards: 10
Unassigned Shards: 0

索引已采用的优化:
1.副本
2.data与master节点分离,3data节点,3master节点
3.以日为单位,减少单个节点,控制单个index的大小小于单个服务器内存(16G)
4.数据动态持续写入场景,业务查询使用通配符,如biz-info-accs-*,biz-info-credit-*,易维护分散到不同的基于时间索引,便于数据动态扩展而不会过大导致索引性能下降
5.
几个异常问题:
1) 日志utf8字段长度,以keyword存储超长,日志录入报错
[elk@xd-credit-app02 config]$ [2019-11-19T18:02:24,395][WARN ][logstash.outputs.elasticsearch] Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>nil, :_index=>"biz-info-xxx-2019.11.19", :_type=>"doc", :_routing=>nil}, #<LogStash::Event:0x2c04e0a6>], :response=>{"index"=>{"_index"=>"biz-info-xxx-2019.11.19", "_type"=>"doc", "_id"=>"YtMcg24Bz3_C8YTY-MD4", "status"=>400, "error"=>{"type"=>"illegal_argument_exception", "reason"=>"Document contains at least one immense term in field=\"threadid\" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[104, 116, 116, 112, 45, 110, 105, 111, 45, 55, 48, 48, 49, 45, 101, 120, 101, 99, 45, 49, 49, 93, 67, 114, 101, 100, 105, 116, 82, 101]...', original message: bytes can be at most 32766 in length; got 35024", "caused_by"=>{"type"=>"max_bytes_length_exceeded_exception", "reason"=>"max_bytes_length_exceeded_exception: bytes can be at most 32766 in length; got 35024"}

2)queue队列池少

告警来源:
{"path":"/.kibana/_search","query":{"size":20,"from":0,"_source":"index-pattern.title,index-pattern.fields,type,title,fields"},"body":"{\"version\":true,\"query\":{\"bool\":{\"filter\":[{\"term\":{\"type\":\"index-pattern\"}}],\"must\":[{\"simple_query_string\":{\"query\":\"\\\"biz-info-credit-*\\\"\",\"all_fields\":true}}]}}}","statusCode":503,"response":"{\"error\":{\"root_cause\":[],\"type\":\"search_phase_execution_exception\",\"reason\":\"\",\"phase\":\"fetch\",\"grouped\":true,\"failed_shards\":[],\"caused_by\":{\"type\":\"es_rejected_execution_exception\",\"reason\":\"rejected execution of org.elasticsearch.common.util.concurrent.TimedRunnable@5218f476 on QueueResizingEsThreadPoolExecutor[search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 239.9micros, adjustment amount = 50, QueueResizingEsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.QueueResizingEsThreadPoolExecutor@190a4040[Running, pool size = 7, active threads = 0, queued tasks = 1194, completed tasks = 659786700]]]\"}},\"status\":503}"}: [search_phase_execution_exception] :: {"path":"/.kibana/_search","query":{"size":20,"from":0,"_source":"index-pattern.title,index-pattern.fields,type,title,fields"},"body":"{\"version\":true,\"query\":{\"bool\":{\"filter\":[{\"term\":{\"type\":\"index-pattern\"}}],\"must\":[{\"simple_query_string\":{\"query\":\"\\\"biz-info-credit-*\\\"\",\"all_fields\":true}}]}}}","statusCode":503,"response":"{\"error\":{\"root_cause\":[],\"type\":\"search_phase_execution_exception\",\"reason\":\"\",\"phase\":\"fetch\",\"grouped\":true,\"failed_shards\":[],\"caused_by\":{\"type\":\"es_rejected_execution_exception\",\"reason\":\"rejected execution of org.elasticsearch.common.util.concurrent.TimedRunnable@5218f476 on QueueResizingEsThreadPoolExecutor[search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 239.9micros, adjustment amount = 50, QueueResizingEsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.QueueResizingEsThreadPoolExecutor@190a4040[Running, pool size = 7, active threads = 0, queued tasks = 1194, completed tasks = 659786700]]]\"}},\"status\":503}"}
at respond (/elk/kibana/node_modules/elasticsearch/src/lib/transport.js:295:15)
at checkRespForFailure (/elk/kibana/node_modules/elasticsearch/src/lib/transport.js:254:7)
at HttpConnector.<anonymous> (/elk/kibana/node_modules/elasticsearch/src/lib/connectors/http.js:159:7)
at IncomingMessage.bound (/elk/kibana/node_modules/elasticsearch/node_modules/lodash/dist/lodash.js:729:21)
at emitNone (events.js:91:20)
at IncomingMessage.emit (events.js:185:7)
at endReadableNT (_stream_readable.js:974:12)
at _combinedTickCallback (internal/process/next_tick.js:80:11)
at process._tickDomainCallback (internal/process/next_tick.js:128:9)

优化思路:

一、index队列size

目前所有系统采集到elastic 集群的归集数量大,随着业务、采集范围扩大,将独立的elastic集群拆分出去。1.长期目标: 专项elastic集群的划分;dashboard、实时图表类;汇聚查询类;错误日志触发发送类
采集agent根据elastic集群角色调整输出至对应的elastic集群

短期优化实施:

data节点:
1)调大队列次数;
curl -XPUT 'localhost:9200/_cluster/settings' -d '{
"transient": {
"threadpool.index.type": "fixed",
"threadpool.index.size": 100,
"threadpool.index.queue_size": 1500
}
}'

告警来源:
{"path":"/.kibana/_search","query":{"size":20,"from":0,"_source":"index-pattern.title,index-pattern.fields,type,title,fields"},"body":"{\"version\":true,\"query\":{\"bool\":{\"filter\":[{\"term\":{\"type\":\"index-pattern\"}}],\"must\":[{\"simple_query_string\":{\"query\":\"\\\"biz-info-credit-*\\\"\",\"all_fields\":true}}]}}}","statusCode":503,"response":"{\"error\":{\"root_cause\":[],\"type\":\"search_phase_execution_exception\",\"reason\":\"\",\"phase\":\"fetch\",\"grouped\":true,\"failed_shards\":[],\"caused_by\":{\"type\":\"es_rejected_execution_exception\",\"reason\":\"rejected execution of org.elasticsearch.common.util.concurrent.TimedRunnable@5218f476 on QueueResizingEsThreadPoolExecutor[search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 239.9micros, adjustment amount = 50, QueueResizingEsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.QueueResizingEsThreadPoolExecutor@190a4040[Running, pool size = 7, active threads = 0, queued tasks = 1194, completed tasks = 659786700]]]\"}},\"status\":503}"}: [search_phase_execution_exception] :: {"path":"/.kibana/_search","query":{"size":20,"from":0,"_source":"index-pattern.title,index-pattern.fields,type,title,fields"},"body":"{\"version\":true,\"query\":{\"bool\":{\"filter\":[{\"term\":{\"type\":\"index-pattern\"}}],\"must\":[{\"simple_query_string\":{\"query\":\"\\\"biz-info-credit-*\\\"\",\"all_fields\":true}}]}}}","statusCode":503,"response":"{\"error\":{\"root_cause\":[],\"type\":\"search_phase_execution_exception\",\"reason\":\"\",\"phase\":\"fetch\",\"grouped\":true,\"failed_shards\":[],\"caused_by\":{\"type\":\"es_rejected_execution_exception\",\"reason\":\"rejected execution of org.elasticsearch.common.util.concurrent.TimedRunnable@5218f476 on QueueResizingEsThreadPoolExecutor[search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 239.9micros, adjustment amount = 50, QueueResizingEsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.QueueResizingEsThreadPoolExecutor@190a4040[Running, pool size = 7, active threads = 0, queued tasks = 1194, completed tasks = 659786700]]]\"}},\"status\":503}"}
at respond (/elk/kibana/node_modules/elasticsearch/src/lib/transport.js:295:15)
at checkRespForFailure (/elk/kibana/node_modules/elasticsearch/src/lib/transport.js:254:7)
at HttpConnector.<anonymous> (/elk/kibana/node_modules/elasticsearch/src/lib/connectors/http.js:159:7)
at IncomingMessage.bound (/elk/kibana/node_modules/elasticsearch/node_modules/lodash/dist/lodash.js:729:21)
at emitNone (events.js:91:20)
at IncomingMessage.emit (events.js:185:7)
at endReadableNT (_stream_readable.js:974:12)
at _combinedTickCallback (internal/process/next_tick.js:80:11)
at process._tickDomainCallback (internal/process/next_tick.js:128:9)


2)关闭data节点http功能
相关http请求专有master或客户端节点处理
3)增加客户端节点,将查询类请求放在客户端节点,减少master、data节点的io请求
4)定期清理缓存
# 缓存类型设置为Soft Reference,只有当内存不够时才会进行回收
index.cache.field.max_size: 50000
index.cache.field.expire: 10m
index.cache.field.type: soft
凌晨日志采集业务空闲时间,定期索引合并

5)增加自适应副本选择,重定向到响应最快的节点;
优点:提高查询吞吐量并减少搜索量大的应用程序的延迟
缺点:增加IO压力

PUT /_cluster/settings
{
"transient": {
"cluster.routing.use_adaptive_replica_selection": true
}
}

6)设置分片分配到指定节点
需要设置cluster.routing.allocation.balance.shard值,默认值为0.45f。
数值越大越倾向于在节点层面均衡分片
官方实践:0.6
PUT _cluster/settings
{
"transient" : {
"cluster.routing.allocation.balance.shard" : 0.60
}
}
7) 在汇聚查询类集群未拆分之前,使用查询内存熔断:
原因:查询本身对响应的延迟产生重大影响。为了在查询时不触发熔断并导致Elasticsearch集群处于不稳定状态
官方实践:60%(相对保守值,默认70%)
PUT /_cluster/settings
{
"persistent" : {
"indices.breaker.fielddata.limit" : "60%"
}
}

二、减轻data节点的性能要求

目前data节点处理日志存储的比例:
data1(66节点):credit、核心、征信sal、xgateway-sal、资金、ERROR日志等
data2(166节点):data节点的分发、kibana(130)调用的数据源、云上logstash采集日志
data3(134节点):data节点的分发


1)查询及dashboard访问的请求转向master节点(资源使用率低),部分dashboard的kibana展示页面需要重新配置,或迁移时更改新的index hash值

2)为集群设置breaker 40%,防止集群OOM
PUT _cluster/settings
{
"persistent": {
"indices.breaker.total.limit": "40%"
}
}

posted @ 2020-08-12 10:41  dereklok  阅读(559)  评论(0)    收藏  举报