ClickHouse关于参数receive_timeout探究
ClickHouse关于参数receive_timeout探究
仿伪链接:https://www.cnblogs.com/PiscesCanon/p/16898032.html
这几天研究给这个参数搞懵逼了,完全不明意义。
一个查询可能出现好几种报错,人麻了。。
辣鸡CK只想说。
connect_timeout,receive_timeout,send_timeout
用于与客户端通信的套接字上的超时以秒为单位。
默认值:10,300,300。
测试表信息:default.zkm_local
dev-app76 :) SELECT host AS `主机名`, sum(rows) AS `总行数`, formatReadableSize(sum(data_uncompressed_bytes)) AS `原始大小`, formatReadableSize(sum(data_compressed_bytes)) AS `压缩大小`, round((sum(data_compressed_bytes) / sum(data_uncompressed_bytes)) * 100, 0) AS `压缩率` FROM `gv$parts` WHERE database='default' AND table='zkm_local' GROUP BY host ORDER BY host ASC; SELECT host AS `主机名`, sum(rows) AS `总行数`, formatReadableSize(sum(data_uncompressed_bytes)) AS `原始大小`, formatReadableSize(sum(data_compressed_bytes)) AS `压缩大小`, round((sum(data_compressed_bytes) / sum(data_uncompressed_bytes)) * 100, 0) AS `压缩率` FROM `gv$parts` WHERE (database = 'default') AND (table = 'zkm_local') GROUP BY host ORDER BY host ASC Query id: 729ba8e9-4b90-44be-9f74-1f2b487774bd ┌─主机名───-─┬──总行数─-┬─原始大小───-┬─压缩大小───-┬─压缩率-─┐ │ dev-app76 │ 2595814 │ 846.24 MiB │ 234.81 MiB │ 28 │ │ dev-app77 │ 2595814 │ 846.24 MiB │ 234.81 MiB │ 28 │ │ dev-app78 │ 2597249 │ 846.73 MiB │ 234.50 MiB │ 28 │ │ dev-app79 │ 2597249 │ 846.73 MiB │ 234.50 MiB │ 28 │ └───────────┴─────────┴────────────┴────────────┴────────┘ 4 rows in set. Elapsed: 0.193 sec.
receive_timeout=0的情况,远程连接同时设置该参:
[root@dev-app80 daily]# clickhouse-client -h 192.168.1.172 -u default --password defaultsys --receive_timeout=0 ClickHouse client version 22.3.2.1. Connecting to 192.168.1.172:9000 as user default. Connected to ClickHouse server version 22.3.2 revision 54455. dev-app77 :) select * from zkm_local; SELECT * FROM zkm_local Query id: 1e66f2b7-fddc-4947-a21f-1ed699972915 Timeout exceeded while receiving data from server. Waited for 0 seconds, timeout is 0 seconds. Cancelling query. Ok. Query was cancelled. 0 rows in set. Elapsed: 0.137 sec. Processed 158.84 thousand rows, 81.89 MB (1.16 million rows/s., 596.64 MB/s.)
对于查询SQL快的可能成功,也可能失败,如:
dev-app77 :) select * from system.settings where name='receive_timeout'; SELECT * FROM system.settings WHERE name = 'receive_timeout' Query id: e937b13c-37d6-43c0-ba66-45e07ebd1b30 ┌─name────────────┬─value─┬─changed─┬─description─┬─min──┬─max──┬─readonly─┬─type────┐ │ receive_timeout │ 0 │ 1 │ │ ᴺᵁᴸᴸ │ ᴺᵁᴸᴸ │ 0 │ Seconds │ └─────────────────┴───────┴─────────┴─────────────┴──────┴──────┴──────────┴─────────┘ 1 rows in set. Elapsed: 0.005 sec.
测试receive_timeout=1.
测试语句:
select t.* from zkm_local t;
结果1:
能够正常跑完
结果2:
远程连接第二个节点,同时设置receive_timeout=1,执行SQL,输出部分结果后报错如下:
↘ Progress: 117.88 thousand rows, 60.67 MB (18.12 thousand rows/s., 9.33 MB/s.) ████▌ (1.2 CPU, 152.95 MB RAM) 4%Exception on client: Code: 33. DB::Exception: Cannot read all data. Bytes read: 171757. Bytes expected: 604103.: while receiving packet from 192.168.1.172:9000. (CANNOT_READ_ALL_DATA) Connecting to 192.168.1.172:9000 as user default. Connected to ClickHouse server version 22.3.2 revision 54455.
同时,在输出结果(client端还未抛出报错)的某个时刻,错误日志报错信息:

2022.11.16 17:30:49.354948 [ 6507 ] {6b5a5fde-a832-4c9a-af93-dceda3c12127} <Error> executeQuery: Code: 209. DB::NetException: Timeout exceeded while writing to socket (192.168.1.123:46930, 1000 ms). (SOCKET_TIMEOUT) (version 22.3.2.1) (from 192.168.1.123:46930) (in query: select t.* from zkm_local t;), Stack trace (when copying this message, always include the lines below): 0. DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0xa4dde1a in /usr/bin/clickhouse 1. DB::WriteBufferFromPocoSocket::nextImpl() @ 0x14187daf in /usr/bin/clickhouse 2. DB::TCPHandler::sendData(DB::Block const&) @ 0x1544a014 in /usr/bin/clickhouse 3. DB::TCPHandler::processOrdinaryQueryWithProcessors() @ 0x15448626 in /usr/bin/clickhouse 4. DB::TCPHandler::runImpl() @ 0x154417a7 in /usr/bin/clickhouse 5. DB::TCPHandler::run() @ 0x15450e79 in /usr/bin/clickhouse 6. Poco::Net::TCPServerConnection::start() @ 0x164b264f in /usr/bin/clickhouse 7. Poco::Net::TCPServerDispatcher::run() @ 0x164b4aa1 in /usr/bin/clickhouse 8. Poco::PooledThread::run() @ 0x16671e49 in /usr/bin/clickhouse 9. Poco::ThreadImpl::runnableEntry(void*) @ 0x1666f1a0 in /usr/bin/clickhouse 10. start_thread @ 0x7ea5 in /usr/lib64/libpthread-2.17.so 11. __clone @ 0xfe96d in /usr/lib64/libc-2.17.so 2022.11.16 17:30:49.362383 [ 6507 ] {6b5a5fde-a832-4c9a-af93-dceda3c12127} <Error> TCPHandler: Code: 209. DB::Exception: Timeout exceeded while writing to socket (192.168.1.123:46930, 1000 ms). (SOCKET_TIMEOUT), Stack trace (when copying this message, always include the lines below): 0. DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool) @ 0xa4dde1a in /usr/bin/clickhouse 1. DB::WriteBufferFromPocoSocket::nextImpl() @ 0x14187daf in /usr/bin/clickhouse 2. DB::TCPHandler::sendData(DB::Block const&) @ 0x1544a014 in /usr/bin/clickhouse 3. DB::TCPHandler::processOrdinaryQueryWithProcessors() @ 0x15448626 in /usr/bin/clickhouse 4. DB::TCPHandler::runImpl() @ 0x154417a7 in /usr/bin/clickhouse 5. DB::TCPHandler::run() @ 0x15450e79 in /usr/bin/clickhouse 6. Poco::Net::TCPServerConnection::start() @ 0x164b264f in /usr/bin/clickhouse 7. Poco::Net::TCPServerDispatcher::run() @ 0x164b4aa1 in /usr/bin/clickhouse 8. Poco::PooledThread::run() @ 0x16671e49 in /usr/bin/clickhouse 9. Poco::ThreadImpl::runnableEntry(void*) @ 0x1666f1a0 in /usr/bin/clickhouse 10. start_thread @ 0x7ea5 in /usr/lib64/libpthread-2.17.so 11. __clone @ 0xfe96d in /usr/lib64/libc-2.17.so
同时,服务器上netstat -anputwco | grep 192.168.1.123的结果
tcp 0 341728 192.168.1.172:9000 192.168.1.123:46942 ESTABLISHED 1435/clickhouse-ser on (0.04/0/0) tcp 0 361629 192.168.1.172:9000 192.168.1.123:46942 ESTABLISHED 1435/clickhouse-ser probe (0.14/0/0) tcp 0 361629 192.168.1.172:9000 192.168.1.123:46942 ESTABLISHED 1435/clickhouse-ser probe (0.06/0/1) tcp 0 361629 192.168.1.172:9000 192.168.1.123:46942 ESTABLISHED 1435/clickhouse-ser probe (0.91/0/0) tcp 0 361629 192.168.1.172:9000 192.168.1.123:46942 ESTABLISHED 1435/clickhouse-ser probe (3.63/0/0) tcp 0 361629 192.168.1.172:9000 192.168.1.123:46942 ESTABLISHED 1435/clickhouse-ser probe (2.62/0/0) tcp 0 361629 192.168.1.172:9000 192.168.1.123:46942 ESTABLISHED 1435/clickhouse-ser probe (1.61/0/0) tcp 0 0 192.168.1.172:9000 192.168.1.123:46942 FIN_WAIT2 - timewait (59.37/0/0) tcp 0 0 192.168.1.172:9000 192.168.1.123:46942 FIN_WAIT2 - timewait (58.36/0/0) tcp 0 0 192.168.1.172:9000 192.168.1.123:46942 FIN_WAIT2 - timewait (57.35/0/0) tcp 0 0 192.168.1.172:9000 192.168.1.123:46942 FIN_WAIT2 - timewait (56.34/0/0)
同时,使用dstat的观察结果:
17-11 09:37:12| 44k 1373k| 2 3 95 0 0 0|1195M 104M 924M 5759M 17-11 09:37:13| 18k 324k| 1 0 99 0 0 0|1201M 104M 924M 5753M 17-11 09:37:14|1300B 3997B| 1 0 98 0 0 0|1189M 104M 926M 5763M 17-11 09:37:15| 16k 10k| 0 0 99 0 0 0|1182M 104M 925M 5771M 17-11 09:37:16| 10k 3814B| 1 0 99 0 0 0|1182M 104M 925M 5771M 17-11 09:37:17| 38k 26k| 1 0 99 0 0 0|1185M 104M 925M 5768M 17-11 09:37:18| 17k 10k| 0 0 99 0 0 0|1184M 104M 925M 5770M 17-11 09:37:19|7390B 420k| 2 1 98 0 0 0|1198M 104M 924M 5756M 17-11 09:37:20|7923B 7543B| 1 1 99 0 0 0|1186M 104M 923M 5769M 17-11 09:37:21|3287B 2255B| 1 0 98 0 0 0|1190M 104M 925M 5763M 17-11 09:37:22| 31k 25k| 0 0 99 0 0 0|1173M 104M 925M 5780M 17-11 09:37:23| 17k 11k| 0 0 99 0 0 0|1163M 104M 925M 5790M
结果3:
结果4:
结果5:
........
其他结果一样也是报错,只是报错信息略有区别,有的没保留到。
太乱了就不全部贴出来。
换一种方式。
将输出内容重定向到文件中,则基本不会报错(偶尔出现):
clickhouse-client -h 192.168.1.172 -u default --password defaultsys --receive_timeout=1 --send_logs_level='debug' --query="select * from default.zkm_local" > zkm_local.csv

[root@dev-app80 daily]# clickhouse-client -h 192.168.1.172 -u default --password defaultsys --receive_timeout=1 --send_logs_level='debug' --query="select * from default.zkm_local" > zkm_local.csv [dev-app77] 2022.11.17 09:28:11.124199 [ 22489 ] {049d188f-b305-4cc6-89d0-a828dde1009c} <Debug> executeQuery: (from 192.168.1.123:47040) select * from default.zkm_local [dev-app77] 2022.11.17 09:28:11.126679 [ 22489 ] {049d188f-b305-4cc6-89d0-a828dde1009c} <Debug> default.zkm_local (23cd08a1-1ba8-45a4-9d34-01b0a640d047) (SelectExecutor): Key condition: unknown [dev-app77] 2022.11.17 09:28:11.126926 [ 22489 ] {049d188f-b305-4cc6-89d0-a828dde1009c} <Debug> default.zkm_local (23cd08a1-1ba8-45a4-9d34-01b0a640d047) (SelectExecutor): MinMax index condition: unknown [dev-app77] 2022.11.17 09:28:11.128069 [ 22489 ] {049d188f-b305-4cc6-89d0-a828dde1009c} <Debug> default.zkm_local (23cd08a1-1ba8-45a4-9d34-01b0a640d047) (SelectExecutor): Selected 14/14 parts by partition key, 14 parts by primary key, 327/327 marks by primary key, 327 marks to read from 14 ranges [dev-app77] 2022.11.17 09:28:11.130267 [ 22489 ] {049d188f-b305-4cc6-89d0-a828dde1009c} <Debug> default.zkm_local (23cd08a1-1ba8-45a4-9d34-01b0a640d047) (SelectExecutor): Reading approx. 2595814 rows with 8 streams [dev-app77] 2022.11.17 09:28:23.481457 [ 22489 ] {049d188f-b305-4cc6-89d0-a828dde1009c} <Information> executeQuery: Read 2595814 rows, 1.25 GiB in 12.357118953 sec., 210066 rows/sec., 103.74 MiB/sec. [dev-app77] 2022.11.17 09:28:23.494248 [ 22489 ] {049d188f-b305-4cc6-89d0-a828dde1009c} <Debug> MemoryTracker: Peak memory usage (for query): 169.81 MiB. [root@dev-app80 daily]# [root@dev-app80 daily]# [root@dev-app80 daily]# clickhouse-client -h 192.168.1.172 -u default --password defaultsys --receive_timeout=1 --send_logs_level='debug' --query="select * from default.zkm_local" > zkm_local.csv [dev-app77] 2022.11.17 09:29:05.007509 [ 22489 ] {af63e99a-0c61-4b84-8288-475b94c0b619} <Debug> executeQuery: (from 192.168.1.123:47042) select * from default.zkm_local [dev-app77] 2022.11.17 09:29:05.009623 [ 22489 ] {af63e99a-0c61-4b84-8288-475b94c0b619} <Debug> default.zkm_local (23cd08a1-1ba8-45a4-9d34-01b0a640d047) (SelectExecutor): Key condition: unknown [dev-app77] 2022.11.17 09:29:05.009747 [ 22489 ] {af63e99a-0c61-4b84-8288-475b94c0b619} <Debug> default.zkm_local (23cd08a1-1ba8-45a4-9d34-01b0a640d047) (SelectExecutor): MinMax index condition: unknown [dev-app77] 2022.11.17 09:29:05.010641 [ 22489 ] {af63e99a-0c61-4b84-8288-475b94c0b619} <Debug> default.zkm_local (23cd08a1-1ba8-45a4-9d34-01b0a640d047) (SelectExecutor): Selected 14/14 parts by partition key, 14 parts by primary key, 327/327 marks by primary key, 327 marks to read from 14 ranges [dev-app77] 2022.11.17 09:29:05.011670 [ 22489 ] {af63e99a-0c61-4b84-8288-475b94c0b619} <Debug> default.zkm_local (23cd08a1-1ba8-45a4-9d34-01b0a640d047) (SelectExecutor): Reading approx. 2595814 rows with 8 streams [dev-app77] 2022.11.17 09:29:17.401748 [ 22489 ] {af63e99a-0c61-4b84-8288-475b94c0b619} <Information> executeQuery: Read 2595814 rows, 1.25 GiB in 12.394144753 sec., 209438 rows/sec., 103.43 MiB/sec. [dev-app77] 2022.11.17 09:29:17.405031 [ 22489 ] {af63e99a-0c61-4b84-8288-475b94c0b619} <Debug> MemoryTracker: Peak memory usage (for query): 169.69 MiB.
同时,服务器上netstat -anputwco | grep 192.168.1.123的结果:
tcp 0 2028857 192.168.1.172:9000 192.168.1.123:47042 ESTABLISHED 1435/clickhouse-ser on (0.19/0/0) tcp 0 2884482 192.168.1.172:9000 192.168.1.123:47042 ESTABLISHED 1435/clickhouse-ser on (0.20/0/0) tcp 0 2599228 192.168.1.172:9000 192.168.1.123:47042 ESTABLISHED 1435/clickhouse-ser on (0.22/0/0) tcp 0 2597150 192.168.1.172:9000 192.168.1.123:47042 ESTABLISHED 1435/clickhouse-ser on (0.03/0/0) tcp 0 3081369 192.168.1.172:9000 192.168.1.123:47042 ESTABLISHED 1435/clickhouse-ser on (0.21/0/0) tcp 0 2537205 192.168.1.172:9000 192.168.1.123:47042 ESTABLISHED 1435/clickhouse-ser on (0.21/0/0) tcp 0 2734426 192.168.1.172:9000 192.168.1.123:47042 ESTABLISHED 1435/clickhouse-ser on (0.19/0/0) tcp 0 2711009 192.168.1.172:9000 192.168.1.123:47042 ESTABLISHED 1435/clickhouse-ser on (0.19/0/0) tcp 0 3360854 192.168.1.172:9000 192.168.1.123:47042 ESTABLISHED 1435/clickhouse-ser on (0.22/0/0) tcp 0 2708708 192.168.1.172:9000 192.168.1.123:47042 ESTABLISHED 1435/clickhouse-ser on (0.21/0/0) tcp 0 3148247 192.168.1.172:9000 192.168.1.123:47042 ESTABLISHED 1435/clickhouse-ser on (0.20/0/0) tcp 0 3248253 192.168.1.172:9000 192.168.1.123:47042 ESTABLISHED 1435/clickhouse-ser probe (0.22/0/0) tcp 0 0 192.168.1.172:9000 192.168.1.123:47042 ESTABLISHED 1435/clickhouse-ser off (0.00/0/0) tcp 0 0 192.168.1.172:9000 192.168.1.123:47042 ESTABLISHED 1435/clickhouse-ser off (0.00/0/0) tcp 0 0 192.168.1.172:9000 192.168.1.123:47042 ESTABLISHED 1435/clickhouse-ser off (0.00/0/0)
同时,使用dstat的观察结果:
[root@dev-app77 ~]# dstat -tncm --nocolor ----system---- -net/total- ----total-cpu-usage---- ------memory-usage----- 17-11 09:29:05| 71k 29M| 9 4 87 0 0 0|1267M 103M 913M 5699M 17-11 09:29:06| 72k 49M| 12 1 87 0 0 0|1298M 103M 913M 5667M 17-11 09:29:07| 10k 17M| 4 0 96 0 0 0|1304M 103M 913M 5661M 17-11 09:29:08| 16k 18M| 5 1 92 3 0 0|1298M 103M 913M 5668M 17-11 09:29:09| 10k 15M| 5 1 94 0 0 0|1304M 103M 912M 5663M 17-11 09:29:10| 24k 13M| 4 1 94 0 0 0|1307M 103M 913M 5658M 17-11 09:29:11| 26k 16M| 6 2 92 0 0 0|1339M 103M 915M 5625M 17-11 09:29:12|4661B 17M| 4 1 95 0 0 0|1327M 103M 915M 5637M 17-11 09:29:13| 18k 15M| 4 1 91 4 0 0|1335M 103M 915M 5629M ----system---- -net/total- ----total-cpu-usage---- ------memory-usage----- time | recv send|usr sys idl wai hiq siq| used buff cach free 17-11 09:29:14| 14k 16M| 4 1 95 0 0 0|1322M 103M 915M 5641M 17-11 09:29:15|5627B 10M| 3 1 97 0 0 0|1305M 103M 915M 5658M 17-11 09:29:16| 31k 8571k| 3 1 96 0 0 0|1293M 103M 913M 5673M 17-11 09:29:17|6984B 9263k| 3 1 96 0 0 0|1260M 103M 914M 5704M 17-11 09:29:18| 13k 222k| 1 0 97 2 0 0|1257M 103M 915M 5707M 17-11 09:29:19| 16k 8609B| 2 1 93 4 0 0|1242M 103M 916M 5720M 17-11 09:29:20|9057B 3024B| 1 0 99 0 0 0|1240M 103M 916M 5722M
一开始以为receive_time控制客户端接受数据时间,
比如receive_time=1时看,执行一条sql返回数据超过1s立刻抛出错误。
(PS:控制SQL执行时间是由max_execution_time控制,默认0表示无限制)
从实验结果看,有时候返回数据(屏幕滚动多达几秒十几秒)超过1s然后最终不报错的都有。
而且若是重定向到文件,大部分时候是成功的。
根据报错试验期间服务器日志的某几次报错信息(类似信息,但有区别,可和前边的报错对比):
210. DB::NetException: Connection reset by peer, while writing to socket (127.0.0.1:33484). (NETWORK_ERROR)
Code: 209. DB::NetException: Timeout exceeded while writing to socket (127.0.0.1:33410, 1000 ms). (SOCKET_TIMEOUT)
关于Connection reset by peer,猜测原因是客户端主动断开重连,当然也可能是网络导致,不是很确定。只能确定连接没有正常维持而被reset了。
关于socket,读这两篇了解一下:学C语言会用到socket网络编程,那它是什么意思 socket的概念和原理
while writing to socket,可以知道是当对socket做写操作,即服务器对客户端发送数据的时候报错。
Timeout exceeded while writing to socket,即超过设定的1s超时报错,
结合实验过程,这个超时也就是当服务器对socket发送数据的时候,socket无响应或者来不及响应的时间超过1s则报错。
socket无响应或者来不及响应可能原因是网络拥堵导致的,也可能是客户端接受数据过慢导致的。
接受数据过慢是什么说法,
比如从上边看,通过clickhouse-client交互式执行SQL有时候在某段时间内几乎次次报错,同时dstat观察网络流量都是几十K并远远没有达到峰值,说明是clickhouse-client来不及接受数据导致的,估计是刷屏过慢导致的。
刷屏问题在Oracle中也有,sqlplus数据刷屏的时间是计算在statistics_level=all的时间内的,即使sql可能0.1s执行完成但是由于数据在sqlplus刷屏时间过长最终倒是sql执行时间是10s都有可能。
clickhouse-client由于数据刷屏慢来不及响应服务器对socket的写操作,所以报错了。
对比将数据用CSV格式重定向到文件无刷屏过程,几乎就次次成功无报错,并且dstat的网络流量从10几M到40几M都有,网络流量上去了说明socket的接收能力没有被限制,所以没有报错。
使用tc限制网络流量,再次观察重定向方式是否报错。
限制 ens160 网卡的带宽为 500kbit:
tc qdisc add dev ens160 root tbf rate 500kbit latency 50ms burst 15kb
再试执行重定向操作,
[root@dev-app80 daily]# clickhouse-client -h 192.168.1.172 -u default --password defaultsys --receive_timeout=1 --query="select * from default.zkm_local" > zkm_local.csv Code: 33. DB::Exception: Cannot read all data. Bytes read: 152964. Bytes expected: 585803.: while receiving packet from 192.168.1.172:9000: (in query: select * from default.zkm_local). (CANNOT_READ_ALL_DATA)
可以看出,网络流量限制之后,出现client接收数据超时也报错了。
删除网卡 ens160 网卡限制。
tc qdisc del dev ens160 root
关于receive_timeout到这里。
太乱了,搞了有4天......
还有个send_timeout......