[bug]记一次同步数据问题排查

记一次线上事故排查

具体场景简介:

目前有一个数据同步的场景(Mongodb -> ES)

架构模型如下:

遇到的问题:

任务消费到某一时刻,就会hang住,无法继续消费,channel队列满了以后,导致整个同步任务全面停止

可以很明显的看到已经hang住了,并且最后一条日志明确的显示是阻塞在发送任务上

已经预先开启了pprof,我们看一下当前goroutine阻塞的内容

goroutine profile: total 56
10 @ 0x103d0b2 0x104d4b2 0x1e3ef6c 0x1e443d5 0x1fef5a5 0x1feeaa5 0x106ed81
#    0x1e3ef6b    github.com/elastic/go-elasticsearch/v7/esutil.(*bulkIndexer).Add+0x12b    /Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:229
#    0x1e443d4    aqgs/sync_assistant/internal/toes.ToES+0x394                /Users/songzhibin/go/src/aqgs/sync_assistant/internal/toes/toes.go:30
#    0x1fef5a4    aqgs/sync_assistant/internal/task/risk.StorageSoftware+0x604        /Users/songzhibin/go/src/aqgs/sync_assistant/internal/task/risk/software.go:141
#    0x1feeaa4    aqgs/sync_assistant/internal/task/risk.PoolStorageSoftware+0x1c4    /Users/songzhibin/go/src/aqgs/sync_assistant/internal/task/risk/software.go:106

8 @ 0x103d0b2 0x103d14a 0x106aafe 0x107ac78 0x1c1ada5 0x1c1a7ac 0x106ed81
#    0x106aafd    sync.runtime_notifyListWait+0x13d                                /usr/local/Cellar/go/1.17.5/libexec/src/runtime/sema.go:513
#    0x107ac77    sync.(*Cond).Wait+0x77                                        /usr/local/Cellar/go/1.17.5/libexec/src/sync/cond.go:56
#    0x1c1ada4    go.mongodb.org/mongo-driver/x/mongo/driver/topology.(*pool).createConnections.func2+0x1a4    /Users/songzhibin/go/pkg/mod/go.mongodb.org/mongo-driver@v1.8.3/x/mongo/driver/topology/pool.go:616
#    0x1c1a7ab    go.mongodb.org/mongo-driver/x/mongo/driver/topology.(*pool).createConnections+0x22b        /Users/songzhibin/go/pkg/mod/go.mongodb.org/mongo-driver@v1.8.3/x/mongo/driver/topology/pool.go:643

4 @ 0x103d0b2 0x104d4b2 0x11a4007 0x106ed81
#    0x11a4006    database/sql.(*DB).connectionOpener+0xc6    /usr/local/Cellar/go/1.17.5/libexec/src/database/sql/sql.go:1196

4 @ 0x103d0b2 0x104d4b2 0x1c1b8eb 0x106ed81
#    0x1c1b8ea    go.mongodb.org/mongo-driver/x/mongo/driver/topology.(*pool).maintain+0x54a    /Users/songzhibin/go/pkg/mod/go.mongodb.org/mongo-driver@v1.8.3/x/mongo/driver/topology/pool.go:750

4 @ 0x103d0b2 0x104d4b2 0x1c1e3d5 0x106ed81
#    0x1c1e3d4    go.mongodb.org/mongo-driver/x/mongo/driver/topology.(*rttMonitor).start+0x2b4    /Users/songzhibin/go/pkg/mod/go.mongodb.org/mongo-driver@v1.8.3/x/mongo/driver/topology/rtt_monitor.go:95

4 @ 0x103d0b2 0x104d4b2 0x1c22b66 0x1c22a2c 0x106ed81
#    0x1c22b65    go.mongodb.org/mongo-driver/x/mongo/driver/topology.(*Server).update.func3+0x105    /Users/songzhibin/go/pkg/mod/go.mongodb.org/mongo-driver@v1.8.3/x/mongo/driver/topology/server.go:493
#    0x1c22a2b    go.mongodb.org/mongo-driver/x/mongo/driver/topology.(*Server).update+0x78b        /Users/songzhibin/go/pkg/mod/go.mongodb.org/mongo-driver@v1.8.3/x/mongo/driver/topology/server.go:556

4 @ 0x103d0b2 0x104d4b2 0x200da5e 0x106ed81
#    0x200da5d    github.com/go-sql-driver/mysql.(*mysqlConn).startWatcher.func1+0xfd    /Users/songzhibin/go/pkg/mod/github.com/go-sql-driver/mysql@v1.6.0/connection.go:614

3 @ 0x103d0b2 0x104d4b2 0x1e3ef6c 0x1e44c7d 0x1e42aba 0x1e4064a 0x106ed81
#    0x1e3ef6b    github.com/elastic/go-elasticsearch/v7/esutil.(*bulkIndexer).Add+0x12b    /Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:229
#    0x1e44c7c    aqgs/sync_assistant/internal/toes.ToES.func1+0x27c            /Users/songzhibin/go/src/aqgs/sync_assistant/internal/toes/toes.go:41
#    0x1e42ab9    github.com/elastic/go-elasticsearch/v7/esutil.(*worker).flush+0x1639    /Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:590
#    0x1e40649    github.com/elastic/go-elasticsearch/v7/esutil.(*worker).run.func1+0x7a9    /Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:386

2 @ 0x103d0b2 0x103d14a 0x104e185 0x106a805 0x108019c 0x1debab2 0x106ed81
#    0x106a804    sync.runtime_Semacquire+0x24                /usr/local/Cellar/go/1.17.5/libexec/src/runtime/sema.go:56
#    0x108019b    sync.(*WaitGroup).Wait+0xbb                /usr/local/Cellar/go/1.17.5/libexec/src/sync/waitgroup.go:130
#    0x1debab1    github.com/spf13/viper.(*Viper).WatchConfig.func1+0x611    /Users/songzhibin/go/pkg/mod/github.com/spf13/viper@v1.10.1/viper.go:452

2 @ 0x103d0b2 0x104d4b2 0x1454007 0x106ed81
#    0x1454006    github.com/go-redis/redis/v8/internal/pool.(*ConnPool).reaper+0x166    /Users/songzhibin/go/pkg/mod/github.com/go-redis/redis/v8@v8.11.4/internal/pool/pool.go:485

2 @ 0x103d0b2 0x104d4b2 0x1debc4e 0x106ed81
#    0x1debc4d    github.com/spf13/viper.(*Viper).WatchConfig.func1.1+0x14d    /Users/songzhibin/go/pkg/mod/github.com/spf13/viper@v1.10.1/viper.go:413

2 @ 0x106b79b 0x1d2987a 0x1d29405 0x1d2d965 0x1d2bbcd 0x106ed81
#    0x106b79a    syscall.syscall6+0x3a                        /usr/local/Cellar/go/1.17.5/libexec/src/runtime/sys_darwin.go:44
#    0x1d29879    golang.org/x/sys/unix.kevent+0xb9                /Users/songzhibin/go/pkg/mod/golang.org/x/sys@v0.0.0-20211210111614-af8b64212486/unix/zsyscall_darwin_amd64.go:276
#    0x1d29404    golang.org/x/sys/unix.Kevent+0x144                /Users/songzhibin/go/pkg/mod/golang.org/x/sys@v0.0.0-20211210111614-af8b64212486/unix/syscall_bsd.go:417
#    0x1d2d964    github.com/fsnotify/fsnotify.read+0xa4                /Users/songzhibin/go/pkg/mod/github.com/fsnotify/fsnotify@v1.5.1/kqueue.go:512
#    0x1d2bbcc    github.com/fsnotify/fsnotify.(*Watcher).readEvents+0x10c    /Users/songzhibin/go/pkg/mod/github.com/fsnotify/fsnotify@v1.5.1/kqueue.go:275

1 @ 0x1036b3e 0x106af68 0x1e3e75d 0x106ed81
#    0x106af67    os/signal.signal_recv+0x27    /usr/local/Cellar/go/1.17.5/libexec/src/runtime/sigqueue.go:166
#    0x1e3e75c    os/signal.loop+0x1c        /usr/local/Cellar/go/1.17.5/libexec/src/os/signal/signal_unix.go:24

1 @ 0x103d0b2 0x1006e7c 0x1006bfd 0x1fee4b3 0x20adcf8 0x1e54f67 0x1e55d93 0x1e5536f 0x20ad31f 0x20ae037 0x103cc93 0x106ed81
#    0x1fee4b2    aqgs/sync_assistant/internal/task/risk.Software+0xcf2    /Users/songzhibin/go/src/aqgs/sync_assistant/internal/task/risk/software.go:80
#    0x20adcf7    aqgs/sync_assistant/cmd.glob..func4+0x277        /Users/songzhibin/go/src/aqgs/sync_assistant/cmd/risk.go:53
#    0x1e54f66    github.com/spf13/cobra.(*Command).execute+0xb66        /Users/songzhibin/go/pkg/mod/github.com/spf13/cobra@v1.3.0/command.go:860
#    0x1e55d92    github.com/spf13/cobra.(*Command).ExecuteC+0x8b2    /Users/songzhibin/go/pkg/mod/github.com/spf13/cobra@v1.3.0/command.go:974
#    0x1e5536e    github.com/spf13/cobra.(*Command).Execute+0x2e        /Users/songzhibin/go/pkg/mod/github.com/spf13/cobra@v1.3.0/command.go:902
#    0x20ad31e    aqgs/sync_assistant/cmd.Execute+0x3e            /Users/songzhibin/go/src/aqgs/sync_assistant/cmd/root.go:55
#    0x20ae036    main.main+0x16                        /Users/songzhibin/go/src/aqgs/sync_assistant/main.go:21
#    0x103cc92    runtime.main+0x1f2                    /usr/local/Cellar/go/1.17.5/libexec/src/runtime/proc.go:255

1 @ 0x103d0b2 0x100796d 0x10075f8 0x1e3eb56 0x106ed81
#    0x1e3eb55    aqgs/sync_assistant/internal/signal.Register.func1+0x35    /Users/songzhibin/go/src/aqgs/sync_assistant/internal/signal/signal.go:14

1 @ 0x103d0b2 0x1035aae 0x1068fa5 0x1118fa8 0x1119057 0x111db37 0x1374cd2 0x139edf5 0x139d0c7 0x15b6e70 0x15b6825 0x15b8416 0x20ae009 0x106ed81
#    0x1068fa4    internal/poll.runtime_pollWait+0x44        /usr/local/Cellar/go/1.17.5/libexec/src/runtime/netpoll.go:234
#    0x1118fa7    internal/poll.(*pollDesc).wait+0x87        /usr/local/Cellar/go/1.17.5/libexec/src/internal/poll/fd_poll_runtime.go:84
#    0x1119056    internal/poll.(*pollDesc).waitRead+0x36        /usr/local/Cellar/go/1.17.5/libexec/src/internal/poll/fd_poll_runtime.go:89
#    0x111db36    internal/poll.(*FD).Accept+0x4b6        /usr/local/Cellar/go/1.17.5/libexec/src/internal/poll/fd_unix.go:402
#    0x1374cd1    net.(*netFD).accept+0x71            /usr/local/Cellar/go/1.17.5/libexec/src/net/fd_unix.go:173
#    0x139edf4    net.(*TCPListener).accept+0x54            /usr/local/Cellar/go/1.17.5/libexec/src/net/tcpsock_posix.go:140
#    0x139d0c6    net.(*TCPListener).Accept+0x66            /usr/local/Cellar/go/1.17.5/libexec/src/net/tcpsock.go:262
#    0x15b6e6f    net/http.(*Server).Serve+0x52f            /usr/local/Cellar/go/1.17.5/libexec/src/net/http/server.go:3002
#    0x15b6824    net/http.(*Server).ListenAndServe+0x164        /usr/local/Cellar/go/1.17.5/libexec/src/net/http/server.go:2931
#    0x15b8415    net/http.ListenAndServe+0xf5            /usr/local/Cellar/go/1.17.5/libexec/src/net/http/server.go:3185
#    0x20ae008    aqgs/sync_assistant/cmd.Execute.func1+0x28    /Users/songzhibin/go/src/aqgs/sync_assistant/cmd/root.go:53

1 @ 0x103d0b2 0x103d14a 0x104e185 0x106a925 0x107d691 0x107d3f1 0x1e3fcea 0x106ed81
#    0x106a924    sync.runtime_SemacquireMutex+0x24                        /usr/local/Cellar/go/1.17.5/libexec/src/runtime/sema.go:71
#    0x107d690    sync.(*Mutex).lockSlow+0x270                            /usr/local/Cellar/go/1.17.5/libexec/src/sync/mutex.go:138
#    0x107d3f0    sync.(*Mutex).Lock+0x50                                /usr/local/Cellar/go/1.17.5/libexec/src/sync/mutex.go:81
#    0x1e3fce9    github.com/elastic/go-elasticsearch/v7/esutil.(*bulkIndexer).init.func1+0x269    /Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:319

1 @ 0x1068c05 0x209e6f9 0x209e350 0x20995e7 0x20aa46d 0x20ab2c9 0x15b1783 0x15b4155 0x15b655a 0x15b0618 0x106ed81
#    0x1068c04    runtime/pprof.runtime_goroutineProfileWithLabels+0x24    /usr/local/Cellar/go/1.17.5/libexec/src/runtime/mprof.go:746
#    0x209e6f8    runtime/pprof.writeRuntimeProfile+0x178            /usr/local/Cellar/go/1.17.5/libexec/src/runtime/pprof/pprof.go:724
#    0x209e34f    runtime/pprof.writeGoroutine+0x8f            /usr/local/Cellar/go/1.17.5/libexec/src/runtime/pprof/pprof.go:684
#    0x20995e6    runtime/pprof.(*Profile).WriteTo+0xa6            /usr/local/Cellar/go/1.17.5/libexec/src/runtime/pprof/pprof.go:331
#    0x20aa46c    net/http/pprof.handler.ServeHTTP+0x42c            /usr/local/Cellar/go/1.17.5/libexec/src/net/http/pprof/pprof.go:253
#    0x20ab2c8    net/http/pprof.Index+0xe8                /usr/local/Cellar/go/1.17.5/libexec/src/net/http/pprof/pprof.go:371
#    0x15b1782    net/http.HandlerFunc.ServeHTTP+0x42            /usr/local/Cellar/go/1.17.5/libexec/src/net/http/server.go:2047
#    0x15b4154    net/http.(*ServeMux).ServeHTTP+0x134            /usr/local/Cellar/go/1.17.5/libexec/src/net/http/server.go:2425
#    0x15b6559    net/http.serverHandler.ServeHTTP+0x479            /usr/local/Cellar/go/1.17.5/libexec/src/net/http/server.go:2879
#    0x15b0617    net/http.(*conn).serve+0x18b7                /usr/local/Cellar/go/1.17.5/libexec/src/net/http/server.go:1930

1 @ 0x106ed81

我这个程序开了10个groutine去同步,我们可以看到10个worker都hang在了这里

goroutine profile: total 56
10 @ 0x103d0b2 0x104d4b2 0x1e3ef6c 0x1e443d5 0x1fef5a5 0x1feeaa5 0x106ed81
#    0x1e3ef6b    github.com/elastic/go-elasticsearch/v7/esutil.(*bulkIndexer).Add+0x12b    /Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:229
#    0x1e443d4    aqgs/sync_assistant/internal/toes.ToES+0x394                /Users/songzhibin/go/src/aqgs/sync_assistant/internal/toes/toes.go:30
#    0x1fef5a4    aqgs/sync_assistant/internal/task/risk.StorageSoftware+0x604        /Users/songzhibin/go/src/aqgs/sync_assistant/internal/task/risk/software.go:141
#    0x1feeaa4    aqgs/sync_assistant/internal/task/risk.PoolStorageSoftware+0x1c4    /Users/songzhibin/go/src/aqgs/sync_assistant/internal/task/risk/software.go:106

主程卡在了发送消息到channel中

1 @ 0x103d0b2 0x1006e7c 0x1006bfd 0x1fee4b3 0x20adcf8 0x1e54f67 0x1e55d93 0x1e5536f 0x20ad31f 0x20ae037 0x103cc93 0x106ed81
#    0x1fee4b2    aqgs/sync_assistant/internal/task/risk.Software+0xcf2    /Users/songzhibin/go/src/aqgs/sync_assistant/internal/task/risk/software.go:80
#    0x20adcf7    aqgs/sync_assistant/cmd.glob..func4+0x277        /Users/songzhibin/go/src/aqgs/sync_assistant/cmd/risk.go:53
#    0x1e54f66    github.com/spf13/cobra.(*Command).execute+0xb66        /Users/songzhibin/go/pkg/mod/github.com/spf13/cobra@v1.3.0/command.go:860
#    0x1e55d92    github.com/spf13/cobra.(*Command).ExecuteC+0x8b2    /Users/songzhibin/go/pkg/mod/github.com/spf13/cobra@v1.3.0/command.go:974
#    0x1e5536e    github.com/spf13/cobra.(*Command).Execute+0x2e        /Users/songzhibin/go/pkg/mod/github.com/spf13/cobra@v1.3.0/command.go:902
#    0x20ad31e    aqgs/sync_assistant/cmd.Execute+0x3e            /Users/songzhibin/go/src/aqgs/sync_assistant/cmd/root.go:55
#    0x20ae036    main.main+0x16                        /Users/songzhibin/go/src/aqgs/sync_assistant/main.go:21
#    0x103cc92    runtime.main+0x1f2                    /usr/local/Cellar/go/1.17.5/libexec/src/runtime/proc.go:255

点进去源码可以看到,阻塞到了select这里

定位到sdk代码

我们切回到 pprof中full groutine stack dump中详细看一下

goroutine 175 [select, 20 minutes]:
github.com/elastic/go-elasticsearch/v7/esutil.(*bulkIndexer).Add(0xc00034a000, {0x2431250, 0xc00003e0d0}, {{0x0, 0x0}, {0x2353093, 0x5}, {0xc002dceea0, 0x2e}, {0x0, ...}, ...})
    /Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:229 +0x12c
aqgs/sync_assistant/internal/toes.ToES.func1({_, _}, {{0x0, 0x0}, {0x2355caa, 0x6}, {0xc002dceea0, 0x2e}, {0x0, 0x0}, ...}, ...)
    /Users/songzhibin/go/src/aqgs/sync_assistant/internal/toes/toes.go:41 +0x27d
github.com/elastic/go-elasticsearch/v7/esutil.(*worker).flush(0xc00069e7e0, {0x2431250, 0xc00003e0d0})
    /Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:590 +0x163a
github.com/elastic/go-elasticsearch/v7/esutil.(*worker).run.func1()
    /Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:386 +0x7aa
created by github.com/elastic/go-elasticsearch/v7/esutil.(*worker).run
    /Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:351 +0x68

goroutine 176 [select, 20 minutes]:
github.com/elastic/go-elasticsearch/v7/esutil.(*bulkIndexer).Add(0xc00034a000, {0x2431250, 0xc00003e0d0}, {{0x0, 0x0}, {0x2353093, 0x5}, {0xc000c1a450, 0x2e}, {0x0, ...}, ...})
    /Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:229 +0x12c
aqgs/sync_assistant/internal/toes.ToES.func1({_, _}, {{0x0, 0x0}, {0x2355caa, 0x6}, {0xc000c1a450, 0x2e}, {0x0, 0x0}, ...}, ...)
    /Users/songzhibin/go/src/aqgs/sync_assistant/internal/toes/toes.go:41 +0x27d
github.com/elastic/go-elasticsearch/v7/esutil.(*worker).flush(0xc00069e840, {0x2431250, 0xc00003e0d0})
    /Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:590 +0x163a
github.com/elastic/go-elasticsearch/v7/esutil.(*worker).run.func1()
    /Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:386 +0x7aa
created by github.com/elastic/go-elasticsearch/v7/esutil.(*worker).run
    /Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:351 +0x68

goroutine 177 [select, 20 minutes]:
github.com/elastic/go-elasticsearch/v7/esutil.(*bulkIndexer).Add(0xc00034a000, {0x2431250, 0xc00003e0d0}, {{0x0, 0x0}, {0x2353093, 0x5}, {0xc0034cc360, 0x2e}, {0x0, ...}, ...})
    /Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:229 +0x12c
aqgs/sync_assistant/internal/toes.ToES.func1({_, _}, {{0x0, 0x0}, {0x2355caa, 0x6}, {0xc0034cc360, 0x2e}, {0x0, 0x0}, ...}, ...)
    /Users/songzhibin/go/src/aqgs/sync_assistant/internal/toes/toes.go:41 +0x27d
github.com/elastic/go-elasticsearch/v7/esutil.(*worker).flush(0xc00069e8a0, {0x2431250, 0xc00003e0d0})
    /Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:590 +0x163a
github.com/elastic/go-elasticsearch/v7/esutil.(*worker).run.func1()
    /Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:386 +0x7aa
created by github.com/elastic/go-elasticsearch/v7/esutil.(*worker).run
    /Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:351 +0x68

我挑出重要的栈信息

goroutine 176 [select, 20 minutes]:
github.com/elastic/go-elasticsearch/v7/esutil.(*worker).flush(0xc00069e840, {0x2431250, 0xc00003e0d0})
    /Users/songzhibin/go/pkg/mod/github.com/elastic/go-elasticsearch/v7@v7.17.0/esutil/bulk_indexer.go:590 +0x163a

再结合我的代码

阿偶,问题找到了,在buffer满的时候刷新调用onfailure,flush时执行失败后调用onfailure,我的onfailure又调用了一个add方法导致消费者死锁!!!

解决方法当然也很简单,onfailure的任务直接go出去,或者推到另外一个channel中异步处理,齐活。。。

posted @ 2022-02-21 16:46  Binb  阅读(345)  评论(0编辑  收藏  举报