我最近遇到 docker push 推送镜像时,某些层卡住,导致推送失败的问题。
只在通过 5g 无线信号连接路由器时会遇到,通过 2.4g 信号 或者 网线连接路由器的方式不会遇到该故障。
环境信息:
MacOS: 12.2.1 (21D62)
Surge:Version 4.4.1 (1532)
路由器:ASUS AC-86u
docker:4.4.2 (73305)
docker 镜像仓库:私有的,需要通过一个 http 代理的方式访问。(Surge 中配置了该代理)
网络流量走向是:
- com.docker.vpnkit 进程与 Surge 建立一个【本地连接】,127.0.0.1:xxx -> 127.0.0.1:6152(Surge 的代理端口)
- Surge 建立一个【外部连接】192.168.1.xxx:xxx -> 外部代理服务器
使用了 netstat -ant 和 tcpdump 观察发现:
出现推送失败时,netstat 显示,【外部连接】出问题的连接在 Send-Q 还很大时(20w+),tcp 的连接状态就变成了 FIN_WAIT_1,该状态表示主动关闭了连接。
然后【本地连接】就立即关闭,docker就感知到了推送失败。
而【外部连接】可能还需要30s+才能把 Send-Q 减小到0才正式关闭该 TCP 连接。
通过打开 Surge 的调试日志发现,【外部连接】先是出现了一个错误:“Write operation timed out“,然后就主动关闭了【内部连接】(6152 端口主动发送了 RST ACK 的包)。具体错误如下:
# TCP 建立连接时
2022-02-12 17:10:47.564357 <INFO> [SGTunnelConnection-2645] Start CONNECT tunnel to host: <xxx domain>:443, SGSocket-6880
2022-02-12 17:10:47.564588 <VERBOSE> [SGTunnelConnection-2645] <initWithSocket:CONNECTHeader:dispatchQueue: (SGTunnelConnection.m:77)> Record ID: 4107
...
2022-02-12 17:10:47.565234 <VERBOSE> [SGConnectorPool] <requestConnectorWithPolicy:targetHostname:targetPort:earlyData:ipVersionDescriptor:dispatchQueue:logger:completionHandler:failureHandler: (SGConnectorPool.m:119)> Create connector request pool: <xxx domain> SGConnectionSetupContext-3972
2022-02-12 17:10:47.565313 <VERBOSE> [SGConnectionSetupContext-3972] <start (SGConnectorPool.m:271)> New connector: SGHTTPProxyConnector-705
2022-02-12 17:10:47.565368 <VERBOSE> [SGHTTPProxyConnector-705] <createSocketWithCompletionHandler: (SGConnector.m:401)> Socket created: SGSocket-6881
2022-02-12 17:10:47.565392 <INFO> [SGHTTPProxyConnector-705] Start connection to proxy server: SGHTTPProxyConnector-705 -> <http-proxy-ip>:52975
...
2022-02-12 17:10:47.660637 <VERBOSE> [SGConnectionSetupContext-3972] <connectorDidBecomeAvailable: (SGConnectorPool.m:401)> connectorDidBecomeAvailable: SGHTTPProxyConnector-705
2022-02-12 17:10:47.660671 <VERBOSE> [SGTunnelConnection-2645] <connectorDidBecomeAvailable: (SGTunnelConnection.m:244)> connectorDidBecomeAvailable
# TCP 断开连接时
2022-02-12 17:11:18.380143 <VERBOSE> [SGHTTPProxyConnector-705] <socketDidDisconnect:withError: (SGConnector.m:233)> Connection disconnect: Error Domain=SGSocketErrorDomain Code=4 "Write operation timed out" UserInfo={NSLocalizedDescription=Write operation timed out}
2022-02-12 17:11:18.380190 <VERBOSE> [SGHTTPProxyConnector-705] <disconnectWithError: (SGConnector.m:350)> disconnectWithError: Error Domain=SGSocketErrorDomain Code=4 "Write operation timed out" UserInfo={NSLocalizedDescription=Write operation timed out}
2022-02-12 17:11:18.380225 <VERBOSE> [SGHTTPProxyConnector-705] <disconnectWithError: (SGConnector.m:364)> disconnectAfterWritingHoldRef
2022-02-12 17:11:18.380272 <VERBOSE> [SGTunnelConnection-2645] <connectorDidDisconnect:withError: (SGTunnelConnection.m:307)> Connector closed: Error Domain=SGSocketErrorDomain Code=4 "Write operation timed out" UserInfo={NSLocalizedDescription=Write operation timed out}
2022-02-12 17:11:18.380318 <VERBOSE> [SGTunnelConnection-2645] <disconnectWithReason: (SGTunnelConnection.m:173)> disconnectWithReason: Closed by remote (Write operation timed out)
2022-02-12 17:11:18.380383 <VERBOSE> [SGLogRequestRecord-4107] <markRecordInactive (SGLogRecord.m:256)> markRecordInactive
2022-02-12 17:11:18.380487 <VERBOSE> [SGConnectionManager] <releaseTunnelConnection: (SGConnectionManager.m:286)> Release tunnel connection: SGTunnelConnection-2645 <xxx domain>:443 (Last activity seconds: 30)
2022-02-12 17:11:18.380522 <VERBOSE> [SGTunnelConnection-2645] <dealloc (SGBaseObjectWithDispatchQueue.m:180)> Dealloc
2022-02-12 17:11:18.380559 <VERBOSE> [SGHTTPProxyConnector-705] <dealloc (SGBaseObjectWithDispatchQueue.m:180)> Dealloc
...
2022-02-12 17:11:18.380850 <VERBOSE> [SGSocket-6880] <dealloc (SGBaseObjectWithDispatchQueue.m:180)> Dealloc
我也试过换成 ClashX Pro,依旧使用相同的代理。
在 2.4g 网络环境下,推送镜像很快(2MB/s),而切换到 5g 网络环境下,推送镜像很慢(20-200KB/s),最终也能推送成功。
不清楚 Surge 这里的错误日志:’Connection disconnect: Error Domain=SGSocketErrorDomain Code=4 "Write operation timed out"‘ 是不是一个导火索,导致了立即断开【内部连接】。
后面还有一个错误日志: (Last activity seconds: 30)
这些个错误提示在 google 上也没搜索到有用的日志。
该错误也不是一开始就有,而是传输了一段时间后才出现的。
不清楚 Surge 在这块是怎么处理的。
(我一直使用 Surge,平常也没有遇到这种稀奇古怪的问题。)