This week we finally got to the bottom of a perplexing issue that we have been examining for a number of months now. We had been seeing intermittent HTTP timeouts or other unusual disconnection problems affecting LUSID, our investment data management platform. These were most obvious in our build pipeline, where they were causing test-failures.
Tests, tests and more tests
In order to respond to our clients’ needs quickly, we have always followed a continuous integration and continuous delivery methodology, and we routinely release multiple new versions of LUSID each day. To do this safely, every change is subjected to an extensive suite of automated tests, designed to detect defects in the code, before they reach our users.
As such, the efficient operation of our release pipeline is essential to the rate at which we can deliver new features and enhancements to our customers. While doing some housekeeping to improve the stability of the pipeline (fixing ‘flaky tests’ and speeding up long-running tasks), we found that there was still a low-level of seemingly random failures across our test suites.
These failures would always manifest in a similar way: either as a client-side HTTP timeout waiting for an API request to complete, or as TCP connection-reset errors.
We gather extensive telemetry about every API request processed by the platform, and we began to analyse the data to try and understand what was going wrong. It quickly became apparent to our engineering team that the timeouts were not related to long-running work in the system. Our logs showed that the offending requests were initiated promptly, and then proceeded to complete in a typical timeframe. But for some reason, the client-side code running the tests appeared to never receive a response and would ultimately time-out. What could cause us to randomly ‘lose’ API responses?
We correlated our application server logs with our Nginx and AWS traffic logs to see if anything unusual showed up, but there was nothing obviously wrong. Just a few Nginx “499” responses and some TCP connection-reset events on the AWS load balancers, but no consistent failure pattern. The data indicated an intermittent network issue, so we embarked on the onerous task of getting packet-capture traces to try and spot where the issue was…
No single points of failure
LUSID is a mission-critical system for our clients, so we deploy to multiple Availability Zones (“AZs”) in every AWS region we operate in. An AZ is effectively an independent data centre, so by distributing ourselves across them, we can carry on serving requests even if an AZ fails. Which they do – the most recent example was when AZ4 in US-East-1 had problems on December 22nd last year (which, thankfully we weathered with no client impact).
During a similar episode a year or so ago, we detected problems serving requests from another of our AZs. This was due to a failed cooling system in one of the AWS data centres, which flooded some of the racks with water! On that occasion, we elected to fail-away from the damaged AZ and span up additional capacity elsewhere. However, we were unable to safely remove the Network Load Balancer from that AZ, so we enabled “Cross Zone Load Balancing” which permitted the NLB in the degraded AZ to route requests to one of the other, healthy AZs.
Following that incident post-mortem, we elected to leave this feature turned on, given it seemed to make the solution more robust to these kinds of failures. What we hadn’t appreciated was that this exposed us to a different kind of problem…
Where’s that book on TCP/IP?
Before we get to the nuts and bolts of the issue, let’s first recap on a few TCP/IP basics…
A TCP connection is uniquely identified by a combination of IP-Address and Port pairs, each pair representing one end of the connection. This 4-part key is used by devices all along the connection path, to uniquely identify each connection being handled.
However, intermediate network devices are allowed to re-write those IP addresses and/or ports if required. A common example would be a router with Network Address Translation (NAT) enabled. NAT can be used to hide the private IP addresses of machines from devices on the public internet, by substituting them with an alternative address owned by the NAT router. To the remote end of the connection, it appears as if the connection is established with the NAT router, but in practice the NAT router is forwarding traffic to another machine on the private network.
Now, imagine we have client machines that are trying to make HTTP requests to a remote web service (i.e. LUSID). If these client devices sit behind a NAT device, it is possible for two different clients to present the same IP address and port combination to LUSID – the NAT router is perfectly allowed to re-use its own ports to proxy multiple connections, provided the connections are going to different destinations (because the 4-part IP + port combination is still unique). Clients will often connect to different IP addresses when calling LUSID’s APIs: we have a DNS record which resolves to 3 different IP addresses, each representing the ingress point in each of the 3 AZs we operate in. Specifically, each IP relates to the AWS Network Load Balancer (NLB) in each AZ.
By default, AWS NLBs have ‘Client IP Preservation’ enabled. This means they do not do NAT on connections which they intermediate. Instead, they propagate the IP-Address and Port of the upstream device. This is useful, as it means our servers have visibility of where the traffic actually originated from, rather than simply seeing the IP address of our NLBs, at the AWS network periphery.
If you also enable Cross-AZ load balancing on the NLBs (as we had done), the NLBs are able to route traffic to any of the target machines in any of the AZs, not just targets in their ‘local’ availability zone.
Now we have a potential problem…With this configuration, it’s possible for two different end-clients (with different private IP addresses) to independently connect to two different NLBs (with different public IPs, in different AZs), with the NLB being able to route those connections to the same individual web server in our cluster, with a single IP address. If the NAT device on the client-side also happens to NAT those two clients to the same IP-Address and Port on the NAT router, then we end up in the situation. There are two logically separate TCP connections, which present exactly the same source Address + Port pair (i.e. the NAT router) to our web server, using the same destination IP address, and port (i.e. HTTPS on 443).
Needless to say, this confuses the machines involved, as we’ve managed to end up with two different ‘conversations’ happening on the same ‘line’! The TCP packets observed by the servers often don’t conform to the protocol spec, and this results in the server aborting the connection by sending a TCP RST
(reset) packet to the client.
It was these RST packets that were manifesting as HTTP 499 response codes in our intermediate Nginx proxy logs, or as ‘Connection Reset By Peer’ errors, at the HTTP client. However, we also observed plenty of examples where responses simply went missing – the client’s request made it OK to the server, the server processed the request and returned a response, but the response never made it back to the caller (which ultimately then timed out waiting). We’re still not 100% sure what the failure mode is here.
The Fix
The solution was to disable the Cross-AZ load balancing, at which point the intermittent failures immediately disappeared. We also found that disabling Client IP Preservation had the same result.
Interestingly, the issue seemed to be much more pronounced for some sources of traffic than others. For example, our GitHub Actions pipelines – where we build our open-source SDKs – were affected much more than some of our internal test runners. We presume this was due to different NAT behaviour on the GitHub side.
Continuous evolvement
This experience was a reminder that network issues can be universally difficult to pin down, especially if they extend into clients’ infrastructure.
To help this, we are looking to introduce additional telemetry within our LUSID SDKs, so we can automatically receive reports of faults via an out-of-band channel, making it much more obvious when these kinds of problems occur.
We want LUSID to be the technology backbone for the global investment community, so our customers need to be able to depend on us 24 hours a day, 7 days a week. Continuous evolvement and a focus on technical quality is part of what drives us, particularly for my colleagues and I in the Engineering team.
This one was definitely high up on our leader-board of ‘weird failures’ though!