For past few months we’ve been working on improving gRPC-Go performance. This includes improving network utilization, optimizing CPU usage and memory allocations. Most of our recent effort has been focused around revamping gRPC-Go flow control. After several optimizations and new features we’ve been able to improve quite significantly, especially on high-latency networks. We expect users that are working with high-latency networks and large messages to see an order of magnitude performance gain. Benchmark results at the end.
This blog summarizes the work we have done so far (in chronological order) to improve performance and lays out our near-future plans.
This is an optimization used by gRPC-C to achieve performance benefits for large messages. The idea is that when there’s an active read by the application on the receive side, we can effectively bypass stream-level flow control to request the whole message. This proves to be very helpful with large messages. Since the application is already committed to reading and has allocated enough memory for it, it makes sense that we send a proactive large window update (if necessary) to get the whole message rather than receiving it in chunks and sending window updates when we run low on window.
This optimization alone provided a 10x improvement for large messages on high-latency networks.
After having several discussions with gRPC-Java and gRPC-C core team we realized that gRPC-Go’s connection-level flow control was overly restrictive in the sense that window updates on the connection depended upon if the application had read data from it or not. It must be noted that it makes perfect sense to have stream-level flow control depended on application read but not so much for connection-level flow control. The rationale is as follows: A connection is shared by several streams (RPCs). If there were at least one stream that read slowly or didn’t read at all, it would hamper the performance or completely stall other streams on that connection. This happens because we won’t send out window updates on the connection until that slow or inactive stream read data. Therefore, it makes sense to decouple the connection’s flow control from application reads.
However, this begs at least two questions:
Won’t a client be able to send as much data as it wants to the server by creating new streams when one runs out?
Why even have connection-level flow control if the stream-level flow control is enough?
The answer to the first question is short and simple: no. A server has an option to limit the number of streams that it intends to serve concurrently. Therefore, although at first it may seem like a problem, it really is not.
The need for connection-level flow control:
It is true that stream-level flow control is sufficient to throttle a sender from sending too much data. But not having connection-level flow control (or using an unlimited connection-level window) makes it so that when things get slower on a stream, opening a new one will appear to make things faster. This will only take one so far since the number of streams are limited. However, having a connection-level flow control window set to the Bandwidth Delay Product (BDP) of the network puts an upper-bound on how much performance can realistically be squeezed out of the network.
Sending a window update itself has a cost associated to it; a flush operation is necessary, which results in a syscall. Syscalls are blocking and slow. Therefore, when sending out a stream-level window update, it makes sense to also check if a connection-level window update can be sent using the same flush syscall.
This feature is the latest and in some ways the most awaited optimization feature that has helped us close the final gap between gRPC and HTTP/1.1 performance on high latency networks.
Bandwidth Delay Product (BDP) is the bandwidth of a network connection times its round-trip latency. This effectively tells us how many bytes can be “on the wire” at a given moment, if full utilization is achieved.
The algorithm to compute BDP and adapt accordingly was first proposed by @ejona and later implemented by both gRPC-C core and gRPC-Java (note that it isn’t enabled in Java yet). The idea is simple and powerful: every time a receiver gets a data frame it sends out a BDP ping (a ping with unique data only used by BDP estimator). After this, the receiver starts counting the number of bytes it receives (including the ones that triggered the BDP ping) until it receives the ack for that ping. This total sum of all bytes received in about 1.5 RTT (Round-Trip Time) is an approximation of the effective BDP * 1.5. If this is close to our current window size (say, more than 2/3rd of it) we must increase the window size. We put our window sizes (both streaming and connection) to be twice the BDP we sampled(total sum of all bytes received).
This algorithm by itself could cause the BDP estimation to increase indefinitely; an increase in window will result in sampling more bytes which in turn will cause the window to be increased further. This phenomenon is called buffer-bloat and was discovered by earlier implementations in gRPC-C core and gRPC-Java. The solution to this is to calculate the bandwidth for every sample and check if it is greater than the maximum bandwidth noted so far. If so, only then increase our window sizes. The bandwidth, as we know, can be calculated by dividing the sample by RTT * 1.5 (remember the sample was for one and a half round trips). If the bandwidth doesn’t increase with an increase in sampled bytes that’s indicative that this change is because of an increased window size and doesn’t really reflect the nature of the network itself.
While running experiments on VMs in different continents we realized that every once in awhile a rogue, unnaturally fast ping-ack at the right time (really the wrong time) would cause our window sizes to go up. This happens because such a ping-ack would cause us to notice a decreased RTT and calculate a high bandwidth value. Now if that sample of bytes was greater than 2/3rd of our window then we would increase the window sizes. However, this ping ack was an aberration and shouldn’t have changed our perception of the network RTT al together. Therefore, we keep a running average of the RTTs we note weighted by a constant rather than the total number of samples to heed more to recent RTTs and less to the ones in past. It is important because networks might change over time.
During implementation, we experimented with several tuning parameters, such as the multiplier to compute the window size from the sample size to select the best settings, that balanced between growth and accuracy.
Given that we’re always bound by the flow control of TCP which for most cases is upper bounded at 4MB, we bound the growth of our window sizes by the same number: 4MB.
BDP estimation and dynamically adjusting window sizes is turned-on by default and can be turned off by setting values manually for connection and/or stream window sizes.
We are now looking into improving our throughput by better CPU utilization, the following efforts are in-line with that.
We noticed a bug in our transport layer which causes us to make a flush syscall for every data frame we write, even if the same goroutine has more data to send. We can batch a lot of these writes to use only one flush. This in fact will not be a big change to the code itself.
In our efforts to get rid of unnecessary flushes we recently combined the headers and data write for unary and server streaming RPCs to one flush on the client-side. Link to code
Another related idea proposed by one of our users @petermattic in this PR was to combine a server response to a unary RPC into one flush. We are currently looking into that as well.
For every data frame read from the wire a new memory allocation takes place. The same holds true at the gRPC layer for every new message for decompressing and decoding. These allocations result in excessive garbage collection cycles, which are expensive. Reusing memory buffers can reduce this GC pressure, and we are prototyping approaches to do so. As requests need buffers of differing sizes, one approach would be to maintain individual memory pools of fixed sizes (powers of two). So now when reading x bytes from the wire we can find the nearest power of 2 greater than x and reuse a buffer from our cache if available or allocate a new one if need be. We will be using golang sync Pools so we don’t have to worry about garbage collection. However, we will need to run sufficient tests before committing to this.
Benchmark on a real network:
|Message Size||GRPC||HTTP 1.1|
|1 KB||~152 ms||~152 ms|
|10 KB||~152 ms||~152 ms|
|10 KB||~152 ms||~152 ms|
|1 MB||~152 ms||~152 ms|
|10 MB||~622 ms||~630 ms|
|100 MB||~5 sec||~5 sec|
|GRPC||HTTP 2.0||HTTP 1.1|
|GRPC||HTTP 2.0||HTTP 1.1|
|GRPC||HTTP 2.0||HTTP 1.1|