In December 2019, we published the 2020 Cloud Report and an accompanying blog post, which summarized original research we conducted benchmarking the performance of Amazon Web Services (AWS), Microsoft Azure (Azure), and the Google Cloud Platform (GCP).
The response from our customers and the cloud providers was tremendous. Before publishing, we shared our findings with each of the benchmarked clouds and allowed time for their review and feedback. After publishing, we followed up to hear further feedback and to improve upon next year’s Cloud Report. We also met with a number of customers to answer follow-up questions about what our learnings meant for their mission-critical workloads.
Much of the feedback we heard from the cloud providers was with regards to how each cloud was tuned for performance. In our conversations with AWS, Azure, and GCP, we heard directly how we could have better tuned their products and services for optimal performance both on benchmarks and on real-world workloads. We also discussed how we might revise the testing plan for the 2021 Cloud Report in order to gain an even more thorough picture of OLTP workload performance across AWS, Azure, and GCP.
The following blog post highlights some of our key learnings about how to get the best performance from each cloud, as well as changes we are contemplating to the testing suite for the 2021 Cloud Report.
CockroachDB used each cloud’s default configuration for running the benchmarks included in the 2020 Cloud Report. We opted to use the defaults because we wanted to avoid biasing the results with claims that we better configured one cloud over another, and we thought it would be representative of performance as many cloud customers would not know how to configure machine types differently from the base defaults.
Deciding how much tuning one should do versus selecting the default values is a delicate balance to strike. For example, one cloud may offer a feature as a tuning parameter, while another lists it as a seperate machine type. Additionally, some defaults may be constrained, and in these cases there should be sufficient guidance in the documentation. While we stand by our decision to use each cloud’s default configuration, we learned that some of these default configuration settings can result in unintended consequences.
AWS showed well in both of the previous cloud reports. As a result, the majority of their feedback has been on what additional machine types they recommend we test with in the future as well as in which tests are best in class to measure OLTP database performance.
AWS is the only cloud where we tested AMD machines. Specifically, we tested four different AMD EPYC 7000 series instance types, which each contain an “a” specifier. The CPU benchmarks do not show a significant difference one way or the other when comparing these against the Intel processors. Since there’s so much variance in processor types available on AWS, we think it's worth paying extra close attention to the processor type used with AWS.
AWS recently announced support for new machine types which were not available when we ran numbers 2020 Cloud Report. AWS claims that the new “general-purpose (M6g), compute-optimized (C6g), and memory-optimized (R6g) Amazon EC2 instances deliver up to 40% improved price/performance over current-generation M5, C5, and R5 instances for a broad spectrum of workloads including application servers, open-source databases, etc.” We’re excited to test these machine types out and seeing AWS invest in even more improvements.
Azure provided a large amount of feedback on aspects of the configurations we ran in the Cloud Report, in large part because they offer so many configuration options that it can be hard to get yours right out-of-the-box.
As we noted in the report, testing the number of vCPUs allocated to the VM instances was out-of-scope for the 2020 Report. We only focused on 16 vCPU VMs, which offers some advantages such as a smaller price discrepancy between clouds. Although not tested, according to the documentation, Azure noted that their larger VMs “perform and scale linearly, compared to than most competitive SKUs which allocate disproportionate resources to smaller sizes.” Azure went on to claim that “AWS and GCP skew network allocation of network and remote storage on their smaller VMs whereas Azure allocates proportionately to size. As a result, as Azure scales up to larger VM sizes (32 vCPUs+), their performance typically improves much more than AWS and GCP.” We didn’t test this effect, but it's worth noting that varying CPUs sizes could have a large impact on performance, particularly with Azure. It’s also important to note that these scaling claims may be dependent on the workload.
CPU is mostly influenced by the processor type and (at the time of writing) most Azure systems use older Haswell/Broadwell types. Like AWS, Azure will start using new AMD based Da/Ea v4 systems soon which they note “perform comparably to the newer Intel systems” tested in this report on the other clouds. Azure will additionally offer some Skylake Dv3/Ev3 as well as newer Intel CPUs in 2020. It’s worth noting that some of Azure’s older systems (GS4, DS14v2) did well “because they’re full core vCPUs while the newer systems (Dv3/Ev3) will be hyperthreaded (like AWS and GCP).”
Azure offers a large number of configuration options that can be tricky to get right. In the 2020 Cloud Report, we provisioned Azure VMs using the REST API command line. Accessing the REST API directly is an advanced deployment method with precise controls and currently requires explicit flag to enable Accelerated Networking. If we had provisioned these nodes directly from Azure’s UI, however, they would have been configured with Accelerated Networking. Azure indicated to us that Accelerated Networking dramatically improves performance and they enable it by default in the CLI when the NIC is allowed to be chosen at creation time, rather than creating the NIC before the VM. This intricacy can be confusing to end users (and to us) and can make it tricky to determine how best to provision Azure nodes.
Additionally, we configured both “regions” and “zones” for AWS and GCP while only configuring “locations” (e.g., regions for Azure and not “proximity placement groups” or zones). As a result, Azure pointed out that “this leads to longer physical paths (through the host and across greater distances) in the region and hence greater latency.” Further muddying the water, AWS subnets are restricted to a single zone (impacting placement) while Azure subnets are not. Since AWS restricts placement to a single zone, the machines will be physically closer together. The primary effect of this is that latency will be reduced between the machines (since the signal has to travel less physical distance).
In our report, we mentioned that Microsoft did not publish network throughput expectations for their VMs. This was inaccurate and those expectations can be found in their documentation, although they were somewhat tricky for us to track down:
Azure went on to share with us that Azure HB120rs (v2) can use a local Infiniband connection to hit 200 Gb/sec while HC44rs and HB60rs can do 100 Gb/sec on Infiniband, and 40 Gb/sec on local network. They again referred to their documentation here, indicating that “most of our full node VMs (e.g. D/E[a]64[s] v3) can do 30-32 Gb/sec, or 4x faster than the 16 vCPU VMs. AWS, by contrast, is only ~2.5x faster (16 -> 64 vCPUs).”
Azure offers a number of differentiated storage types. Again, we chose the defaults for each VM type where possible. Azure doesn’t offer default storage types so we chose the Premium SSD which appears to be equivalent to AWS and GCP. While we think this is defensible for our test, there are many options available. Azure highlighted that they expect their various permutations to behave in the following manner:
Azure also provides powerful host caching options which can greatly improve persistent storage read (and write) performance. Again, like many of Azure’s capabilities, this requires custom configuration.
Azure Lsv2 was not tested in the 2020 Cloud Report due to accidental omission. This machine type is Azure’s offering in the storage-optimized class. Although not included in the report this year, it will surely be considered as a candidate in 2021.
Azure will be offering newer VM SKUs in 2020 with “higher CPU performance” such as the Da_v4/Ea_v4 (AMD based) and additional Intel machines. They also plan to make improvements in the read/write latency across all storage types in 2020. We’re excited to see their improvements in 2020.
GCP’s new n2 series and c2 series make a big difference across many tests and they have future planned improvements as well. GCP took a big step forward with their network and shared some insight into their storage configurations.
GCP introduced both the n2 series and the c2 series earlier this year, both of which use the new Intel Cascade Lake Processor. We see from the benchmarks these result in significantly higher performance than the corresponding n1 series instances, which use last generation Intel Skylake processors.
GCP suggested using
--min-cpu-platform=skylake for the
n1 family of machines because they have higher network performance caps. [Important note: In our report, we mentioned that GCP did not publish network throughput expectations for their VMs. This was inaccurate and the expectations can be found at here.] Like we see with other clouds, newer processors perform better than older processors. It’s critical to stay on top of the newest processor types offered by GCP and to explicitly request them when provisioning nodes, since they’re not on by default. It's also worth noting that this CPU platform is only available in some regions/zones, so you’ll need to consider availability in addition to performance.
After the publication of the first Cloud Report in 2018, GCP confirmed that they were planning on investing heavily in network performance in the coming year. As promised, they rolled out increased network throughput limits over the past year, and we see those performance improvements reflected in the 2020 Cloud Report numbers. This type of increase benefits customers not just because it increases the amount of Network Throughput, but because it requires no direct intervention from GCP customers.
As we noted in the 2020 Cloud Report, the n1 series offers approximately the same network latency as we observed in our previous report. However, unlike in our CPU experiment, GCP’s n2 series dramatically improved its network latency. In addition, the c2 series offers even better network latency. GCP notes that customers would benefit from these improvements without even needing to reboot their VMs.
GCP prefers using Ping with the -i=0 setting. This is only available if users employ sudo before running the command. We ran -i=.2 as our defaults across all three clouds. GCP claims that -i=0 might be more representative and it is something we will consider for next year’s report, also suggest investigating other benchmarking tools such as netperf with the TCP_RR option. Even GCP noted that “it will make everyone’s numbers better and more consistent.” They attribute this smoothness to the reduced awakening of cstates, something we attempted to avoid this year by switching from the less frequent default periodicity we used in the 2018 Cloud Report.
GCP doesn't have a storage-optimized instance, but local SSD can be attached to most VMs with either NVMe or SCSI interfaces. SCSI is the default from the CLI, so users should be careful to specify NVMe if that’s what you’re after. GCP intentionally manages the number of different machine types available to reduce the complexity for its users. Further, rather than needing to completely switch machine types, GCP provides customers the flexibility of starting with one config and evolving over time as many aspects of the machine can be tweaked to suit your workload. GCP noted that SCSI is the default since it supports more operating systems, but references users to their documentation for guidance in choosing a local SSD interface.
We like the flexibility GCP provides in allowing users the ability to configure the number of SSDs attached to a single host. Other clouds are not as accommodating.
We expect GCP to focus heavily on rolling out their n2 and c2 series machines in 2020 given the advantages over the older generation series. In addition, GCP is focusing on their new AMD machine types in the n2d series, and they expect to continue to improve performance across their VM offerings. GCP will also be rolling out new AMD machine types and they expect to continue to improve performance across all lines of VMs in 2020.
We stand by our decision to use default configurations in this experiment, and welcome continued feedback from the clouds. Choosing how to configure your cloud is not intuitive, and obviously, many developers don’t have the benefit of speaking directly to the cloud providers to learn best in class performance configurations. With that in mind, we think it’s hugely valuable to keep benchmarking cloud performance, and will keep these suggestions in mind as we begin work on the 2021 Cloud Report.
Per the feedback we received while working with Amazon, Azure, and GCP, we’ve decided to make a couple tweaks to the way we run our microbenchmark tests for the next issue of the Cloud Report.
For the 2020 Cloud Report, we tested all machine types using stress-ng’s
matrix stressor. This stressor provides a good mix of memory, cache and floating point operations. We found its behavior to be representative of real workloads like CockroachDB. Because of this, the results we have seen on stress-ng have a strong correlation with the results we have seen in TPC-C.
This is in contrast with the
cpu stressor, which steps through its 68 methods in a round-robin fashion, allowing it to be disproportionately affected by changes in some of its slower, less-representative methods like stressing deeply recursive call stacks. GCP and AWS both shared concerns about the
cpu stressor with us ahead of time and our own experimentation validated those concerns. The results we found using the
cpu stressor were difficult to explain across CPU platforms and were not useful predictors of TPC-C performance, indicating that it is not a representative benchmark. We therefore decided only to present results using the
In our open source machine provisioning tool Roachprod, we used ubuntu 16.04 (and therefore stress-ng 0.05.23) with AWS and GCP but ubuntu 18.04 (stress-ng 0.09.25) with Azure. We didn’t do this maliciously, rather, it was because we added Azure to Roachprod this year and we added AWS and GCP last year. As scientists, this is a mistake because we should be holding as many variables constant as possible. As pragmatists, it is an even bigger mistake because different versions of Ubuntu install different versions of stress-ng by default through the Debian package manager. And, unsurprisingly, different versions of stress-ng stress test CPU differently resulting in incomparable data. We re-ran the CPU data after accounting for these changes and included it in the 2020 report. The updated results still put Azure in the lead, but considerably reduced their margin by approximately 50%. . Moving forward we will hold both the operating system and tooling versions consistent across each cloud.
We are considering other measures of CPU for next year’s cloud report. The current measure, stress-ng, is sensitive to underlying microarchitectures and other tools may be better suited for benchmarking. Both GCP and AWS recommend Spec. We get the sense that they internally test against this regularly but are hesitant to switch to it as Spec has all sorts of legal guidelines for reporting. Further, Spec is not free and would therefore make it harder for the public to verify our benchmarking claims. As we’ve written previously, a benchmark that can’t be reproduced isn’t really a benchmark. We’ve also received recommendations to explore a combination of Linux Bench, Embedded Microprocessor Benchmark Consortium’s Coremark, and Baidu’s Deepbench to provide a more complete picture of CPU performance.
In the 2020 Cloud Report, we used iPerf 2 to test each cloud’s network throughput. Another version of iPerf, iPerf 3, also exists. Somewhat confusingly, iPerf 3 isn't really the "new version" of iPerf 2. iPerf2 and iPerf3 have been maintained in parallel for several years and have diverged feature sets. We will re-evaluate the selection of iPerf 2 to ensure that it is still the version that provides the most representative results.
In the 2020 Cloud Report we used ping to measure network latency. Azure claims that ping is not representative of database IO as ping (ICMP, connectionless) latency is very different from (TCP/IP, persisted connection) latency. ICMP is a non-performance critical protocol and is not “accelerated” by Azure while TCP is accelerated. As a result, ping takes a longer path through the host than TCP data packets. In theory, ping could be responded to by a load balancer which considerably cuts down the RTT, but Azure doesn’t do this.
Azure likes the tool Sockperf configured in either ul # under-load or pp # ping-pong mode with the corresponding parameters --tcp # tcp, -m MSG_SIZE # msg size, default is 14B, and -t TIME_SEC # time in seconds, we normally do 5 min/300s. We will explore this test next year.
GCP similarly doesn’t care for ping. Like Azure, GCP prefers TCP. GCP recommends Netperf for testing network latency via TCP_RR. GCP is also familiar with iPerf (already used by the Cloud Report for Network Throughput).
We provided results from local and network-attached storage in intermingled charts, which can make it hard to differentiate between the two. Performance differences can vary as much as 100x, which makes it hard for our readers to observe differences. Next year, we’re going to l separate the storage classes into separate sections, making it easier to differentiate as well as provide numbers in the charts for easy comparison.
We use Sysbench as the microbenchmark to measure storage. Azure believes that Sysbench isn’t optimally configured for the filesystem testing done and may not be the best tool for the job. They stated that focusing benchmarks that fsync frequently are not effective because they “serialize[s] writing (and some reading) and therefore limit[s] parallelism based on the latency of the device. In other configs like Direct I/O, other I/O platforms could exhibit greater parallelism and yield higher throughput at higher latency.” However, as an OLTP database with a strong focus on durability and resilience, we think tests that stress fsync performance are critical. lIn next year’s Cloud Report, we’ll definitely spend more time to fine tuning these configurations and more comprehensively explaining how we came to this configuration.
Additionally, it was pointed out that the current block size is too large to maximize IOPS performance and too small to maximize transfer rate performance. We’ll re-evaluate the block sizes that we use in the I/O tests in the 2021 Cloud Report to more closely match expected workloads.
After we released the report we realized that we didn’t provide clear TPC-C reproduction steps. We have since updated the reproduction repo to include a link to TPC-C reproduction steps.
In an early version of the report (which has since been updated), we didn’t make it clear that these results were obtained using the nobarrier ext4 mount option, which matched the previous Cloud Report. Some clouds believe that this is not appropriate because nobarrier can leave machine types vulnerable to disaster events. In fact, these disaster events are part of the very reason we made CockroachDB highly available by default. Unfortunately, none of the big three cloud providers share their expectations for surviving disaster events with or without nobarrier so it makes it challenging to compare their survivability. Some clouds have larger performance margins with and without nobarrier, which may have influenced this feedback. For example, in 2018 we saw a narrower gap in performance when using AWS with and without
nobarrier as compared to GCP.
We believe TPC-C performance benchmarking has stood the test of time, something we’ve written about extensively. Azure shared with us that they “typically prefer TPC-E” as they believe it to be “designed to be a more realistic OLTP benchmark than TPC-C” citing this study from Carnegie Mellon. We haven’t yet explored TPC-E, but will plan to in the future. For now, it’s worth noting that “TPC-E has a 9.7:1 Read/Write ratio and more varied transactions” while “TPC-C has a less typical 1.9:1 Read/Write ratio and a simpler IO model.”
AWS, Azure, and GCP all plan to continue to work with Cockroach Labs as we produce future versions of this report as they all want to put forth the most accurate numbers possible in the community. We plan to continue to reach out to them during the research phase of each report so that we can collect the best information to share with our customers.