As an SRE on the CockroachCloud team, we have the unique challenge of monitoring and managing a fleet of CockroachDB clusters around the globe. Perhaps needless to say, as a distributed database, security is an utmost priority for us. To address some of the needs related to security and monitoring (for example intrusion detection audit logging), we’ve invested in our next generation of Security Information and Event Management (SIEM) infrastructure.
In this post, we’ll cover how we aggregate logs for our SIEM needs and how that same system helps level up the observability for all members of the CockroachCloud team.
Some background first. CockroachCloud hosts CockroachDB clusters across two clouds: AWS and GCP. Each Cockroach cluster is made up of many pods running across one or more Kubernetes clusters and all Kubernetes Clusters for a single Cockroach cluster are in a single cloud account.
Prior to our Centralized Logging project, logs for each Cockroach cluster were pulled via a series of shell scripts and cloud provider tools. These methods were very effective for our size at the time, but with CockroachDB’s usage growing rapidly, we knew they wouldn’t scale.
A major bottleneck to the development of our security infrastructure was not having all logs in a single place. Thus we focused on solutions that would ship all logs to a central secure location for analysis. A fair amount of research was done on what our solution should look like and that started with finding out how much data we were working with. This included checking the amount of logs produced for an average workload on one CockroachDB cluster and then extrapolating out to the size of the current fleet. Given our growth rate, we made approximations with hefty padding for future clusters. Once we had our estimates, we went shopping.
Our primary choices were ELK (Elasticsearch, Logstash, and Kibana), Splunk, or something built in-house. Planning for a larger amount of data helped us in negotiating with vendors as we were able to take advantage of price breaks. Though ultimately, we wanted to design for low maintenance.
A wise person once said:
First buy it, then build it, then forget it.
Meaning, when building a product, first try to buy what you need to sell the product, then try to build it. And if both of those options are possible, you should forget it.
This quote derives from the fact that most of what’s needed to deliver a product to market is not in your wheelhouse. The more time you spend maintaining that software the less time you can spend on your own competitive advantage. And the time needed to maintain the product is always more than expected. These not-so-hidden, but often forgotten costs, can quickly eat into profits. Thus our eyes were on cloud offerings rather than enterprise offerings that would require us to host and run the software ourselves.
There was another human element in this. What do people want? As much as I want to build a great product, I don’t believe there is a single way to achieve that goal. Most often there are a lot of good options in front of you with varying tradeoffs. This is one of the reasons humans still have jobs. If there were a clear answer, we’d automate it.
Elasticsearch is a great tool and we’ve enjoyed using it both for personal and professional projects. However, in the end we chose Splunk.
After our testing and cost analysis, it was close but clear that Elasticsearch would cost us more in the long run, with the added costs mostly due to the operational and maintenance needs of Elasticsearch, rather than the licensing fees. Also, there was a strong drive inside the Security team for Splunk which we knew would help smooth the path towards completion of the project. Given that this was a security motivated project, we needed to confirm that SplunkCloud could give us the flexibility we needed to hit our security goals. Once that was confirmed, my eyes were locked on Splunk. Now it was time to build.
Each CockroachDB cluster lives in one cloud at the moment (either AWS or GCP) over one or more regions. There are two major sources of logs and both needed to be shipped to Splunk and a data lake (ie. Google Cloud Storage, S3). Those two sources are:
Shipping the application logs to Splunk was relatively easy. Simply adding an extra DaemonSet to each kubernetes cluster allowed us to ship the logs produced by CockroachDB and friends to a Splunk HTTP Event Collector (HEC) which is effectively an ingestion endpoint. Shipping the cloud provider logs was much more involved.
One snag that we ran into was compatibility with our older versions of CockroachDB which did not have the option to send logs over TCP. Instead logs were written to disk in several files. To remedy this compatibility issue, we decided to run a second instance of FluentBit as a sidecar to the CockroachDB process. This sidecar would be responsible for reading logs from the files produced by the CockroachDB process and forwarding them to the FluentBit DaemonSet. The DaemonSet would then be responsible for shipping logs to Splunk. Thankfully, newer versions of CockroachDB support logging over TCP and we can gradually phase out the sidecar pattern.
The trickiest part of sending application logs to Splunk was updating our brownfield (or preexisting) clusters. It’s quite easy to tell kubernetes to spin up a new StatefulSet with one more container. It’s another several weeks of work to tell hundreds of kubernetes clusters to edit currently running StatefulSets to include a new container and not incur downtime. Perhaps another blog post is due just for that topic!
For both cloud providers (AWS and GCP), we wanted to ship logs like VPC flow logs, cloud audit logs, and kubernetes infrastructure logs. Since we use Pulumi to manage our infrastructure, we needed to create two separate stacks of cloud resources; one for each cloud.
The general flow of logs from AWS to Splunk starts by sending logs from a source (VPC for VPC flow logs) to a CloudWatch Log group. The log group acts as a Pub/Sub topic and a subscription relays these logs to a Kinesis Firehose Stream. The Kinesis Firehose Stream is responsible for transforming the log line with a lambda function (which is necessary given the log format that Splunk ingestors expect). Kineses then ships logs to Splunk and S3 for long term storage.
The reader might ask why we created three kinesis streams rather than one. Each stream has its own transformation function and given the diversity of the log types, this appeared to be the most maintainable solution. Also, given that all resources are managed via code, we were able to quickly create a second and third stream once one was working properly.
GCP logs worked in a similar way as AWS. Logs are first sent to Cloud Logging by all sources (eg. VPC flow logs). Cloud Logging Sinks then filter the main Cloud Logging stream and dump the filtered logs to a particular location (eg. Pub/Sub topic, Google Cloud Storage). For those logs sent to the Pub/Sub topic, a cloud function is responsible for transforming the logs and forwarding them to Splunk.
With these two solutions in place, we are now able to ship thousands of CockroachDB clusters’ logs to longterm and searchable storage. While we are still in the process of rolling these changes out to all clusters, we are planning for numbers like the following:
Most importantly, we can now alert on a myriad of new event types that stream from each cloud provider. These include AWS API authentication attempts, infrastructure changes, and privilege escalations. By alerting on specific authentication attempt profiles and network traffic, our security team can hold a pager (rather than scouring logs) that yields actionable alerts.
While there are a number of paths that would have proved adequate for our needs, this path was chosen for its simplicity, maintainability, and scalability. And most importantly for this project, with Splunk in place our security team can operate faster and more efficiently. I’m happy to say that we are ever increasing our ability to provide our customers with a scalable and secure distributed SQL database.