Distributed Glossary

CockroachDB

What is CockroachDB?

At the highest level, CockroachDB is software for storing data. More specifically, CockroachDB is a distributed SQL database technology that enables users to enjoy many of the benefits of the traditional relational database (such as the familiar SQL language, reliable schema, etc.) while also offering the key features required for a modern, cloud-native database, such as high availability, bulletproof resilience, elastic scale, and geographic scale.

Learn more about CockroachDB.

More about CockroachDB

Cost-Based Optimizer

Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.

What is a cost-based optimizer?

The Cost-Based Optimizer is a feature of CockroachDB that looks at all possible ways in which a query can be executed and assigns each a “cost” that indicates how efficiently the query can be run. Then the optimizer chooses the way that has the lowest cost, and is therefore most efficient. This feature only works with databases that speak SQL, so it’s an added benefit obtained from having a SQL layer.

More about Cost-Based Optimizer

Encoding

Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.

What is encoding?

Encoding is the process by which CockroachDB converts SQL statements into bytes (because the lower layers of CockroachDB deal with bytes).

More about Encoding

Gateway Node

Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.

What is a Gateway Node?

When a request comes into CockroachDB, the load balancer routes the request to the node it thinks can best handle the request at that time. This node is called the Gateway Node. The Gateway Node receives the request and then responds to the request. It identifies which nodes in the cluster are the leaseholders for the specific requests that came in, and sends the requests to those nodes.

More about Gateway Node

Key Value (KV) Layer

Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.

What is the Key Value (KV) Layer?

The key value layer is a figurative layer of CockroachDB. One helpful way to think about the architecture of CockroachDB is as a SQL system built on top of a key value store. When the data is up in the top layers, it follows the rules of a SQL system and is structured in a table format. When the data travels deeper into the database, table format no longer works, due to the distributed nature of the database. So instead of being stored in tables, the data is stored in a different way: in key-value pairs. It’s important to note that this combination of a SQL upper layer with a key-value store underneath is a relatively unusual setup, because translating structured table data into key-value pairs is a difficult task.

Key Value (KV) Pair

Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.

What is a Key Value (KV) Pair?

A key-value pair is a way of storing data. In CockroachDB, individual rows from the tables are mapped into key-value pairs. One column becomes the index, meaning that each piece of data in that column is the “key” in its own key-value pair. One or more other columns become the “value.”

For example, in a table called “Customer Data” the first column might be “Name,” the second “Hometown” and the third “Email Address.” You might decide that the “Name” should be the key because ultimately you want all your data to be sorted by name. Then you might decide that the “Hometown” and “Email Address” columns should be the values.

This information gets mapped, row-by-row, into key-value pairs, and the ultimate format of a single row might end up reading as a string: “Customer Data Table/John Smith/New York City/Johnsmith@gmail.com.” Then, once all the data in the table is translated into the monolithic sorted key-value map, these pieces of data are all sorted by the name value.

More about Key Value (KV) Layer

Load Balance

Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.

What is the Load Balancer?

The Load Balancer is a piece of software added to the front of CockroachDB that helps balance the requests coming into the database. All the nodes in CockroachDB can process requests (they’re all symmetrical), so it makes the most sense to spread the work around between nodes so that the requests can be completed more quickly. The load balancer accomplishes that.

More about Load Balance

Meta Range

Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.

What is a Meta Range?

A meta range is information stored in the cluster that tells the database where to find ranges – i.e., which nodes have certain ranges – so that the database can access the correct range. Sometimes also referred to as an index, but a different definition of an index than above.

More about Meta Range

Black Swan Event

A black swan event is an event that has the following three attributes:

  1. It was unexpected.

  2. It had significant, wide-ranging consequences.

  3. After it happens, people will suggest that it was predictable, despite the fact that it was not widely predicted before it happened.

This definition comes from mathematical statistician Nassim Taleb, who coined and popularized the term in his books Fooled by Randomness and The Black Swan.

However, in everyday language when people talk about a “black swan event,” they’re generally thinking just about unexpected events with wide-ranging consequences. The third criteria – that people will rationalize the event as predictable after the fact – isn’t typically discussed.

Officially the term doesn’t have a positive or negative connotation. Black swan events can theoretically be good or bad or neutral. In the real world, however, the term is often used to describe events with negative impacts, such as financial crashes, widespread service outages, and even natural disasters and terrorism.

Long story short: while the official definition of a black swan event is quite a bit more nuanced, in everyday life the term is often used to mean something pretty simple: an unexpected event with significant negative consequences.

(The name, in case you’re wondering, comes from the once-widespread perception that all swans were white, and black swans either didn’t exist or were incredibly rare. In reality, black swans do exist. But they’re only native to Australia, so they’re quite rare everywhere else – including in Europe, where the idea of a “black swan” as a symbol for something unpredictable, unexpected, or unlikely first came to be used.)

Black swan events in technology

In the tech industry, most discussion of black swan events is typically related to infrastructure, and infrastructure failures leading to service outages.

On a global scale, a black swan event in technology could be something like a coronal mass ejection from the sun causing a kind of natural EMP that knocks out electronic systems over a large area. (Whether this would be a true black swan event is debatable, considering that it has happened before, but it happens only rarely and is considered unlikely).

More commonly, though, discussions of black swan events in technology are company-specific, meaning that a “black swan event” is an event that significantly and negatively impacts the company’s services, generally due to some kind of infrastructure failure or outage.

Examples of potential company-level black swan events in technology include:

* Cloud provider outages.

* Power outages.

* System failures.

* Human mistakes.

Real-world example of a black swan event in tech

More about Black Swan Event

ACID or A.C.I.D

What is ACID?

ACID stands for Atomicity, Consistency, Isolation, and Durability. These are a set of characteristics that are desirable for database transactions. Together, they guarantee that transactions and all data stored in the database remains valid even in the event of errors or power failures. These characteristics are extremely important for any OLTP database.

Specifically:

Atomicity: Each transaction must be treated as a single, indivisible unit (even if processing the transaction requires multiple steps). In other words, every step of the transaction must complete successfully or the entire transaction will fail and no change will be written to the database. This is desirable because without atomicity, a transaction that’s interrupted while processing may only write a portion of the changes it’s supposed to make, which could leave the database in an inconsistent state.

Consistency: No transaction can violate the integrity of the database – all transactions must leave the database in a valid state.

Isolation: Any concurrently-processed transactions (i.e. transactions happening at the same time) leave the database in the same state as if they were executed sequentially (i.e. one after another).

Durability: Once committed, the transaction remains committed even in the case of hardware failure, power outage, etc.

More about ACID or A.C.I.D

Unstructured Data

What is Unstructured Data?

Unstructured Data is essentially the opposite of structured data: data that is not arranged with any kind of data model or schema and that can’t be easily adapted to a table format.

For example, datasets consisting of video or audio files are good examples of unstructured data. A traditional table structure would work for storing metadata about videos (such as title, description, Youtube link, etc.), but it is not a good fit for storing the videos themselves.

CockroachDB does support some types of unstructured data via its JSON support, but it was designed with primarily structured data in mind.

More about Unstructured Data

Transaction Layer

Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.

What is the transaction layer?

The Transaction Layer is the layer of CockroachDB that receives requests from the SQL Layer and coordinates concurrent operations.

More about Transaction Layer

TPC-C

What is TPC-C?

TPC-C, short for Transaction Processing Performance Council Benchmark C, is a benchmark used to measure how well a database holds up when it’s trying to run certain workloads. TPC-C is specifically an OLTP benchmark, and it’s widely recognized and standardized. TPC-C simulates an environment where a lot of users are making requests of a database, and then it measures how well the database holds up – i.e., how fast it can complete the transactions, what the cost of completing the transactions is, and so on.

More about TPC-C

Structured Data

What is Structured Data?

Structured Data (also called relational data) is data that lives best in a structured format, i.e., the kind of data you would enter into a table in a spreadsheet. The best way to organize it is in columns and rows.

Two examples of structured data are an inventory of products for an eCommerce site, or a list of customers and their information. CockroachDB was designed primarily to support this kind of data (although it does also include support for unstructured data).

More about Structured Data

Storage Layer

Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.

What is the Storage Layer?

The storage Layer is the layer of CockroachDB that writes data to the disk (the Physical Storage component), and reads data from the disk. It’s still part of the software, but it’s the piece that communicates with the hardware.

More about Storage Layer

SQL Layer

Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.

What is the SQL Layer?

The SQL Layer is the layer of CockroachDB that speaks to the application or client using the programming language SQL, adhering to the PostgreSQL Wire Protocol. After performing various tasks, the SQL layer sends the relevant requests to the Transaction Layer.

More about SQL Layer

SQL API

What is a SQL API?

A SQL API is an API for interacting and communicating with a database. In the case of CockroachDB, CockroachDB offers a SQL API to users and apps. This means that users and apps can send SQL commands into the API and receive the results of their query in return. The commands must follow the PostgreSQL Wire Protocol, because CockroachDB adheres to it.

More about SQL API

Serializable Isolation

What is Serializable Isolation?

Serializable Isolation is the highest level of isolation possible under the guidelines provided by the “ANSI SQL Standard” (an official standard of best SQL practices). It means that transactions committed to a database appear as if they were executed one after another, even if they were processed in parallel. Most distributed databases only achieve a lower level of isolation, called “snapshot isolation,” but CockroachDB is able to achieve serializable isolation.

More about Serializable Isolation

Secondary Index

What is a Secondary Index?

A Secondary Index is a secondary column by which you can sort data. CockroachDB supports both primary and secondary indexes. This just adds another level of organization to data sorting. For example, in a table of data about employees, you might first sort alphabetically by “name,” (the primary index) and then within that, sort alphabetically by “hometown” (the secondary index).

More about Secondary Index

RTO (Recovery Time Objective)

What is RTO (Recovery Time Objective)?

RTO (Recovery Time Objective) is the amount of time a system’s data is unavailable due to a failure, before the system returns to service. The goal is zero RTO, and CockroachDB achieves this through its multi-active availability model.

More about RTO (Recovery Time Objective)

RocksDB

What is RocksDB?

RocksDB is a piece of software that was embedded in CockroachDB to store key-value data. CockroachDB used RocksDB to communicate with the disk in order to actually store data. CockroachDB now uses Pebble, a RocksDB-inspired and compatible key-value store that is specifically designed for distributed SQL databases. Many other tech companies still use RocksDB.

More about RocksDB

Replication Layer

Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.

What is the Replication Layer?

The Replication Layer is the layer of distributed database software that copies data between nodes and ensures consistency between these copies. In CockroachDB, this is accomplished by implementing the Raft consensus protocol.

More about Replication Layer

Region

Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.

What is a Region?

A region is a specific geographical location where you can host your resources. Each region has one or more zones; most regions have three or more zones. For example, US-West is the West Coast of the US. US-East is the East Coast. For example, public cloud providers often allow you to choose the region or regions you’d like to deploy to.

More about Region

Range / Shard

What is a Range / Shard?

A range in CockroachDB (called a “shard” in other databases) is a chunk of data, stored as key-value pairs. The Distribution Layer breaks tables apart into these chunks so the data can be distributed across different nodes.

In CockroachDB, a range is 512 MiB or smaller. This default range size is a sweet spot – small enough to move quickly between nodes, but large enough to store a meaningfully contiguous set of data whose keys are more likely to be accessed together. Once a range gets bigger than 512 MiB, it’s split into two smaller chunks in order to keep the ranges from getting too big.

More about Range / Shard

Range Replication

What is Range Replication?

Range replication is the duplication of ranges on multiple nodes so that if one node fails, the range’s data is still accessible via another node. In CockroachDB, each range is replicated three times by default. This replication allows CockroachDB to be highly available and remain online even when a node goes down, because the data lives in two other places (and two nodes are required to achieve quorum). When a failure happens, CockroachDB automatically redistributes data to the surviving nodes.

More about Range Replication

Raft Consensus Protocol

What is the Raft Consensus Protocol?

The Raft Consensus Protocol is the set of quorum guidelines that CockroachDB follows to ensure each range is consistent. It’s an algorithm that makes sure all copies agree on changes to data. A “leader” is elected for each range (leaders are also known as Leaseholders), to coordinate changes for that range, and the two other ranges are called “followers.” Changes get entered in the leader’s log, then the leader sends out the changes to the followers, and the changes get entered into their logs. Once this happens, the change is committed to the leader’s data, and then committed to the followers' data.

More about Raft Consensus Protocol

Quorum

Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.

What is Quorum?

Quorum is the consensus required to commit changes in a distributed database such as CockroachDB. Different types of distributed databases may use different systems for quorum. CockroachDB uses the Raft Consensus Protocol, in which a minimum of two nodes are required to achieve a consensus (i.e. quorum) in a three-node system.

More about Quorum

Public Cloud

What is a Public Cloud?

Public Cloud is the term for the most common kind of cloud computing vendor. Data lives in “public,” and hardware, storage, and network devices are shared with other organizations or cloud “tenants.” The services are accessed and managed via a web browser.

More about Public Cloud

Private Cloud

What is a Private Cloud?

In a private cloud, a company’s data is stored in a dedicated cloud environment, rather than being stored on shared machines. The services and infrastructure are always maintained on a private network and the hardware and software are dedicated solely to that single organization. Benefits include increased security and more customization. Vendors include Amazon Virtual Private Cloud (VPC), Dell Cloud Solutions, and Microsoft Private Cloud.

More about Private Cloud

Primary Index

What is a Primary Index?

An index is a column whose data is designated to the “key” location in a Key-Value pair. Think of sorting data in a spreadsheet; the column by which you’re sorting the data is called the index. A primary index is the main column by which the data is sorted.

More about Primary Index

PostgreSQL Wire Protocol

What is the PostgreSQL Wire Protocol?

PostgreSQL Wire Protocol is a dialect of the SQL language that’s spoken by the database called PostgreSQL. Many people have used PostgreSQL before and are familiar with its language, and many apps and ORMs already use the dialect. The fact that CockroachDB adheres to the PostgreSQL Wire Protocol makes it easy for customers to plug into the database and use it.

More about PostgreSQL Wire Protocol

Physical Storage

What is Physical Storage?

Physical Storage is the hardware on which the data is stored. CockroachDB’s Storage Layer (software) communicates with this hardware to physically write data onto the disk and read data from the disk.

More about Physical Storage

ORM (Object-Relational Mapper)

What is an ORM (Object-Relational Mapper)?

An ORM is a software intermediary between the application and the database. It allows developers to speak to a SQL database using a language other than SQL. Some developers may not be experienced with SQL, or simply prefer to write in languages like Python, C++, Javascript etc. When writing the parts of their application that communicate with a database, they can use an ORM as a go-between, to translate their code into SQL and send it to the database to make requests.

More about ORM (Object-Relational Mapper)

On-Prem (On-Premises)

Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.

What is On-Prem (On-Premises)?

In the context of databases, on-prem describes a database deployment in which a company’s data is stored on a machine that they own, rather than stored with a public cloud provider such as GCP or AWS. Companies often choose to store proprietary or high-security data on-prem because they can have more control over the safety of it.

More about On-Prem (On-Premises)

OLTP (OnLine Transaction Processing)

What is OLTP (OnLine Transaction Processing)?

OLTP (Online Transaction Processing) describes the kind of data processing that deals with heavy transactional workloads. In OLTP workloads, in other words, there are many relatively simple, transactional operations (many reads and writes) constantly coming into the database. The data typically relates to standard business tasks like keeping track of inventory, hotel reservations, or online banking functions.

The other main kind of data processing is OLAP (Online Analytics Processing). Compared to OLTP workloads, the workloads run on OLAP databases are usually much more complicated but much less frequent.

Often, database technologies are designed with one or the other in mind. OLTP databases such as CockroachDB are focused on transactional use cases such as payment processing, logistics, metadata management – basically any use case that involves frequent reads and writes. OLAP databases, in contrast, are typically used for data analytics.

Often a company may use both, with the OLTP database handling the transactions coming from the application, and with relevant data periodically offloaded to an OLAP database for analysis. This approach ensures that even very complex analytics work will not interfere with the performance of the OLTP database.

More about OLTP (OnLine Transaction Processing)

Node

Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.

What is a Node?

A node is a single instance of a distributed database such as CockroachDB, or one individual machine of many that are running the same distributed database. Many nodes join together to create the full cluster.

More about Node

MVCC (Multiversion Concurrency Control)

Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.

What is MVCC (Multiversion Concurrency Control)?

MVCC (Multiversion Concurrency Control) is a protocol that CockroachDB follows to ensure isolation of transactions when concurrent transactions are happening. Without MVCC, if a database is being used in multiple ways at the same time, then someone might see half-written or inconsistent data. MVCC keeps multiple copies of data, so each user sees a snapshot of the database at a particular instant in time, and they won’t see changes until the transaction has been committed.

More about MVCC (Multiversion Concurrency Control)

Multi-Cloud

Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.

What is Multi-Cloud?

Multi-Cloud refers to a strategy where multiple cloud providers are being used, rather than just one. Typically, these are multiple public cloud providers (i.e. AWS and GCP). CockroachDB is one of the few distributed SQL databases that supports multi-cloud deployments.

More about Multi-Cloud

Multi-Active Availability

What is Multi-Active Availability?

Multi-Active Availability is CockroachDB’s high availability model. All replicas in the cluster can handle traffic, including both reads and writes, but the Raft consensus is used to ensure data remains consistent. This model also prevents RTO if a node goes down.

More about Multi-Active Availability

Monolithic Sorted Key Value Map

Note: This term is specific to CockroachDB, a Distributed SQL database. In other contexts, it may be used differently.

What is a Monolithic Sorted Key Value Map?

When all of the data from all of the tables is translated into key-value pairs in CockroachDB, it’s called a monolithic sorted key-value map. This just means a giant, list of key-value pairs that correspond to rows in tables, organized in a way that allows you to easily insert and find data.

More about Monolithic Sorted Key Value Map

Mainframe

What is a mainframe?

A mainframe is a gigantic machine (typically made by IBM) that has a lot of storage abilities and high computing power. Mainframes are almost exclusively on-prem and privately owned by the businesses that use them. However, recently IBM has released a version of its mainframe to be used in private cloud settings.

More about Mainframe

Machine / Server

Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.

What is a machine or server?

A machine is just a computer. It can live in a data warehouse or on-prem. It can vary in computing power and storage abilities. Typically, the machines that live in data warehouses to be used for cloud purposes are much smaller with less power than mainframes.

More about Machine / Server

JSON

Note: Here, we are defining JSON in the context of a distributed database.

What is JSON?

JSON, or Javascript Object Notation, is a way of formatting “semi-structured” data, and CockroachDB supports it by default. A small amount of data in CockroachDB is typically JSON data.

To understand what JSON is, follow this example: Imagine a table that stores information about blog posts. Much of the data is structured; every blog post needs data on title, author, and number of words (and these become the columns in the table). But there might be additional data that’s not applicable to every post, such as if a user comments on a post or likes a post. Instead of having to make a separate column for each of these data pieces (which would be inefficient since many of the cells would be empty), you can make a single column that supports JSON – and then you format the unique data pieces in JSON and put any item at all into that column, without having a predefined label attached to it.

More about JSON

Isolation

Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.

What is Isolation?

Isolation describes a desirable database characteristic in which concurrently-processed transactions (i.e. transactions happening at the same time) leave the database in the same state as if they were executed sequentially (i.e. one after another).

It is one of the four ACID properties that are desirable for databases dealing with transactional workloads.

More about Isolation

Hybrid Logical Clock (HLC) Timestamps

What are Hybrid Logical Clock (HLC) Timestamps?

Accurate time is very important in distributed systems, because events often occur at different nodes at the same time, and these events need to be ordered accurately. Google Spanner uses atomic clocks to accomplish this, but because CockroachDB is open source and can run on any public/private cloud/on-prem, it’d be impossible to use atomic clocks. Instead CockroachDB used HLC, which is a clock method that has logical and physical components. CockroachDB applies HLC timestamps to all transactions so the system knows when they occured and can order them correctly.

More about Hybrid Logical Clock (HLC) Timestamps

Hybrid Cloud

Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.

What is Hybrid Cloud?

Hybrid Cloud refers to a strategy where cloud storage and on-prem storage are both being used. A typical example of this would be a single company storing sensitive data on-prem and less sensitive data in the cloud.

More about Hybrid Cloud

High availability

Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.

What is high availability?

High Availability is a desirable characteristic for databases; it describes the ability of a database to survive (and thus remain available) even when parts of the system fail. For example, a CockroachDB database with five nodes could survive and remain available even if several nodes failed.

Different databases use different models to achieve high availability. The two most common high availability models are Active-Passive and Active-Active. Meanwhile, CockroachDB uses a Multi-Active Availability model.

More about High availability

Gossip Protocol

Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.

What is a gossip protocol?

In the context of a distributed database, the gossip protocol is the form of communication used between nodes, to allow each node to locate data across the cluster.

More about Gossip Protocol

Durability

Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.

What is durability?

In the context of a database, durability is a desirable property that describes a system where, once data has been committed to the database, it will remain even in the event of machine failures, power outages, etc.

It is one of the four ACID properties that are desirable for databases dealing with transactional workloads.

More about Durability

Driver

Note: This term has other meanings in other contexts. Here, we are defining it in the context of a distributed database.

What is a driver?

In the context of databases, a driver is a piece of code that you add to your app to help it speak the language necessary for communicating with the database, such as SQL. For example, if you’re building a Python app, you might use the psycopg2 driver to enable it to communicate with CockroachDB using SQL.

More about Driver

Distribution Layer

Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.

What is the distribution layer?

In a distributed database such as CockroachDB, the distribution layer takes data from the SQL layer and breaks it down into chunks called ranges to be stored in a distributed way (i.e., it is stored in multiple locations instead of just a single location).

More about Distribution Layer

Data Warehouse / Datacenter

What is a data warehouse or datacenter?

A data warehouse or datacenter is a giant warehouse where thousands of machines live, arranged in racks. Cloud providers own many of these warehouses, and when you run or store something on the cloud, it lives in one of these warehouses.

More about Data Warehouse / Datacenter

CPU

What is a CPU?

A CPU (Central Processing Unit) is a chip that sits on top of the motherboard of a computer and carries out instructions from the software. Usually CPUs are multi-core, meaning that there is more than one CPU on a single chip. In the context of databases, the power of the CPUs on the machines running your database will impact the performance of the database.

More about CPU

Core

Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.

What is a core?

A core is a component of a CPU that carries out instructions. A multi-core CPU has multiple CPUs together on a single chip, increasing the chip’s overall computing power.

More about Core

Consistency

What is database consistency?

Most database systems have rules for what kinds of data can and cannot be stored (among other rules). Consistency is the term used to describe a database in which those rules are always enforced. A database is said to have consistency when no transaction can violate the integrity of the database – all transactions must leave the database in a valid state.

It is one of the four ACID properties that are desirable for databases dealing with transactional workloads.

More about Consistency

Cluster

Note: This term may have other meanings in other contexts. Here, we are defining it in the context of a distributed database.

What is a cluster?

A cluster is the full collection of nodes associated with a distributed database. For example, a company might spin up a CockroachDB cluster with five nodes for its database.

More about Cluster

Cloud

What is the cloud (in the context of databases)?

In the context of databases, “cloud” refers to storing your data on machines that belong to a third-party cloud provider. This is more cost efficient than on-prem storage, as it eliminates the need for a company to maintain its own machines, and typically increases the availability and scalability of the company’s services. This is because with cloud storage, data is stored across many smaller computers inside a data warehouse and across data warehouses (when storing data on-prem, it’s usually stored in a single massive computer or mainframe).

More about Cloud

Byte

What is a byte?

In the context of data storage, a byte is the most basic format for coding a character on a computer. A byte is a group of eight 0’s and 1’s in a specific order that represents their character. For example, the letter “A,” when translated into bytes, reads as: “01000001”

More about Byte

AZ (Availability Zone, or Zone)

What is an AZ?

An AZ, availability zone, or just “zone”, typically refers to a single data warehouse within a region. Multiple warehouses/zones make up a single region. Sometimes, an availability zone isn’t at the warehouse level, but instead at the rack level - i.e., a single rack within a warehouse.

More about AZ (Availability Zone, or Zone)

Atomicity

What is atomicity?

Atomicity is a desirable characteristic for database transactions. The name comes from the idea of an indivisible “atomic unit”, and it describes a method for processing transactions that treats each transaction as a single, indivisible unit (even if processing the transaction requires multiple steps). In other words, every step of the transaction must complete successfully or the entire transaction will fail and no change will be written to the database.

This is desirable because without atomicity, a transaction that’s interrupted while processing may only write a portion of the changes it’s supposed to make, which could leave the database in an inconsistent state.

It is one of the four ACID properties that are desirable for databases dealing with transactional workloads.

More about Atomicity

Application / Client

What is an application or client?

An application or client is a software program. In the context of a database, the client is what sends data to a database, and/or gets data from it. An example of an application is a phone app – a piece of software that runs on the phones of users, and may re quest and send data to and from a database as the user takes different actions in the app.

More about Application / Client

API

What is an API?

API stands for Application Programming Interface. Simply put, an API is a way for developers/apps/clients to communicate with applications. It’s an interface that allows the developers to send requests to an application and receive simple responses from it.

More about API

Active-Passive Availability

What is Active-Passive Availability?

Active-passive availability is one way to configure a distributed database to offer high availability.

In an Active-Passive Availability setup, all traffic is routed through a single active replica, and changes are copied to a backup passive replica. If the active replica fails, the passive one takes over. However, the active one might fail before the passive one copies all the data over, leading to data loss. Plus, it can take a while for the passive replica to boot up, causing some RTO.

Other configurations include active-active availability and multi-active availability.

More about Active-Passive Availability

Active-Active Availability

What is Active-Active Availability?

Active-active availability is one way to configure a distributed database to offer high availability.

In an active-active availability setup, multiple replicas run identical services and traffic is routed to all of them. If any replica fails, the others handle the extra traffic. This model runs into consistency issues in database contexts because multiple replicas might be trying to edit the same data at the same time.

Other configurations include active-passive availability and multi-active availability.

More about Active-Active Availability

Edge Computing

What is edge computing?

Edge computing is a somewhat overloaded term to describe locality-sensitive distributed computing architecture. Wikipedia defines it as “a distributed computing paradigm that brings computation and data storage closer to the sources of data.” We can think of the “sources of data” being users or even sensors making requests to our system.

The main aim of edge computing or multi-access edge computing is to reduce location-related latency in applications to enable high-performance, real-time use cases in widespread geographies. Edge computing systems are faster when computation and data are closer to the devices.

Developers today are beginning to realize that to get computation data closer to devices, there are better choices than centralized databases and even distributed databases with limited distribution to a single region.

Operating a database in a single region leads to high latency for users or edge devices in areas outside of the database region. Even if you distribute your application across multiple regions, users or devices outside the database region may experience unacceptable response times. And unexpectedly high latency can translate into dissatisfied users.

Edge computing uses data that has life cycles or life spans. In the same application, you can have ephemeral data, in-memory data, short-term persistent data, and long-term persistent data. Typically, long-term persistent data is stored in databases. Unfortunately, when LTP data is far from the edge, it’s slower, and this slow data effect tends to give databases a lousy reputation.

Choosing the right database solution can improve your edge computing architecture and user experience.

More about Edge Computing

Bit

What is a bit?

In the context of data storage, a bit is a single 0 or 1 (also called a binary digit). There are 8 bits in a byte.

More about Bit