Advantages of using columnar storage for logs

GO BACK TO BLOG

Traditional solutions for logs are row based and have a reverse index that allows for fast retrieval of records based on search terms. Recently, it is becoming more and more common that analytical databases with columnar storage are being used for storing and processing logs. But what has caused such a shift?

Why do we store logs at all?

To get a better understanding of why OLAP databases utilizing columnar storage are becoming more and more popular, especially for storing logs, one needs to really understand what is the purpose behind it. Typical reasons are as follows:

Analyzing system behavior
Calculating metrics
Debugging
Security audits

Each of those use cases has different requirements. When we analyze system behavior, we perform specific actions by using the tool and then interpret produced logs to learn how the system worked. Here, we usually want to retrieve log entries for a given time range and service.

When there’s a need to calculate metrics, we usually want to compute aggregations over certain log types within specified time range. Usually, just a fraction of logs is used for calculation as the rest of them doesn’t provide any actionable value. In case of structured logs, in most situations only part of the record is required. For example, when it comes to HTTP logs we might want to retrieve the response code, time stamp and service name, which is so much more than the whole row. Generation of metrics can also cause significant query workload, as we might want to monitor our system in the real time, which requires running queries on a regular basis to look for any kind of anomalies.

Debugging is another use case, which might ask for an astounding amount of verbose level logs. For this one, there are not that many queries being run (the ones used are very selective) but it is important to store a lot of data in a cost efficient manner. Typically, first we search for an entry with a specific log error or error ID and then we look for all entries from around a specified timeframe.

Logs are also useful for the purpose of performing security audits.

According to IBM’s report, statistically data breaches are detected after 9 months.

What does it mean more or less? To analyze what has happened, what is the scope of the breach and how it was performed, it is extremely important to store data for a long period of time. Data breach is not something that happens on a regular basis so while this use case does not generate a significant amount of queries on an everyday basis, it involves a notable cost of data storage related to long retention period.

Crucial database characteristics for storing logs

Based on the above-mentioned use cases, a database designed for storing logs should:

Have low cost of storing data to allow having long retention periods and high verbosity
Allow running queries efficiently over specified time range and aggregating the results

Use cases that require efficient retrieval of log entries based on text search are actually less common than aggregation and filtering by time range, service name or type.

What makes columnar storage OLAP databases so attractive?

The most important requirement for logs databases is to store data in a cheap manner. That’s where columnar storage steps in, as it allows for much heavier compression than row storage. All entries for a given column are stored together, which makes it much easier to compress. Oxla users typically see a compression ratio between 3x to 12x, compared to their previous row based solution. What is more, solutions with decoupled storage and compute like Oxla, can use much cheaper storage solutions than local hard drives.

Let’s take a look at an example of how costs of storing data in row based solution and column based one with decoupled storage and compute might look like. Let’s assume we have 10 TB of raw logs data stored in a row based solution that uses local hard drives like ElasticSearch. To have data redundancy that prevents us from irreversible data loss in case of drive failure we need to replicate data 3 times over 3 nodes. Apart from the data already stored in the database, we need space on our hard drives to store new data. Assuming that it is sufficient to have one third of disk space available for that, which is a dangerously low amount in such a case, to store 10 TB we need to have a drive with 15 TB of free space. Performing threefold data replication, it forces us to have three drives with the total capacity of 45 TB available for storage.

‍

‍

On AWS, the price of gp3 EBS (general purpose elastic block storage) is $0.08/GB/month in the US East data center. The cost of storing the above mentioned amount of data would be $3600/month. On the other hand, if we use a solution with columnar storage, 10TB of raw logs require between 0.8TB to 3.3TB of storage space, which costs between $300-$1200 with threefold replication and free space for insertions. What’s more, if we use a solution with decoupled storage and compute, we can store data on distributed objects storage like AWS S3, Google Cloud Storage (GCS) or Azure Blob Storage. Using this kind of a database eliminates the need for such replication (typical data durability on local drive is up to 99.99% while durability of S3 is 99.999999999%). Additionally, one does not have to pay for capacity as pricing is based on actual data being stored. On top of that, the cost of storing data is lower. On S3 in the US East data center the price is up to $0.023, which means that storing 10TB of compressed data there would cost between 19 and 76 USD per month, which is 50 times cheaper than a row based solution on local drives!

What is also important is that scanning compressed data is much faster than performing the same operation on raw data. With 4x compression ratio we can scan 4 times more rows within the same period of time. Column based storage also allows skipping data from columns that are not needed. In use cases like metrics calculation, it can speed up scanning several times. This superior scanning speed often defies the need to create a reverse index, which makes databases easier to maintain and provides faster data insertion.

When an analytical database is not a way to go?

If you do not need a long retention period, your logs are concise and you perform a lot of queries that are very selective, typically based on text search, an analytical database with columnar storage would be a bad choice. Additionally, in that scenario costs of storage are negligible compared to the cost of the server. What is more, query performance would be based on ability to retrieve several rows based on text search criteria. Columnar storage is efficient in scanning over a range of rows but not in retrieving small subsets of data.

Summary

If you do not care for storage cost and you are mostly running text based search queries then you should stick to solutions like ElasticSearch.

Analytical databases with decoupled storage and compute like Oxla can reduce your storage bill by over 98% and significantly reduce the need for computing power, which translates into further cost reduction.