Adam Szymański

Founder & CTO

Published
June 19, 2024

Why OLAP database can be your secret weapon for massive data sets

OLAP
OLTP
Performance
Scalability

The most commonly encountered type of database is a transactional database, commonly known as OLTP (Online Transaction Processing). PostgreSQL, MySQL, or MariaDB are the most popular examples of such database systems. They can run nearly any type of query you request. In general, they can be used for:

  • Structured / semi-structured / unstructured data
  • Logs
  • Time Series Data
  • Vectors

All of them can run transactions and some of them can even scale. What else can anybody ever dream about?

The answer is pretty simple: performance and scalability. There are areas in which databases of another type shine more brightly. In this blog post, I will focus on OLAP databases (Online Analytical Processing), which – like Oxla – are focused on running queries on large amounts of data.

Performance

To illustrate differences in performance I will use the Star Schema Benchmark dataset with a scale 100. In this benchmark, the biggest table (lineorder) has about 600 mln rows. For the purpose of this article, I will compare the performance of Oxla and PostgreSQL.

When can you expect OLAP to perform better than OLTP? Most commonly, when a query is running over a large amount of data or at least when we can efficiently use a scan reduction mechanism built into a given OLAP (like partitions or natural order of table). Let’s check this example to calculate an average revenue per order:

SELECT AVG(order_revenue)
FROM (
  SELECT lo_orderkey, SUM(lo_revenue) as order_revenue
  FROM lineorder
  GROUP BY 1
);
On my desktop, this query takes 2.75s on Oxla and 182.23s on PostgreSQL. It is a great example of what kind of use cases OLAP databases are good at.

Now let’s check what happens when we select all the columns and apply a highly selective filter in a query.

SELECT * FROM lineorder WHERE LO_ORDERKEY = 5;

This takes 416ms on Oxla and 11ms on PostgreSQL after adding an index on the LO_ORDERKEY column. Most OLAP databases have very strict limitations on indexes. Some of them do not support indexes at all, some of them allow data partitioning, and some of them allow storing data sorted using a set of columns selected at table creation stage. There are just a few databases that actually allow adding additional indexes but it always includes some drawbacks when we compare them to OLTP databases.

Apart from the query speed, there is another thing to consider – the cost of storage. In the given example, lineorder table requires 16.8GB of disk space in Oxla vs 94.7GB in PostgreSQL.

Scalability

In OLTP databases everything revolves around transactions. Unfortunately, transactions are not easy to scale. This usually results in strict limitations of what and how can be scaled. In contrast, a typical OLAP database has none or very limited transaction support – either in feature set, parallelism, or performance. However, this makes them much easier to scale.

Underlying differences

The starkest difference between OLAP and OLTP is that the first class entity in OLTP is a row and in OLAP it is a column. While OLTP allows you to easily access any single record and modify it, OLAP databases do not offer such an option. As a result, many index types become less effective or even impossible to use. This is at the cost of using columnar storage, which allows for scanning columns very fast and easily but also reduces storage size. This way data is usually stored in a compressed format and each compressed chunk contains data for more than one row. 

What’s more, many modern OLAP databases are built to use distributed object storage to reduce storage costs, improve storage fault tolerance, and allow independent storage and compute scaling. On the other hand, distributed object storage solutions do not allow modification of existing objects. This makes efficient modifications of a single row impossible, as it requires rewriting the whole file containing a given row.

That’s why OLAP databases are not well suited for UPDATE/DELETE statements touching only several rows, fine-grained MVCC (multiversion concurrency control), or transactions.

Conclusion

If you have a huge amount of data that you do not modify on a regular basis and you run queries that require no more than one index or are not very selective, then an OLAP database like Oxla should be your weapon of choice. If you want to give us a try, share your feedback, or simply say hello – reach out to us at hello@oxla.com!