Adam Szymański

Founder & CTO

Published
July 17, 2024

Our road to improving Oxla's results on ClickBench

Performance
ClickBench

In Oxla we strive to be the fastest distributed analytical database, that’s why optimisations are extremely important to us. This allows our customers to achieve more with their budget or even unlock completely new use cases that would not make sense with less efficient solutions due to budget constraints. Achieving such performance requires a lot of engineering effort in many different areas within our system. It’s also a constant pursuit of verification and making sure that each change doesn’t cause performance regression.

Optimization types

There are several different types of optimizations:

  • Single operation optimizations (e.g. adding one column to another, extracting day of week from date, etc.).
  • Data organization (e.g. indexes, compression algorithms, partitioning, ordering).
  • Planner optimizations (e.g. reusing common sub expressions, join ordering).
  • Query engine architecture optimisations (e.g. Vectorized Query Execution, Morsel Driven Parallelism, Massively Parallel Processing).

Single operation optimizations are the most isolated ones. They usually do not affect the rest of the system and they can be usually fairly easily moved from one database to another. While they are relatively easy to implement, there is a huge amount of operations that databases can perform, which means it is hard to provide top of the line optimization for all of them. Let’s imagine we want to speed up the ABS() function. We might have had a fairly generic initial implementation that works for all numeric types:

1for each value V in column C:
2  V := abs(V)


Unfortunately, this will usually cause the compiler to generate code, that performs the following action:

1for each value V in column C:
2  if V < 0:
3    V := -V


"If" statements are extremely costly. Luckily for floating point numbers, this can be simplified to the following form:

1for each value V in column C:
2  V := V & 0x7FFFFFFF


Obviously, this can be optimized further using a vector instructions set. When approaching it holistically, one also needs to implement it for all supported CPU architectures (usually x86_64 and ARM).

Data organization optimizations are usually much more complex and some of them can’t be combined. There are obvious choices typical for every single OLAP database like using columnar storage but there are also several others that are less typical. For example, Clickhouse, Oxla or StarRocks have data stored within tables in user defined order, which allows for reducing not only files that are scanned during each query but even the amount of data read within the file. Other solutions use partitioning, which simply reduces the list of files to scan, and it doesn’t happen in a very granular manner. 

Planner optimizations are trickier. Some of them are well known and transferable between databases like cost based optimizers. Others depend on a given query engine implementation. For example, in Oxla we run operations in an order that allows us to reuse memory blocks previously occupied by input data of another operation. It is not that easily achievable with query engines that utilize abstract column types.

And finally, we can go to query engine architecture optimizations. Here the differences might be fundamental between different solutions and optimizations might not be transferable without rewriting the whole engine from scratch. In case of Oxla, we have focused on minimizing I/O (including networking) and then minimizing the amount of data transferred between CPU and main memory.

Challenges in ClickBench

ClickBench is a database benchmark that focuses on measuring the performance of aggregations (GROUP BY or aggregation over the whole table) with optional WHERE, ORDER BY, and LIMIT keywords. It does not contain JOIN or sub-queries. It measures performance on tasks typical for real-time analytical databases, the ones for which Clickhouse was initially designed. The dataset itself contains real data from the Yandex search engine. The distribution of values is typical for such data and it’s either exponential or normal.

While most queries test raw performance on operations like GROUP BY over lots of unique keys, some of them are much trickier and require some changes in the query planner. Let’s take a look at query number 30:

1SELECT SUM(ResolutionWidth), SUM(ResolutionWidth + 1), SUM(ResolutionWidth + 2), SUM(ResolutionWidth + 3), SUM(ResolutionWidth + 4), SUM(ResolutionWidth + 5), SUM(ResolutionWidth + 6), SUM(ResolutionWidth + 7), SUM(ResolutionWidth + 8), SUM(ResolutionWidth + 9), SUM(ResolutionWidth + 10), SUM(ResolutionWidth + 11), SUM(ResolutionWidth + 12), SUM(ResolutionWidth + 13), SUM(ResolutionWidth + 14), SUM(ResolutionWidth + 15), SUM(ResolutionWidth + 16), SUM(ResolutionWidth + 17), SUM(ResolutionWidth + 18), SUM(ResolutionWidth + 19), SUM(ResolutionWidth + 20), SUM(ResolutionWidth + 21), SUM(ResolutionWidth + 22), SUM(ResolutionWidth + 23), SUM(ResolutionWidth + 24), SUM(ResolutionWidth + 25), SUM(ResolutionWidth + 26), SUM(ResolutionWidth + 27), SUM(ResolutionWidth + 28), SUM(ResolutionWidth + 29), SUM(ResolutionWidth + 30), SUM(ResolutionWidth + 31), SUM(ResolutionWidth + 32), SUM(ResolutionWidth + 33), SUM(ResolutionWidth + 34), SUM(ResolutionWidth + 35), SUM(ResolutionWidth + 36), SUM(ResolutionWidth + 37), SUM(ResolutionWidth + 38), SUM(ResolutionWidth + 39), SUM(ResolutionWidth + 40), SUM(ResolutionWidth + 41), SUM(ResolutionWidth + 42), SUM(ResolutionWidth + 43), SUM(ResolutionWidth + 44), SUM(ResolutionWidth + 45), SUM(ResolutionWidth + 46), SUM(ResolutionWidth + 47), SUM(ResolutionWidth + 48), SUM(ResolutionWidth + 49), SUM(ResolutionWidth + 50), SUM(ResolutionWidth + 51), SUM(ResolutionWidth + 52), SUM(ResolutionWidth + 53), SUM(ResolutionWidth + 54), SUM(ResolutionWidth + 55), SUM(ResolutionWidth + 56), SUM(ResolutionWidth + 57), SUM(ResolutionWidth + 58), SUM(ResolutionWidth + 59), SUM(ResolutionWidth + 60), SUM(ResolutionWidth + 61), SUM(ResolutionWidth + 62), SUM(ResolutionWidth + 63), SUM(ResolutionWidth + 64), SUM(ResolutionWidth + 65), SUM(ResolutionWidth + 66), SUM(ResolutionWidth + 67), SUM(ResolutionWidth + 68), SUM(ResolutionWidth + 69), SUM(ResolutionWidth + 70), SUM(ResolutionWidth + 71), SUM(ResolutionWidth + 72), SUM(ResolutionWidth + 73), SUM(ResolutionWidth + 74), SUM(ResolutionWidth + 75), SUM(ResolutionWidth + 76), SUM(ResolutionWidth + 77), SUM(ResolutionWidth + 78), SUM(ResolutionWidth + 79), SUM(ResolutionWidth + 80), SUM(ResolutionWidth + 81), SUM(ResolutionWidth + 82), SUM(ResolutionWidth + 83), SUM(ResolutionWidth + 84), SUM(ResolutionWidth + 85), SUM(ResolutionWidth + 86), SUM(ResolutionWidth + 87), SUM(ResolutionWidth + 88), SUM(ResolutionWidth + 89) FROM hits;


Sum(ResolutionWidth + K) can be replaced with Sum(ResolutionWidth) + COUNT(*) * K. This way, it calculates only two aggregations (Sum(ResolutionWidth) and COUNT(*)) and then produces the final result. Otherwise, for each input row, we would need to update intermediate aggregation results for almost 89 expressions.

The other example of planner optimization required by ClickBench is query number 35:

1SELECT ClientIP, ClientIP - 1, ClientIP - 2, ClientIP - 3, COUNT(*) AS c FROM hits GROUP BY ClientIP, ClientIP - 1, ClientIP - 2, ClientIP - 3 ORDER BY c DESC LIMIT 10;


This query can be simplified to the following form:

1SELECT ClientIP, ClientIP - 1, ClientIP - 2, ClientIP - 3, COUNT(*) AS c FROM hits GROUP BY ClientIP ORDER BY c DESC LIMIT 10;


Running aggregation over a single integer column is much faster than over multiple columns of such type.

One might wonder why there's a need to optimize such peculiar queries. While it seems unlikely that anyone would write something so unusual, these types of queries can be generated by data visualization tools or ORM libraries.

Other ClickBench queries check optimizations like:

  • LIKE operator speed,
  • regexp speed,
  • finding top N values.

What We Had To Optimize

The biggest challenge for us was to improve GROUP BY operation when the key had extremely high cardinality. This has required us to improve our hashmap implementation and change the way we exchange partial results between threads. We hope to release this change soon.

Apart from that, we have implemented lots of single operation optimizations:

  • We have added in memory compression of integers using varint to group by key representation in cases where group by is using multiple columns. It speeds up the group by when we aggregate over integer or text columns. Example: SELECT … FROM … GROUP BY my_int_col, some_other_col. It was a minor improvement that reduced the amount of memory transferred between CPU and main memory.
  • Added optimized implementation for ‘<>’ operator when one of operands is literal. This has resulted in 2.5x speed up of this operator.
  • Length function using SIMD on x86. This provided a whooping 8x speed increase.
  • More balanced workload distribution between CPU cores. This was especially important on servers with a very large amount of cores, where some of them were not receiving work at all.
  • Faster implementation of ORDER BY x LIMIT k expression where x is a single scalar column. This reduced query execution time by ~15%.
  • Faster thread creation, increasing speed of WHERE/HAVING expression if less than ~1.3% are passing or more than ~98.7% or rows passing (no speed up otherwise). This was a minor improvement optimistically reducing execution time by ~3%.
  • Vectorized MIN, MAX, and SUM aggregations when aggregating non-nullable columns over a whole table (over 2x faster).
  • Faster LIKE operator (3x faster on ClickBench queries compared to our previous implementation).
  • Optimized aggregations over non-nullable columns.

We also have added two planner optimizations:

  • Normalize expressions in GROUP BY (e.g. transform GROUP BY colA+ 2 into GROUP BY colA).
  • Transform aggregations SUM(A + k) into SUM(A) + k * COUNT(*).

Additionally we’ve implemented one change in data storage. We decided to switch the default compression algorithm for text from zstd to LZ4.

It turns out that impact on compression ratio was negligible but decompression was over 30% faster.

Summary

While very few of those optimizations required significant effort, there were plenty of them and their cumulative effect was tremendous. Our combined result on ClickBench improved almost 3 times. Some of the abovementioned optimizations are not yet released but will be available soon in Oxla.

Want to get up to speed and run Oxla within a few minutes? We’ve got you covered! Head to our docs and learn how to start using our product!