Adam Szymański

Founder & CTO

Published
August 7, 2023

Oxla efficiency on Star Schema Benchmark

Star Schema Benchmark
Performance
OLAP

Hi there! At Oxla, we've been working really hard to achieve exceptional performance. We have dedicated a lot of effort towards this goal, and we're thrilled to present our progress in the latest tech preview. To accomplish this goal, we've chosen to harness the widely acknowledged Star Schema Benchmark dataset for assessing the analytical database's performance. This dataset is known for its ability to accurately evaluate the efficiency of GROUP BY and JOIN operations.

Star Schema Benchmark

The original dataset and queries are described here. For generating the dataset, we used code from GitHub - vadimtk/ssb-dbgen (as advised by Clickhouse: Star Schema Benchmark | ClickHouse Docs). The main table in the dataset was created on a scale of 100 and has around 600 million rows.

The definition of tables and queries for Oxla is available here: GitHub - OxlaCom/StarSchemaBenchmark: Queries used to run SSB on Oxla. As you can notice, clauses in WHERE are manually pushed down: it’s an inconvenience that will be fixed in the next Oxla release.

We used ONLY normalized version: that’s an unmodified version of this dataset. It requires using JOIN operation.

Methodology

Our objective was to conduct a benchmark test that could be directly compared to the results presented by Altinity in their blog post about Clickhouse – "ClickHouse Nails Cost Efficiency Challenge Against Druid & Rockset".

We decided to use the same methodology as Clickhouse: perform a warm-up run before executing every query three times and obtaining an average result for each query.

To get results for Clickhouse, we used table definitions and queries from the Altinity blog post and the latest version of Clickhouse: version 23.6.2 revision 54464. Interestingly, our findings differ from those reported in the Altinity blog post. They claimed a total running time of 18.5 seconds on the normalized dataset, while we observed it to be 25 seconds. The reason for this difference is currently unknown. It could be due to differences in database configuration (we have used default settings, installing it from the DEB package using this instruction: Install ClickHouse | ClickHouse Docs) or a performance regression between different versions of ClickHouse.

Hardware

Oxla benchmark was run on m5.8xlarge server on AWS, exactly as Clickhouse.

In analyzed queries, Oxla shows a significant speed advantage. This difference is especially noticeable in queries that require scanning a large number of rows and performing multiple joins. Let's take a closer look into this matter.

Queries 1.1–1.3 are queries containing only GROUP BY over the whole table with where clause. The queries only contain integer columns, and they select rows based on date ranges. This helps both Oxla and Clickhouse limit the amount of data scanned from the table. Oxla’s advantage over those queries ranges from mild (+18% faster) to large (+273% faster). When processing a smaller amount of data, Oxla's advantage becomes smaller as well, possibly because of the longer query startup time (~8ms for a simple query with GROUP BY + WHERE) on Oxla. That’s an issue we will work on in the future.

Queries 2.1-2.3 are performing 2 JOINs without WHERE condition that’d allow for easy restriction of rows to be scanned. Those queries result in a full range scan over the main table (lineorder). Queries here use both integer and string columns. Here, Oxla is usually 6x faster than Clickhouse.

Queries 3.1–3.4 are performing 3 JOINs and GROUP BY over 3 columns. All those queries select a range of rows by data, resulting in a partial scan over the main table by both Oxla and Clickhouse. Query 3.4 is accessing a very small subset of rows. The problem highlighted in Query 3.4 is similar to that of Q1.2. Although the query is small, the pipeline required for data processing is quite complex, resulting in similar processing speeds. That is causing significant query startup overhead. Queries 3.1–3.3 are running between ~6.5–9x faster on Oxla.

Queries 4.1-4.3 are the most varied in their execution, involving 3 to 4 JOINs. Queries Q4.2 and Q4.3 select a range of rows based on data, resulting in a partial scan of the main table by both Oxla and Clickhouse. Q4.1 requires a full table scan. Oxla’s advantage on those queries varies between 4x-8x faster.

Summary

Oxla’s implementation of JOIN operation is 6x+ faster than Clickhouse, and the GROUP BY implementation is ~3x faster when aggregated data is relatively large (tens of millions of records).

Clickhouse performance is still better on denormalized SSB dataset due to their amazing LowCardinality column type modifier. We plan to showcase our solution for that particular use case in the near future through one of our upcoming releases.