Adam Szymański

Founder & CTO

Published
September 26, 2024

Challenges of working with large JSONs in your database system

JSON
Challenges

In today’s article, you’re going to learn more about out of memory messages and the whole user experience related to it. Additionally, we will mention industries having issues with that and explain associated challenges to eventually explain how Oxla handles these types of scenarios. If you’re interested already, keep on reading!

Not enough memory

While working with your database, you have probably encountered errors like:

  • “Spark out of memory error”
  • “The query could not be executed in the allotted memory”
  • “Out of memory: Killed process”

If you store a large amount of data in a single field e. g. a very large JSON, which is pretty common in cybersecurity or a very long text (basically a whole document in one field), then you might have seen this kind of messages. Such troublesome large fields might be also a result of operations like string_agg, array_agg or json_agg.

Why does it happen?

It is a daily bread to encounter this kind of issue in database systems with vectorized query engines. Why does it happen though? Vectorized query engines are fast thanks to processing data in batches: many rows at once. Default size of a batch is usually an internal parameter selected by a given query engine’s developers and its size is usually larger than a thousand. Modern query engines are also usually parallelized. What it means quite often is that there is at least one thread per CPU core and each of them is processing at least one batch of data at the time. Now let’s assume that we process 4000 rows in a batch and we have a column with JSON that’s size is 2MB, on average. It would mean that each code requires at least 8BG of RAM. Depending on a query you run, there might be a need for additional data to be stored. Operations like GROUP BY, JOIN or ORDER BY typically require storing additional data in memory for efficient processing.

So if you have 16 cores CPU you would need at least 128GB of RAM just for storing currently processed batches: more than a typical server has at its disposal!

Solutions

Some of the databases do allow for adjusting the size of a block, at least at storage level, hence minimizing these issues.  Other solutions run a series of queries on small subsets of data, store query results and then combine the results using manually written query. Others recommend using servers with larger RAM memory to CPU count ratio. In Oxla we have a different solution.

Forcing developers to understand database internals in order to run their query is always a bad practice and it’s making the developer experience very unwelcoming. In Oxla we always strive to automate this kind of chores to make our users’ lives easier. Our query engine is dynamically adjusting batch size. We ensure that batches written to a file do not exceed the threshold of batch byte size limit. Thanks to that, we know what is the upper limit of single batch size during reading which gives us better control over resources.

During pipeline execution, the batch is modified: executing functions, performing operations like GROUP BY or JOIN, which can result in changing the byte size of the batch. Each time the operation is about to change the size of it, we check the size of the result and split the batch if necessary, so the next operator in the pipeline will receive an input that has size within the limits.

Thanks to this approach we can execute queries over JSONs and strings of size of up to 32MB. So, if you are struggling with support for large JSON or strings in your database then choosing Oxla is the right path for you! Head to our Product page to learn more!