Why we’re building for on-prem

GO BACK TO BLOG

How we got here

In 2019, I started Oxla as an R&D project with a simple but ambitious goal: to build an ultra-efficient OLAP engine from scratch. By 2020 Oxla had grown into a company and a team. Building on a vectorized MPP architecture, we focused on low-level optimizations throughout the system, minimizing data transfer between CPU and RAM to make queries not just fast, but highly resource-efficient, no matter the dataset size.

Fast forward to summer 2023, Oxla came out of stealth. Soon after, we published performance results from the Star Schema Benchmark, where Oxla outperformed ClickHouse, especially on queries that scan large datasets and perform multiple joins. Since then, we’ve been continuously improving Oxla’s efficiency. At the time of this publication, we’re ranked number 6 on ClickBench for c6a.4xlarge 500GB gp2 when compared to popular self-hosted analytical databases and engines.

However, as MotherDuck’s Jordan Tigani rightly pointed out, performance alone isn’t enough. The former BigQuery founding engineer argues that focusing solely on query response time is an anti-pattern because, from a user’s perspective, the real speed metric is the time it takes from idea to insight. More recently, Andy Pavlo stated that "vectorized execution engines for analytical queries are a commodity now." He similarly argues that ease of use and tooling compatibility should be the industry's primary focus.

While our work at Oxla shows that MPP query engines can and should continue to evolve, I couldn't agree more that developer experience for data engineers is paramount. That said, through many conversations with engineers and technical leaders over the past few months, we've realized that real-world data engineering challenges go far beyond usability. The way solutions fit within operational constraints, long-term costs, and industry-specific requirements highlights the broader impact of software infrastructure decisions on organizations—challenges just as pressing, yet often overlooked in the database world.

So what’s the deal with cloud?

Ever since the early 2010s, data warehousing has been focused on the cloud. That decade saw a steady stream of cloud-native OLAP launches with BigQuery (2012), Presto (2013), Snowflake (2014), Redshift (2015), ClickHouse (2016), and Databricks (2017). By the early 2020s, the old guard, including Teradata, Vertica, and Oracle, jumped on the cloud bandwagon too.

Of course, this was part of the great cloud compute platform shift that began with AWS in 2006. By 2020, public cloud had become mainstream, even in the enterprise, with non-software laggards accelerating digital transformation due to the pandemic push.

This process continues today, but in the SaaS world, long-term cloud costs have been a concern for years. So much so that a16z wrote a whole piece on the cloud paradox and how companies can address it. The TL;DR of their findings goes something like this:

Cloud is great at first. It provides flexibility, speed, and eliminates upfront infrastructure costs, which is crucial for startups.
But costs scale badly. As companies grow, cloud spending becomes a major burden, eating into profit margins.
Public companies are hit hardest. High cloud expenses shrink gross margins, reducing market valuations by billions.
VC-backed startups face a paradox. Cloud helps them scale fast but creates long-term financial inefficiencies.

The recommended fix? Plan for cloud cost management early and balance cloud with owned infrastructure for efficiency. Smart enterprises realized this upfront, which is why hybrid environments became popular. But in data analytics, this whole cloud conundrum is even more nuanced.

Cloudception

First of all, OLAP has always been resource-hungry by nature, which only amplifies the scaling issue mentioned above. Fortunately, hardware performance has improved tremendously: over the last decade we went from just a couple cores per CPU to over a hundred (though RAM bandwidth hasn't kept up, which is the problem our team solves at the engine level).

It’s not just that CPUs have gotten more powerful, but they’ve also become nominally cheaper. Looking at this through a performance-to-cost lens, compute power is way more affordable than it was a decade ago.

Yet AWS cost memes are still going strong, and for good reason. While hardware has improved, public cloud compute pricing has remained relatively stable. Rather than passing on the cost savings from hardware advancements, AWS keeps its profit margins steady and reinvests the overhead into R&D.

This is of course smart on their part. Price wars with competitors would most likely shrink margins. And it’s worth noting that the whole cloud revolution started when Amazon turned its infrastructure cost center into a profit center.

Before you start thinking I’m actually the old man yelling at cloud, I do realize the convenience of on-demand computing. My point is just that the public cloud is and always will be on the expensive side.

No wonder modern cloud infra startups build their own data centers in co-location facilities. Prisma’s CEO recently discussed how they’re able to pass on savings to their customers that way. Railway’s CEO wrote about how they built their data center to move away from GCP. Just a few days ago, the CTO of 37signals shared it’s been 10 years since they launched Basecamp on a fleet of 61 servers, pretty much doubling the time they initially expected these would last.

Unfortunately, what these companies are doing would never work for data warehousing because of the sheer scale required to run analytics for customers. And believe me, building a cloud warehouse product is already difficult and expensive, even when leveraging an open-source DBMS. Scaling out physical infrastructure on top of that makes no sense when you can use public cloud.

Can you really own data in the cloud?

Then there’s the fact that the public cloud isn’t exactly great for data sovereignty. For many companies, especially in regulated industries or certain geographies, keeping data within specific borders isn’t just a preference but a legal requirement. Public cloud providers offer region-based hosting, but that doesn’t always mean full control over where data is processed or who can access it.

Even with encryption and dedicated regions, companies still have to trust a third party to enforce sovereignty. And when it comes to privacy, relying on someone else’s infrastructure always introduces some level of risk, whether from compliance gaps, jurisdictional conflicts, or plain old vendor lock-in.

This is where VPC (Virtual Private Cloud) and BYOC (Bring Your Own Cloud) deployments come in. Companies get the flexibility of the cloud while keeping full control over where and how their data is stored and processed. VPC setups, whether on self-managed infrastructure or in a dedicated region of a public provider, give organizations more security and control, though they also require a lot more internal management to stay compliant and secure. BYOC lets businesses deploy a managed data warehouse inside their own cloud account, helping them meet strict regulations. In theory, these approaches should solve the sovereignty and privacy concerns of public cloud. In practice, it’s not that simple.

For one, since the introduction of the CLOUD Act in 2018, the U.S. government has explicitly confirmed its ability to request access to data stored on servers owned by AWS, GCP, and Azure, even if they’re located outside the U.S. That sucks big time for data sovereignty and has only heightened concerns around public cloud use ever since.

Regulations aside, the challenge with VPC and BYOC self-hosting is that data warehouses aren’t just any software. They’re massive, distributed systems designed to scale across hundreds or thousands of nodes. Running them efficiently takes deep expertise in infrastructure, orchestration, and tuning. Public cloud providers hide most of that complexity, but in a VPC or BYOC setup, the burden falls entirely on internal teams.

That means handling everything from provisioning and scaling to network optimization and security. For a lot of organizations, that’s a level of operational overhead they just aren’t equipped to deal with.

So why not just go on-prem? Instead of managing cloud infrastructure or fine-tuning a BYOC setup, teams can run their warehouse on dedicated hardware they fully control. There’s no cloud networking overhead, no surprise egress fees, and no dependencies on a third-party provider.

Of course, on-prem comes with its own challenges, like upfront hardware costs and capacity planning. But if you already own or rent infrastructure, it can be a predictable and cost-effective alternative that lets you keep your data fully under your control.

What on-prem warehousing looks like today

So what are the options for building a data warehouse on-prem in 2025? From talking to organizations that still do this, I’d say there are two common approaches.

The first is sticking with a legacy platform like Oracle or Db2 as the centerpiece, with specialized open-source databases glued to it. This is usually done with database gateways, which offer some convenience by providing access to multiple sources from a single environment. Still, these stacks tend to be costly, bloated, and inefficient.

Some companies try to go hybrid by adding a cloud warehouse alongside their legacy system. While this can work in some cases, cloud data warehouses don’t always integrate well with older platforms, making data migration a massive headache. And at enterprise scale, they often turn out to be way too expensive.

The second approach is going fully open-source, typically built around Apache Spark, sometimes with Apache Flink for streaming. While powerful and robust, this setup is extremely complex, and Spark is not optimized for OLAP workloads. It’s great for large-scale batch processing, but it struggles with interactive analytical queries compared to specialized OLAP engines. Probably the biggest challenge here is the lack of official support beyond the open-source community and third-party services, which puts a huge burden on teams.

It’s no surprise then that many companies end up migrating their Spark environments to Databricks. Sure, this makes things easier and might even save money in the short term. In the long run though, companies end up locked into Databricks’ pricing and spend a fortune on cloud processing.

But what about more modern OLAP engines, you might ask. There are certainly some high-performance open-source analytical DBMS out there, like Apache Doris, ClickHouse, or StarRocks. However, these were built for the cloud-native era and optimized for cloud deployments. Running them on-prem is possible, but it introduces implementation and maintenance challenges that make things far more complicated than they should be.

What’s worth noting is that even if you can self-host a modern OLAP system (usually because it’s open-source), the company behind it is focused on monetizing its cloud offering. From a business perspective, it’s in their best interest not to make self-hosting easy for anyone. In their view, on-prem warehousing is a negligible freak of nature, while the idealistic premises of open-source are sometimes sacrificed—typically due to fear of someone building a competing cloud product—leading to the growing popularity of source-available licensing.

What’s next

The cloud-native era is a fact of life, but teams building data warehouses on-prem deserve the right tools. That’s why at Oxla, we’ve refocused our mission on empowering on-prem data engineers.

Today, Oxla is an OLAP database and query engine purpose-built for compute and memory efficiency. It handles versatile workloads at hundreds of terabytes scale with low-latency performance across batch, real-time, time-series, and ad hoc queries. Right now, it’s being used by on-prem customers in POCs across multiple industries and use cases that demand efficiency at scale and complete control over data.

In the spirit of building in public, we’re sharing our product roadmap for the first half of 2025. Beyond ongoing improvements in performance, scalability, and other areas, two things in particular are worth mentioning:

We’re working on enabling queries on external sources, including open catalogs, databases, and other sources (e.g. Parquet files on S3). Once we launch this capability, Oxla will effectively become a data lakehouse solution, as data engineers would expect in 2025.
Support for ACID (Atomicity, Consistency, Isolation, Durability) transactions. This is non-negotiable in the enterprise, and we plan to implement full ACID support, which will also enable time-travel capabilities.

Looking further into the future, my vision is to make on-prem data warehousing as easy as possible. That means bringing some of the benefits of cloud solutions to on-prem deployments, like sharing a single database instance across multiple projects while ensuring resource separation per project or team and maintaining full visibility into resource usage. Beyond that, I want to introduce true elasticity to on-prem environments by enabling autoscaling within data centers. This would allow resources to scale dynamically based on workload demands while minimizing maintenance overhead.

Oxla is already easy to deploy with a Docker image, but I want to take that simplicity even further. Our architecture is homogeneous by design, meaning each server is deployed the same way, while the database and query engine decide how workloads are distributed. This eliminates manual resource allocation and ensures the system optimally assigns tasks without user intervention. The goal is to extend this effortless scalability by combining cloud-like elasticity with the control and predictability of on-prem, all without the complexity of Kubernetes.

If you’re a data engineer or technical leader working with on-prem analytics, I’d love to hear about your challenges. We’ve learned so much from our early customers over the last few months and I couldn't be more excited to keep building for on-prem while pushing the boundaries of what’s possible in database engineering.

Looking forward to speaking with you soon!