Wojciech Chlapek

Software Engineer

Published
February 1, 2024

Using Huge Pages in Linux Applications Part 2: Transparent HugePage

Huge Pages
HugeTLB
Linux
Memory Management
Virtual Memory

Transparent HugePage (THP)

Transparent HugePage (THP in short) is a mechanism that may be used in applications to utilize huge page support offered by the Linux kernel. It is basically an alternative to HugeTLB with roughly the same purpose: to allow the programs to get performance benefits from using huge pages. When created, THP’s main goal was to simplify the usage of huge pages, minimizing the amount of work needed by system administrators and developers.

It was a kind of trade-off from the very beginning. It was never intended to give that much performance gain and control as HugeTLB. However, THP’s authors believed that it would be much more widely adopted than HugeTLB thanks to being easier to use. Even today, using HugeTLB is rather unwieldy, but the situation was much worse at the time THP was created. In those days, HugeTLB was used mainly by some database engines where the performance gain was worth all this extra development and administrative effort. Nobody else really cared about that.

The word transparent in THP suggests that from the application perspective, it is indistinguishable if huge pages are used under the hood or not. To some extent, this is true. However, there are some quirks one must know to use THP properly. Often, using explicit hints is required, sometimes also changing the system-wide configuration.

The core assumptions in THP are (at least so far) as follows:

  1. THP tries to use huge pages if possible. It depends on the current configuration and what will happen if they are unavailable. However, as long as standard-size pages are available, no failures will be observed from the application’s perspective.
  2. Memory pages can be automatically promoted to huge pages and demoted to standard pages.
  3. A huge page in THP always means a page that skips the last level in the page walk, the so-called PTE level. Bigger sizes are not supported. For example, on x86-64 architecture, only 2MB huge pages are used.

Enable / disable THP system-wide

THP is configurable via system-wide values (changing them requires root privileges). The most basic system-wide setting is the value of /sys/kernel/mm/transparent_hugepage/enabled file:

  • always means using THP mechanism always
  • never – self-explanatory
  • madvise – use THP only on memory regions that have been appropriately marked by madvise syscall (with MADV_HUGEPAGE flag).

madvise setting requires attention, as it is the default value in many Linux distributions (including Ubuntu). Its effect is that THP will be used only on memory regions that have been appropriately marked using madvise syscall. All other memory regions will use pages of standard size.

madvise as the default value makes sense for at least two reasons:

  1. Memory consumption always increases when using huge pages, but only some applications will run faster. Increased memory consumption is sometimes unacceptable.
  2. Some applications are fine-tuned for memory pages of standard size, and enabling huge pages will degrade their performance.

The official docs contain this rule of thumb:

Applications that gets a lot of benefit from hugepages and that don't risk to lose memory by using hugepages, should use madvise(MADV_HUGEPAGE) on their critical mmapped regions.

The rationale is simple: such regions may get allocated huge pages provided by THP if either always or madvise is set in /sys/kernel/mm/transparent_hugepage/enabled. Other memory regions may use huge pages provided by THP only if always is set.

From the application developer’s view, the basic usage of THP is to follow this rule and, well, hope that huge pages will be used on marked memory regions. We may take greater control over the process, but it won’t be transparent anymore. However, if transparency is unimportant to us, we should at least consider switching to HugeTLB.

Huge pages depleted scenario

What happens if THP is enabled, but no huge page is available during page allocation? It’s not a very uncommon scenario. Remember that in THP huge pages are not reserved in advance. Long-running, highly utilized Linux systems tend to have highly fragmented physical memory. Finding a continuous, unused piece of physical memory of the size of a huge page may be impossible, even if RAM is not full yet.

Remember that the authors of THP wanted to hide huge page allocation errors from the user's perspective. The simplest approach one may think about when the huge page cannot be allocated is to simply use normal-sized pages and possibly replace them in the future with huge pages.

A more complicated technique looks as follows:

  1. Swap out unused memory to free up some space in physical memory.
  2. Move existing physical pages of normal size into contiguous areas in RAM. This process is called compaction. Moving used pages into contiguous areas causes creating new, bigger areas of unused memory, that may be eventually used to allocate new huge pages. This process is time-consuming because moved pages must be copied into a new location.
  3. Create new huge pages that will be used by THP.

Looking at the description, you may guess that executing all these steps may cause a significant delay if we want to use a huge page instantly, synchronously waiting until it is available. In the past, under very specific conditions, an application could stall for several minutes (sic!), waiting for all this work to be done.

Because of such performance implications, there is a system-wide setting that determines what THP should do when huge pages are not immediately available. This setting is the value stored in /sys/kernel/mm/transparent_hugepage/defrag file. Possible values are:

  • always: reclaim memory pages and run compaction to make huge pages available instantly. It will wait for huge pages availability synchronously, so a big delay is possible.
  • defer: run memory reclaim and compaction asynchronously without waiting for them to be available. Use standard pages and hope that khugepaged will replace them with huge pages in the near future.
  • never: just use standard pages. They may be replaced with huge pages in the future, but no extra action will be triggered at this point.
  • madvise (the default value in Ubuntu 23.10 and 22.04 LTS) behaves like:
      ⚬ always for memory regions that used madvise(MADV_HUGEPAGE)
      ⚬ never for other memory regions.
  • defer+madvise behaves like:
      ⚬ always for memory regions that used madvise(MADV_HUGEPAGE)
      ⚬ defer for other memory regions.

Again madvise marked memory regions are handled differently in two possible setting values (out of 5). In madvise and defer+madvise settings, THP tries to install huge pages quicker and more aggressively on memory regions marked with the MADV_HUGEPAGE flag.

Automatic promotion / demotion of huge pages

Promotion is a process of converting standard pages that are contiguous in virtual memory into huge pages. This is achieved by cooperating services running in the background: kswapd (responsible for swapping out unused pages), kcompactd (responsible for memory compaction), and khugepaged (responsible for installing huge pages in place of standard pages).

Typically khugepaged will run at low frequency, thus limiting the amount of extra work done to make huge pages available. That may or may not be desired, so there exist some steering knobs to alter this behavior.

Demotion means converting huge pages back into pages of standard size. This possibility was necessary in the past, as many parts of the existing Linux kernel’s code could operate on only pages of standard size. Nowadays, the number of scenarios when demotion is required is decreasing, which makes sense from the performance perspective. We want to pay some extra cost at the beginning and use huge pages instead of paying extra cost twice and using pages of standard size in the end.

System-wide configuration

Many more configuration knobs may be adjusted in /sys/kernel/mm/transparent_hugepage directory. I won’t describe them here, but it’s a good idea to check them out in the official docs and experiment with them if you want to use THP most efficiently.

How to use THP

Prerequisites to make it work

It should work out-of-the-box on Ubuntu without any need for changing system-wide settings. That should be true in most cases. However, it is good to check the content of already mentioned files:

  1. /sys/kernel/mm/transparent_hugepage/enabled - its value must be either always or madvise.
  2. /sys/kernel/mm/transparent_hugepage/defrag - its value should be either always, madvise or defer+madvise. Otherwise, we risk using 4 KB pages unintentionally, but at least we avoid potentially big latency spikes due to synchronous huge page creation.

Allocation

The usage of THP in practice is a bit tricky. It should follow this pattern:

1. Map memory into virtual memory of the current process using either:

2. Mark the memory as a candidate for conversion to huge page using:

madvise(address,
        size_in_bytes,
        MADV_HUGEPAGE);

address should be aligned to a huge page size. That’s problematic when using mmap, which by default returns an address aligned to standard page size.

3. A huge page will be allocated on the first write to the mapped memory region, and only one page will be allocated. If a mapped memory region covers several huge pages, they will be allocated in the future in a similar fashion - on the first writes to mapped areas for which physical memory pages have not been allocated yet.

⚠️ Caveat: neither read nor write should be done in area intended for conversion to huge page between memory map and madvise(). Otherwise standard size page will be allocated.

A straightforward way of dealing with huge page alignment requirement with mmap()is allocating a bigger size than needed, finding inside the allocated memory the first address aligned to a huge page size, and using it as the beginning of the first huge page. Extraneous space from the beginning and end may be deallocated instantly. Deallocation may also be postponed, but it’s important not to forget about it. In my opinion postponed deallocation of these unused regions does not make much sense, as 1. it is difficult to use this space anyhow and 2. it increases memory fragmentation.

You may now think using plain mmap()is too much hassle, but unfortunately, using posix_memalign() and aligned_alloc() comes with another issues.

posix_memalign() / aligned_alloc() problems

As I checked using strace, under the hood, these functions call mmap without HUGETLB flag, reserving a bigger memory area than requested to make sure that:

  1. it is possible to return a piece of memory of the requested size that is same time aligned to a huge page size
  2. some extra space is needed to store information used afterward by free()to deallocate the memory.

These functions work, but we don’t have any fine-grained control over them. We cannot deallocate memory earlier, we just have to accept the fact part of memory is wasted and the memory is fragmented more than needed.

Another problem with posix_memalign() and aligned_alloc() according to this tweet: Travis Downs on Twitter / X is that these functions theoretically can touch the part of allocated memory intended for use as a huge page. We have no way to control that.

Easy way to satisfy page alignment in mmap()

From my experiments, it looks like mmap() returns a huge page-aligned address if the requested length is multiplied by a huge page size. It looks like a perfect solution: no unmapping, no wasted memory. Simple and efficient. However, it’s not officially documented behavior, so you should verify it works on the target machine.

Edge cases

From my experiments, it looks like if THP’s enabled flag has value madvise, huge page will never be used unless whole area in virtual memory that could be backed by this huge page is marked by madvise() with MADV_HUGEPAGE flag. It has one interesting consequence. If the memory region we have mapped is not a multiple of huge page size, it means it will end somewhere in the middle of the area that potentially could be backed by a huge page. That means that not the whole area will be marked with MADV_HUGEPAGE. This area won’t be replaced with a huge page.

Let’s consider another scenario. What happens if the whole area that could be backed by a huge page is marked by MADV_HUGEPAGE, but the first write access is located further from the beginning than the size of the standard page? Remember that if THP was disabled entirely, it would cause allocating standard-size pages, but not the first one. Turns out that if THP is enabled, a huge page will be installed immediately if available, so fortunately, the first write access may happen anywhere, and it will be fine.

Finally, you may wonder what will happen if madvise() is used after the first write. As you may expect, at the beginning standard-size page will be used, and that perfectly makes sense. During handling page faults, the OS is not informed anyhow that a huge page should be used instead. Later, after marking the whole area that could be backed by a single huge page with MADV_HUGEPAGE, it is highly probable that, eventually, khugepaged will replace this page with a huge page. To make this possible, one more condition must be met: the number of non-allocated standard-size pages in the area that could be replaced by huge page must be lower or equal to /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none value. In Ubuntu, it is equal to 511 by default. As one huge page replaces 512 standard-size pages, it means effectively that replacement may be done always, even if only a single standard-size page was allocated.

Deallocation

Deallocation should be made according to the way allocation has been made: free() for posix_memalign()/ aligned_alloc() – allocated areas and munmap() for mmap-allocated ones.

Monitoring Transparent HugePage

The first thing you may check out is /proc/meminfo. This page provides some help on how to analyze its content. You may also find the description of interfaces that may be used to monitor huge pages usage in the official docs.

Unfortunately, verifying that huge pages are used in your application (and not regular-size) is tricky. It requires root permissions. If you are interested, please take a look at this post: Reliably allocating huge pages in Linux.

Using THP via glibc tunable

Similarly, like in HugeTLB, instead of mapping and unmapping memory on your own, you may use set value 1 in glibc.malloc.hugetlb tunable. It will work this way: every time malloc()needs to call mmap() internally, it will also mark mapped memory using madvise()with MADV_HUGEPAGE flag, nothing more, nothing less.

This approach has some drawbacks. malloc() does not enforce alignment to huge page size when using mmap() internally. Because of that it is very likely that at least the beginning of the returned memory will never be backed by a huge page. Moreover, typically the end of the returned memory won’t be aligned to huge page size. Thus only part of virtual memory that could be backed by huge page will be marked with MADV_HUGEPAGE. Because of that THP won’t use huge page there if THP is enabled only for MADV_HUGEPAGE marked memory regions.

Also remember that malloc() allocates more memory than requested (as part of it is used for malloc’s internal purposes). Thus it's not that trivial to exploit the mmap() behavior I mentioned earlier: returning huge page aligned memory if requested allocation size is a multiple of huge page size.

THP vs. HugeTLB comparison

As you probably already noticed, THP is a much more complex mechanism than HugeTLB. This complexity is caused mainly by the assumption of transparent promotion and demotion of huge pages. Most of the details are invisible to someone who just wants to use it. Same time fine-tuning is possible using several steering knobs and usage statistics.

HugeTLB is much simpler. However, this simplicity also limits possible usage scenarios.

📄 Being a developer, as a rule of thumb:
  • use HugeTLB if performance and full control over memory management is your top priority
  • otherwise, use THP.
Below I prepared more detailed comparison.

HugeTLB advantages

Different huge page sizes: only HugeTLB gives the possibility to request huge pages of different sizes. THP uses the default size only.

Huge pages are always used: when using HugeTLB, successful allocation means that huge pages are always used. THP can use normal pages instead and convert them into huge pages in the future (if possible).

Allocated memory is always aligned to page size: only HugeTLB guarantees that.

madvise syscall is never required: On Ubuntu madvise() syscall is needed for THP to work unless system-wide configuration files are changed.

Transparent Hugepage advantages

It does not require changing system-wide files: only THP will work out-of-the-box on most Linux distributions.

Has tunable behavior what to do when huge pages are not available instantly: for example it is possible that THP will use pages of normal size when allocating huge pages fails.

Can convert normal pages into huge pages in the background: normal-size pages used by THP may get converted into huge pages at some point, thanks to the background worker (khugepaged). Its behavior is tunable via system-wide configuration files.

Can convert huge pages into normal pages under memory pressure: when only part of a huge page is mapped by a process, and there is high memory pressure in the system, the page may get split into smaller pages to reclaim unused memory.

Evolution of huge pages support in Linux

What’s interesting, both HugeTLB and THP are still evolving. The Transparent Hugepages idea was mentioned for the first time in 2009 and introduced in the Linux kernel two years later, so it’s about 15 years of development now. HugeTLB is even older.

Looking carefully at their history, you’ll see that their functionality has slowly evolved since they were introduced. This process is time-consuming because huge pages belong to memory management, which is a critical component. Most other kernel mechanisms depend on the correctness and performance of memory management, so Linus Torvalds took a very conservative approach and accepted new features after very careful consideration.

However, the evolution is not over. For example, if you look at the official kernel docs about THP matching linux-next branch (that holds patches that are expected to land soon in the main branch), you may notice it describes multi-size THP feature that is not yet available in the 6.7 version (the most recent at the date of writing this article).

Huge pages in Oxla

Finally, you may wonder if we are using huge pages in Oxla. That would be a logical step, as it can speed up memory-intensive applications, and we are obsessed with optimizing performance. However, when writing this post, huge pages are not used (yet). We did some initial experiments with enabling THP, but the performance was actually worse. That was caused by an unoptimized pattern of memory allocations that couldn’t take any significant advantage of huge pages. At the same time, the increased cost of handling page faults on the OS side caused a visible slowdown.

This is, by the way, a good example that simply enabling huge pages is not a straightforward, universal way to speed up applications. The effect will be better if the code is written with the assumption that huge pages will be used. We will enable huge pages in Oxla in the future alongside necessary optimizations of memory allocation patterns.

In the meantime, if you're interested in trying Oxla, you can run it for free in just two minutes. Additionally, if you enjoyed reading this article and want to share your thoughts about our product or simply say hi, please do not hesitate to contact us via hello@oxla.com.

References

Transparent Hugepage Support — The Linux Kernel documentation Official kernel docs for Linux 6.7. It contains a detailed description of knobs that alter THP’s behavior.

Transparent Hugepage Support (linux-next branch) Official kernel docs about THP matching linux-next branch.

Concepts overview — The Linux Kernel documentation Short description of the concept of huge pages in official Linux kernel docs.

Reliably allocating huge pages in Linux Describes how one may verify if huge pages are used by THP.

AMD64 Architecture Programming Manual Part 2 You may find there how huge pages are supported by hardware in x86-64 architecture.

Huge pages related articles on LWN A collection of very in-depth articles about huge pages support in Linux kernel (throughout years).

Large allocations related articles on LWN A bunch of articles about large allocations issues in Linux. Many of them are, to some degree, related to huge pages because huge pages rely on the possibility of allocating big chunks of memory.

mmap(2) - Linux manual page Detailed description of mmap() and munmap() functions.

The /proc/meminfo File in Linux | Baeldung on Linux Shows how to interpret /proc/meminfo output.

Memory Allocation Tunables (The GNU C Library) Description of available memory allocation tunables in glibc.

Transparent HugePages presentation Valuable presentation from courses on MIMUW. It contains information on how THP evolved over the years (and other information about huge pages, too).

Give Oxla a spin

Install Oxla for Linux using Docker and connect with PostgreSQL client to experience the efficiency of a single node on your machine.