investigating a suggestion to pin VMs to a single NUMA zone

Updated article with more information available

This is the second post in a series involving NUMA VMs on Threadripper. The first post is here and I highly recommend checking the follow-up post where I make some changes to this single NUMA zone configuration to really nail down DPC latency.

Threadripper “Game Mode”?

It’s more than just “disabling some cores”.

AMD Threadripper includes a Game Mode that does two things;

  • Disable an entire CCX (essentially an entire CPU die) containing half the cores while leaving SMT enabled
  • Ensures NUMA is enabled (CBS -> DF -> Memory Interleave = Channel)

In theory, this ensures that all CPU access occurs on a single die - not to enhance performance, per se, but to enhance compatibility with certain games.

I just wanted to put to rest the idea that Game Mode is magically faster or somehow required for 3D games (though it may very well lead to increased IPC on high-core-count CPUs - read on for more).

“Game Mode” on Linux

Setting Game Mode is done through a Windows utility called Ryzen Master, but since we don’t have this utility on Linux where we’re running QEMU, we’ll have to do it the old fashioned way:

  • Enable NUMA in the host kernel
  • Enable NUMA support in libvirt
  • Add NUMA XML to the libvirt domain config, ensuring static placement of hugepages on a single zone
    • Use numastat -n to monitor NUMA misses and ensure that Other_Node and Numa_Miss are not growing

For my Threadripper 1900X, this ended up looking like:

  <vcpu placement='static' cpuset='0-15'>16</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='0'/>
    <vcpupin vcpu='1' cpuset='1'/>
    <vcpupin vcpu='2' cpuset='2'/>
    <vcpupin vcpu='3' cpuset='3'/>
    <vcpupin vcpu='4' cpuset='8'/>
    <vcpupin vcpu='5' cpuset='9'/>
    <vcpupin vcpu='6' cpuset='10'/>
    <vcpupin vcpu='7' cpuset='11'/>
    <emulatorpin cpuset='4-7,12-15'/>
  </cputune>
  <numatune>
    <memory mode='strict' nodeset='0'/>
  </numatune>
  <cpu mode='host-passthrough' check='partial'>
    <topology sockets='1' cores='4' threads='2'/>
    <numa>
      <cell id='0' cpus='0-3,8-11' memory='8388608' unit='KiB'/>
    </numa>
  </cpu>

This will give us 4 cores with 8 threads and 8GB of memory from the primary CPU die.

Use lstopo to determine which CPU die your GPU is connected to, and use that for the best performance.

Verifying latency

Using Geekbench 4 on the guest, I was able to observe that the memory latency was a decent 89ns - I’m using 2400MHz quad channel RAM, your numbers may be slightly lower (2133MHz) or higher.

Such a low latency value means that we’re certainly not suffering from “long access”; memory read / copy that occurs over the CPU’s InfinityFabric.

Benchmarks

I decided to use a few CPU and GPU heavy benchmarks this time to get an idea of the actual game performance on this system. I did try to run some real games for benchmarking, but there were issues gathering statistics, and this approach demonstrates the difference well enough.

  • Unigen Heaven
    • 1080p, High, Tesselation disabled, AA disabled
    • Using “high” to determine if GPU bottleneck changes with NUMA mode
  • Unigen Valley
    • 1080p, Low, all disabled
    • Using “low” to determine of CPU bottleneck changes with NUMA mode
  • Geekbench 4
    • Uses CPU and memory to compare performance
  • Cinebench R15
    • Benefits from raw throughput and peak multi-core performance
  • 3DMark Sky Diver (Combined test)
    • Emulates gameplay with physics simulating CPU use and graphics simulating GPU use

Results

Cinebench R15

Cinebench R15

Cinebench tells an obvious tale; more cores = more power. The lines across the bottom indicate the “Score per Core”, how many points were generated by each thread available to the instance. Both single and dual NUMA configurations result in nearly identical SPC, with single zone instance pulling slightly ahead.

Basically, if you’re encoding video in a virtual machine (and dear god why?) you’ll want as many threads as you can reasonably manage - unless maybe it’s a heavily single-threaded process (but none really are nowadays).

Geekbench 4

Geekbench 4

Full details:

Geekbench results are a little bit less obvious up front.

Single thread performance is just ever so slightly better in a single NUMA zone. It’s more consistent, at least.

One thing to take note is that while certain operations are indeed faster in a single NUMA zone, they’re not common tasks for a gaming VM.

It looks like memory latency could be just a tiny bit lower on a single NUMA zone, but most applications and games shouldn’t care about such a minute difference - we’re talking just a few ns.

Unigen Heaven

Unigen Heaven Scores

Unigen Heaven FPS

Unigen Valley

Unigen Valley Scores

Unigen Valley FPS

Using both NUMA zones in either Unigen benchmark gives us a moderate performance edge. Performance reduces as more runs occur because the 1050 Ti cooler is just really bad.

Summary

Disclaimer

This investigation was done with the intention of improving performance in a single user machine. When running games, the machine typically isn’t doing anything else, so I didn’t go into core isolation (isolcpus) or nohz_full kernel options. It’s likely that in a future post I will look into this and test any benefits for games.

Power consumption is a factor, too

Though I didn’t discuss it above, I monitored and stored the temperatures for the CPU and GPU during each of the tests. Obviously, during Unigen testing the CPU was not terribly busy and remained about 34C, but the 1050Ti climbs to 82C because of its miserable cooling situation.

With all 16 threads active, during certain parts of Geekbench and especially during Cinebench, the CPU use would pin at 100% and temps rose to 59C, even with liquid cooling.

With just 8 threads, the system would hover around a more reasonable 45C with only brief spikes to 50C during certain tests.

So what is isolation good for?

If you were running more than one VM at a time from the same machine, for example, one VM to play games in while another streams, you could easily split each onto their own NUMA zone and avoid competition.

For most single VM workloads it makes sense to provide the VM with peak power, allocating all threads.

For heat sensitive environments where power consumption is a concern, pinning to a single NUMA zone may be best.

Whatever you do, always configure NUMA in libvirt and UEFI

When I was running these benchmarks to test what would happen if we went with an unoptimised CPU topology and just let QEMU do whatever it wanted, the results were spectacularly awful.

The guest would consume a minimum of 100% host CPU when completely idle. Once NUMA was exposed to the guest, this dropped to a manageable 1%. If your system has an unstable TSC, it may not be possible to get it this low.

To profile a VM’s host CPU use, try perf - it’s got a ton of overhead, but it’s quick and easy (and available in most systems):

sudo perf kvm --host top -p `pidof qemu-system-x86_64`

You’ll want to substitute the pidof command with the actual PID of your intended virtual machine, if you are running more than one on the system.

Raw Data

If for whatever reason you’re interested in the raw data I used to generate these graphs (it includes the aforementioned temperatures) that would be here (it opens in LibreOffice)