making Far Cry 5 run on VFIO with NUMA awareness

Specifications

  • Motherboard: ASUS X399a PRIME
    • Enable ACS, IOMMU+IVRS, SEV
    • CBS -> DF -> Memory Interleave = Channel (enable NUMA)
  • CPU: Threadripper 1900X OCed to 4.0GHz, water cooled
  • Memory: 64GB - 4x 16GB Kingston KVR24E17D8
  • GPUs:
    • AMD RX460 4GB (Host)
    • NVIDIA 1060 6GB (Guest)
  • Storage: zfs-9999, native encryption, allocation classes
    • 2x RD400 NVMe special mirror vdev
    • 2x WD Gold 4TB ashift=9 data vdev
  • Kernel: 4.20.4-gentoo
    • No patches required but it87 driver for sensor / pwm control
  • QEMU: 3.1.0
    • aw’s PCIe link speed patches to force GPU as PCIe 3.0 16x
    • OVMF (Win10+Q35), virt-manager configured PCIe heirarchy
    • libvirt xml used to set vendor id for NVIDIA tomfoolery
    • Guest runs with a single NUMA node, 4 cores / 8 threads - not isolated using isolcpus/nohz_full/rcu_nocb.
    • host-passthrough CPU

Patches

QEMU emulator

  • Pulseaudio driver improvements; I like to share my headphones between the guest and its host OS.
  • Link speed negotiation patch; Parts of the Nvidia driver fail to initialise when the guest appears to have PCIe 1.0 support only.

See “Related Articles” for a link to the last article where I distributed these patches.

Background

As I mentioned in earlier posts, my original VFIO setup exposed two NUMA nodes to the host Linux kernel with proper CPU mappings for each:

16GiB (8GiB per node)
NUMA node0 CPU(s):   0-3,8-11
NUMA node1 CPU(s):   4-7,12-15

When investigating stutters in GTA V performance, I limited Windows into NUMA zone 1 and decided to stick with that configuration; NUMA zone 0 seemed to contend more frequently with host processes. I didn’t bother with isolcpu because I need full bandwidth in the host, which is rough on a 1900x. Maybe with a 2950/1950x I’d feel better about it, because I need those 16 threads.

8GiB (8GiB per node)
NUMA node1 CPU(s):   4-7,12-15

A few months later, I bought two copies of Far Cry 5 (FC5) for co-op and was determined to make it work on my system. I had repeated crashes before the game even loaded - sometimes I’d get to the epilepsy warning, but the cursor was slow and laggy. One time only, I made it as far as the ‘FC5 is not Real Life’ warning before the game crashed.

Each time the game crashed, it gave no reason - no EAC error, no ‘bad file’ warning or ‘cannot run in a VM’ nonsense. GTA V runs fine in the same VM but FC5 can’t even launch.

Back to two NUMA zones?

When I started the old libvirt profile with both NUMA zones exposed to the guest, the game launched! It had miserable performance with a ton of stuttering and yet the game would not launch in a single zone VM.

I’ve tried using Process Lasso to loop FC5 into its own NUMA zone in a two zone guest but the game refuses to run this way.

host-passthrough & L3 cache performance

I was suggested to run AIDA64 L3 cache benchmark because host-passthrough is known on Threadripper to have major performance problems. I saw it hitting almost DDR4 speeds at 40-ish GiB/s but no faster - the L1 & L2 cache were blazin’ along at 400-500GiB/s.

I modified my XML from this:

  <qemu:commandline>
    <qemu:arg value='-cpu'/>
    <qemu:arg value='host-passthrough,+topoext,hv_time,kvm=off,hv_vendor_id=nvidiaistheworst'/>

...

To the following:

  <qemu:commandline>
    <qemu:arg value='-cpu'/>
    <qemu:arg value='EPYC-IBPB,+topoext,hv_time,kvm=off,hv_vendor_id=nvidiaistheworst'/>

...

This ‘emulates’ an AMD EPYC CPU with IBPB. I can’t say I understand why, but AIDA64 now reports 300-420GiB/s for L3 cache performance, and much lower latency in the range of 7-10ns or so.

Back in Far Cry 5 and running everything on High at 1080p, now with this change runs fantastically. An order of magnitude fewer stutters - presumably until I can isolate the game, they won’t disappear entirely. I have a theory though and will investigate using AMD RX VEGA instead of NVIDIA once I manage to acquire one.

Summary

I haven’t had any crashes with this configuration, though my friend with my handed-down X79 chipset has EasyAntiCheat kill his client occasionally claiming he has “Bad Files”.

Upon further investigation, FC5 seems to have NUMA awareness internally as it limits its activities to zone 0 in the host monitoring software atop - at least, when emulating EPYC. When using host-passthrough, FC5 uses zone 1. Perhaps Windows topology is seen differently, despite using +topoext for both.

Related Articles

After reading this, be sure to check out the this article where I do some benchmarks in a single NUMA zone, and a subsequent post with some changes and QEMU 3.1 patches in December 2018.