Page Table Isolation and Large Pages: (January 7, 2018)
If you follow tech news, you should be well aware of the Meltdown and Spectre side-channel attacks that affect nearly all processors with speculative execution. Furthermore, the patches come with performance penalties that range anywhere from negligible to a ridiculous 50% depending on the application and hardware.
Is y-cruncher affected? Yes. But it may be avoidable under certain circumstances.
Hardware:
- Core i7 4770K (Haswell) @ 4.0 GHz (4 cores/8 threads)
- 32 GB DDR3 @ 2133 MT/s
- Asus Z87-Plus
- Windows 10 Anniversary Update
The following table compares performance with and without KPTI for Meltdown. Unfortunately, no BIOS/microcode update for the Spectre patch was available to test. Given the age of the system, it seems unlikely that the manufacturer will provide such an update.
[TABLE="class: grid, width: 500, align: center"]
[TR]
[TD]
1 billion digits of Pi[/TD]
[TD="colspan: 2"]
y-cruncher v0.7.4[/TD]
[/TR]
[TR]
[TD][/TD]
[TD]
Normal Pages (4 KB)[/TD]
[TD]
Large Pages (2 MB)[/TD]
[/TR]
[TR]
[TD]
No Patches[/TD]
[TD]107.448[/TD]
[TD]106.803[/TD]
[/TR]
[TR]
[TD]
Kernel Page Table Isolation (KPTI)[/TD]
[TD]110.418[/TD]
[TD]106.388[/TD]
[/TR]
[/TABLE]
Notes:
- All times are in seconds.
- Each benchmark was done multiple times to ensure consistency and the fastest time was chosen. Run-to-run variation is on the order of +/- 0.5%.
- When PTI was enabled, it was enabled with the PCID (Process-Context Identifiers) optimization.
y-cruncher spends very little time in the kernel. So based on that, one would expect the effect of KPTI to be negligible. However, there are a lot of small system calls from all the multi-threading related constructs. (mutexes, condition variables, signals, etc...)
In the end, we see a 3% performance impact when using normal (4 KB) pages. But when switching to large (2 MB) pages, that penalty disappears.
A possible explanation for this is that each system call that goes into kernel mode will cause a TLB flush upon its return. So even if the system call is short, it leads to a flood of TLB misses as the computation resumes and has to re-populate the TLB. Since y-cruncher has a massive memory footprint, there will be a lot of pages to re-populate. With large pages, there are far fewer of them - thereby drastically reducing the performance penalty. Though this explanation has issues since PCID should (theoretically) be eliminating the TLB flushes as far as I understand (which I admit I don't).
Regardless of the exact reason for why large pages help so much, let's not get too excited. This is just a single benchmark on a single platform. Things may look different on other systems. Furthermore, there are requirements to enable large pages - some of which may be inconvenient.
Those interested in testing out large pages for y-cruncher can refer to the
memory allocation guide.
Looking forward, the current development version of v0.7.5 is showing significantly less penalty from KPTI:
[TABLE="class: grid, width: 500, align: center"]
[TR]
[TD]
1 billion digits of Pi[/TD]
[TD="colspan: 2"]
y-cruncher v0.7.5 (trunk)[/TD]
[/TR]
[TR]
[TD][/TD]
[TD]
Normal Pages (4 KB)[/TD]
[TD]
Large Pages (2 MB)[/TD]
[/TR]
[TR]
[TD]
No Patches[/TD]
[TD]104.337[/TD]
[TD]102.995[/TD]
[/TR]
[TR]
[TD]
Kernel Page Table Isolation (KPTI)[/TD]
[TD]104.581[/TD]
[TD]103.142[/TD]
[/TR]
[/TABLE]
It's unclear why this is the case. But it could be a side-effect of the new bandwidth optimizations.
Version v0.7.5 is currently not ready for release. However, it is in
feature freeze.
Spectre Mitigations:
So far, I have yet to test the impact of the Spectre mitigations.
- Retpoline should be irrelevant as long as compilers make it optional. So that leaves its usage in the kernel. But y-cruncher spends so little time in the kernel anyway that there should be little effect of any retpoline overhead in the kernel itself.
- Microcode updates for branch target injection are still unclear. If we assume worst case in that they disable branch target prediction, y-cruncher is expected to be affected, but only minimally so. While y-cruncher makes fairly heavy use of virtual calls, they are never used in any place that is super performance critical.