Debunking Zswap and Zram Myths

(chrisdown.name)

143 points | by javierhonduco 8 hours ago

17 comments

patrakov 6 hours ago
User here, who also acts as a Level 2 support for storage.
The article contains some solid logic plus an assumption that I disagree with.
Solid logic: you should prefer zswap if you have a device that can be used for swap.
Solid logic: zram + other swap = bad due to LRU inversion (zram becomes a dead weight in memory).
Advice that matches my observations: zram works best when paired with a user-space OOM killer.
Bold assumption: everybody who has an SSD has a device that can be used for swap.
The assumption is simply false, and not due to the "SSD wear" argument. Many consumer SSDs, especially DRAMless ones (e.g., Apacer AS350 1TB, but also seen on Crucial SSDs), under synchronous writes, will regularly produce latency spikes of 10 seconds or more, due to the way they need to manage their cells. This is much worse than any HDD. If a DRAMless consumer SSD is all that you have, better use zram.
[-]
- cdown 6 hours ago
  Thank you for reading and your critique! What you're describing is definitely a real problem, but I'd challenge slightly and suggest the outcome is usually the inverse of what you might expect.
  One of the counterintuitive things here is that _having_ disk swap can actually _decrease_ disk I/O. In fact this is so important to us on some storage tiers that it is essential to how we operate. Now, that sounds like patent nonsense, but hear me out :-)
  With a zram-only setup, once zram is full, there is nowhere for anonymous pages to go. The kernel can't evict them to disk because there is no disk swap, so when it needs to free memory it has no choice but to reclaim file cache instead. If you don't allow the kernel to choose which page is colder across both anonymous and file-backed memory, and instead force it to only reclaim file caches, it is inevitable that you will eventually reclaim file caches that you actually needed to be resident to avoid disk activity, and those reads and writes hit the same slow DRAMless SSD you were trying to protect.
  In the article I mentioned that in some cases enabling zswap reduced disk writes by up to 25% compared to having no swap at all. Of course, the exact numbers will vary across workloads, but the direction holds across most workloads that accumulate cold anonymous pages over time, and we've seen it hold on constrained environments like BMCs, servers, desktop, VR headsets, etc.
  So, counter-intuitively, for your case, it may well be the case that zswap reduces disk I/O rather than increasing it with an appropriately sized swap device. If that's not the case that's exactly the kind of real-world data that helps us improve things on the mm side, and we'd love to hear about it :-)
  [-]
  - patrakov 5 hours ago
    1. Thanks for partially (in paragraph 4 but not paragraph 5) preempting the obvious objection. Distinguishing between disk reads and writes is very important for consumer SSDs, and you quoted exactly the right metric in paragraph 4: reduction of writes, almost regardless of the total I/O. Reads without writes are tolerable. Writes stall everything badly.
    2. The comparison in paragraph 4 is between no-swap and zswap, and the results are plausible. But the relevant comparison here is a three-way one, between no-swap, zram, and zswap.
    3. It's important to tune earlyoom "properly" when using zram as the only swap. Setting the "-m" argument too low causes earlyoom to miss obvious overloads that thrash the disk through page cache and memory-mapped files. On the other hand, with earlyoom, I could not find the right balance between unexpected OOM kills and missing the brownouts, simply because, with earlyoomd, the usage levels of RAM and zram-based swap are the only signals available for a decision. Perhaps systemd-oomd will fare better. The article does mention the need for tuning the userspace OOM killer to an uncomfortable degree.
    I have already tried zswap with a swap file on a bad SSD, but, admittedly, not together with earlyoomd. With an SSD that cannot support even 10 MB/s of synchronous writes, it browns out, while zram + earlyoomd can be tuned not to brown out (at the expense of OOM kills on a subjectively perfectly well performing system). I will try backing-store-less zswap when it's ready.
    And I agree that, on an enterprise SSD like Micron 7450 PRO, zswap is the way to go - and I doubt that Meta uses consumer SSDs.
  - nextaccountic 3 hours ago
    It's very rare for disk reads to hang your UI (you would need to be running blocking operations ij the UI thread)
    But a swap witu high latency will occasionally hang the interface, and with it hang any means to free memory manually
- bilegeek 4 hours ago
  Counterargument: you can mostly disable zswap writeback, so it will only use the swap partition when hibernating[1].
  [1]https://wiki.archlinux.org/title/Power_management/Suspend_an...
- zozbot234 2 hours ago
  > Many consumer SSDs ... under synchronous writes, will regularly produce latency spikes of 10 seconds or more
  Surely "regularly" is a significant overstatement. Most people have practically never seen this failure mode. And if it only occurs under a heavy write workload, that's not something that's supposed to happen purely as a result of swapping.
  [-]
  - seba_dos1 8 minutes ago
    Upgrading a rolling distro online is enough to see it happen regularly.
- scottlamb 3 hours ago
  > Many consumer SSDs, especially DRAMless ones (e.g., Apacer AS350 1TB, but also seen on Crucial SSDs), under synchronous writes, will regularly produce latency spikes of 10 seconds or more, due to the way they need to manage their cells.
  Is there an experiment you'd recommend to reliably show this behavior on such a SSD (or ideally to become confident a given SSD is unaffected)? Is it as simple as writing flat-out for say, 10 minutes, with O_DIRECT so you can easily measure latency of individual writes? do you need a certain level of concurrency? or a mixed read/write load? etc? repeated writes to a small region vs writes to a large region (or maybe given remapping that doesn't matter)? Is this like a one-liner with `fio`? does it depend on longer-term state such as how much of the SSD's capacity has been written and not TRIMed?
  Also, what could one do in advance to know if they're about to purchase such an SSD? You mentioned one affected model. You mentioned DRAMless too, but do consumer SSD spec sheets generally say how much DRAM (if any) the devices have? maybe some known unaffected consumer models? it'd be a shame to jump to enterprise prices to avoid this if that's not necessary.
  I have a few consumer SSDs around that I've never really pushed; it'd be interesting to see if they have this behavior.
  [-]
  - magicalhippo 2 hours ago
    > Also, what could one do in advance to know if they're about to purchase such an SSD? You mentioned one affected model.
    Typically QLC is significantly worse at this than TLC, since the "real" write speed is very low. In my experience any QLC is very susceptible to long pauses in write heavy scenarios.
    It does depend on controller though. As an example, check out the sustained write benchmark graph here[1], you can see that a number of models starts this oscillating pattern after exhausting the pseudo-SLC buffer, indicating the controller is taking a time-out to rearrange things in the background. Others do it too but more irregularly.
    > You mentioned DRAMless too, but do consumer SSD spec sheets generally say how much DRAM (if any) the devices have?
    I rely on TechPowerUp, as an example compare the Samsung 970 Evo[2] to 990 Evo[3] under DRAM cache section.
    [1]: https://www.tomshardware.com/pc-components/ssds/samsung-990-... (second image in IOMeter graph)
    [2]: https://www.techpowerup.com/ssd-specs/samsung-970-evo-1-tb.d...
    [3]: https://www.techpowerup.com/ssd-specs/samsung-990-evo-plus-1...
- justsomehnguy 4 hours ago
  At this point just throw your shitty SSD in the garbage bin^W^W USB box and buy a proper one. OOMing would always cost you more.
  And if you still need to use a shitty SSD then just increase your swap size dramatically, giving a breathing room for the drive and implicitly doing an overprovisioning for it.
seba_dos1 13 minutes ago
A simpler alternative to OOM daemons could be enabling MGLRU's thrashing prevention: https://www.kernel.org/doc/html/next/admin-guide/mm/multigen...
I'm using it together with zram sized to 200% RAM size on a low RAM phone with no disk swap (plus some tuning like the mentioned clustering knob) and it works pretty well if you don't mind some otherwise preventable kills, but I will happily switch to diskless zswap once it's ready.
lproven 3 hours ago
I wrote about this recently too:
https://www.theregister.com/2026/03/13/zram_vs_zswap/
I prefer zswap to zram and as I linked at the end of the piece, it's not just me:
https://linuxblog.io/zswap-better-than-zram/
Maybe I am overthinking but I am wondering if this piece about myths is in any way a response to my article?
CoolGuySteve 4 hours ago
Would be nice if zswap could be configured to have no backing cache so it could completely replace zram. Having two slightly different systems is weird.
There's not really any difference between swap on disk being full and swap in ram being full, either way something needs to get OOM killed.
Simplifying the configuration would probably also make it easier to enable by default in most distros. It's kind of backwards that the most common Linux distros other than ChromeOS are behind Mac and Windows in this regard.
[-]
- cdown 3 hours ago
  This is actually something we're actively working on! Nhat Pham is working on a patch series called "virtual swap space" (https://lwn.net/Articles/1059201/) which decouples zswap from its backing store entirely. The goal is to consolidate on a single implementation with proper MM integration rather than maintaining two systems with very different failure modes. It should be out in the next few months, hopefully.
- jcalvinowens 3 hours ago
  > Would be nice if zswap could be configured to have no backing cache
  You can technically get this behavior today using /dev/ram0 as a swap device, but it's very awkward and almost certainly a bad idea.
  [-]
  - seba_dos1 4 minutes ago
    And you can use a zram-backed ram0 if you're still undecided :D
- adgjlsfhk1 4 hours ago
  Very much agreed. I feel like distros still regularly get this wrong (as evidence, Ubuntu, PopOS and Fedora all have fairly different swap configs from each other).
astrobe_ 31 minutes ago
> It only really makes sense for extremely memory-constrained embedded systems
Even "mildly" memory constrained embedded systems don't use swap because their resources are tailored for their function. And they are typically not fans [1] of compression either because the compression rate is often unpredictable.
[1] Yes, they typically don't need fans because overheating and using a motor for cooling is a double waste of energy.
garaetjjte 1 hour ago
>They size the zram device to 100% of your physical RAM, capped at 8GB. You may be wondering how that makes any sense at all – how can one have a swap device that's potentially the entire size of one's RAM?
zram size applies to uncompressed data, real usage is dynamically growing (plus static bookkeeping). Most memory compresses well, so you probably want to have zram device size even larger than physical RAM.
prussian 4 hours ago
With zram, I can just use zram-generator[0] and it does everything for me and I don't even need to set anything up, other than installing the systemd generator, which on some distros, it's installed by default. Is there anything equivalent for zswap? Otherwise, I'm not surprised most people are just using zram, even if sub-optimal.
[0]: https://crates.io/crates/zram-generator
[-]
- seba_dos1 1 minute ago
  It's a handy tool, but it doesn't even give you a reasonable zram size by default and doesn't touch other things like page-cluster, so "I don't even need to set anything up" applies only if you don't mind it being quite far from optimal.
- bilegeek 1 hour ago
  Kernel arguments are the primary method: https://wiki.archlinux.org/title/Zswap#Using_kernel_boot_par...
  Snag: I had issues getting it to use zstd at boot. Not sure if it's a bug or some peculiarity with Debian. Ended up compiling my own kernel for other reasons, and was finally able to get zstd by default, but otherwise I'd have to make/add it to a startup script.
- melvyn2 3 hours ago
```
  echo 1 > /sys/module/zswap/parameters/enabled
```
  It's in TFA.
  [-]
  - prussian 3 hours ago
    enabling != configuring. Are you saying this is all that's necessary, assuming an existing swap device exists? That should be made clearer.
    Edit: To be extra clear. When I was researching this, I ended up going with zram only because:
    * It is the default for Fedora.
    * zramctl gives me live statistics of used and compressed size.
    * The zswap doc didn't help my confusion on how backing devices work (I guess they're any swapon'd device?)
    [-]
    - stdbrouw 15 minutes ago
      It doesn't really need any config on most distros, no.
      That said, if you want it to behave at its best when OOM, it does help to tweak vm.swappiness, vm.watermark_scale_factor, vm.min_free_kbytes, vm.page-cluster and a couple of other parameters.
      See e.g.
      https://makedebianfunagainandlearnhowtodoothercoolstufftoo.c...
      https://documentation.suse.com/sles/15-SP7/html/SLES-all/cha...
      I don't know of any good statistics script for zswap, I use the script below as a custom waybar module:
      #!/bin/bash stored_pages="$(cat /sys/kernel/debug/zswap/stored_pages)" pool_total_size="$(cat /sys/kernel/debug/zswap/pool_total_size)" compressed_size_mib="$((pool_total_size / 1024 / 1024))" compressed_size_gib="$((pool_total_size / 1024 / 1024 / 1024))" compressed_size_mib_remainder="$((compressed_size_mib * 10 / 1024 - compressed_size_gib * 10))" uncompressed_size="$((stored_pages * 4096))" uncompressed_size_mib="$((uncompressed_size / 1024 / 1024))" uncompressed_size_gib="$((uncompressed_size / 1024 / 1024 / 1024))" uncompressed_size_mib_remainder="$((uncompressed_size_mib * 10 / 1024 - uncompressed_size_gib * 10))" ratio="$((100 * uncompressed_size / (pool_total_size + 1)))" echo "$compressed_size_gib.$compressed_size_mib_remainder"
- alfanick 3 hours ago
  This should've been a bash script...
Szpadel 32 minutes ago
There is one more feature that zram can do: multiple compression levels. I use simple bash script to first use fast compression and after 1h recompress it using much stronger compression.
unfortunately you cannot chain it with any additional layer or offload to disk later on, because recompression breaks idle tracking by setting timestamp to 0 (so it's 1970 again)
https://gist.github.com/Szpadel/9a1960e52121e798a240a9b320ec...
MaxCompression 3 hours ago
One underappreciated aspect of zswap vs zram is the compression algorithm choice and its interaction with the data being compressed.
LZ4 (default in both) is optimized for speed at the expense of ratio — typically 2-2.5x on memory pages. zstd can push that to 3-3.5x but at significantly higher CPU cost per page fault.
The interesting tradeoff: memory pages are fundamentally different from files. They contain lots of pointer-sized values, stack frames, and heap metadata — data patterns where simple LZ variants actually perform surprisingly well relative to more complex algorithms. Going beyond zstd (e.g., BWT-based or context mixing) would give diminishing returns on memory pages while destroying latency.
So the real question isn't just "zswap vs zram" but "how much CPU are you willing to spend per compressed page, given your workload's memory access patterns?" For latency-sensitive workloads, LZ4 with zswap writeback is hard to beat.
guenthert 5 hours ago
So much polemic and no numbers? If it is a performance issue, show me the numbers!
[-]
- cdown 5 hours ago
  There are quite a few numbers in the article, although of course I'm happy to hear any more you'd like presented.
  * A counterintuitive 25% reduction in disk writes at Instagram after enabling zswap
  * Eventual ~5:1 compression ratio on Django workloads with zswap + zstd
  * 20-30 minute OOM stalls at Cloudflare with the OOM killer never once firing under zram
  The LRU inversion argument is just plain from the code presented and a logical consequence of how swap priority and zram's block device architecture interact, I'm not sure numbers would add much there.
  [-]
  - guenthert 4 hours ago
    > The LRU inversion argument is just plain from the code presented and a logical consequence of how swap priority and zram's block device architecture interact, I'm not sure numbers would add much there.
    Yes, while it is all very plausible, the run times of a given workload (on a given, documented system) known to cause memory pressure to the point of swapping with vanilla Linux (default swappiness or some appropriate value), zram and zswap would be appreciated.
    https://linuxblog.io/zswap-better-than-zram/ at least qualifies that zswap performs better when using a fast NVMe device as swap device and zram remains superior for devices with slow or no swap device.
Mashimo 4 hours ago
Is this advice also applicable to Desktops installations?
[-]
- stdbrouw 1 minute ago
  I get the impression that most desktop users enable zram or zswap to get a little bit more out of their RAM but there is never any real worry about OOM, not regularly anyway, so then (according to the principles laid out in the article) it shouldn't matter much.
  On my workstation, I run statistical simulations in R which can be wasteful with memory and cause a lot of transient memory pressure, and for that scenario I do like that zswap works alongside regular swap. Especially when combined with the advice from https://makedebianfunagainandlearnhowtodoothercoolstufftoo.c... to wake up kswapd early, it really does make a difference.
- therealmarv 2 hours ago
  The better distros have it (ZRAM) enabled by default for desktops (I think PopOS and Fedora). In my personal experience every desktop Linux should use memory compression (except you have an absurd amount of RAM) because it helps so much, especially with everything related to browser and/or electron usage!
  Windows and macOS have it enabled by default for many years (even if it works a little different there).
  [-]
  - captn3m0 1 hour ago
    I did an Archinstall setup this weekend, and that also suggested zram.
adgjlsfhk1 5 hours ago
can you make a follow-up here for the best way to setup swap to support full disk encryption+hybernation?
[-]
- tmtvl 59 minutes ago
  If you want hibernation then you just encrypt your swap partition like you encrypt your root partition.
tonnydourado 3 hours ago
That's a banger article, I don't even like low level stuff and yet read the whole thing. Hopefully I will have opportunity to use some of it if I ever get around to switch my personal notebook back to linux
nephanth 6 hours ago
I used to put swap on zram when my laptop had one of those early ssds, that people would tell you not to put swap on for fear of wearing them out
Setup was tedious
jitl 6 hours ago
thank goodness Kubernetes got support for swap; zswap has been a great boon for one of my workloads
devnotes77 4 hours ago
[dead]
quapster 3 hours ago
The interesting meta-point here is how a kernel mechanism turned into cargo-cult tuning advice.
"Use zram, save your SSD" made sense in the era of tiny eMMC, no TRIM, and mystery flash controllers. It also fit a very human bias: disk I/O feels scary and finite, CPU cycles feel free and infinite. So zram became a kind of talisman you enable once and never think about again.
But the kernel isn't optimizing for your feelings about SSD wear, it's optimizing for global memory pressure. zswap fits into that feedback loop, zram mostly sits outside it. Once you see that, the behavior people complain about ("my system thrashes and then dies mysteriously") stops being mysterious: they effectively built a second, opaque memory pool that the MM subsystem can't reason about or reclaim from cleanly.
What's funny is that on modern desktops and servers, the alleged downside of zswap (writing to disk sometimes) is the one thing the hardware is extremely good at, while the downside of zram (locking cold garbage in RAM and confusing reclaim/oom) is exactly what you don't want when the machine is under stress. The folk wisdom never updated, but the hardware and the kernel did.