Dan Magenheimer
(This is a bit incomplete, I’ll update it Soon.)
OS Overview
CMM Overview
“Balloon overview”
CMM shortcomings
- Guests are still hogs who will expand to take the available space.
- Leaves no hypervisor memory for live migration.
- Not instant.
- Can cripple page cache.
- Can cause thrashing or OOM.
What is Transcendant Memory
- Collect all spare mamory across the hypervisor and disk.
- Allow the hypervisor to make decisions about memory access via an API.
- Small OS changes to the client for it to work.
- Narrowed, well-specified
- Operations are synchronous, page-oriented, copy-based
- Multi-facted and extensivle.
- Transcendant memory is divided into four pools, marked with appropriate flags:
- Ephemeral—can dissapear at any time at the whim of the hypervisor for tasks such as live migration.
- Persistant—stays there as long as the guest exists and is live and unmigrated.
- Private—for one guest.
+Public—can be shared across multiple guests.
- 64-bit only on the hypervisor. 32-bit clients OK.
- Workload: needs some memory pressure, but at different times, for optimum value.
- dom0 should be confiured with a fixed memory size and the guests need swap disks.
- Complimentary to self-balloning (cmm) and ksmd.
Transcendant Memory in Action
cleancache
- A second-chance clean page for a guest. If a guest needs to evict pages, the hypervisor may take ownership of those copies if there’s spare memory. If there’s a subsequent cache miss for those pages, the hypervisor can satisfy them. Note that the guest is respnsible for cache coherance, and the memory is private to each guest.
- cleancache can be compressed. Slow, but faster than going to swap.
- A pool is created for each guest - or more accurately, for every filesystem in every guest.
- Needs a memory schedular.
Shared cleancache
- Can be shared between multiple guests sharing a clustered filesystem.
- Becomes a shared epehmeral pool.
- Previously security issues have been resolved.
- Optional compression.
- A fairly small, discrete patch.
- Handles only specific filesystems: ext3, ext4, ocfs, btrfs
frontswap
- The most important use.
- Private, persistant pool.
- Memory-based safety valve, providing a memory swap disk, in effect.
- Intended to reduce the liklihood of being forced into swap and swap thrashing or OOM visits.
- Memory can be given to guests under serious memory pressure.
Performance Analysis
- Aggressive self-balooning.
- Pull real use from CommittedAS from /proc/meminfo.
- cleancache reduces page-ins from stoage by 30-40%, or 200-250 IO/second.
- frontswap reduces io to external storage by 70 IO/second.
- Keeps track of its own cycle consumption via performance counters (roughly).
- Cost is around 0.08 - 0.15%.
- Trading IO for CPU. 0.1% of one core on an 8 guest system saves ~300 IO/s.
TODO
- cleancache, frontswap, shared cleancache patches posted and work.
- Xen code finished and in Xen 4.0; support for save/restore/live migration.
- tmem available in Oracle VM.
- self-ballooning is currently a service.
- Integration with ramzswap & FS-cache.
- Combine with page-sharing.
- real-world analysis.
- Other virtualisation: containers, LXC, KVM, etc.
Singing happened here.
Q&A
Can it be extended to apps, allowing apps that cache (like Firefox) to use it? Yes, we’ve thought about it, and there are obvious benefits for J2EE, Databases, and so on. There are some challenges, since the app needs to be aware of the need to release memory on demand, as well as taking it.
Prioritisation on the queues? Well, we can put weights and queueing on the machines, but there’s been no effort on allowing machines to mark pages as high or low priority except in clustering.
Could you talk about the security assumptions, especially one VM poisoning other VMs? It’s assumed that VMs are fairly trustworthy, so OFCS2 would allow you to pass bad/dangerous data around via the sharing mechanisms. Admins can lock the guests down by key mechanisms, but explicit clustering is open to poisoning.
How do you track between nodes that you’re looking at the same memory? The filesystem is providing the index for the pages, since this is focused on the page cache.
How does it compare to CMM? CMM, specifically CMM2, is arguably less collaborative, since the hypervisor is king in that context. CMM2 is hard, opaque, and IBM have largely given up on it.
Hierarchies of flash and disk? Let the hypervisor own the flash, and have the TM swap to flash before disk.