The x86 virtualization became a mainstream technology, but it was not always that way. Intel CPU architecture was not designed for virtualization, so it was relatively recent that clever software methods were devised to successfully implement it.
In this post, I want to share some interesting technologies I've discovered while writing a proprietary virtualization solution.
When a new piece of hardware is designed, the software drivers for it are often on the critical path since the engineers need working hardware to write and debug it. Frequently, slow functional software models are used but sometimes also very expensive Quickturn boxes.
Around the mid-'90s (the last century!), VMWare came out with its virtualization product. I recognized the value that such software could have in device simulation, so a couple of us at 3dfx Interactive, a company I worked for at that time, went to visit them in their offices in Palo Alto to try to make a deal where they would provide hooks for our simulated graphics processor device.
The deal was never made mainly because they were a growing company lacking resources to dedicate to a task that was not aligned with their primary goal.
However, with the 3dfx management's blessing, I started this project - to research and write internal proprietary virtualization software (at that time, other alternatives were very limited or nonexistent). By the time 3dfx went under (December 2000), vmSim could load DOS and go far through the Linux boot; it had a number of basic PC virtual devices and could load our GPU simulation model that mapped into the PCI aperture space, IO, and interrupts. The virtualization code implemented scan-before-execute to detect and translate instructions that can't be virtualized. It all ran on Windows NT.
This document contains a technical overview of the software. This one describes the paging implementation.
The most interesting pieces of that software are the bridge and the monitor.
The bridge is a very small, but delightfully complex piece of code that resides on a page shared by the host and a guest VM. Its task is to swap page tables so the CPU "sees" the entirely new context and then the bridge performs the very act of transition to and from a VM.
You can see the source code for the bridge here.
Surprisingly, it only has 3 functions:
TaskSwitch() performs the context swap by reloading CPU control registers and table pointers. It "injects" the bridge page and "calls" it which causes the guest context to run until interrupted by an external interrupt or a CPU trap. It then returns into the VM monitor code. Since the performance was not critical, my monitor code resided in the host address space which made it easier to debug.
TaskBridgeToVM() and TaskBridgeToHost() are two functions that are copied into the bridge page. They execute in both the host and guest address contexts. All the addressing is relative so that the code may run at any address (it needs to be mapped to an unused section of a guest address space and that can vary per guest OS). This code is cognizant of the guest CPU mode and will arm it appropriately (real/V86 mode, protected mode).
The effective interrupt table of a running guest is not what the guest has set up but it resides in the bridge code so that it can take control and route interrupts (and traps) to the resident monitor code. In the same way, all other tables are also shadowed: page tables, LDT, GDT, TSS.
The monitor code controls guest execution. It receives guest interrupts and faults and emulates desired behavior. Since several x86 instructions would be able to reveal to the guest that it is being virtualized, or they have a dangerous potential to change the system state or CPU tables, the technique that is used is scan-before-execute (or binary translation): each guest page, before it is run, is quickly scanned and instructions that would be a problem are replaced with INT3 and thus cause trapping back into the monitor where they are evaluated and dealt with appropriately. That process generates a set of shadow pages that are actually being mapped and run instead of the "real" guest pages. The shadow pages themselves are marked as not present for the guest OS so its attempts to read them would trigger a trap and then the original page would have been presented instead thus further ensuring the guest can't figure out it was being virtualized.
The instruction scanner (which is based on a trimmed-down disassembler) scans until:
- target instruction crosses a page boundary
- hits a terminal instruction - one that needs to be virtualized
- instruction is already scanned
- invalid opcode
In addition, certain decoded instructions (conditional calls and jumps) will cause a recursive scan on both code paths.
Overall, the monitor code is not large, but it's fairly complicated and your test PC will tend to crash until everything is just right.
You can see part of the scanner source code here, and the portion of the monitor code that executes virtualized instructions here.
On the other hand, there was a lot of code implementing the following --
Since the monitor virtualizes the CPU only, the app has to provide a set of standard memory-mapped and IO devices that would "answer" to the guest and provide services as expected by all basic hardware devices that are part of a generic PC architecture. Those (virtual) devices include:
- PIIX4 chipset (I picked that one since it was very generic and widely used)
- CMOS and RTC
- DMA chip and
- Floppy and HDD controller which maps sector operations to the host file system
- Keyboard that maps to the host keyboard when it is "connected" by the application
- Monochrome and VGA adapters in text mode and selected graphic modes that are being displayed in the application pane
- And last, but not the least - all add-on virtual devices - such as GPU simulation code - that are to be hooked up to this virtualization simulation environment
For example, the source code for the virtual floppy drive controller is here.
The Main App
Finally, there is the main application that hosts all those components which also includes a set of tools to create and manage resources that are connected to a guest VM, such are floppy and hard disk images, etc.
This is a virtual machine control program with one of its settings tabs opened:
This is a snapshot of a plain DOS session running in the virtual machine. The code page used by my VGA model was different so some characters look wrong:
The following image shows Linux booting within the VMSim; on the right, there is a debug window dumping the logs from various virtual devices as they process requests:
The project was never finished since 3dfx went bust and I had other more existential problems to take care of, but the virtualization remained a very exciting area, especially after learning just how tricky and complex that kind of software is.
VMware, Parallels, and a host of other vendors are now providing very sophisticated and robust solutions and virtualization applications became a free commodity while the companies race to provide related services instead.
New versions of Windows even have it built-in into the OS.