Introduction to Bhyve#

In this short article, I'm going to introduce what is Bhyve, how it works, and how to modify it.

What is Bhyve?123#

  • Bhyve, pronounced "beehive", is a hypervisor for FreeBSD.
  • Bhyve runs on x86_64 host and supports i386 and x86_64 guests.
  • Bhyve requires VT-x/EPT CPU support (core i*).
  • Bhyve consists of a kernel module: vmm.ko, a library libvmmapi, and some utilities bhyve, bhyveload, and bhyvectl. Yes, it works as a kernel model like KVM.
  • The source code is in the FreeBSD SVN source repository: sys/amd64/vmm/, usr.sbin/bhyve/, usr.sbin/bhyveload/, usr.sbin/bhyvectl/, and lib/libvmmapi/
  • Variants: xhyve4, Pluribus Netvisor, bhyve in Illumos-based distributions.

Components5#

component functionality
vmm.ko VT-x, local APCI, VT-d for PCI pass-thru, guest phymem mgmt, user-space cdev-interface
bhyveload user-space bootloader, userboot lib + bhyve API, creates VM, lays out kernel + metadata, sets up initial VM register state
bhyve user-space run loop, PCI bus/device emulation, device backends, threads for vCPU, i/o devs, kqueue loop
bhyvectl dump/modify VM state, dump VM stats, delete VMs
libvmmapi userland API

Dismistafication#

  • vhyveload only supports FreeBSD, grub2-bhyve can load Linux and OpenBSD
  • each /dev/vmm/${vmname} contains each VM instance state
  • VMX-root: hypervisor, VMX-non-root: VM

VMX (VMCS-maintenace) instructions#

  • VMPTRLD — It makes the referenced VMCS active and current.
  • VMPTRST — The current-VMCS pointer is stored into the destination operand.
  • VMCLEAR — The instruction sets the launch state of the VMCS referenced by the operand to “clear”, renders that VMCS inactive, and ensures that data for the VMCS have been written to the VMCS-data area in the referenced VMCS region. If the operand is the same as the current-VMCS pointer, that pointer is made invalid.
  • VMREAD — This instruction reads a component from a VMCS.
  • VMWRITE — This instruction writes a component to a VMCS.
  • VMLAUNCH — This instruction launches a virtual machine managed by the VMCS. A VM entry occurs, transferring control to the VM.
  • VMRESUME — This instruction resumes a virtual machine managed by the VMCS. A VM entry occurs, transferring control to the VM.
  • VMXOFF — This instruction causes the processor to leave VMX operation.
  • VMXON — It causes a logical processor to enter VMX root operation. (vmm_init -> vm_init -> vm_enable -> vmxon)
  • INVEPT — This instruction invalidates entries in the TLBs and paging-structure caches that were derived from extended page tables (EPT).
  • INVVPID — This instruction invalidates entries in the TLBs and paging-structure caches based on a VirtualProcessor Identifier (VPID).
  • VMCALL — This instruction allows software in VMX non-root operation to call the VMM for service. A VM exit occurs, transferring control to the VMM.
  • VMFUNC — This instruction allows software in VMX non-root operation to invoke a VM function (processor functionality enabled and configured by software in VMX root operation) without a VM exit.

Some data structs#

#define VM_MAXCPU   8
struct vm_exit {
    enum vm_exitcode    exitcode;
    int                 inst_length;    /* 0 means unknown */
    uint64_t            rip;
    union {
        struct {} inout; // direction, how many bytes, port number, eax for out
        struct {} paging;
        struct {} vmx;
        struct {} msr;
    } u;
};
struct vm_exit vmexit[VM_MAXCPU];

The above code defines exit information for each virtual cpu.

BHyve VMExit code#

code handler description next
VM_EXITCODE_INOUT vmexit_inout in and out instructions VMEXIT_CONTINUE or VMEXIT_ABORT
VM_EXITCODE_VMX vmexit_vmx vm exit VMEXIT_ABORT
VM_EXITCODE_BOGUS vmexit_bogus VMEXIT_RESTART or VMEXIT_SWITCH
VM_EXITCODE_RDMSR vmexit_rdmsr Local APIC VMEXIT_ABORT
VM_EXITCODE_WRMSR vmexit_wrmsr -> emulate_wrmsr Local APIC VMEXIT_CONTINUE or VMEXIT_SWITCH
VM_EXITCODE_MTRAP vmexit_mtrap VMEXIT_RESTART
VM_EXITCODE_PAGING vmexit_paging -> emulate_instruction VMEXIT_CONTINUE or VMEXIT_ABORT

Example: vmexit_inout#

static int
vmexit_inout(struct vmctx *ctx, struct vm_exit *vme, int *pvcpu)
{
    // ignore ins/outs
    if (vme->u.inout.string || vme->u.inout.rep)
        return (VMEXIT_ABORT);
    // reset: out 0x64, 0xFE -> vmexit_catch_reset -> VMEXIT_RESET
    if (out && port == 0x64 && (uint8_t)eax == 0xFE)
        return (vmexit_catch_reset());
    // host notification: out 0x488, {0, 1, 5} -> VMEXIT_CONTINUE
    if (out && port == GUEST_NIO_PORT)
            return (vmexit_handle_notify(ctx, vme, pvcpu, eax));
    // handle other in/out
    error = emulate_inout(ctx, vcpu, in, port, bytes, &eax, strictio);
    if (error == 0 && in)
        error = vm_set_register(ctx, vcpu, VM_REG_GUEST_RAX, eax);
    if (error == 0)
        return (VMEXIT_CONTINUE);
    else {
        return (vmexit_catch_inout()); // VMEXIT_ABORT
    }
}

Some data structs#

#define SET_DECLARE(set, ptype)                 \
    extern ptype *__CONCAT(__start_set_,set);   \
    extern ptype *__CONCAT(__stop_set_,set)
#define SET_BEGIN(set)                          \
    (&__CONCAT(__start_set_,set))
#define SET_LIMIT(set)                          \
    (&__CONCAT(__stop_set_,set))
#define SET_FOREACH(pvar, set)                  \
    for (pvar = SET_BEGIN(set); pvar < SET_LIMIT(set); pvar++)
#define SET_ITEM(set, i)                        \
    ((SET_BEGIN(set))[i])
#define SET_COUNT(set)                          \
    (SET_LIMIT(set) - SET_BEGIN(set))

SET_DECLARE(pci_devemu_set, struct pci_devemu);

struct pci_devemu {
    char      *pe_emu;      /* Name of device emulation */
    /* instance creation */
    int       (*pe_init)(struct vmctx *, struct pci_devinst *, char *opts);
    /* config space read/write callbacks */
    int       (*pe_cfgwrite)(...)
    int       (*pe_cfgread)(...)
    /* I/O space read/write callbacks */
    void      (*pe_iow)(...)
    uint32_t  (*pe_ior)(...)
};

#ifdef __GNUCLIKE___SECTION
#define __MAKE_SET(set, sym)                        \
    __GLOBL(__CONCAT(__start_set_,set));                \
    __GLOBL(__CONCAT(__stop_set_,set));             \
    static void const * const __set_##set##_sym_##sym       \
    __section("set_" #set) __used = &sym
#else /* !__GNUCLIKE___SECTION */
#ifndef lint
#error this file needs to be ported to your compiler
#endif /* lint */
#define __MAKE_SET(set, sym)    extern void const * const (__set_##set##_sym_##sym)
#endif /* __GNUCLIKE___SECTION */

#define TEXT_SET(set, sym)  __MAKE_SET(set, sym)
#define DATA_SET(set, sym)  __MAKE_SET(set, sym)
#define BSS_SET(set, sym)   __MAKE_SET(set, sym)
#define ABS_SET(set, sym)   __MAKE_SET(set, sym)
#define SET_ENTRY(set, sym) __MAKE_SET(set, sym)

PCI_EMUL_SET(pci_xxx);

The above code defines name and callbacks of each PCI devices.

Virtual devices (dummy means very low fidelity) (Old BHyve)#

peripheral file description
atpic usr.sbin/bhyve/atpic.c dummy
console usr.sbin/bhyve/consport.c ttyread|ttywrite
gdbport usr.sbin/bhyve/dbgport.c bind, listen, accept, read|write
elcr usr.sbin/bhyve/dbgport.c dummy
pit 8254 usr.sbin/bhyve/pit_8254.c pit_8254_handler
post usr.sbin/bhyve/post.c dummy
rtc usr.sbin/bhyve/rtc.c rtc_addr_handler|rtc_data_handler
uart usr.sbin/bhyve/uart.c dummy
pci-dummy usr.sbin/bhyve/pci_emul.c pci_emul_dinit|pci_emul_diow|pci_emul_dior
pci-hostbridge usr.sbin/bhyve/pci_hostbridge.c pci_hostbridge_init
pci-passthru usr.sbin/bhyve/pci_passthru.c /dev/pci, /dev/io
pci-uart usr.sbin/bhyve/pci_uart.c pci_uart_init|pci_uart_write|pci_uart_read
pci-virtio-blk usr.sbin/bhyve/pci_virtio_block.c pci_vtblk_init|pci_vtblk_write|pci_vtblk_read
pci-virtio-net usr.sbin/bhyve/pci_virtio_net.c pci_vtnet_init|pci_vtnet_write|pci_vtnet_read

Virtual device initialization#

while (c = getopt())
    switch (c)
        case 's':
            pci_parse_slot(optarg, 0); // not legacy
            break;
        case 'S':
            pci_parse_slot(optarg, 1); // legacy
            break;
        // ------------------------------
        // pci_parse_slot
        // ------------------------------
        // snum is from 0 to 31
        // 0,hostbridge
        // 1,virtio-net,tap0
        // pci_slotinfo[snum].si_name = emul; // hostbridge, virtio-net
        // pci_slotinfo[snum].si_param = config; // null, tap0
        // pci_slotinfo[snum].si_legacy = legacy; // 0, 1
init_inout
    install handler for each port: 
        atpic, console, gdbport, elcr, pci_emu, pit 8254, post, rtc, and uart.
init_pci
    // it depends on opts
    pde = pci_emul_finddev(si->si_name);
    if (pde != NULL) {
        pci_emul_init(ctx, pde, i, si->si_param); // invoke ->init

Extend BTest (like QTest) to BHyve#

The idea of BTest is to implement the same primitives as QTest. We can leverage the PCI interfaces because we are doing in-memory fuzzing rather than doing traps.