cloud-panda-logo-img
Virtualization

How to decide VMware vCPU to physical CPU ratio

data-center

Introduction: 

This topic is always a perpetual debate to define the proper ratio of physical CPU to virtual CPU in any virtualized environment. Neither any vendor has the thumb rule number to derive this ratio. Numerous times we have asked this candid question to ourselves or to our fellow architects, from a commercial point of view that - why the workload optimization trend (i.e. No of workload running on a host which is ultimately talking about over commitment) is not increasing even though the processing efficiency of underlying hardware has tremendously enhanced, followed by cost of course. It could be a reason, the compute resource consumption mechanism or technique in new generation’s operating systems and cloud aware application has been developed to cater the extraordinary business demands and to offer multi-tenant based solutions. So it might be a parallel enhancement for the processing efficiency of underlying hardware as well as the OS and APP. Example some kernels, program the virtual system timer to request clock interrupts at 100Hz (100 interrupts per second) and some kernel use 250Hz. However the best practices are always in place based on the market research and majority of acceptance, to define the ration for your need.

We have witnessed in most environments VMware ESXi allows substantial levels of CPU over commitment i.e. running more vCPUs on a host than the total number of physical processor cores in that host, without impacting virtual machine performance. We had always been of the mind-set that when provisioning new VMs it is best to start with less vCPUs and add more as they are required, unless we specifically know that this application will be craving more resources. Many has pointed out that the single vCPU mind-set is obsolete and we can always debate on this because the older operating systems being uni-processor (i.e. single core) based but the trend has been significantly changed in new operating systems and applications.

 

Sizing Factors:

  • The exact amount of CPU over commitment a VMware host can accommodate will completely depend on the VMs and the applications they are running. A general guide for performance of [allocated vCPUs] vs [total vCPU] from our Best Practices recommendations are as listed below. You may differ on this statement because you will know your applications and environment needs and requirements far more than any best practices will dictate. But please don’t over provision just to leverage over provisioning feature. The below numbers are the figures based only on our understanding.
* 1:1 - It will not cause any performance problem. All the time it won’t make a whole lot of sense to provision like this but still it’s recommended for business critical workloads.
* 2:1 - It may give better performance depending on the environment. Recommended for compute intensive workloads.
* 3:1 - It may cause little performance degradation depending on the type of workloads. Might be recommended for regular production workloads.
* 4:1 - It may cause performance scarcity. Might be recommended for low priority mixed workload environments.
* 5:1 or greater is often going to cause performance problem but might be recommended for VDI environments.

  • In past we have seen all virtualization vendor’s promoting approach was focused on resource over-commitment to run more number of workloads to optimize the resources for data-center transformations but now the paradigm is shifting towards performance because of automation.
  • Actually there are a lot of factors that go into defining what the vCPU to CPU ratio should be in any given environment. And, as the workloads keep change and are utilized throughout the day, the “magical” ratio number changes. So there is no such thumb rule to define this number and in reality there is no way to manage the business critical workloads as same as a typical IT environment with mixed workloads.


Basic Calculations:

The number of physical cores i.e. pCPU available on a host:
         (# of Processor Sockets) X (# of Cores) = # of Physical Processors (i.e. pCPU)
The number of logical cores if hyper-threading is enabled on the host:
         (# of Physical Processors i.e. pCPU) X (2 cores) = # Virtual Processors (vCPU)
Total CPU resources required for virtual machines at peak:
        (# total number of virtual machines X average peak CPU utilization per system)

Kindly note that hyper-threading does not actually double the available of physical CPU. It works by providing a second execution thread to a processor core. So, a processor with 4 physical cores with hyper threading will appear as 8 logical cores for scheduling purposes only. When one thread is idle or waiting, the other thread can execute instructions. This can increase efficiency if and only there is enough CPU Idle time to provide for scheduling two threads, but in practice performance increases are up to a maximum of 25 to 30% at max. Because it may duplicate a small portion of a core, but it doesn’t duplicate the entire core. Hence it can’t be 100% parallel multi-threading. The VMkernel uses a relaxed co-scheduling algorithm to schedule processors. With this algorithm, every vCPU does not have to be scheduled to run on a physical processor, at the same time, when the virtual machine is scheduled to run. The number of vCPUs that run at once depends on the operation being performed at that moment. Never try to reserve VM memory until unless it’s needed from application performance perspective because overhead memory includes space reserved for the virtual machine frame buffer and various virtualization data structures, such as shadow page tables. There is non-trivial computation overhead for maintaining the coherency of the shadow page tables.

The overhead is more pronounced when the number of vCPUs increases. Overhead memory also depends on the number of vCPUs and the configured memory for the guest operating system. In SMP virtual machines the guest operating system can migrate processes from one vCPU to another. This migration can incur a small CPU overhead. If the migration is very frequent it might be helpful to pin guest threads or processes to specific vCPUs. (Note that this is another reason not to configure virtual machines with more vCPUs than they need). Many operating systems keep time by counting timer interrupts. The timer interrupt rates vary between different operating systems and versions. In addition to the timer interrupt rate, the total number of timer interrupts delivered to a virtual machine also depends on a number of other factors: The more vCPUs a virtual machine has, the more interrupts it requires. Delivering many virtual timer interrupts negatively impacts virtual machine performance and increases host CPU consumption. If you have a choice, use guest operating systems that require fewer timer interrupts.


Top 10 Best Practices Recommendations:

  1. If possible, start with one vCPU per VM and increase as needed. Try not to assign more vCPUs than needed to a VM as this can gratuitously limit resource availability for other VMs and increase CPU ready wait time. CPU virtualization adds varying amounts of overhead depending on the workload and the type of virtualization used.
  2. VMware schedules all the vCPUs in a VM at the same time. If all the allocated vCPUs are not available at the same time, then the VM will be in a state of "co-stop" until the host can co-schedule all vCPUs. In its simplest form co-stop indicates the amount of time after the first vCPU is available until the remaining vCPUs are available for the VM to run. Sizing VMs to use the least number of vCPU possible to minimizes the time needed for co-stop waits. A co-stop percent that is persistently >3 to 4% is a critical sign that need an immediate attention to resize that VM compute configuration.  The co-stop value can be identified by running “esxtop” command and referring to %CSTP column.
  3. Monitor CPU Utilization by the VM to determine if additional vCPUs are required or if too many have been allocated. CPU use can be monitored through VMware or through the VM’s guest operating system. Utilization should generally be < 80% on average, and > 85% should trigger a critical alert, but this will vary depending on the type of applications running in the VM. Virtual Machine CPU Ready is a measure of the time a VM has to wait for the CPU resources from the host. VMware recommends CPU ready must be less than 5%.
  4. Monitor CPU Utilization on the VMware host to determine if CPU use by the VMs is approaching the maximum CPU capacity. As with CPU usage on VMs, CPU utilization at > 80% to 85% should be considered a warning level, and >= 90% shows that the CPUs are approaching an overloaded situation.
  5. Configuring a virtual machine with more vCPUs than its actual workload can use, might cause slightly increased resource usage, potentially impacting performance on very heavily loaded systems. Common examples of this include a single threaded workload running in a multiple vCPU virtual machine or a multithreaded workload in a virtual machine with more vCPUs than the workload can effectively use. Even if the guest operating system doesn’t use some of its vCPUs, configuring virtual machines with those vCPUs still imposes some small resource requirements on VMware ESXi that translate to real CPU consumption on the host and unused vCPUs still consume timer interrupts in some guest operating systems. Most guest operating systems execute an idle loop during periods of inactivity. Within this loop, most of these guest operating systems halt by executing the HLT or MWAIT instructions.
  6. Find out the applications those are processor bound (i.e. most of the application's time is spent executing instructions rather than waiting for external events such as user interaction, device input, or data retrieval) any processor virtualization overhead translates into a reduction in overall performance.
  7. So don’t expect the resource increase or a reboot to be all that is required to have the VM running with improved CPU performance. However doing this might be like putting a Band-Aid and moving ahead till the same performance issue reappears.  
  8. Overcommitting CPU will definitely allow us to maximize the use of host CPU resources to increase the workload optimization level, but we need to also make sure to rigorously monitor the overcommitted hosts for CPU use, CPU Ready and Co-stop percentages. We must try to avoid oversizing VMs with more vCPU’s than needed. Consider NUMA architecture and the effect of co-stop waits when creating VMs with multiple vCPUs.
  9. In case of memory over commitment the case is completely different then CPU. When memory contention comes in to picture, the host server uses four memory management mechanisms (like transparent page sharing, ballooning, memory page compression, swapping) to dynamically expand the amount of memory allocated for each virtual machine.
  10. Always consider the average value of all peak utilization to calculate the accurate vCPU requirements and do not plan on fully utilizing all host CPU resources. This method helps to ensure that all systems are able to run at their observed peak resource levels simultaneously. But do not forget to add the overhead requirements like hypervisor usage, high availability requirements, failover requirements, isolation requirements and it’s suggested to keep the hypervisors loaded at average of 70 to 75% for business critical environments for better performance. When calculating the required number of hosts to support your design, consider the higher of the two values (either CPU or RAM). The higher value must be used because it is the limiting factor.


CPU Scheduler in Detail: 

  • Keep in mind, a CPU socket or a CPU package refers to a physical unit of CPU which is plugged into a system board. For example, a 4-way system or a 4-socket system can contain up to four CPU packages. In a CPU package, there can be multiple processor cores, each of which contains dedicated compute resources and may share memory resources with other cores. Such architecture is often referred to as chip multiprocessor (CMP) or multicore. There are different levels of cache available to a core. Typically, L1 cache is private to a core but L2 or L3 cache may be shared between cores. Last-level cache (LLC) refers to the slowest layer of on-chip cache beyond which a request is served by the memory (that is, the last level of cache available before the request needs to go to RAM, which takes more time than accessing L1-L3 cache). SRAM is different than System RAM and only used on processors. It stores data right before and after it is processed. SRAM is extremely expensive. 
  • A processor core may have multiple logical processors that share compute resources of the core. A physical processor may be used to refer to a core in contrast to a logical processor. A physical CPU denotes a physical CPU, referring to a logical processor on a system with Hyper-Threading (HT) enabled; otherwise, it refers to a processor core.
  • A virtual machine is a collection of virtualized hardware resources that would constitute a physical machine on a native environment. Like a physical machine, a virtual machine is assigned a number of virtual CPUs, or vCPUs. For example, a 4-way virtual machine has four vCPUs. A vCPU and a world are interchangeably used to refer to a schedulable CPU context that corresponds to a process in conventional operating systems. A host refers to a vSphere server that hosts virtual machines; a guest refers to a virtual machine on a host.
  • A world is associated with a run state. When first added, a world is either in RUN or in READY state depending on the availability of a pCPU. A world in READY state is dispatched by the CPU scheduler and enters RUN state. It can be later de-scheduled and enters either READY or COSTOP state. The co-stopped world is co-started later and enters READY state. A world in RUN state enters WAIT state by blocking on a resource and is later woken up once the resource becomes available. Note that a world becoming idle enters WAIT_IDLE, a special type of WAIT state, although it is not explicitly blocking on a resource. An idle world is woken up whenever it is interrupted. Below Image illustrates the state transitions in detail.

VMware Document for CPU Performance (Page #5, Figure 1)


Juggler’s Theory   

Juggling can be the manipulation of one object or many objects at the same time, using one or many things. Recall the circus or event which you must have seen in child hood. Now we can associate this game or art with memory and CPU virtualization to understand it easily. It’s all about the objects, requirement, allocation, time and of-course we can't forgot the demand vs supply of resources.


Overhead Memory on VMware Virtual Machines:

VMware Document for CPU Overhead (Page #34, Table 6-1)


  • We can use software MMU (memory management unit) when your virtual machine runs heavy workloads, such as, Translation Lookaside Buffers (TLBs) intensive workloads that has significant impact on the overall system performance. However, software MMU has a higher overhead memory requirement than hardware MMU. Hence, in order to support software MMU, the maximum overhead supported for virtual machine limit in the VMkernel needs to be increased. You can configure your virtual machine with up to 128 CPUs, if your virtual machine host has ESXi 6.0 and later compatibility (hardware version 11)

Procedure:

  • Right-click a virtual machine in the inventory and select Edit Settings.
  • On the Virtual Hardware tab, expand CPU, and select an instruction set from the CPU/MMU Virtualization drop-down menu.
  • Click OK
Tags:

comments

  1. comment-img

    This blog is in very detail and Jugger's theory is awesome....

      2

  2. comment-img

    Yes this is the matter we always discuss with our clients... thanks buddy... nicely written.

      0

  3. comment-img

    Hi Uma, Keep it up, Really a wonderful site, will be coming back to this site for more info for sure. Thanks.

      1

  4. comment-img

    Great content!.. Well explained.

      1

Write Review

Your email address will not be published. Required fields are marked *