Now that you’ve moved to a virtual infrastructure and have embraced the concept of software-defined everything, you have to focus less on architecture and more on service delivery. Unfortunately, as you begin to place workloads on your virtual infrastructure, you and your customers notice significant performance degradation for your applications and for your services. What architects and system administrators must realize is that everything works on paper, but real workloads and real users often change your perspective and send you hustling back to the drawing board to devise a new plan of action. Even in a software-defined world, the underlying hardware that supports that abstraction, from hardware to software, often has a mind of its own when it comes to performance. There are five key areas where a virtual infrastructure can suffer performance problems: storage, compute, network, workload balance and sprawl.
Software-Define Everything Patent-pending network and storage technology with compute, virtualization, and SaaS management in ONE enterprise cloud in a box. Witness the power of Ignite today. |
Storage Lag
One of the most complained about performance problems with virtualization is storage lag. The solution to all storage lag problems is SSDs and all-flash arrays. The problem with that solution is cost. Virtualization’s promise, however, is to lower computing costs, not increase them. So, for a moment, leave all-flash arrays on the table as a “nice to have” option and look at some practical and reasonably priced resolutions to storage lag.
VMware offers the following suggestions for resolving storage performance problems:
- Separate ESXi LUNs from non-ESXi LUNs
- Enable read caching and write caching
- Load balance disk I/O
- Continual monitoring, redesign and tuning
System administrators should also consider separating application storage from operating system storage to alleviate disk I/O contention. Virtualization administrators should also configure certain workloads to stay separated from one another instead of allowing them the freedom to move around via automated distributed resource scheduling (DRS). (For more on virtualization and storage, see Keeping Up With the Data Explosion by Virtualizing Storage.)
Constrained Compute Resources
Compute resources, CPU and memory, generally face constraint in a virtualized environment when administrators oversubscribe them. Oversubscription is easy to avoid by creating standards for virtual machines (VMs) and sticking to them. The reason that oversubscription occurs is that administrators who create VMs, do so without standards or those who attempt to mimic the physical environment from which a virtual system has migrated. Remember that underutilized capacity is one of the major drivers toward a virtualized environment. Mostly idle physical systems purchased with multiple multicore CPUs, dozens of gigabytes of memory, and terabytes of storage brought on this shift from physical to virtual.
The process of converting physical to virtual machines via a P2V process is often the culprit in pinpointing the source of waste in a software-defined infrastructure. Migrations that occur via P2V conversions rarely reflect the optimal sizing standard recommendations set forth by vendors such as Microsoft and VMware.
Correctly sizing VMs and preventing wasted capacity requires standards, monitoring, balancing and updating to ensure that infrastructure utilization remains optimal.
Network Latency
Logical and physical separation of network traffic is a best practice for standard physical systems that often goes ignored in virtual systems. For example, on standard physical hardware, enterprises generally configure a management network, a backup network, and possibly various private networks that link to database servers, and of course to storage networks. Administrators must carry over this practice to virtual systems by using VLANs and separate physical networks to carry different traffic streams.
Live Migration/VMotion, backup, data, management and storage should transmit on separate physical and logical networks. Any overlap results in network constraints and unhappy users who expect fast network response from applications and desktops.
VMware suggests the following to optimize network throughput:
- Assign each physical NIC to a port group and to a vSwitch.
- Use separate NICs to handle different traffic streams.
- If dropped packets are a problem, increase the vNetwork driver ring buffers.
- Verify that all NICs operate in full duplex mode.
- Set all NICs to maximum capacity.
- Use VMXNET 3 NIC drivers.
- Balance VMs on vSwitches.
- Add more physical NICs to the host.
- Install VMware Tools on each VM.
Network latency is relatively uncommon on most networks, especially if administrators separate networks into isolated traffic streams. Many VMs can share a single network interface without saturation. But network latency does happen. The question is, “Why does network latency happen?” Network latency occurs most often on Live Migration/VMotion networks where VMs with large memory allocations move from one physical host to another. Correct VM memory allocation and dedicated networks help with lowering the opportunity for bottlenecks during migrations. NIC teaming, installing the latest drivers, optimizing physical NIC settings contribute to keeping bottlenecks at bay.
Microsoft offers the following guidelines for Hyper-V and VMs:
- Enable VMQ on VMQ-capable physical NICs.
- Do not use Automatic Private IP Addressing.
- Install the latest NIC drivers.
- Turn on Jumbo Frames for all networking equipment.
- Remove unused protocols.
- Set the Management NIC first in the binding order.
- Create NIC teams prior to assigning to a network.
- Do not share virtual switch NICs with the host OS.
- Create redundant network paths, especially for the Live Migration network.
Network interfaces are not all created equal. There are some known problems with certain NICs that administrators can avoid either by selecting different NICs or by studying documentation thoroughly and making appropriate changes to their settings. Administrators should also take extra care to ensure that all network settings match on all network devices. For example, the administrator should set all 1 Gb NICs to 1 Gb, full duplex and match those settings on switches.
Workload Imbalance
Workload balancing is not hardware related nor is it software related, but it is a problem that exists in virtual infrastructures. Workload imbalance occurs when too many of a particular type of workload collect onto a single host system. For example, VMs that have high CPU utilization can adversely affect each other’s performance if too many congregate onto the same host. You can observe the same effect for memory-intensive workloads and for high disk I/O applications and workloads.
Workload balance requires extra thought and planning on the administrator’s part. Default DRS algorithms don’t always know best. The administrator has to study workloads and distribute and load balance accordingly. They also have to put rules in place that prevent VMs from congregating. For example, in a cluster of ten host systems, the administrator can set rules for self-balancing workloads. Conservative settings will ensure that VMs don’t move unless resources for a host remain constrained over a period of time. Setting migrations too aggressively can have the added performance hit of adding network latency to the migration network. Additionally, administrators can, and should, set rules that exclude certain VMs from ever landing on the same host.
Virtual Machine Sprawl
Virtual machine sprawl causes problems in virtual environments including licensing issues, space constraints, wasted use of resources, and outages for legitimate use VMs. Sprawl occurs because provisioning a new VM requires little more than a VM template and a few mouse clicks. Physical machine sprawl had built-in controls, but virtual machines often do not. Underutilized physical systems led IT professionals to embrace virtualization, but those same professionals contribute to the wasteful practice of overprovisioning VMs, creating VMs for every purpose, and never removing unused VMs from disk and from inventory.
VMware proposes a reduce, reuse and recycle policy to prevent sprawl. First, reducing the number of unauthorized and over-provisioned VMs with governance similar to what managers used when provisioning physical machines. The solution is to put VM requests through a governance process that includes requests and approvals so that VMs go through a vetting process to guarantee VM rightsizing, license tracking and lifecycle management.
Second, rather than putting VMs through a normal, and lengthy, decommissioning process, administrators can redeploy VMs for new workloads. This is the reuse part of the plan. Often those short-term use VMs have sufficient resources assigned to them to participate either as another short-term use system or to continue life as a full production workload. The VM already has CPU, memory, disk, network and licensing assigned, so it makes sense to simply reuse the effort on an existing VM.
Finally, the recycling of inactive and abandoned virtual machines helps enterprises recover wasted capacity for use in test or for production workloads. The process of regaining this capacity is a manual one. Recycling requires administrators to perform audits of currently used systems, to reclaim abandoned resources, and to manually remove orphaned disks. Administrators must spend a great deal of time determining the value of the abandoned VMs. It can prove a better return on investment of time and resources to simply remove the VMs, return the licenses to the license pool, and to reallocate space to live workloads. (For more on VM implementation, see 5 Best Practices for Server Virtualization.)
Conclusion
Embracing virtualization’s many benefits also means that enterprises have to embrace virtualization’s shortcomings. Often thought of as a panacea for wasted capacity, virtualization lends itself to more waste unless properly controlled with governance, audits and policies. Balancing network bandwidth, compute capacity and storage performance require vigilance and adherence to best practices as recommended by hypervisor manufacturers. Although virtualization’s goals are to decrease hardware spending, lower power costs, more efficiently use capacity, and to better leverage computing resources, problems still exist. While virtualization certainly provides solutions to those issues, it hasn’t removed the underlying problem with hardware itself: managing capacity.