PowerCLI script to check/set nested hypervisor support in VMware virtual machines

Our recent post entitled Windows bugchecks on VMware ESXi with Xeon E5-2670 CPUs has been getting a lot of attention lately. I wanted to post a few scripts that I've created to help get information back about our guests to help workaround the problem.

Long story short, if your guest has nested hypervisor support enabled then trying to force the use of software MMU virtualization will not work. The host will automatically use hardware MMU virtualization. If you're SSH'd into the host on which the guest is running you can run the following command to determine what MMU settings are currently being used by the guest:


/vmfs/volumes/datastore1/vm1 # grep "HV Settings" vmware.log
2014-03-31T17:34:58.409Z| vmx| I120: HV Settings: virtual exec = 'hardware'; virtual mmu = 'software'

I've also written a quick PowerCLI script to check the NestedHVEnabled property of a PowerCLI VM object and set the property on powered off machines. Both scripts are attached to this post as text files. Download them and rename them to .ps1 files. Change the YOUR_VCENTER_SERVER to the name or IP address of your vCenter Server and you should be able to run them in your PowerCLI environment.

In order to set the NestedHVEnabled parameter the guest MUST be powered off.

These scripts have only been tested in PowerCLI version 5.5 Release 1 build 1295336

I hope these scripts help some folks out there.

Enjoy,
Andy aka Flux.

Related Story: 

VMotion - A general system error occurred

We recently updated some of our ESXi hardware and software and when we did we started having issues VMotion'ing guests from newer hardware to older hardware. If a guest was powered on on our older hardware it could freely be VMotion'd from one host to another. If the same guest was powered on on newer host hardware it would only VMotion to hardware of the same type and not older hardware. This was despite the fact that Enhanced VMotion Compatibility (EVC) is enabled on the cluster and all hosts validate for the selected mode.

The VMotion would always fail at 65-67% with the following error:


A general system error occurred. The source detected that the destination failed to resume. The VM failed to resume on the destination during early power on.

We had an open case with VMware for a few months and yesterday they found a work-around for the problem. It appears that for some reason the older host didn't think it had enough VM overhead memory reservation to power on the guest before the VMotion completed so the transferred failed.

Our very helpful VMware tech had us run the following command on the older host hardware:


esxcfg-advcfg -s -1 /Mem/VMOverheadGrowthLimit

This command sets the VMOverheadGrowthLimit advanced configuration setting to it's maximum limit (4294967295). Once this advanced configuration setting is set the guest will happily migrate from newer host hardware to older host hardware.

After running that command our support tech asked us to run


auto-backup.sh

which is supposed to save the setting across reboots. We haven't rebooted the hosts yet so we don't know if that actually saves the value.

Our tech said that VMware would be writing a KB article on this issue but I decided to write it up here until they do.

Hope this helps.

Cheers,
Flux.

Windows Bugchecks on VMware ESXi with Xeon E5-2670 CPUs

UPDATE: Vendors released BIOS update to resolve this issue

Dell and HP have release BIOS updates that seem to resolve this issue. We have not validated that the new Dell R720 BIOS 2.2.3 actually resolves this issue but many in the comments below and other articles I've read say it appears to fix the BSOD problem.

The Dell BIOS can be obtained here: http://flux.dj/1mDoCqq

The HP firmware can be acquired here: http://flux.dj/1mDorLU

Thanks to everyone who emailed me and commented below. Let's hope it's fixed.

UPDATE: Workaround available

Since writing the original post we have receive a workaround from VMware. Their current suggestion is to force VMs to use software MMU virtualization. This change can be done per VM or at the host level. VMware recommended to us to change it at the host level.

To force all guests on a host to use software MMU virtualization add the following line to /etc/vmware/config on all your hosts:


monitor.virtual_mmu = "software"

A reboot of the host may be necessary to get this setting to take effect. Once completed, as long as the nested hypervisor setting isn't enabled on the guest and VMotion to the host should set the MMU to software.

To check what MMU setting is in effect SSH to the host on which the guest is running, change to the datastore and directory for the guest and run the following command:


/vmfs/volumes/datastoreGUID/vm1 # grep "HV Settings" vmware.log
2014-03-31T17:34:58.409Z| vmx| I120: HV Settings: virtual exec = 'hardware'; virtual mmu = 'software'

virtual mmu = 'software' will let you know that you are running in software MMU virtualization.

If the nested hypervisor settings is enabled on a particular guest then this setting will not work and neither will forcing it at a guest VM level. This setting is enabled/disabled via the vSphere Web Client under the CPU section of the guest as a check box .

As general information, changing to software MMU could change baseline performance on certain workloads per this VMware whitepaper http://www.vmware.com/pdf/Perf_ESX_Intel-EPT-eval.pdf

I'll post any more updates as we receive them.

**** ORIGINAL ARTICLE BELOW ****

We recently upgraded most of our physical host hardware in our production VMware ESXi cluster. Since that time we've been having random bugchecks or Blue Screen of Death in some of our guests. Usually these are 0x000000FC (ATTEMPTED_EXECUTE_OF_NOEXECUTE_MEMORY) or 0x0000000A (IRQL_NOT_LESS_OR_EQUAL) BSOD codes. All of the information we've seen from research appears that these bugchecks are a result of hardware (bad memory, etc) or driver problems.

All of the guests that have bugchecked were running Windows Server 2008 R2 with all current Microsoft and software updates as of December 2013. All of the hosts are running ESXi 5.1 Build 1312873 with all VMware updates as of December 2013 as well.

We opened a case with Microsoft and VMware on these issues. Obviously there is the finger pointing say "it's the other guy's problem." This morning (8 Jan 2014) we received an email from the VMware support tech indicating that these issues may be a result of an Intel Xeon E5 CPU bug. They reference these bugs in the following Intel document:


BT39. An Unexpected Page Fault or EPT Violation May Occur After Another
Logical Processor Creates a Valid Translation for a Page


BT78. An Unexpected Page Fault or EPT Violation May Occur After Another
Logical Processor Creates a Valid Translation for a Page

http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/xeon-e5-family-spec-update.pdf

It may be a combination of the hardware we're using in addition to a CPU bug. For reference our hosts have the following specs:


Dell R720
BIOS 2.0.19
(2) Intel Xeon E5-2670 10 core CPUs
384GB RAM
(2) Onboard Intel 1Gb I350 NICs
(2) Onboard Intel 10Gb 82599 NICs

We still don't have a resolution as of yet but I will update this article when we find out the root cause and a solution.

Until then I would advise caution when using E5-2670 CPUs in your virtual infrastructure

Cheers,
Flux.

Received missed errors with esxcli network command and Intel 10Gb NICs

We recently started upgrading the hosts in our production vSphere cluster. We replaced some hosts with new ones and we upgraded some existing hosts with more RAM. All servers in our cluster will eventually have at least two (2) Intel 10Gb network adapters which will be multi-purpose for VM traffic as well as VMotion traffic. Since we rely on

The addition of the new hosts went fine and the new 10Gb uplinks were speedy. When we started adding new 10Gb NICs to the older hosts everything seemed fine but I checked the NIC stats via esxcli on the hosts and the Receive missed errors counter started to increment. I searched all over Google and VMware's sites and forums and didn't find anything on this counter or its meaning.

For reference the new hosts hardware have onboard Intel 82599EB 10Gb SFI/SFP+ NICs and the add-in cards for the older hosts are Intel 10G 2P X520.

The only thing that changed was adding memory to the hosts and the new NICs. I checked on VMware's site for ESXi 5.1 and there was a newer driver VIB for the Intel 10Gb NICs (they use the ixgbe vmkernel module). I decided to give that a try.

The version of ixgbe that is current in ESXi 5.1U1 with all patches updated via vSphere Update Manager as shown on our other hosts is net-ixgbe 3.7.13.6iov-10vmw.510.1.20.1312873. You can see what VIBs are installed on your host by enabling SSH access to the host or the ESXi shell, logging in and running:


esxcli software vib list

I grabbed the VMware ESXi 5.0/5.1 Driver CD for Intel X540 and 82599 10 Gigabit Ethernet Controllers at: https://my.vmware.com/web/vmware/details?downloadGroup=DT-ESXI5X-INTEL-IXGBE-3187&productId=285#sthash.aoGzIQdz.dpuf

I placed the host in maintenance mode, unzipped offline bundle on one of our web servers, and manually updated one of the hosts with the following command line:


esxcli software vib update -v http://vibhostingserver/path/to/expandedoffline/

NOTE: This update requires a reboot so make sure you migrate all your guests to another host and place this host in maintenance mode

Once I rebooted the host came up and after migrating machines to/from the host several times we saw no errors. After the update your hosts should show the following version:


net-ixgbe 3.7.13.6iov-10vmw.510.1.20.1312873

I've also added this driver package to our VUM repository so any new hosts get this new driver. Check out this fine post for information on patching hosts via esxcli and adding patches to vSphere Update Manager http://blog.mwpreston.net/2012/01/16/installing-offline-bundles-in-esxi-5/

I going to try to contact VMware and get better idea of what the Received missed error counter actually tells us and update this post at a later date.

Hope this helps.

Cheers,
Flux.

DHCP on System Center Virtual Machine Manager 2012 SP1 Logical Swtiches

I have the privilege of actually receiving college credit while working at my current employer. I have to work extra, but my employer benefits by getting a project completed that otherwise would not have been done and I get experience and college credit.

The project that I've chosen this semester is to provide a self-service portal for our application developers so that they may spin up their own virtual machine instances for testing. Due to our current infrastructure I've chosen to base the self-service portal on Microsoft's System Center Virtual Machine Manager 2012 SP1 and System Center App Controller (SCAC)

I had fits trying to get a new VM to receive a DHCP address from our current network DHCP servers. I found a TechNet article that told me why. Read on.

If you have everything configured in SCVMM and your VMs are not getting DHCP addresses from your normal DHCP servers, this line in a MS TechNet article will tell you why:


If you want to use Dynamic Host Configuration Protocol (DHCP) that is already available on the network to assign IP addresses to virtual devices in a specified VLAN, create network sites with only VLANs assigned to them.

I found this sentence in this TechNet article: http://technet.microsoft.com/en-us/library/jj721568.aspx

Anyway, if you already have your Logical Networks, Network Sites, Uplink Port Profiles, Logical Switches, and VM Networks setup you'll have to remove your VM Network and disassociate the Logical Network with the Uplink Port Profile. Once that is complete, edit your Network Site and remove the Subnet definition. Keep the VLAN ID if you have multiple physical VLANs already defined and want DHCP on a specific VLAN ID.

Needless to say, SCVMM is a complicated beast, as is the rest of the System Center suite. Coming from a VMware background it seems to me that the network configuration is overly complex. Maybe I'll change my mind after we finish this project, but for now I'm having fits trying to setup both DHCP and Static IP Pools in SCVMM.

For those interested, I am enrolled in an Experiential Learning program at the University of Illinois at Springfield as part of my Computer Science undergraduate degree. For more details about the Applied Study Term at UIS check out http://www.uis.edu/appliedstudy/students/infoforstudents/. Also checkout UIS for great online degree programs http://www.uis.edu

Have fun with SCVMM,
Flux.

SCOM 2012 Agent for CentOS/RHEL/Fedora and SSL errors

We recently starting testing the SCOM 2012 Management Agent for Linux on a test server running CentOS 6.4 x86_64. Since we just installed a minimal set of packages via an automated kickstart the hostname of the machine wasn't set to the DNS resolved name.

After the SCOM agent was installed and even after we changed the hostname and setup the proper DNS the SCOM console was still barking about an SSL certificate common name matching the hostname of the server.

Cannot drag and drop in Windows 7

Every once in a while when I'll try to drag and drop something like an email in my Outlook Inbox to a sub-folder, or drag and drop files in my GUI FTP program and it just won't work.

My fix was just to hit the Escape key on my keyboard a few time. Poof! I can now drag and drop again.

If this doesn't work for you, make sure that one of the special keys like Alt or Ctrl aren't stuck.

Drag and drop away.

Syndicate content