Performance monitoring problem
My new R710 running ESXi 4.0 has one problem: When the CPU load comes close to maximum, all performance monitoring for the host failes. Only a few, random monitoring requests comes through, making the resource graphs in vCenter look pretty ugly.
As you can see from the attached image, a lot of data is missing. This only happens if the CPU usage is close to max, and only on my R710. My three other hosts, the two PE200′s and the PE2850 does not exhibit these problems even though their CPU usage is also very close to max.
I have tried to give the VMware ESXi more CPU by giving it more shares. The virtual machine responsible for most of the CPU usage is placed in a resource pool that has a very low amount of shares (500). The most important machines are placed in a different resource pool with 8000 shares. I have given the VMware ESXi processes the same amount of shares as the important machines: 8000.
My virtual machines have no problems operating, they respond quickly. Reconfigurating the host also goes without a hitch. I do not think that the problem is related to lack of CPU resources since everything else works smooth. The problems is not related to the vCenter neither since I have tried two different vCenter installations (one running as a VM on the host itself, one running as a VM on a different host).
Any ideas out there on how to solve this problem? I would really like to see how my virtual machines are utilizing resources…
Update 2009-06-20 03:31 CEST: After doing some thinking, I suspect that the problem is related to the Intel Turbo Boost technology added to the Intel Xeon 5500 series. It allowes some of the cores to operate at a higher clock frequency if the machine is under heavy load. After increasing the clock frequency on some of the cores, the reported usage is higher than the maximum expected value calculated during boot. This might confuse the performance monitor and make it reject the reported values. This makes sense since my R200′s, which are using Intel X3320 Quad-Core processors without Turbo Boost, do not have this particular problem. It also explains why the summary page for that resource pool shows that 18.35GHz is consumed even though I only have 18.08GHz available (8 x 2.26GHz). The 18.35GHz comes in addition to the resources used in the other resource pool which is usually around 200MHz.
As the server is currently in production, I have no way of verifying my idea.
Update 2009-06-20 04:04 CEST: Reserving 1GHz for the host does not solve the problem.
Posted: June 20th, 2009 under VMware by Frode.
Comments: none
Write a comment