Posted by: markachtemichuk | September 21, 2010

Transparent Page Sharing & Large Memory Pages

Do you feel like your vSphere 4,x servers seem are consuming more memory than they used to?

Does vCenter indicate, or alarm, that host memory utilization is consistently high?

Do you have a funny feeling that Transparent Page Sharing (TPS) isn’t reclaiming memory?

Turns out there is some confusion about how and when TPS works with Large Memory Pages.  Ever since ESX 3.5, if your CPU leveraged a hardware MMU (ie: AMD RVI or Intel EPT), large memory pages (sized at 2Mb each) are used for performance benefit instead of 4Kb pages.  But since TPS only works with 4Kb pages, this means that TPS does not come into play – until – the host is under memory pressure and begins breaking 2Mb pages into 4Kb pages for TPS reclamation.  If you’re not looking at the right combination of counters, you would wrongly assume the host is out of memory and stop adding workloads to it.

Background: Transparent Page Sharing (TPS) in hardware MMU systems

The reality is that you can continue to add workloads, over committing the physical memory until TPS, Ballooning and Memory Compression can no longer manage that memory pressure and you start to swap to disk.  Remember, swapping is very bad.  In the past, many administrators have used only memory utilization to predict when their hosts are out of memory and performance may suffer.  Since memory is consumed first and shared second, this counter is not a good measure of memory capacity.

Some other counters to consider in combination:

  • Active – Amount of memory that is actively used, as estimated by VMkernel based on recently touched memory pages.
  • Swapped – Current amount of guest physical memory swapped out to the virtual machine’s swap file by the VMkernel.
  • Swap In Rate – Rate at which memory is swapped from disk back into memory – if you see this consistently, it’s too late.

Also remember that use of reservations will have an effect on these counters.  Example:  If you have a host with large percentage of memory reserved by guests and those guests are idle, using only the active memory counter it would seem you might have more capacity left, but the reservation is not subject to memory reclamation techniques like TPS.  So you might add workloads only to find swapping occurs.

Capacity Planning and Management is more complex then reviewing a couple of host counters because things like clusters and reservations must be considered.  If you’re interested in a purpose-built tool I’d suggest evaluating VMware’s Capacity IQ.  Proper planning for capacity is important as otherwise performance suffers.

Posted by: markachtemichuk | September 14, 2010

VMworld Performance Lab & Sessions

Performance is King!

At the VMworld labs, I’m proud to say that the “vSphere Performance and Tuning” lab was the second most popular lab with 1229 sittings.  In my mind this means that performance, and its related features are front of mind for customers.

Side note (since I’ve been asked many time): Lab manuals and session presentations will be available to attendees after VMworld Europe (Oct 12-14) has taken place via the vmworld.com portal.

While I was deeply involved fielding client questions and supporting the VMworld labs (yes I was one of those guys in the bright orange shirts), I was able to sneak away to a couple of sessions including “Performance Best Practices for vSphere” with Scott Drummonds and Kaushik Banerjee.  This session was PACKED! and very interactive.  Again, highlighting the fact that customers are craving this information.

So what if you couldn’t attend the conference and have access to all this cool information?

If you have a performance question, comment or concern, be sure to reach out as I’m confident there’s an answer for you!

P.S.  Kudo’s to all the VMworld Lab Staff

Posted by: markachtemichuk | September 14, 2010

VCDX #50

I’m very excited and proud to announce I’ve received my VCDX certification (lucky #50) !

I’d like to congratulate all my peers who were also recently certified and a special thanks to everyone who gave their guidance and challenged me during my preparation and defense.  To those who tried but were unsuccessful this time – don’t give up, stand tall for making it this far and try again – its worth it and we need you!

Why is VCDX certification important?

While VCDX is a VMware certification, I want to highlight that in my mind it was more about general enterprise infrastructure design, business processes, clear communication, industry methodologies with an eye towards successful virtualization adoption than it was vendor technical commands and low-level configuration (that said, you need to know that too).  This is not a technical certification.

As industry continues to mature their use of virtualization and push toward to the hybrid cloud, regardless of vendor, businesses need people who can solve complex problems and think beyond technical details.  People are needed that can bridge technology and business.  I feel VCDX could be considered an industry certification as it recognizes those people & skills and will hopefully encourage others to grow a more rounded skill set.  Though VMware designed the certification, the skills assessed by the program reach far beyond those of any one technology or vendor.

I hope we continue to grow in numbers and I urge everyone that’s up for a rewarding challenge to step up and apply.

Good luck to all future candidates!

Posted by: markachtemichuk | July 29, 2010

8 Way Virtual Machines

I get a lot of performance related questions around 4-way and 8-way virtual machine configurations that I wanted to blog about.  Typically I get involved when performance isn’t meeting expectations.  I find that during the course of my investigations, I have uncovered some common themes I wanted to make you aware of.

1) Consider your application.

Before you configure a large VM, an important question to ask is: “Can my application use extra processors?”  Many applications, unless specifically designed to, won’t scale past 2-way or 4-way configurations.  So even though you can give them 8-way configuration, there is no performance advantage – in fact performance might even drop due to the overhead of managing the extra threads and their migrations.  As an example, this will manifest itself as a 4-way VM performing the same as an 8-way VM.  Adding vCPUs does not mean an application will automatically scale.  Unfortunately, to decide this sometimes requires load testing the virtual machine with the real workload.  This is always a good investment.

2) Consider the host hardware platform.

The larger the platform, by count of sockets and cores, the more scheduling opportunities the hypervisor has for large SMP configurations.  My rule of thumb for hardware platforms is this:

  • If you want to use 4-way VMs, you should have at least a dual socket quad-core host.
  • If you want to use 8-way VMs, you should have at least a quad socket, quad-core host with at least Nehalem/Shanghai processors.
  • Do not oversubscribe vCPUs to the actual physical core count for 8-way configurations (i.e.: start with only two 8-way VMs on a quad socket, quad-core host and adapt as utilization permits)

Note:  Hyper-threading doesn’t count – only sockets and cores.

3) But doesn’t Hyper-threading double my capacity?

An important distinction I’d like to make about Hyper-threading is that it has everything to do with logical threads, but does not create twice the compute capacity on a host.  A common misconception is that when Hyper-threading is enabled, you get twice as many logical cores and can provision twice the number of vCPUs.  This is not the case and leads to many of the performance issues I see.  Hyper-threading enables the management of two threads executing on that single core.  That is not the same as having two physically separate cores.  Each core still has a finite capacity that can executed, but by using Hyper-threading, this form of “parallelism” increases the efficiency by executing instruction from two threads with the assumption that there is minute downtime in each thread.  See the Hyper-threading wiki for more details.

Hyper-threading does create more scheduling opportunities for the hypervisor.  In addition, the vSphere hypervisor is aware of the difference between threads, cores and sockets and schedules them appropriately.

Incidentally, I always recommend enabling Hyper-threading when using vSphere and the newer hardware platforms.

3) Don’t forget about 3-way, 4-way or 6-way configurations.

We are all still in a habit of over-provisioning.  I see many people creating 8-way VMs because they can but not because they have assessed an application needs access to that level compute.  Remember that 3-way, 4-way and even 6-way configurations are valid options and help you scale as required.  That’s the beauty of modern operating systems and changing vCPU counts from a dialog box.

So, “Thank You!” to all those administrators who have fought the good fight around SMP over-provisioning.  Be assured though that you can use 2-way, 3-way, 4-way, 6-way and 8-way configurations with confidence.

Pick the scale you need and virtualize away.

Posted by: markachtemichuk | July 28, 2010

cpuid.coresPerSocket

Here’s a somewhat hidden gem inside of vSphere 4.1: cpuid.coresPerSocket

Detailed in this KB article.

This new feature allows you to configure an SMP virtual machine, with the appropriate VMware license of course, but presents it to the guest OS not just a flat 4, 6 or 8 socket configuration, but as a socket with cores.

Example:  By configuring an 8 vCPU virtual machine, then setting cpuid.coresPerSocket=4, the guest OS will see a 2 way quad core set of processors.  This unlocks more compute for those operating systems which could make use of extra cores on a physical platform but wasn’t an option on the virtual platforms because they were restricted to 2 sockets and therefore only 2 vCPUs.

So is this a performance feature?  No – not really.  This feature has more to do with OS restrictions and being able to use 8 cores when an operating system will only recognize 2 sockets (ie: XP or Windows 2003 Standard Edition).  That said, it does enable more compute to these virtual configurations.

This feature should be used responsibly and you should ensure you are in compliance with your operating system’s EULA.  It’s not meant to ‘trick’ anything to save license dollars.

Posted by: markachtemichuk | July 16, 2010

New vSphere 4.1 Feature: Scalable vMotion

We’ve all come to love vMotion.  Who hear hasn’t used it and had that warm fuzzy feeling when the VM suddenly appears on another host and you want to yell out “Hey!  Check that out!”  Yes, there are some other ‘similar’ technologies now available today but none as easy or reliable in my opinion.

So how can vMotion be made better?  Well there have always been some minor limitations:

  • Only two simultaneous vMotions were supported (though some people figured out how ot make an unsupported change).
  • Some pathological workloads didn’t vMotion very well, or at all sometimes (large VM’s with large memory configurations).
  • Despite throwing more bandwidth at the vMotion network, it never seemed to use it, especially 10GbE which is a common infrastructure today.

Enter Scalable vMotion

  • One can now perform up to 8 simultaneous vMotions and they’re fast.
  • 10GbE network testing shows utilization up to 8Gb/sec (increased 3x from 2.6Gb/sec).
  • Significantly reduced stun times (the time a VM is frozen while executing is started on the target host).

This was accomplished through:

  • Restructuring how vMotion used the network.
  • Optimized VM memory handling.
  • Optimized vMotion convergence logic.
  • A new feature “quick resume.”

I think you’ll be very happy with the performance of vMotion and this scalability only helps promote 100% virtualization in datacenters.  Next vMotion stop – the cloud!

Posted by: markachtemichuk | July 14, 2010

New vSphere 4.1 Feature: Memory Compression

Originally hitting the blogosphere via Scott Drummonds here, and presented as a future technology by Kit Colbert and Fei Guo at VMworld 2009, this cool new performance feature is for those over-committing memory and walking the thin line between paging – evil – and virtual happiness.

We should all know that when we over-commit memory on an ESX host, that is to say allocate more virtual memory than we have physical memory to back it, the risk becomes the memory scheduler will eventually need to page some memory out to disk – more commonly known as swapping.  Memory access times for those swapped pages go from microseconds to milliseconds when retrieved from disk – this is bad.

But doesn’t the balloon driver save us?

Yes, but only for awhile.  The balloon driver is another mechanism to manage memory pressure before pages must be swapped to disk.  It is better for the guest OS to decide what should be swapped out because it has visibility to which memory pages are more frequently used.  It can then ‘intelligently’ swap something else that will have less of an impact on the guest.  This process occurs because the balloon driver increases memory pressure and the memory freed by this process can then be used elsewhere.  But eventually there is nothing else to intelligently swap and the ESX host will stop using the balloon drivers and start to randomly swap memory pages to disk.  This ‘unintelligent’ process  will cause performance issues.

So how does Memory Compression help?

This mechanism is one more tool that fits between Ballooning and Swapping.  Before memory pages are swapped to disk as a last resort, they are first compressed (think zipped) and stored in a special new memory cache, in an attempt to keep them off disk a little longer at the expense of some processor cycles.  To decompress a page from cache takes approximately 20 microseconds as compared milliseconds if it has to go to disk for it.  So in times of high memory pressure, this mechanism transparently begins compressing pages so as to maximize performance as long as possible.  Eventually though if the memory commitment is too large, or not transient, paging to disk will still occur.

Memory Compression is a great new feature but doesn’t remove an administrators responsibility to properly size a host and its guests.  Think of it as a safety measure, not a license to blindly over-commit.

Posted by: markachtemichuk | July 13, 2010

New vSphere 4.1 Feature: Wide NUMA

So what is regular NUMA?

Frank Denneman, a well-respected VMware peer, posted an excellent blog article that clearly describes NUMA.  So I’m going to just give a shout out to him and point you his way.

How is Wide NUMA different?

In the case a virtual machine doesn’t fit into a NUMA node (ie: 8 vcpu VM on dual socket quad core system), all the benefit of memory locality is lost as the scheduler takes the resources it needs across nodes and remote memory access occurs.  With Wide NUMA, the scheduler accommodates such a large virtual machine by splitting the workload into multiple NUMA clients, each of which is assigned to a node and then managed by the scheduler as a normal, non-spanning client. This can improve the performance of certain memory‐intensive workloads with low locality (testing shows up to 7% performance improvement).

Another benefit – you don’t need to do anything.  This function is provided automatically by vSphere 4.1 and is enabled by default.  This technology will further enhance the ability to virtualize much larger workloads with confidence.

Posted by: markachtemichuk | July 13, 2010

vSphere 4.1 – Announced and Released Today!

Today, July 13, 2010, VMware announced their latest release of the vSphere platform – v4.1.  In an even better surprise – its available for download!

Many new features are available as part of this ‘minor’ release (personally I’d suggest it could have been a major release) that I’m excited about including:

My Personal Favorite (even had some input):

Some very cool stats about the scale of this job:

  • 4,000 development weeks were spent to get to FC
  • 5,100 QA weeks were spent to get to FC
  • 35,419 bugs were triaged and 24,000 bugs were resolved
  • 80% of features committed in June 2009 were finished on time
  • 872 beta customers downloaded and tried it out
  • 2,012 servers, 2,277 storage arrays, and 2,170 IO devices are already on the HCL

Download Code Here

I’d like to congratulate the VMware team on their creativity, hard work and dedication in bringing about this virtual advancement.  Now that their hard work is complete, let’s begin ours by digging into some of these features …

Posted by: markachtemichuk | June 11, 2010

esxtop/resxtop NFS Datastore Counters – Now Available

Very excited to see that within yesterday’s release of vSphere 4 Update 2, a feature I’ve been waiting for has finally made it.  You can now see NFS datastore counters within esxtop/resxtop.  Counters include:

reads / s
writes / s
MBreads / s
MBwrtn / s
cmds / s
gavg / s – IO latency as seen from the guest perspective!

These counters will be very helpful in troubleshooting performance issues, which in the past were more difficult because when using NFS all you had to work with was network counters.  Screenshots coming soon.

Thanks VMware Team!

« Newer Posts - Older Posts »

Categories

Follow

Get every new post delivered to your Inbox.