Virtualization is (not) the problem – A real life debunking

By | September 21, 2011

One thing that is always true, as new technology comes along, some people are going to criticize, ridicule, or simply not believe it does what it says it does.

Many years ago, I was approached by some guys in development that were having issues with an application that had a SQL database backend, this is the story of how they questioned VMware, and what transpired.

The environment
The development environment was comprised of several application servers and SQL servers that were particular to different builds of our in-house software.  Some of the servers were physical, some were virtual running on VMware GSX (now VMware Server) and some were running on VMware ESX 2.5.

Yes, this is an old story that I have told many times. This is my first time posting it. I decided to post it, because lately I have had some conversations questioning running servers/applications in VMware. Yes, even in 2011.  Now back to the story.

The problem
The developers were working on some stored procedures, and noticed some procedures were not executing properly.  Upon troubleshooting the problem they decided to blame VMware.  The stored procedures worked properly on all of the physical SQL servers, but failed on the virtual SQL servers on VMware GSX/ESX.

When they stepped through the stored procedures manually in both physical and virtual, they worked.  When they attempted to run them normally through the application, they failed on the virtual machines, but worked on the physical systems.

Result: The developers looked toward IT for a solution, because obviously VMware was the problem, or so it appeared.

The test
I proposed a challenge to them to determine in fact whether the problem was VMware or not.

I suggested the following troubleshooting steps:

  • I would give them an appropriate workstation for them to install Windows Server & SQL on.
  • They would run their tests against that physical machine for 2 weeks, to validate the configuration was correct.
  • After 2 weeks, I would work the weekend, perform a P2V of the physical system & present it to them as a VM.
  • They would then run their tests against the VM to validate the configuration still behaved properly as a VM.

The wager:

  • I told them, that not only would they not have any issues, but the server would be faster.
    If I was right, they (6+ people) could buy me a nice lunch.
  • I also said that if they had any issues, I would buy them a nice lunch (6+ people) out of my pocket.

Initial setup
The following week, I gave them an appropriate workstation and the MSDN media for Windows Server 2003 and SQL 2000.  I asked them to install Windows and SQL, with all the approved patches (relative to our software).  The only things I had control of, with their agreement, were the IP address of the server, the appropriate membership in the Active Directory Domain, and the physical location of it.  Nothing else.  They had full control of the system.  No input/configuration of the system was going to be done by me or anyone else in IT.  They felt very comfortable with that.

When they were done, I moved the desktop to our staging area, and connected it where they could get to it.

Upon completing the installation by the end of the week, they were set to run their tests on following Monday.

Week One
On Monday morning, I dropped by the developer’s desks and checked to see how everything was going.  “Good” was the response I got from all of them.

Each day, I dropped by, several times a day, checking on how things were going.  Again, “Good” was the only response I received.

Weekend Work
That Friday night we had a maintenance window.  I figured… Why not go ahead and P2V this box and get this over with. Remember from earlier that I was using VMware GSX/ESX? Yep, VMware Converter hadn’t come out yet, and it wasn’t as easy to P2V a system.  I had to use the older P2V Importer application that was a little quirky in comparison.

Regardless of the steps, after some effort, it was virtualized.

Week Two
I failed to share with the developers that I had virtualized the server over the weekend.

Just as I had the week before, I stopped by each of the developer’s desks, asking how the test was going.  Most of them said “Good.”  I would ask “Any issues?” “Nope, nope, and nope.” A few however said words to the effect of “This is faster this week.” The most engaged developer even used the term “flying” to describe the performance.  I simply responded with “We had a maintenance window, and now everything in the staging room is on a Gigabit switch.”  They took that as the reason that it was running faster.  Nevermind that it had more disks from a storage perspective.

In the middle of the day on Friday, I asked them to meet with me regarding virtualizing this server, so we could talk about the P2V process/etc.  We started to talk about the process, and they brought up the normal concerns about downtime, connectivity, and so on.  In responding to their question about downtime…

I responded “There won’t be any.”

They responded with “What? No downtime? How can that be?”

I said “I virtualized it last weekend.”

Every mouth in the room dropped.  “You did what?”

I said “No problems, issues, etc? Running faster?”

They said “Oh, the ESX box has Gigabit connections, and that’s why it is faster.”

I hated to (but did) break it to them that where it had been running had already been connected to a Gigabit switch.

Looking deeper
To them the test proved that VMware was not the root of the issue in any form or fashion.

It made them look deeper into their issue from a coding/configuration standpoint.  The root of the problem was a misconfiguration of that initial development SQL VM Template.  Because the template was misconfigured, any/all VMs deployed from it were destined to fail. That’s why VMs on both GSX & ESX had the issue.

The point
The point to my story is, it is easy to blame technology when you don’t understand it.

As more and more companies, businesses, etc get to the point of virtualizing servers/applications, it is important to remember to troubleshoot further than just blaming virtualization.

After that, I can’t remember a developer implicating VMware as the cause of an issue again.

And I’m still waiting for that lunch…

8 thoughts on “Virtualization is (not) the problem – A real life debunking

  1. Matt Liebowitz

    Great story Jase and if it were me I definitely would have bought you a nice lunch.

    Your points are all well taken and even though your story is most likely from the 2005ish time frame the lesson holds true today. Too many times the virtual infrastructure is blamed for problems when folks try to virtualize mission critical applications, and most of the time the issues are either misconfigurations or improper sizing.

    Perfect example – last night a friend asked for some help with some unexpectedly poor Jetstress results he was seeing when testing a brand new Exchange 2010 environment on vSphere. After talking through a few things (and explaining that Jetstress is known for reporting improper results when virtualized), I said “Let’s just do a quick screen share to see if there is anything obviously wrong.”

    It took about 2 minutes to see that someone had (accidentally) set a limit of 4GB of RAM on the Exchange Mailbox server. The server was configured with 24GB of RAM and had ballooned/swapped/compressed almost 20GB of RAM. Not only that but that same limit was incorrectly configured on many other servers as well. I had him remove the limit and rerun the test and everything ran great.

    Thanks for sharing the story. It’s just as relevant now as it was back then.

    Matt

    Reply
    1. Jase Post author

      Matt,

      Good points as well. In your case improper configuration of the virtualization stack was the culprit. It could have easily been blamed on virtualization, but close inspection proved otherwise.

      Your example shows that having a complete picture from top to bottom is the only way to properly troubleshoot issues.

      It is in everyone’s benefit to gather ALL the facts before making any assumptions.

      Thanks,
      Jase

      Reply
  2. juan

    That is an awesome story, one worth passing along to those who still believe that some day people have more faith in the IT group. I do have a question. Did you allocate mulitple processors to the vm?

    Cheers,

    Reply
    1. Jase Post author

      Hey Juan,

      Our SQL VMs typically had 1 vCPU. I didn’t keep up with the development side (as far as number of SQL VMs), but when I left, we had at least 150 Production & Infrastructure SQL VMs (of about 280). I can only remember one SQL VM having more than 1 vCPU (it had 4 vCPUs), for political reasons alone.

      How many vCPUs you allocate, are really up to the capabilities of your environment and the type of workload. I didn’t really have a need to allocate more than 1 vCPU.

      Thanks,
      Jase

      Reply
  3. Marcus Smith

    Cool Story. I’ve virtualized almost our entire server workload. People wouldn’t even know the difference if I didn’t tell them.

    Marcus

    Reply
    1. Jase Post author

      Hey Marcus,

      You are right many wouldn’t know if you didn’t tell them. These developers had access to their development systems, and therefore knew about them being VMs.

      Before this “exercise” we went through, they weren’t really keen to the idea of VMs. Once this myth was put to rest, and they realized they didn’t have to wait as long for a VM in comparison to a physical system, they opened up.

      Thanks,
      Jase

      Reply
  4. Pingback: Virtualization Is Not The Problem (Part 2) | Blue Shift

  5. Pingback: Виртуализация – проблема? | vMind.ru

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.