A look at solving the VDI IOPS problem

One of the great things about working for a VAR is that I get to evaluate best of breed solutions in a vendor agnostic manner. Lately, I’ve been spending more and more time speaking with and doing VDI designs for both our EMC and Netapp customers.

I thought I’d take some time to discuss how each of these companies are helping to solve the “VDI IOPS problem.” For those not familiar with this particular problem space, to put it very simply: with physical desktop deployments, each desktop/user has their own hard drive. When moving to a virtual desktop, the storage is consolidated on a SAN/NAS array, and thus there is no longer a 1:1 dekstop to disk drive relationship. This is not to say that you *cannot* have a 1:1 desktop to disk relationship, but doing so would be extremely cost prohibitive and ruin any practical method for justifying VDI regardless of any opex savings. The issue is not one of disk space, but one of performance, namely IOPS. Each disk drive can only support so many IOPS, so the question becomes, how do we design cost effective storage solutions for VDI, while providing the correct amount of performance to not impact user experience.

EMC and Netapp both have different solutions to this problem. To illustrate, lets look at a simple use case:

  • 5000 users
  • average desktop user IOPS requirement of 10IOPS per user

Simple math gives us 5000 * 10 = 50,000 IOPS required from the storage array. Assuming a standard deployment of 15K fiber channel disks (at 200 random IOPS per disk with decent latency), NOT counting RAID overhead, we would need ~250 spindles to support this workload. This (relatively) massive spindle count will completely skew any TCO when building a business case for VDI, so we must find a way to lower this spindle count, and do it in a cost effective manner WITHOUT sacrificing the performance of the overall system. This is not a $/GB problem, this is s $/IOPS problem. Applying deduplication techniques, and reducing the overall spindle requirement from a CAPACITY perspective will do nothing for solving the $/IOPS problem. Or will it? The storage cost is one of the biggest costs of any VDI deployment. So this is a very important place to look for efficient designs.

Let’s first talk about Netapp’s architecture, and how one would go about designing a storage solution for VDI. For starters, Netapp employs what is called WAFL, which is a file system that sits on top of the raw storage. WAFL stands for “write anywhere file layout.” When doing write I/O, Netapp always appends to and never over-writes existing blocks. This boosts write I/O performance because all write I/O becomes sequential– random write I/O is “grouped” in cache and written to disk in a sequential manner and a metadata/mapping table is maintained in cache to track the actual blocks. The draw back however is during read I/Os. Because blocks were not written in the order the host specified, all reads become random in nature and lose their “sequentiality” even if the host is requesting sequential blocks on its filesystem. Of course read cache does help this problem somewhat because cache pre-fetch can still happen since Netapp has knowledge of the metadata/mappings, thus it can use algorithms to pre-fetch during host sequential I/O, but there is undoubtedly overhead due to the extra seeks on disk. How much, and its exact impact on read performance is something which has been wildly debated (and tested by some). Lets just say that if it was a REAL big issue in the real world, that is if it was a major competitive disadvantage, Netapp would not be as successful as it is selling its kit. The other optimization Netapp brings to the table, is what they call “intelligent caching.” Essentially this is “deduplication” of the cache, and Netapp will only store one actual block of data in cache for every 255 duplicate physical blocks. This allows the storage system to hold more actual blocks of data in cache since it is not storing duplicate blocks. More data in cache, less seeks to disk, faster I/O with less spindles. Based on read/write work load, 95th percentile I/O, and other factors, the Netapp sizing can tell you how much in spindle count can be saved with this feature versus the RAW IOPS spindle count calculation. This will allow for a reduction in spindle count in the VDI use case due to the proliferation of duplicate blocks. Think about the OS image on all 5000 desktops– there will undoubtedly be many duplicate blocks referenced again and again and these can all be served via the same block in cache to some degree, saving other areas of cache for other blocks of data. The more significant reduction in spindle count, comes in the way of what Netapp calls PAM/PAM-II cards. These cards are essentially large read caches for the Netapp array. Since Netapp is already very efficient in write I/O due to WAFL, using PAM/PAM-II cards much of the read I/O can be served from cache, drastically reducing the spindle count needed to support the I/O. Again the amount of spindle count reduction must be run through Netapp sizing while taking into consideration all aspects of the I/O profile for the particular desktop load. I.E. it will have a bigger reduction in spindle count on READ work loads then WRITE work loads because the PAM/PAM-II cards themselves are pure read caches. So in this design, utilizing View Linked Clones/Netapp Dedupe, the OS drive can be placed on FC, letting PAM/PAM-II do the work of serving read I/O from cache, while the data drives can be placed on a SATA tier.

What is EMC’s approach to this IOPS problem? Enterprise SSD drives. The idea being that SSD drives can perform at several magnitudes the speed of traditional 15K spinning disk (I am intentionally leaving out a magnitude number as that varies based on I/O work load…but we can assume a safe 5-8x increase in I/O density for desktop work loads), we need far fewer of them to support the I/O requirements. Enterprise SSDs are a perfect fit where a LOT of IOPS are required but very little capacity. Replacing hundreds of FC drives with a much smaller footprint of SSD drives can save cost, and provide the I/O required in a much smaller footprint. EMC does perform write coalescing and cache pre-fetch, but it does not do cache “dedupe” like Netapp, however its not needed since the underlying drives themselves can serve the I/O very quickly compared to traditional FC. Using something like View Linked Clones, the master replica can be placed on I/O dense SSD drives, and the data drives can be placed on a capacity dense SATA tier. Because the SSD drives themselves offer a massive increase in performance on a per drive basis, massive amounts of read cache are not needed like the Netapp solution. While not as involved as Netapp’s WAFL+Intelligent Dedupe+PAM/PAM-II design, I can assure you that it does a very good job of reducing the spindle count, lowering costs, and satisfying performance.

So to summarize…

Netapp:

  • intelligent cache dedupe to increase cache efficiency
  • WAFL to optimize write I/Os
  • PAM/PAM-II to optimize the read I/O
  • Netapp dudupe, or Linked Clones to solve the “capacity problem”

EMC:

  • Enterprise SSD storage tier to serve the OS/boot drive image
  • Linked Clone (or potentially Celerra dedupe) to solve the “capacity problem”

I have specifically left out EXACT spindle count savings with PAM/PAM-II and SSD as that is something which has to be done very carefully with BOTH systems by anticipating (or actually monitoring) I/O work loads as it will effect the raw spindle reduction accordingly. I have also left out things such as RAID-DP versus RAID5 as I believe they don’t play as big a role as what is highlighted above.

Both vendors have different approaches to the problem, but both are effective as demonstrated in real customer deployments.

Comments always welcome.



Categories: VDI, vmware

16 replies

  1. Nice article, clearly explained!

    One point you might be interested in (WAFL is seriously cool!); we call the feature Tetris.

    “Because blocks were not written in the order the host specified, all reads become random in nature and lose their “sequentiality” even if the host is requesting sequential blocks on its filesystem.”

    WAFL reorders IO from arrival order to host order to avoid this problem; so data from one host is written sequentially to disk, where possible. The designers of the feature used the Tetris game as inspiration; if you think about how it works, I think you’ll get the idea of what we’re doing to avoid randomizing the data. Plus, as you note, readsets (see my blog) help with predicting which blocks can be pre-fetched.

  2. Just to build on the mental Tetris model, I often suggest the following to describe how WAFL works:

    If you really wanted to be successful at Tetris you wouldn’t play the game as intended, you would change the rules. WAFL changes the rules. Rather than attempting to find best fits for odd shapes and do it in time to handle the next set of odd shapes, you would cheat a little. The first thing you would do is stop the game. The next thing you would do is organize the shapes differently. And, what’s the best shape you could hope for in Tetris? The long sequential set of blocks, of course!

    WAFL essentially stops the game by collecting I/O for 10 seconds (or when NVRAM is half full) – dog years in terms of compute I/O. It then uses a little extra intelligence by taking a look at the space available on disk and then organizes the blocks into a long sequential stripe that best fits in the space available. It then calculates parity in memory for all of those coalesced writes and then drops it down to disk in one smooth motion.

    There’s some other cool things going on to reduce head movement but whoever figures out how to reduce disk activity and go from MB/sec to MPH best, wins. WAFL does this extremely well and does so by paying a substantially lower disk access tax in the process. So, I suggest that Tetris is a great model to start with but then think of it more as WAFL cheating at Tetris.

  3. As a member of Chad’s vSpecialist team whose focus area is VMware View, I’ve got to tip my hat to you. This is one of the best even handed break downs I’ve seen on this issue. great work. I could certainly add color here and there, but I fear that the responses it would result in would only detract form the tone of this excellent post. What I would like to see you followup with now is posts that address the 2nd and 3rd pillars of Virtual Desktop deployments.

    Pillar #1 Deliver a desktop image that is Inexpensive, available and performs well … you addressed that here.

    Pillar #2 How do you address the User data challenges and properly design architecture and layout? In the case of User data, that is actually, in the end the business asset. The data IS the intellectual property being generated so why is so much attention being paid to the “portal” portion of the equation which is in essence replaceable.

    Pillar #3 How do you address the most transformational aspect of this whole thing which is how to handle Applications? something as simple as a software update being pushed to thousands of desktops can crush the back end storage no matter how efficient is is in ingesting writes (one debate point of the WAFL vs. EFD discussion). So, why debate over who ingests faster and simply stop ingesting anything by using ThinApp.

    I think that everyone agrees that capacity reduction while maintaining performance does little more that bring the OPEX cost of VDI into parity with physical desktops (more or less). Points 2 and 3 are what makes VDI actually cost effective by transforming OPEX and increasing business agility, while also increasing a companies control over the real business asset … the information that their employees generate.

  4. Fantastic article breaking down where the real problem with VDI costs lie.

    As many of you know, SSDs (even the best from EMC) are typically very bad at performing small random writes (ironically exactly the type of writes you find most in VDI) because it is not bit addressable, all writes need to commit a full erase-block regardless of their size. This performance issue also has the nasty side effect of manifesting as a wear issue as well, as each erase-block can only be written to a certain number of times before it becomes impossible to change the data in the flash cell.

    Here at WhipTail Tech, our storage solution relies on some of the same principals that WAFL uses to virtually eliminate the performance penalty of random writes to the flash media, while retaining the massive advantage in random access performance that SSDs bring to bear.

    Our solutions can provide > 150,000 IOPS per appliance at the cost of less than a single shelf of FC storage from the traditional vendors, slashing VDI storage costs.

    James Candelaria
    CTO
    WhipTail Tech
    http://www.whiptailtech.com

  5. um … Mr Candelaria … I hoped my preduction of someone using this excellent post and go negative, but alas, you couldn’t hold off … sad … BTW, you said “SSDs (even the best from EMC) are typically very bad at performing small random writes”

    You might want to learn how EMC EFD (SSD) works before you make a fool out of yourself again … next time in front of a customer

  6. @alex @mike — thanks for the detail.

    @aaron — i agree there are definitely more challenges than just this particular problem. however, its important to note that if you don’t get figure out pillar #1, you never make it to #2 & #3 for “VDI” which is why, as you noted, a lot of time is being devoted to the “portal” / back end architecture aspect of VDI. however, what i find interesting is that #2 and #3 aren’t really specific to full blown VDI, they also apply to physical desktop deployments, however its just not something customers think about in general. for example, in pillar #3: using something like Thinapp is a great fit for customers looking to solve application delivery / maintenance issues even in a PHYSICAL desktop environment. same goes for pillar #2 — there should have been provisions to secure user data even before this VDI wave.

    VDI is totally disruptive in the sense that is is forcing customers to actually THINK about how to optimize desktops beyond just delivering them via VMs across a network. thats just the first piece of the puzzle. lots of room for improvement and innovation in this space, which is why i find it so interesting. desktops have typically been nothing more than an after thought for most companies….not anymore!

  7. Great article. – On estimating requirements. We have found that Liquidware Labs tools do an excellent job of auditing detailed IO requirements at the guest level, Hypervisor level, and disk level.

    http://www.liquidwarelabs.com/

    Robert

  8. And if you all want to measure these things in the real world….ping us..happy to donate software (stratusphere™) to any benchmarking tests or pilots, etc. [/shameless plug end]

  9. Great article. I’d be interested to know what read/write ratio you are assuming for this desktop with 10 IOPS and how you feel that would impact the approaches laid out by EMC and NetApp.

    The read/write ratio is key to understand (particularly when you are looking at the read/write performance of each solution) for your storage design. R/W ratio can be managed by putting effort into optimising the desktop image. However you need to measure the ratios on a per app/process basis to do this (that’s why we love Liquidware and Lakeside right?) Optimising the desktop image is something many people don’t want to do, how many times have we heard customers say “we don’t want to manage another image”? Time to change that attitude.

    As before customers didn’t want to change their server images for virtualisation. Now its given you have optimised server builds for virtualisation. Desktop must go the same way.

    What about the impact of App Virt on disk performance? Doesn’t this now offer the ability to move that disk workload to another storage device/tier (i.e. streamed apps sat on NAS)? How does that now impact your storage (and networking) design/sizing?

    What’s the management impact of having desktops with an OS disk and User data disk? Isn’t a Linked Clone (with delta disk) a better approach?

    VDI has always been more than just delivering desktop VMs via a remote protocol, its just that the market has been dominated by certain companies that compete well in the protocol space 😉

  10. Aaron,

    Do you know where I can find detailed documentation on WAFL and write IOPS?

    Thanks,

    Robert

  11. I found this Netapp blog that if I’m reading correctly says WAFL gets the most write IOPS out of available disk by eliminating the RAID requirment.

    http://blogs.netapp.com/extensible_netapp/2009/03/understanding-wafl-performance-how-raid-changes-the-performance-game.html

  12. I keep running into the following problem when working with customers who want to scale up to 1000’s of users. Disk seek times when VMDK’s are striped across many spindles seam to slow access yet there is no way of measuring it?

    Any suggestions?

  13. Great article
    I’m assuming you’ve also reviewed this fantastic article .. http://www.brianmadden.com/blogs/rubenspruijt/archive/2010/05/01/vdi-and-storage-deep-impact.aspx
    By rspruijt.

    I can see whynyou presented the comparison between netapp and emc the way you did, but I found it slightly sneaky that you avoided price points. In all fairness thismarticle wasn’t about the cost differences between the 2 companies, but isn’t that what comes into play when making a choice between either or?

    A netapp with a Pam card is considerably cheaper than an emc ssd solution. I’m not trying to take away from either team, as they both have their areas of strength and weakness, but ssd shelves have a considerably shorter lifespan and higher price than fcal shelves and the Pam cards have had their prices cut in half anise insight 2010…

    Just my 2 cents

Trackbacks

  1. Virtualization Short Take #37 | Free Techie Blog
  2. Nice Post

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: