EMC Storage Pool Deep Dive: Design Considerations & Caveats

This has been a common topic of discussion with my customers and peers for some time. Proper design information has been scarce at best, and some of these details appear to not be well known or understood, so I thought I would conduct my own research and share.

Some time ago, EMC introduced the concept of Virtual Provisioning and Storage Pools in their Clariion line of arrays. The main idea for doing this is to make management for the storage admin simple. The traditional method of managing storage is to take an array full of disks, create discrete RAID groups with a set of disks, and then carve LUNs out of those RAID groups and assign them to hosts. An array could have dozens to hundreds of RAID groups depending on its size, and often times this would result in stranded islands of storage in these RAID groups. Some of this could be alleviated by properly planning the layout of the storage array to avoid the wasted space, but the problem is that for most customers, their storage requirements change and they very rarely can plan how to lay out an entire array on day 1. There was a need for flexible and easy storage management, and hence the concept of Storage Pools was born.

Storage pools, as the name implies, allows the storage admin to create “pools” of storage. You could even in some cases, create one big pool with all of the disks in the array which could greatly simplify the management. No more stranded space, no more deep architectural design into RAID group size, layout, etc. Along with this comes a complimentary technology called FAST VP, which allows you to place multiple disk-tiers into a storage pool, and allow the array to move the data blocks to the appropriate tier as needed based on performance needs. Simply assign storage from this pool as needed, in a dynamic, flexible fashion, and let the array handle the rest via auto tiering. Sounds great right? Well, that’s what the marketing says anyway. 🙂

First let’s take a brief look at the difference between the traditional RAID group based architecture and Storage Pools.

 

 

 

 

 

 

 

 

 

 

 

 

 

On the left is the traditional RAID group based architecture. You assign disks to RAID groups, and then carve a LUN/LUNs out of that RAID group and assign to hosts. You would have multiple RAID groups throughout the array based on protection level, capacity, performance and so on. On the right is the pool based approach. This example is the homogenous pool to keep things simple. You simply assign disks to the pool and assign LUNs from that pool. When you need more capacity, you just expand the pool. Contrast this with having to build another RAID group, assigning LUNs from that RAID group while trying to fill the existing ones with proper LUN sizes. Management, and complexity is greatly reduced. But what are the trade offs and design considerations?

Let’s take a deeper look at a storage pool…

Depicted in the above figure is what a storage pool looks like under the covers. In this example, it is a RAID5 protected storage pool created with 5 disks. What FLARE does under the covers when you create this 5 disk storage pool is to create a Private RAID5 4+1 raid group. From there it will create 10 Private LUNs of equal size. In my test case, I was using 143GB (133GB usable) disks, and the array created 10 Private LUNs of size 53.5GB giving me a pool size of ~530GB. This is what you would expect from a RAID5 4+1 RG (133*4 = 532GB).

When you create a LUN from this pool and assign it to the host, the I/O is processed in a different manner than in a traditional FLARE LUN. In a traditional (ignoring Meta-LUNs for simplicity) FLARE LUN, the I/O is going to one LUN on the array which is then written directly to a set of disks in its RAID group.

However, as new host writes come into a Pool LUN, space is allocated in 1GB slices. For Thick LUNs, this space is contiguous and completely pre-allocated. So, if one were to create a 10GB Thick Pool LUN, there would be 1GB slices allocated across each of the 10 Private LUNs for a total of 10x 1GB slices. As host writes comes into the Pool LUN, the LBA (Local Block Address) corresponding with the host write has a 1:1 relationship with the Pool LUN; meaning, LBA corresponding to 0-1GB on the host  would land on Private LUN0 since it contains the first 1GB slice; LBA 1-2GB writes on Private LUN1, LBA 2-3GB writes Private LUN3…. and so on as shown below:

These LUNs are all hitting the same Private RAID group underneath and hence the same disks. I assume EMC creates these multiple Private LUNs for device queuing/performance related reasons.

Caveat/Design Consideration #1: One very important aspect to understand is EMC’s recommendation to create R5 based pools in multiples of 5 disks. Again this is a VERY important thing to note because it could lead to unexpected results if you don’t fully understand this, and proceed to create pools from non-5disk multiples. The pool algorithm in FLARE tries to create the Private RAID5 groups for as 4+1 whenever possible. As an example, if you ignored the 5 disk multiple recommendation, and created a pool with 14 disks, you will NOT get the capacity you might expect. FLARE will create 2x 4+1 R5 Private RGs, and 1x 3+1 Private RG, NOT a single 13+1 Private RG that you may expect. So you would end up capacity which is lower than you are expecting.

In my case, using 143GB disks (133GB usable), with a 14disk R5 pool I would get (4*133)+(4*133)+(3*133)=~1460GB. Not the expected (13*133)=1730GB. A difference of almost 300GB; quite significant! The best option in this case is to add another drive and create a 15disk R5 pool, achieving 3x 4+1 RGs under the covers. This is important to consider when configuring the array as you could end up with one irate customer if multiple 300GB slices go missing over the span of the array!

Next, let’s take a look at some aspects of I/O performance, and some things to consider when expanding the pool.

With a pool composed of 5 disks, things are pretty simple to understand because there is 1x 4+1 Private RG underneath handling the I/O requests, but what happens when we expand the pool? Keeping in mind we need to expand this pool by a multiple of 5, lets add another 5 disks to it, bringing the total capacity to 530*2 = ~1060GB. Underneath the covers, the pool now looks like this:

After adding the 2nd set of 5disks, FLARE has created another 4+1 Private RAID group and 10 more Private LUNs from that RAID group. The Private LUNs currently have no data on them.

Design Consideration / Caveat #2: Note that when the Storage Pool is expanded, the existing data is NOT re-striped across the new disks. Reads to the original Pool LUN will still happen only across the first 5disks, and so will writes to the existing 10GBs LBAs that were previously written to. So do not expect a sudden increase in performance on the existing LUN by expanding the pool with additional disks.

In my testing, I brought my Pool LUN into VMware and put a single VM on it, and then expanded the pool, and put another VM on it. Before putting the 2nd VM on the LUN, the data layout looked exactly as depicted above. There was data spread across the Private LUNs associated with the first Private RAID group, and no data on the Private LUNs of the second RAID group. When I cloned another VM onto the LUN, this is what it looked like:

VM1s data is still spread across the first Private RG and first 10 Private LUNs as expected, but VM2s data is spread across BOTH Private RAID groups and all 20 Private LUNs! Think about that for a second: 2VMs, on the SAME VMFS, on the SAME Storage Pool, one get the I/O of 5 disk striping, and the other gets the I/O of 10disk striping; talk about non-deterministic performance! That second VM will get awesome performance as it is wide striped across 10disks, but the first VM is still using the only first 5 disks. These are both 100GB VMs (in my testing), so all the slices aren’t depicted, but it still illustrates the point. The actual allocation would show 100 slices (1slice = 1GB as previously mentioned) allocated across Private LUNs 0-9 for VM1, and 50 slices across Private LUNs 0-9 and 50 slices across Private LUNs 10-19 for VM2 as the overall slice distribution. If I keep placing VMs on this Pool LUN, they will continue to get 10disk striping, UNTIL the first Private RG gets full, at which point any subsequent VMs will get only 5 disk striping.

Now this imbalance occurred because there was still free space in the first RG, so the algorithm allocated slices there for the 2nd VM because it does show in a round robin fashion. If the pool was at capacity before being expanded, we would likely get something like this (not tested, extrapolating based on previous behavior):

 

In this diagram, the blue simply represents “other” data filling the pool. If the pool was at capacity, and then expanded, and then my 2nd VM placed on it, the 2nd VM could not get slices from the first Private RAID Group (because its full) so its slices would come ONLY from the 2nd Private RAID group, spreading its data across only 5 disks, instead of 10 like last time. Imagine a situation where a VM was created before the first Private RG filled up. Some of the VMs I/O could be striped across 10disks, and the rest across 5 disks as the first Private RG fills.

Design Consideration / Caveat #3: As illustrated above If you expand a storage pool before it gets full or close to being full, you may get unpredictable I/O performance capabilities as depending on under what condition you expand the pool, you could get different levels data striping on the data sets. Things can get even more hairy if you decide to add disks outside the 5disk multiple recommendation. If you just need enough space for 4more disks as an example, and expanded the pool by 4 disks, you would end up with 2x 4+1 RGs, and 1x 3+1 RG underneath. At some point, some of the I/O could be restricted to just 3disk striping, instead of 5 or 10.

From this, it seems the best way to utilize storage pools is to allocate as many disks as you can upfront. By this I mean, if you have a tray of disks on a Clariion or VNX, allocate all 15disks when creating the pool. This will give you 3x 4+1 RGs underneath, and any data placed in the pool will get striped across all 15disks consistently. It would be good to avoid creating small disk count pools, and expanding them frequently with 5disks at a time, as you could run into issues like the above very easily and not realize it.

There is one other issue to consider in pool expansion. Let’s say you create a pool with 15disks, and start placing data on it. All of your I/O is being wide-striped across the 15disks and all is well, but now you need more space, and need to expand the pool. Going by the 5disk multiple rule you should be safe adding 5 disks right? While this is something you can do, and it will work, it may again give unexpected results.

Before expansion, your 15 disk R5 pool looks like this:

All data is spread across 15 disks, but the pool is at capacity (imagine it is full); if the pool is expanded at this point, here is what it would look like with any new VMs (or any data) are placed on it:

After the pool is expanded, the new data is only getting striped across 5 disks, instead of the original 15! So if you placed a new VM on this device, expecting very side striping, you could be sorely disappointed as it is only getting 5 disks worth of data striping.

Design Consideration / Caveat #4: From this, the recommendation to expand storage pools would be expand it by the number of disks it was initially created with. So if you have a 15disk storage pool, expand it by another 15disks so the new data can take advantage of the wide striping. I have also heard people recommend doubling the storage pool size as a recommendation, but this may be overkill. As an example, if you have a 15disk storage pool, and add another 15 disks to it, you could theoretically have some hosts I/O striping their data over 30 disks; so should you now expand this pool by 30disks instead of 15? And then 60 disks the next time? As always, understand the impact of your design choices and performance requirements before making any decisions as there is no blanket right/wrong approach here.

Hopefully EMC will introduce a re-balance feature to the pool like what exists in the latest VMAX code to alleviate most of these issues. But until then, these are some things to be aware of when designing and deploying a Storage Pool based configuration.

Another thing to watch out for is changing default SP owner of the Pool LUN. Because the LUN is made up of Private LUNs underneath, that can introduce performance problems as it has to use the redirector driver to get to the LUNs on the other SP; so make sure to balance the pool LUNs when they are first created.

Utilizing Thin LUNs introduces a whole new level of considerations as it does not pre-allocate the 1GB slices, but rather writes in 8K extents. This can cause even more unpredictable behavior under the circumstances outlined above, but that should be something to be aware of when using Thin provisioning in general. Then there comes the variable of utilizing Thin Provisioning on the host side adding another level of complexity in how the data is allocated and written. I may write a follow up post to this illustrating some of these scenarios in a Thin provision environment on both host and array sides. Also, I did not even touch on some of the considerations when using RAID10 pools, and I will probably follow up with that later as well.

Generally speaking, if ultra deterministic performance is required, it is still best to use traditional RAID groups. Customers may have certain workloads that simply need dedicated disks, and I see no reason not use RAID groups for those use cases still. Again, its about understanding the requirements and translating them into a proper design; the good news is the EMC arrays give that flexibility. There is no question that using storage pool based approaches take the management headache out of storage administration, but architects should be aware of the considerations and caveats of any design, always. Layering FAST VP on top of storage pools is an excellent solution for the majority of the customers, and it is important to note that the ONLY way to get automated storage tiering is to use Pool based LUNs.

As always comments/questions/corrections always welcome!



Categories: EMC, storage

40 replies

  1. Hi, D from NetApp here.

    I read in an EMC forum that each 1GB chunk is not striped beyond the confines of a RAID group.

    So, this means that, in a simplistic 15-drive example and a 6GB LUN, you’ll get a 1GB chunk on each 4 1 RG, then it will wrap around again and do another pass, so you’ll end up with 2 slices per RG.

    That’s not quite striping, that’s concatenation but I/O is still spread among platters since a 1GB slice is 1/6th of the LUN size in this example.

    Which would make sense and would explain why, with Autotiering, rebalancing would effectively be accomplished, since the addition of more drives would mean that, in a few days, some slices would start living on them.

    But without Autotiering it would be an issue as you described.

    Could you get confirmation?

    Thx!

    D

    • Hi D, if you were to have a 15 disk pool, you would have 3x 4+1 RGs underneath, but each RG would have 10x Private LUNs for a total of 30x Private LUNs. From my testing, the 1G chunks are distributed across the Private LUNs not across the RGs. So you would have a 1G chunk on Private LUNs 0-5, which means 6x 1G chunks on RG1, and no 1G chunks on RG2 and RG3. This is ignoring the single 1GB slice that is allocated upfront for metadata purposes; there is actually an extra because of this. I was able to test this in my lab to verify. I created an 8GB LUN, and placed a ~6GB VM on it. It allocated everything from the 1st Private RGs Private LUNs.

      The auto-tiering is something I have been thinking about as well. Keeping in mind that FAST will move blocks between tiers but not WITHIN the same tier, you could still have some distribution issues, but perhaps over a long period of time blocks would get promoted/demoted enough to balance things out; and in the end, the I/O distribution of cold blocks doesn’t matter much anyway.

  2. Interesting – BTW, how are you able to figure out where the space is allocated from?

    It’s kinda interesting that there’s no definitive answer.

    Check here:

    https://community.emc.com/thread/110313?tstart=30

    This is what prompted my initial comment.

    I think that even if there isn’t migration within the same tier, it would be fairly easy algorithmically to evacuate busy RGs. Of course I don’t know if this is how it works, it all seems very secretive.

    Ultimately, do we agree then that a single slice isn’t itself striped?

    Thx

    D

    • The 1G slices are NOT striped across multiple RGs. Each 1G slice lives on a single Private LUN, which of course has a mapping to a single RG on the back end. So in effect, each 1G slice will live in a single RG, in this example a 4+1 RG. So the 1G slice is striped across 5 disks, but not across multiple RGs.

      The allocations aren’t viewable via Unisphere, I had to look at some cryptic SP Collect txt files to figure all this out; I just could not live with the “it just works” mentality.

      • I am interested in finding out how my pool LUNs are actually spread across the private LUNs and raid groups. Can you share some more info on what files in the SP Collect that you looked at?

  3. Thanks for confirming. To me, this means that I/O that hits a slice hard can potentially be slow, which explains the FAST cache and Autotiering approaches.

    I guess success stories will show whether the EMC approach works, or whether it’s a short-term fix until the next-gen gear.

    D

  4. Great article Vijay. I just found your blog and have been really impressed with the clarity of your articles, the VDI IOPS being another one. This is one of the best non-EMC employee write-up’s I’ve seen about how pools are working under the covers. Would love to see results of any testing you do involving FAST or FAST Cache if you’re looking for suggestions 🙂

  5. Excellent article! Read in two minutes (and my english is not so good) and understood all. Thank you

  6. An excellent post that does a very good job of explaining what’s happening under the covers. Thanks!

  7. hi,

    awesome article, you gave me a lot of Storage Pool questions. There is one last thing that i am still not sure about, what if i add 10 SAS and 5 FLASH drives to a RAID5 pool where FAST VP is enabled? Then i would end up with 3 RAID5´s, the performance would be much better on the FLASH RAID but i wouldn´t care because the blocks that wouldn´t need high iops would reside on the slower SAS RAID´s, or am i missing something?

    thanks

    Gernot

  8. Your understanding of how pools are initially created is not quite right. You are correct on 1GB slices, your incorrect saying you get 10x LUNs of 53.5 GB. (Also note, the real disk size is F146-15K or 10K, not 143GB.. Your formatted size is correct, 133GB.). Next you incorrectly stated how data is spread upon expantion in your Caveats #2 section. Assuming you did exactly what you said, and wrote no other data to the POOL, the second/cloned VM would have landed entirely on the (5) disks you expanded with, making the 1st & 2nd private RG’s, equally consumed. Once that occurs, future 1GB slices would be allocated across all 10 disks. EMC is well aware of the caveats, and there are best practice white papers with recommendations on how to best use pools and what to look out for. Agreed, because of the way expansion works you might not always get what you think. Keep in mind POOLs were created and thrust forward for “ease of use”, and not neccessarily top line performance, although similarly sized built traditional RG’s and POOL’s will perform nearly the same, until expansion, which could alter returns as you’ve noted. Follow best practices to avoid such pitfalls.

    -spindle77

  9. Great Article, Vijay! I have been looking for this kind of info for a while! Further, I’m curious what default private raid group allocation is used for RAID 1/0 and RAID6 pool configurations. If RAID5 pools prefer allocating chunks of 4+1’s, what do RAID 1/0 and 6 prefer?

  10. This really isn’t much different than NetApp’s Aggregate approach or SVC’s Managed Disk Groups/Pool approach. Both of those vendor’s have come out with rebalance commands, but they weren’t there initially as well.

    EMC has taken a first pass on FAST VP, I’m sure it will improve going forward with: rebalance, asymmetrical RAID types and other features.

    Urban

  11. The non-deterministic performance issue is indeed a cause of concern. Imagine a scenario where you are using VP as well as FAST. Imagine a LUN that is initially wide striped across 10 drives. Then for tiering reasons imagine that the entire LUN is moved to SSD tier. Now when that LUN becomes cold and needs to be moved to a lower tier it is conceivable that it may not get its original location back and be moved to a new location which is only striped across 3 disks instead of 10 disks. That would mean a significant performance drop and you could go crazy trying to figure out the cause.

  12. Brilliant guide as I am struggling to get this sort of info from EMC. Is there are way of determining the I/Os a pool will produce?

  13. simply superb article..it explained v.simple…have you written some deep dive on adaptive optimisation of 3 par or other successful autotiering.. i would love to read it..thanks

  14. storage pool thik luns pre-allocate 1GB slices and go to next slice only when the 1GB finishes.. But the meta LUN striped metaLUN writes 1MB to each meta members starting with the meta head. That sounds like unless the application fills up fast enough traditional meta lun will give a better performance than storage pool thick luns.

    Also the comment about “1GB slice are not striped across multiple RG” , even a metalun behaves the same way, but since it writes in 1MB the load eventually goes to multiple different raid groups eventually, in a faster way.

  15. Hello,

    Can we have two different RAID pools (RAID 1/0 & RAID5) created on a VNX5500 and FAST VP enabled?

    We have situation to separate two environments, to give better performance for each department individually.

    I am seeing this is wonderful and lot of knowledge sharing.

    Appreciated you help.

  16. hi Guys,

    suppose i am having Raid 5 Pools with 250 Drives ( including different makes) so pls tell how this pool works in EMC array after failed one or two drives….

  17. ONE GB Chunk Size is too much it should be right sized along with application block size other wise chunk size will not optimized.

  18. Hi, first let me thank you for the nice blog posts I’ve read on your blog. EMC storage is what I work on, but despite putting a lot of effort into being up to date with the information, and there are a few things I did find here for the first time 🙂 . The UCS posts are especially informative.

    Can you please provide pointers to how you found about the number of private raid groups/luns created in a pool e.g. are there always 10 private luns created for allocating slices for pool luns? Or does this number vary. Also, how have you tested the number of private luns while running your independed tests?

  19. So if you want to utilize the extra performance of the extra disks in the storage pool you would have to move the virtual machine away from lun, and than copy it back. This way data should be spread over the entire storage pool?

  20. Is the “re-balance” feature still missing or has EMC implemented it since this article was written?

  21. rebalance and multi-raid pool support were announced at EMC World 2012

  22. Thoughts on one large pool for a vsphere environment? customer is comfortable with their IO requirements, have 25 600GB SAS drives, 15 NL sas drives, 2 100GB Fast Cache Drives, using FastVP.

    • One large pool for a vsphere can withstand is there any limitation in that?
      Because we are planing to prepare same FAST VP large pool for VMware.
      So my quesion is how many VMs can we create under a pool?or Multiple Pools are required to create?because multiple VMs are going to place on multiple ESX’s, so is the best method?
      ~Nar

  23. Nice work. You should post something one how you got the text files and how to interpret them to get the private RG and private LUN data.

  24. The re-balance feature will be included in the next major release coming “soon”. Until then I will stick to traditional Raid Groups instead of Pools whereever I can.

  25. Great collection of information; thank you.

    Do you know why the requirement is to have 5 LUNs on each SP while using Storage Pools for File? Not finding much documentation or deep dive in ECN or on any documents (BP, White Papers, etc).

    • I believe the 5 LUNs per SP is simply to have multiple threads/concurrency across the disks. The AVM on data movers stripes across all LUNs presented to it. So if you have 5 on SPA and 5 on SPB, you have load balancing as well as concurrency. There is a whitepaper on how AVM works (a really old one) that describes this.

Trackbacks

  1. Technology Short Take #12: Storage Edition - blog.scottlowe.org - The weblog of an IT pro specializing in virtualization, storage, and servers
  2. NetApp vs EMC usability report: malice, stupidity or both? | Recovery Monkey
  3. penguinpunk.net » EMC CLARiiON VNX7500 Configuration guidelines – Part 2
  4. it answers
  5. Data and File Recovery
  6. free linux download
  7. EMC Storage – Design Considerations & Caveats | Joshua Johnson | www.ijoshuajohnson.com
  8. More EMC VNX caveats | Recovery Monkey

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: