VMAX on a Clariion Planet, Part1: A look at architecture and IO flows
Posted by Vijay Swami on April 25, 2011
This article is focuses on understanding VMAX from the perspective of users who are familiar with Clariion arrays, terminology and architecture. Put another way, a guide to VMAX for Clariion users. We’ll take a look at the architecture similarities/differences, terminology and a look at basic storage administrative tasks. When Clariion is mentioned in this article, it applies equally to VNX arrays as well, as they are similar for the purposes of this article.
Part1 will focus on architecture and IO flows, and Part2 will discuss some storage design and provisioning concepts.
With that said, lets examine I/O flow from the host to a back-end disk of each array type.
The above is a representation of the I/O flow from the host to a Clariion or VNX array. In this diagram there are two service processors, cache within the SPs, and a back-end disk enclosure with some disks, and a sample LUN. The service processors provide front-end connectivity to the hosts (perhaps through switches), cache for data, as well as back-end connectivity to the physical disks. In this dual controller architecture, a LUN is “owned” by either SPA/SPB. What that means is that the host can only access that LUN through the front-end ports on that SP, unless there is a fail over and that LUN ownership is “transferred” or trespassed to the other SP. In this example, the LUN is “owned” by SPA. As such the I/O flows accordingly:
- Host makes an I/O request to the active owner, SPA
- SPA checks its cache. If this is a READ request with a cache hit or a write, the I/O will be served directly from/to the cache. If this is a read cache miss, the I/O proceeds to step 3
- Access is made to the physical disk (or disks) which actually contain the blocks which are being requested. Of note is that ALL the physical disk(s) are accessible through either SP, but the access to data on a particular LUN is through the owning SP. This is what makes it an active/passive system.
Of mention, is a mode of access referred to a ALUA (Asymmetric Logical Unit Access). This allows a host to send requests to SPB for a LUN owned by SPA. On the surface this may seem like it transforms the array into an active/active system, however access to the LUN through SPB is considered a “suboptimal” path. This is because the I/O is not actually served by SPB, but rather it is sent to SPA over a CMI link (the dual arrows connecting the SPs in the diagram), and then the I/O flow remains the same as above. Once the data is fetched, it is then transferred over the CMI link again to SPB, and then to the host. It is meant to mitigate certain failure conditions conditions, not for true active/active access due to the incurred performance penalty.
Observation #1a: In order to scale this system for more (non-spindle bound) performance, the SPs have to be upgraded to larger/more powerful units. You cannot add a 3rd SP for example to gain additional processing power. Performing an SP upgrade is a data-in-place upgrade, but requires downtime. This is OK for some environments, but not for others.
Observation #2a: If you need to add more paths to a device beyond what is available from the front-end port count on a single SP, you are out of luck. You cannot spread the I/O for a particular LUN among both SPs. Your only choice is to upgrade the SPs, or to create smaller devices, assigning some to SPA, and some to SPB. This assumes you have a logical volume manager running on the host that can then combine these back into a larger device. In the case of VMware, you have the option of utilize “extents”, but for many reasons, it won’t be an optimal solution; the reasons are beyond the scope of this article.
Next up, let’s have a look at the architecture of a VMAX.
Unlike the dual controller architecture of a Clariion or VNX, the VMAX is made up building blocks referred to as “engines”, and can scale from 1-8 engines. The above illustrates the internals of a single engine, with each engine containing two directors. The left half of the engine represents one director, and the right half another; a director is analogous to a SP in the Clariion world. Contained within a director are some components which are familiar to a Clariion: front-end ports, back-end ports, and cache. However, the cache in this case is a global cache, meaning it is shared among all the engines (and directors) as one big pool of (mirrored) memory. Another new component is the Virtual Matrix Interface. This is the interconnect by which VMAX engines communicate with each other. Because this is an active/active system with a shared global cache, any engine (and director) can access any LUN simultaneously, which is different from the Clariion architecture in which a LUN can only be accessed by a single controller at a time.
Here is the look at an example I/O flow for a single engine VMAX.
This is a very simple example where a single engine is directly connected to the physical disks which contain the blocks being requested on a LUN. The assumption is that the host has connectivity to front-end ports on both directors through the SAN. The steps 1-3 are very similar to that of the Clariion example, with the notable exception that I/O can be serviced by both directors (analogous to controllers in the Clariion world) simultaneously for the LUN. The red broken separation of the global cache is to indicate that there are separate cache modules in each director, however they are shared like a global pool. In single engine configurations, the cache is mirrored between directors; in multi-engine configurations, the cache is mirrored between engines.
Observation #1b: The ability to access any LUN from any engine, and add engines instead of needing to upgrade them like a Clariion/VNX controller is what makes this system a scale out architecture for block storage. It provides enormous opportunities for scaling as you can simply add engines as you need to scale performance in addition to adding disk. It also provides higher resiliency for the same reason.
Observation #2b: Because any LUN can be accessed by engine simultaneously, one can scale the performance for a single LUN to incredible levels. It would be theoretically possible the have connectivity to a host such that it had paths to a LUN from every single engine and by utilizing a multi-pathing software like PowerPath, drive I/O to all those paths simultaneously!
The first VMAX I/O flow example was a simplistic case because it was a single engine with all the disks attached to it. How about when there are multiple engines, and you are driving I/O to a LUN through all of them and the physical disk containing the data in question is in a remote engine? How does that work?
The above is a pretty busy diagram, so it warrants some explaining.
The virtual matrix fabric is the interconnect used by all the VMAX engines; it is how they communicate. The actual interconnect technology itself is Rapid IO, and you can read more about it here.
You will also notice two sub-parts in the cache component: GM is global memory, and SF is the store & forward buffer. Their uses will become clear as the IO sequencing is explained.
To frame the discussion for the IO flow depicted above: it is an example of a read cache miss with a host connected Engine 3/Director 4, the physical disks containing the data connected to Engine 1/Director 1 and the cache slot for that particular data on Engine2/Director3.
The host does a read for some data on Engine 3/Director 4; the VMAX cannot serve the data from its local cache because it is a CACHE MISS, and the following occurs…
1- Since this is a read cache miss, the data has to be retrieved from the disk. The data is read from the disk into the SF (store & forward) buffer of Engine1/Director1′s cache. The SF buffer is used for situations where data needs to be temporarily stored and moved to another director/engine, such as in this example. It is a separate region of the cache, not shared with the GM (global memory) which is used for general purpose cache storage.
2- Through the use of the Virtual Matrix, the data is moved from the SF buffer of Engine1/Director1 to the GM of Engine2/Director3 because, in this example, it is where the cache slot for this data resides. Subsequent reads of this data can then be served from the cache.
3- The data is then moved to the SF buffer of Engine3/Dir4 where the host connectivity resides.
4- The data is moved from the SF buffer of Engine3/Dir4 to to front-end ports and to finally to the host.
This is probably the most complicated read example there is- the host connectivity, cache slot, physical disk access are all on separate directors, and it is a read miss. Although there are individual cache components in each director, through the use of the Virtual Matrix, the cache is treated as one big pool of memory allowing the data to be accessed from any director and engine. You can imagine how complicated the IO flows must be with the host performing IO to multiple engines, directors and disks; you develop a respect for how difficult good caching algorithms can be to design and implement.
In Part2, we will look at some storage design / provisioning concepts, and how the VMAX compares to the Clariion in that regard.