XtremIO is EMC’s foray into the all flash array market. It is a ground up array design built exclusively for flash. I’ll start by giving an overview of the product, its features, and then do a deep dive of the design principles and architecture.
Note: Before reading this article, it would be helpful to read my previous article “A primer on flash and a look into the challenges of designing an all flash array”
The building block of XtremIO is an X-Brick.
An X-Brick is composed of 2-storage controllers, a DAE holding 25xSSDs and two battery backup units. Each X-Brick at GA will utilize 25x400GB eMLC SSDs and provide 10TB of raw capacity. Sometime in early CY2014 we will see a 20TB X-Brick utilizing 25x800GB eMLC SSDs.
A quick note on the BBU: There are 2 BBUs when you buy the first X-Brick (for redundancy). For each additional X-Brick it only requires adding 1 BBU.
The array scales by adding X-Bricks in a scale-out manner (max of 4 X-Bricks at GA):
The interconnect between all X-Bricks is 40Gbps Infiniband (similar to Isilon, except Isilon uses 10Gbps Infiniband). If we think of Isilon as scale out NAS, XtremIO represents the scale out all flash block category.
Pictured above is an XtremIO node, showing all the connectivity points. The 40Gbps IB ports are for the back-end connectivity. From a front-end host protocol perspective XtremIO supports both 8Gb Fiber Channel and 10Gb iSCSI. Connectivity to the disks are via 6Gbps SAS connections (very similar to a VNX). The ports are standard issue meaning all units ship with all port types pictured above. As one would expect, all connections, power, etc are redundant to provide a highly available unit. This is all very par for the course with no surprises. A few things worth mentioning on the nodes themselves: each node contains 2x SSD to serve as a dump area for metadata if the node were to lose power. Each node also contains 2x SAS drives to house the operating system. In this way, the disks in the DAE themselves are decoupled from the controllers since they only hold data, and this should facilitate easy controller upgrades in the future when better/faster hardware becomes available. At GA, each node is essentially a dual socket 1U whitebox server utilizing 2x Intel 8-core Sandy Bridge CPUs and 256GB of RAM with the aforementioned peripherals.
Speeds & Feeds
From a capacity standpoint a 10TB X-Brick equates to roughly 7.5TB of usable capacity with no de-duplication savings and its not uncommon to see 3:1 (21TB), 5:1 (35TB) and 10:1 (70TB) dedupe rates depending on the data set. VDI would be at the far end of the spectrum (highly de-dupable), and a database would be at the near end of the spectrum (minimal de-dupability in most cases).
From a performance perspective each X-Brick is rated at 100K 100% 4KB write IOPS, 150K 50/50 read/write 4KB IOPS and 250K 100% read 4KB IOPS. And those numbers scale linearly. So with a 2 X-Brick config just multiply the numbers by 2, and so on. True linear scalability is a hallmark of this platform. Its important to note that these are NOT hero numbers. Rather, these numbers are what customers can expect in production measured end-to-end on a host backed by an XtremIO system with several write/erase cycles and the array filled to 80-85%+ capacity. This is in stark contrast to some numbers being quoted by all flash vendors that require the array to be brand new and have almost no data written on it. Recall from the previous “Flash Primer” article the penalty of overwrites on SSD. The numbers XtremIO quotes are at 80%+ full which forces overwrites situations to occur and it is critical that all flash arrays be tested under these conditions as they most accurately represent real world use cases. All of these IOPS numbers are delivered at < 1ms latency. Here is an example of the scaling:
These are the high level list of software features which will be discussed in greater detail in the architecture section.
- Inline Deduplication. Part of the system architecture and lowers effective cost while increasing performance and reliability by reducing write amplification.
- Thin Provisioning. Also part of the system architecture in how writes are managed and carry no performance penalty.
- Snapshots. No capacity or performance penalty due to the data management and snapshot architecture.
- XDP Data Protection. “RAID6” designed for all flash arrays. Low overhead with better than RAID1 performance. No hot spares.
- Full VAAI Integration.
Architecture Deep Dive
XtremIO Core Design Principals and Philosophies:
- Optimize everything for random access (I/O). Accessing any data segment on any node should carry no extra cost than any other. This is a critical design criteria to ensure linear and predictable performance and scalability at all times regardless of the number of nodes in the system.
- Minimize write amplification. As we know from the previous flash primer article, write amplification is an SSD array’s worst enemy from both a performance and reliability standpoint. The goal of the XtremIO system is to minimize back-end writes hence providing write attenuation (less back-end writes than front-end writes).
- Today’s SSDs increasingly do a superior job of efficient garbage collection, thus there is NO system wide garbage collection on the XtremIO platform and this is a significant advantage by further minimizing the write amplification. I cannot underscore enough how important this is for performance and scalability. By allowing the SSD controllers themselves to manage the garbage collection, more time can be spent on the software architecture, data services and other advanced features such as scale out, VAAI, etc.
- Content Based Data Placement. The content IS the address of the data. By decoupling the logical address of the data from the placement of the data, data blocks can be placed anywhere. This further optimizes the system for random access and allows data management techniques which optimize for the peculiar requirements of SSDs themselves.
- A byproduct of the the above is even data distribution across the entire system.
- True Active/Active data access. Similar to a VMAX or other true active/active arrays, there is no concept of LUN ownership and all nodes can serve data for any volume without penalty.
- Scalability. Everything scales in a linear and predictable fashion. I.E. Adding X-Bricks scales both performance and capacity.
XtremIO Software Architecture
Under the covers, the system runs on top of a standard Linux kernel and the XtremIO software, XIOS, executes 100% in userspace. Running in 100% userspace avoids expensive context switching operations, provides for ease of development and does not require the code to be published under GPL. An XIOS implemented in the Linux kernel would require the code to be published via GPL, which poses a problem for a company like EMC that desires to protect its IP.
An XIOS instance called the X-ENV runs on each CPU socket. The CPU and memory itself is monopolized by XIOS and doing a “top” command in Linux would reveal a single process per CPU socket taking 100% of the resources, and this allows the XIOS to manage the hardware resources giving the ability to provide 100% predictable and guaranteed performance leaving nothing to chance (I.E. an outside process or kernel scheduler impacting the environment which would be unbeknown to XIOS). An interesting side effect of this is that the software architecture being 100% in userspace COULD allow for movement from Linux to another OS, or from X86 to another CPU, if required. This isn’t a likely scenario without some serious mitigating factor, but infact, it could be possible due to the design.
The first thing to note in the architecture is that it is software defined in the sense that it is independent of the hardware itself. This is evidenced by the fact that it took the XtremIO team a very short period of time to transition from their pre-acqusition hardware to EMC’s ‘whitebox standard’ hardware. Things which are NOT in the X-Brick include: FPGAs, custom ASICs, custom flash modules, custom firmware, and so on. This will allow the XtremIO team to take advantage any X86 hardware enhancements including speeds/feeds improvements in the HW, density improvements, new interconnect technologies, etc without much hassle. XtremIO really is a software product delivered in an appliance form factor. While there is nothing preventing XtremIO from delivering a software only product, it would be encumbered with the same challenges as all the other software only storage distributions on the market face, namely, the difficulty in guaranteeing predictable performance and reliability when unknown hardware is utilized. That being said, if enough customers demand it, who knows what could happen; but today, XtremIO is delivered as HW+SW+EMC Support.
There are 6 software modules responsible for various functions in the system. The first 3 (R,C,D) are data plane modules and the last 3 (P,M,L) are control plane modules.
P – Platform Module. This module is responsible for monitoring the hardware of the system. Each node runs a P-module.
M – Management Module. This module is responsible for system wide configurations. It communicates with the XMS management server to perform actions such as volume creation, host LUN masking, etc from the GUI and CLI. There is one active M-module running on a single node, and the other nodes run a stand-by M-module for HA purposes.
L – Clustering Module. This clustering module is responsible for managing the cluster membership state, joining the cluster, and typical cluster functions. Each node runs an L-module.
R- Routing Module. This module is the SCSI Command parser and translates all host SCSI commands into internal XtremIO commands/addresses. It is responsible for the 2 FC and 2 iSCSI ports on the node and functions as the ingress/egress point for all I/O of the node. It is also responsible for breaking all I/O into 4K chunks and calculating the data hash values via SHA-1. Each node runs an R-Module.
C- Control Module. This module contains the address to hash mapping table (A2H) which is the first layer of indirection that allows much of the “magic” to happen. Many of the advanced data services such as snapshots, de-duplication, thin provisioning, etc are all handled in this module.
D- Data Module. The data module contains the hash to physical (H2P) SSD address mapping. It is also responsible for doing all of the I/O to the SSDs themselves as well as managing the data protection scheme, called XDP (XtremIO Data Protection).
The function of these modules along with the mapping tables will be clearer after reviewing how I/O flows through the system.
Read I/O flow
Stepping through this diagram, a host first issues a read command for a logical block address via the FC or iSCSI ports. It is received by the R-module which breaks the requested address ranges into 4KB chunks and passes this along to the C-module. To read the 4K chunk at address 3 (address 3 was just picked as an example 4KB address to read) the C-Module does a lookup and sees that hash value for the data is H4. It then passes this to the D-module which looks up the Hash Value H4 in its Hash->Physical Address lookup table and reads physical address D from the SSD.
Write I/O of unique data:
This is an example flow for writing a unique 4KB data segment to the array. The host issues a write I/O via FC or iSCSI. This is picked up by the R-module which breaks up the I/O into 4KB chunks and calculates the hashes for each 4KB chunk. For the purposes of this illustration we are focusing on just a single 4KB chunk to follow it through the system. The R-Module hashes this 4KB of data and produces a hash value of H5 and passes this to the C-Module. We see that the hash of H5 is unique data and thus the C-module places it in its address mapping table at address 1. It then passes the I/O to the D-module which assigns H5 the physical address D and writes the 4KB of data to the SSD at this physical address.
Write I/O of Duplicate Data:
As with the previous write example, the host issues a write via FC or iSCSI. This is picked up by the R-Module which breaks the I/O into 4KB chunks and calculates the hashes of each 4KB chunk. Similar to the previous example, we are just going to follow a single 4KB chunk through the system for the sake of simplicity. In this case the R-module calculates the hash of the 4KB chunk to be H2 and passes it the C-Module. The C-module sees that this data already exists at address 4 and passes this to the D-Module. Since the data already exists, the D-module simply increments the reference count for this 4KB of data from 1 to 2. No I/O is done to the SSD.
We can see by the write I/O flows that both thin provisioning and de-duplication aren’t really features, but rather just a byproduct of the system architecture because we are only writing unique 4KB segments as a design principle. The system was truly designed from the ground up with data reduction for SSDs in mind. The de-duplication happens inline, 100% in memory with zero back-end I/O. There is no “turning off de-duplication” (since its a function of how the system does writes) and it carries no penalties, and in fact boosts performance by providing for write attenuation.
How about a simple use case such as copying a VM?
Stepping through the copy operation the ESXi host issues a VM copy utilizing VAAI. The R-module receives the command via FC or iSCSI ports and selects a C-module to perform the copy. Address range 0-6 represents the VM. The C-module recognizes the copy operation and simply does a metadata copy of addresses range 0-6 (original VM) to a new address range 7-D (represents the new VM) and then passes this along to the D-module. The D-module recognizes that the hashes are duplicates and simply increments the ref. counts for each hash in the table. No SSD back-end is I/O required. In this way, the new VM (represented by address range 7-D) can reference the same 4K blocks as the old VM (represented by address range 0-6).
The key thing to note throughout these I/O flows is that all metadata operations are done in memory. To protect the metadata there is a very sophisticated journaling mechanism that RDMA transfers metadata changes to remote controller nodes and hardens the journal updates to SSDs in the drive shelves using XDP.
The magic behind XtremIO is all in the metadata management and manipulation. It’s also worth noting that the data structures utilized for the A2H and H2P tables are much more complicated than depicted above and have been simplified in the illustrations for the purposes of understanding the I/O flows.
The second thing to note is that the D-module is free to write the data anywhere it sees fit since there is no coupling between a host disk address and back-end SSD address thanks to the A2H and H2P tables. This is further optimized since the content of the data becomes the “address” for lookups since ultimately the Hash Value is what determines physical disk location via the H2P table. This gives XtremIO tremendous flexibility in performing data management and optimizing writes for SSD.
With an understanding of the relationship between the R,C,D modules and their functions, the next thing to look it as how exactly they communicate with each other.
The first thing to understand is how the modules are laid out on the system. As discussed previously each node has 2 CPU sockets and an XIOS instance runs on each socket in usermode. We can see from the above that each node is configured very specifically with R,C running on one socket and D running on another socket. The reasons for this have to do with the Intel Sandy Bridge architecture which has an integrated PCIe controller tieing every PCIe adapter directly to a CPU socket. Thus on a system with multiple CPU sockets, the performance will be better when utilizing the local CPU socket to which the PCIe adapter is connected. The R,C,D module distribution was based on optimizing the configuration based on field testing. For example the SAS card is connected to a PCIe slot which is connected to CPU socket 2. Thus the D-Module runs out of socket 2 to optimize the SSD I/O performance. This is a great example of where while a software storage stack like XtremIO is hardware independent and could be delivered as a software only product, there are optimizations for the underlying hardware which must be taken into consideration. The value of understanding the underlying hardware goes not only for XtremIO but all storage stacks. These are the types of things you do NOT want to leave to chance or for an end-user to make decisions on. Never confuse hardware independence with hardware knowledge & optimization — there is great value in the later. The great thing about the XIOS architecture is that since it is hardware independent and modular, as the hardware architecture improves XIOS can easily take advantage of it.
Moving on to the communication mechanism between the modules, we can see that no preference is given to locality of modules. Meaning, when the R-module selects a C-module, it does not prefer the C-module local to itself. All communications between the modules are done via RDMA or RPC (depending on if its a data path or control path communication) over Infiniband. The total budget for IO in an XtremIO system is 600-700uS and the overhead by Infiniband communication 7uS-16uS. The result of this design is that as the system scales, the latency does NOT increase. Weather there is 1 X-Brick or 4 X-Bricks or more in future, the latency for IO remains the same since the communication path is identical. The C-module selection by the R-module is done utilizing the same calculated data hashes and this ensures a complete random distribution of module selection across the system, and this is done for each 4K block. For example if there are 8 controllers in the cluster with 8x R,C,D modules there is communication happening between all of them evenly. In this way, every corner of the XtremIO box is exercised evenly and uniformly with no hot spots. Everything is very linear, deterministic and predictable. If a node fails, the performance degradation can be predicted, the same as the performance gain when adding node(s) to the system.
XDP (XtremIO Data Protection)
A critical component of the XtremIO system is how it does data protection. RAID5? RAID6? RAID 10? None of the above. It uses a data protection scheme called XDP which can be broadly thought of as “RAID6 for all flash arrays” meaning it provides double parity protection but without any of the penalties associated with typical RAID6.
The issue with traditional RAID6 applied to SSDs is that as random I/O comes into the array forcing updates/overwrites, the 4K block(s) need to be updated in place on the RAID stripe and this causes massive amounts of write amplification. This is exactly the situation we want to avoid. For example: in a RAID6 stripe if we want to update a single 4K block we have to read that 4K block plus 2 4K parity blocks (3 reads) and then calculate new parity and write the new 4K block and two new 4K parity blocks (3 writes) — hence for every 1 write front-end write I/O we have 3 back-end write I/Os giving us a write amplification of 300% or said another way a 3x overhead per front-end write. The solution to this problem is to never do in place updates of 4K blocks and this is the foundation of XDP. Because there is an additional layer of indirection via the A2H and H2P tables XtremIO has complete freedom (within reason) on where to place the physical block despite the application updating the same address. If an application updates the same address with different 4K content, a new hash will be calculated and thus the 4K block will be put in a different location. In this way, XtremIO can avoid any update in place operations. This is the power of content aware addressing where the data is the address. It should also be noted that being able to write to data anywhere is not enough by itself — it is this coupled with flash that makes this architecture feasible — since flash is a random access media that has no latency penalty for random I/O unlike a HDD with physical heads. The previously described process is illustrated below.
The basic principal of XDP is that it follows the above write I/O flow and then waits for multiple writes to come into the system to “bundle” those writes together and write a stripe to the SSDs, thus amortizing the cost of a 4K update over multiple updates to gain efficiencies and lower the write overhead. The I/O flow is exactly the same as the “Write I/O of Unique Data” illustrated in the previous section, except that XDP simply waits for multiple I/Os to come into the system to amortize the write overhead cost. One thing to note is that in the example a 2+2 stripe was used. In practice, the stripe size is dynamic and XtremIO looks for the “emptiest” stripe when writing data. The 23+2 stripes will run out quickly (Due to the holes created by “old” block, denoted by the white space in the figure. It should be noted that these holes will be overwritten by XDP to facilitate large stripe writes. However the math behind this is complex and beyond the scope of this article), however even if a 10+2 stripe is found and used , the write amplification/overhead is reduced from 300% (3 back-end writes for every 1 front-end write) to 20% (12 back-end writes for every 10 front-end writes), and this is what XtremIO conservatively advertises as overhead:
However, in practice, even on a 80% full system, it is likely that stripe sizes much larger than 10+2 will be found leading to even less than a 20% write overhead. Even at 20%, the write overhead is not only less than RAID6 but also much less than RAID1! All the while providing for dual parity protection. So better than RAID1 performance, with the protection of RAID6 — which, in a nutshell, sums up XDP.
XtremIO as it exists today is not a general purpose array to replace non-trivial amounts of capacity. The product is currently focused on targeting workloads which have relatively low capacity requirements but high performance requirements. VDI, Databases, SAP Business applications that need low response times and high IOPS. For the database use case the fast performance and ability to make databases copies with no cost offer a huge benefit in flexibility to customers to make as many copies as they want and leverage the dedupe. Architectures already exist to back 2500-3500 VDI users on a single X-Brick with sub 1ms latency. Customers that have purchased this product already in the DA phase are using it to accelerate their critical applications which is making a significant difference to their business. Essentially, if you can fit the workload on the box given the current capacities, the performance is essentially ungated and will speed up any application.
Hopefully this provides some insight into not only hardware and software features of the XtremIO box, but also a look into the architecture. In my role I see many, many new technologies come across my desk and what I can say about XtremIO is that I am extremely impressed with the elegant design of the system & the team building the product. Its a technology I am genuinely excited to talk to customers about. It is a well thought out architecture with some great ground up thinking and the team has really thought of not only the current requirements but future requirements as well.
I have been reading a lot of “chatter” on the interwebs criticizing the product for being late or lacking certain features. The reality is that creating a storage product like this is hard. EMC acquired this company in May 2012 with no GA product and that is relatively little time in storage development land to bring something quality to market. Keeping that in mind, make no mistake this is an initial product release and EMC is laser focused on delivering a reliable enterprise ready product and that is their first and foremost priority — successful customer implementations because that’s what matters at the end of the day. Given that choice versus rushing a product out the door with features that haven’t been fully QA’d, EMC made the right decision in my opinion. It’s also important to note that there are already a significant number of systems out there in customer environments. I think this platform is going to be very successful for EMC and provide massive customer benefit.