A primer on flash and a look into the challenges of designing an all flash array

It is no secret that flash is changing the storage landscape dramatically and while we are in the beginning stages of all flash arrays becoming common in the market place, customer interest in the technology is ramping up at an exponential rate, in no small part to the plethora of all-flash array products by both start ups and incumbents.

The performance benefits to flash are fairly well understood. It is a random access media type that can read any any address range with no latency penalty. Unlike a HDD which has physical moving components which require relocation when reading from different address ranges and thus significantly increasing the latency for random I/O. The first obvious question is, why haven’t existing array manufacturers simply been able to remove all the drives and replace them with flash? The answer is that while flash as a media provides tremendous benefits, it also has challenges which are unique to the physics of the device. I see it commonly touted that an array is “designed for flash” and making the statements that “legacy arrays cannot be adequately retrofitted with flash”, but no one seems to discuss why this is the case. Before delving into this any further, it’s important to understand the basics of flash memory.

Flash Memory Operation

The basic building block of all SSDs is the single flash memory cell, pictured below:

Flash Memory Cell

Flash Memory Cell

The basic principal behind a flash memory cell being used as a storage device is that the floating gate transistor (pictured above) has the ability to store electric charge. If there is electric charge in the floating gate, we say it is programmed, and if there is no charge we say it is reset or erased. We assign a programmed cell a binary value of 0 and a non-programmed cell a value of binary 1 — in this way, the flash cell can be utilized as a storage device.

Flash Cell States (SLC)

Flash Cell States (SLC)

From each flash cell, we can read a 0 or 1 and we can tunnel charge into or out of the floating gate thereby changing its assigned bit value from 0 to 1 and vice-versa. Each flash cell stores exactly 1-bit of data (0 or 1) and if we package/wire together many flash cells we end up with a high capacity storage device such as a 256GB SSD.

***Note: for the purposes of this article we are discussing SLC Flash. MLC flash operates with the same basic principal except there are multiple charge states not just charge/no charge. MLC flash allows for different “levels” of charge and this is what allows it to store multiple bits in a single cell. Charge Level 1 could represent “01”, Charge Level 2 could represent “10”, Charge Level 3 could represent “11”… and so on. The most common MLC flash today is a 2-level flash allowing for 2^2 = 4 bits of data to be stored per cell.

Since the basic premise behind SSDs requires electrical charge, how is it a non-volatile device that is able to store data with no power?  The answer has to do with the construction of the flash cell — namely, in the picture above the yellow sections above and below the floating gate are insulators. This means there is no way for charge to “escape” into or out of the floating gate on its own. Essentially, the yellow insulator keep the charge states of the floating gates intact allowing an SSD to lose power and still retain data.

How to Read & Write to Flash Memory

A read is performed by applying a voltage vREAD to the Control Gate and seeing if current flows through the transistor from the S to D. If there is charge in the floating gate, we will not have any current flow and if there is no charge in the floating gate we will have current flow. Thus, by determining if current can flow through the transistor we can determine if the flash cell is programmed or unprogrammed and thus “read” the bit value of the gate as being 0 or 1:

Flash Memory Read

Flash Memory Read

Writing to the flash cell is done by applying a high voltage, vWRITE (much higher than vREAD), to the control gate. This causes the electrons to tunnel from the silicon substrate (red) to the floating gate (blue), hence giving the floating gate charge. This changes the bit value from 1 to 0.

Erasing the flash cell is done by applying a high voltage, vERASE (again, much higher than vREAD), to the silicon substrate (red). This causes electrons to tunnel from the floating gate (blue), back to the substrate (red). This changes the bit value from 0 to 1.

Flash Memory Write/Erase

Flash Memory Write/Erase

The tunneling of the electrons to/from the floating gate using high voltages is what is responsible for flash “wear”. The insulating oxide gradually wears over time and after a number of program/erase cycles, fails to do its job and it becomes impossible to determine the state of a cell.

Flash Memory Wiring & Packaging

Now that we understand how read, write and erase a flash cell lets see how multiple flash cells are packaged and wired together to ultimately deliver what we find in today’s SSDs. Today’s SSDs are based on NAND flash packaging, and that will be the focus of the below section.

NAND Flash String

NAND Flash String

Pictured above is a NAND flash string. It is essentially a group of 32 of flash cells wired together in series. Thus each flash string represents 32-bits of data (0s and 1s).

A group of flash strings are then wired together on the same substrate to then form a flash block. So what we end up with is a matrix looking configuration pictured below:

NAND Flash Block

NAND Flash Block

Each row contains 32768 flash cells and represents a flash page, thus each flash page is 32768 bits = 4096 bytes = 4KB. The important thing to note is that the control gates of all the flash cells in a row are tied together and since we need to apply voltage to the control gate to perform read/write operations, we do not operate on one cell at a time, but rather on a row (4KB page) of cells at a time. This is the reason why SSDs can only be read from and written to in 4KB pages and not cell by cell — since applying voltage to a control gate effects the entire row of cells, it makes sense to perform read/write operations on that entire row at a time for efficiency purposes. While there is nothing inherent in the NAND flash design that prevents cell by cell operation, the circuitry would be more complex and cumbersome in an already constrained packaging environment.

As previously discussed the method to erase is to apply a very high voltage to the silicon substrate. However, as we see from the green in the NAND Flash Block diagram, the substrate is shared by all of the cells in a flash block . This means we cannot erase just one 4KB page, rather an entire block of flash cells must be erased. In this case a block is 32 pages (number rows in the diagram, since each row is a page) and since each page is 4KB in size, 32 * 4KB = 128KB of flash memory needs to be erased at once.

***Note: I have omitted certain details in order to keep the NAND Flash principles easy to understand. As an example, there is much more complexity being managed in how and where voltages are applied in certain situations for read/program/erase operations. One such detail worth mentioning is how the charges actually move into and out of the floating gate since the floating gate itself has no contact to voltage and is surrounded by an insulator. This is actually a quantum physics phenomenon called Fowler-Nordheim tunneling. It is also worth noting that page size, block size and other elements are dependent upon the NAND Flash manufacturer. However the above used numbers which are commonly found in the field.

Considerations & Challenges for all flash arrays:

While its true that flash is a random access media device, we know based on the above that the granularity for reading/writing to flash is at the 4K page level, not at the bit level. This typically does not pose any problems since its a friendly size (or multiple) for most application and operating system I/O.  However, because we cannot change any single cell in a 4K page in flash (since the control gates of every cell in a 4K page are tied together)  all writes must be to a clean/empty 4K page thus we cannot overwrite a 4KB page without first erasing it. This challenge is further exacerbated by the fact that if we do have to erase, we can only do so at the 128K granularity since we have to erase an entire flash block which is comprised of 32 4K pages. This is a major difference from HDD I/O in that we can overwrite any portion of the HDD regardless of if it holds data or not without an “erase penalty.”

To “update” a 4K page, 32 4K pages must be read into memory, 32 4K pages must be erased on the NAND device, and then 32 4K pages re-written to the NAND device including the updated 4K page. This is obviously an expensive operation to perform whenever a 4K page needs to be updated and is an example of write amplification. Write amplification is any behavior that causes multiple back-end I/Os to be done for a single front-end/host write I/O. Write amplification can slow down I/O as well as accelerate wear on the SSDs. In an all flash array, one of the primary objectives is to minimize write amplification (sometimes it is unavoidable). Instead, the goal should be write attenuation; to not only maximize performance (by minimizing back-end I/O) but to minimize the number of program/erase cycles (which are limited in an SSD).

Many SSD controllers and flash arrays recognize this and do everything they can to perform all writes to clean 4K pages to avoid the read-erase-write cycle. This however creates a problem of 4KB page fragmentation which necessitates garbage collection processes to rearrange data to provide contiguous space for new writes. The garbage collection process itself causes I/O and if not carefully managed can interfere with host I/O to the SSDs leading to performance issues.

Another challenge is that standard parity RAID data protection schemes are unsuitable because they do the exact thing we want to avoid — update data in place, which causes write amplification. Ideally we would like to have the protection offered by RAID-5/RAID-6 (single & double drive failure protection) without any of the write amplification penalties.

Cost is of course another challenge given SSD pricing compared to HDD.

All of these things combined with reliability concerns due to the limited P/E (program & erase) cycles of NAND flash pose serious challenges for building reliable, robust and cost effective all flash arrays for enterprise use. These are the reasons why you simply cannot take an existing legacy array architecture and replace HDDs with SSDs. Although SSDs are simpler devices due to a lack of moving components, data management is actually much more complicated on an SSD than HDD. Careful thinking around data and flash management is needed to properly utilize SSDs for long term reliable use while extracting the maximum performance from the system.

Categories: flash, storage

9 replies

  1. Excellent information about Flash internals for a common man to understand. Thanks

  2. Clear and easy to consume. Thanks!

  3. What a wonderful post! Had been looking for something like this for quite some time. I appreciate the effort you have put to make people understand the basics, without over complicating it. Congratulations mate. Looking forward to your next post. Cheers.

  4. A very well written post; giving a good understanding of how Flash works. Congratulations!

  5. Easy and very simple you explained.. Good job .. Keep it up..

  6. Pretty good article explaining in as much simple way as possible. Thanks!

  7. micro explanation on Flash. Thanks VJswami


  1. All Flash Array Considerations | storagegene
  2. XtremIO | storagegene

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: