NVMe for DBAs
What is NVMe?
Solid state technology came into the enterprise storage market many years ago and massively disrupted database performance and design. Storage technology went from being optimized for large, sequential workloads to small, random workloads. This was great for transactional databases, but not effective for data chugging workloads like analytics which like to process large sequential chunks.
After an amazing introduction into storage appliances, solid state quickly hit a performance wall. The chips were being made faster and more-dense. But, the controllers and access systems, designed for disk speeds, were holding them back. This is where NVMe comes in.
NVMe is not a change to the solid state media (chips). NVMe is a change to the drivers and controllers which access those chips. The original (SATA/SAS) access methods are single-threaded, thick and do not provide methods for managing garbage collection (a primary challenge with solid state performance). So, NVMe is about changing the access to the existing solid state drives (chips) in order to unleash solid state’s true potential. NVMe is a multi-threaded, fully re-written I/O stack with advanced APIs to handle new solid state considerations like garbage collection.
With NVMe comes a whole new level of performance from the same chips. Additionally, due to the thinner stack, NVMe can lead to up to a 50% decrease in CPU consumption for I/O processing. That’s a lot of CPU cycles going back to the database. And, as the bandwidth potential has now been unleashed, solid state can now be deployed effectively for analytics.
Of note, NVMe is a DAS (or local) technology. This means that it can only be used upon local drives. To get HA, larger capacities, simplicity of scale, data features like snapshots, etc, you’ll need to get a SAN that has been designed for this new NVMe technology. But, be aware, not all SAN vendors are implementing the new NVMe technology in the same fashion. So, get to know how your vendor is handling the important pieces, like writes.
Focus on the writes
While solid state technology is an amazing addition to the database design toolset, it is important to understand its characteristics and thus, how to deploy it.
First, a solid state chip can only do a read or a write at the same time. So, writes block reads and reads block writes. This makes management of the read/write activity crucial. This is one of the main reasons why you see reads slow down (higher latency and/or lower IOPs/bandwidth) as writes are introduced to a solid state system. This can also lead to unmanaged DAS (local) NVMe not being as fast as managed SAN based NVMe. The SAN allows for something to be managing the read & write mixing versus a database sending I/O however and whenever it wants to local NVME drives with nothing catching the I/Os and sorting them for optimal performance. Most databases have writes so make sure any testing that’s being done includes the full write workload you expect to see in production.
Second, with hard disk technology, a database page update will over-write the same sector on the spindle. On solid-state (NAND flash) technology there is no update feature available, only reads, writes and erases. Also, erases are at the block level while reads and writes are at the page level (many pages per block). So, any data change (insert, update, delete) will write the full database page to a whole new chip page location while the old chip page location will be marked as “garbage”. This means that the blocks (made up of many pages) will grow more and more dirty over time and eventually need to be “garbage collected”. This is the processes where all the clean pages get grouped up and re-written to a full, new, clean block. During this process, the chip is locked. So, solid state drives need extra “garbage space”, or work space, allocated to handle this scrubbing and re-allocating of changes. How much garbage space is allocated helps determine how fast writes can come into the device. So, choose your poison. Allocate more to the scrub space and write faster or allocate more to the usable space and have a larger usable capacity. Most vendors allow some sort of tuning here. So, know your write/update workload and know your flash formatting levels to make sure your production workloads will always work. When working with a DAS, local, implementation, a trick is to get a drive that is larger than the space you’ll be using. This ensures there is plenty of working space and reduces the likelihood of a collision without having to do an actual re-format, which is likely not possible with a local drive.
For the above reasons, advanced flash management is crucial. In fact, it has been observed by Vexata in real world testing and production use cases that a SAN implementation of NVMe SSDs can produce net faster database results than deploying the drives locally inside the database servers.
NVMe for scale-up databases
NVMe can be applied to several different aspects of scale-up, relational, databases:
- Memory extension: DRAM can be very expensive, especially at large capacities. So, if your working data set relatively small and your SAN is overburdened, you can place an NVMe based drive in your database server and extend your database cache. This new space will be read-only, so it can only be used to remove additional reads from your existing SAN. But, that may be enough to keep the SAN working well a bit longer. Note that while this helps with the read performance of your database, it can take a lot of writes to load and maintain the cache so write performance at the drive level is also of consideration.
- Temp space: Temp space is where a lot of a relational database’s work is done. All of those sub joins, temp tables, filters and groupings are processed in temp space. If you’re seeing a lot of temp space utilization, it may be of value to place that on an NVMe-based solid state drive. With the new media-access drivers and controllers, the lower latency of NVMe can help speed up temp activity. When doing tens or hundreds of thousands of I/Os per second, all those round trips to storage can add up. The read/write ratio can be near 50/50 so plan ahead for a large write workload to the drives.
- Logging: This is a mostly sequential, write-heavy workload that can be a primary bottleneck for high-transactional systems. By lowering the latency of the writes to the log file, transactions can proceed more quickly. Using solid state media, write management can be a concern. This can be even more amplified when running an in-memory database in which the log needs to handle and even faster stream of changes.
- Data: Formerly, solid state was mostly an OLTP solution for workloads requiring small, random data access. NVMe, through lower latencies and higher IOPs-per-device, can accelerate that even further. But, through enabling higher bandwidths, solid state can now accelerate analytics systems with faster run times, higher parallelization and lower rack densities. NVMe is taking “All-flash” appliances from about 1GB/s per rack unit to about 10GB/s per rack unit. Not bad for using the same chips.
- In-memory: There are few use cases here. A) There are times when a full data set cannot fit entirely in-memory. An in-memory technology can still be deployed as long as it can cycle needed data in and out of the database server quickly enough. NVMe can help with these large bandwidths, high IOPs and heavy write scenarios. B) Re-boot times. Power-cylcing an in-memory system can be a very long process when it has to read all of the base data into memory. With the higher bandwidths available with NVMe-based SSDs, this loading process can be greatly reduced. C) There are times when the total “wall clock” (full-stack execution duration) time is the ultimate goal. Migrating large amounts of data into an in-memory application for each run may delay the processing start time thus delaying the overall run time. For this situation, it could be faster to run the workload entirely off of NVMe. As an example, if it takes 5 minutes to load data into memory and then 3 minutes to run a workload out of memory, the total “wall clock” time is 8 minutes. If instead it takes 6 minutes to just run the workload entirely off of NVMe media then the total “wall clock” time is 6 minutes. In this example, the processing step took twice as long but the total run time (wall clock) was 2 minutes faster.
NVMe for scale-out databases
NVMe is also providing value in the scale-out database world. As in the previous section, there are several different types of workloads and use cases. Those scale-up use cases apply equally here. Additionally, the scale-out architects also have to deal with designing for many, many instances.
For scale-out platforms applying transactional or management type workloads, when the number of nodes is rather small, choosing the node characteristics (CPU, memory and storage) can be rather achievable. But, as the number of nodes increases, solving for an X, Y and Z axis in a cost-effective manner can be more difficult. Stranding many terabytes of NVMe-based storage may not be in the budget. For this reason, it is common for the larger-node deployments to go back to a SAN model with thin-provisioning. This way the storage capacities, scalability and costs can be optimized and challenges around changing failed media drives reduced.
For scale-out platforms processing large quantities of data, NVMe can be used two ways. One, loaded directly into the database nodes in a DAS (local) model. This will deliver a lower I/O latency, higher IOPs and higher bandwidth platform per-node than using traditional SAS SSDs. But, as discussed above, solving for storage capacities as well as processing power can be difficult, potentially stranding hardware and lowering ROIs. Which leads to option number two, deploying NVMe in a SAN with thin-provisioning. This will allow for optimal storage costs as well as provide the needed bandwidths to process at scale. In the use cases where there is a large data load prior to the run, NVMe can also help to ingest that data far faster than traditional SSDs or skip the pre-load step and allow the task to execute off of the SAN instead of local storage media. Yet another “wall clock” type decision.
NVMe’s general advantages for database workloads
NVMe takes the same NAND flash chips and brings a whole new level of performance to them. For DBAs and database architects, this means:
- Massive bandwidths (10x higher than traditional SSD based AFAs)
- Heavy write ingests
- Lower I/O latencies
- Unlock full usage of solid state capacities
- More stable latencies via advanced flash management
- Higher density systems (you can use all of the device because you can now use all of its performance)
- Better mixed-use cases #1: reads and writes (writes not hurting read performance)
- Better mixed-use cases #2: reduction in noisy neighbor issues
- Better ability to run production maintenance in 24×7 operations without hurting user performance
- Faster backup, restore and sharding times
- Quick fixes for temp, log and memory extensions
- Support for scaling up into higher processor count systems or upgrading to Skylake processors
- Reducing overall CPU core counts via reduction in CPU cycles used for I/O processing
- Increase of parallelization of workloads or concurrency of users
- Avoid re-platforming via scaling existing systems
- Reduce complexity via collapsing/eliminating tiers, caches, silos or dedicated appliances.
NVMe is the new protocol for solid state media. Expect the next technology change to be in the solid state itself. With 3DXpoint/Optane already out and other technologies in development, NVMe will be the foreseeable future of media access. Get ready for storage media technology to start changing at an even faster pace. For DBAs and database architects, make sure you are deploying these new technologies in the optimal manner and that your vendors have the correct management technologies (garbage collection) to fully unleash what solid state can do.
Vexata for databases
Vexata set out to design not just an NVMe storage system but a transformational architecture that unleashes what is possible with current and future solid state technologies. To start, Vexata developed an architecture that could fully utilize next generation media and implemented the architecture through the Vexata Operating System (VX-OS), which is a patented design that maximizes data path processing and management of NVMe and Memory Class Storage media (e.g. 3D XPoint™). This allows full flow-through of solid state performance and capacity with minimal overhead, maximum use of pooled solid state and a scale-out architecture that continues to deliver as capacities grow. When VX-OS is deployed in a high availability platform that utilizes enterprise grade SSD media with commercially available hardware the result is a solution that delivers a cost-effective storage system that is optimized for scale, I/O performance and operational simplicity for much lower TCO.
Because Vexata utilizes standard Fibre Channel interfaces, data platform architects can deploy Vexata for any of the SAN-based use cases for scale-up or scale-out databases. With performance factors of at least 10x performance improvements over traditional SAS-based SSD all-flash arrays, DBAs can push their databases to new levels of performance or consolidate platforms for space or cost reductions. Why not go 10x faster with the same chips?
To hear more about NVMe for databases, check out the NVMe for DBAs webinar on BrightTalk.
To speak with a Vexata database specialist about what Vexata’s Scalable Storage Systems can do for your data platform, contact us at email@example.com.