Facebook continued to pursue ways of making data storage more efficient with its release of a new version of Flashcache, with Production Database Engineer Domas Mituzas saying in a note on the Facebook Engineering page that the 3.x series boosted average hit rates from 60 percent to 80 percent and cut the social network’s disk operation rate nearly in half.
The note by Mituzas offers a detailed look at Flashcache, but here are some highlights:
Facebook first started using Flashcache in 2010. At the time, we were trying to find a way around having to choose between an SAS or SATA disk-based solution and a purely flash-based solution. None of the options was ideal: The SAS and SATA options in 2010 were slow (SATA) or required many disks (SAS), and the cost per gigabyte for flash in 2010 was high.
Looking at this in 2010, we saw an opportunity to try to solve this problem at the software layer. We evaluated adding support for an L2 cache directly into InnoDB, but concluded that it was better to make it transparent to applications like MySQL. So we instead implemented Flashcache as a Linux kernel device mapper target and deployed it into production on a large scale.
Our analysis showed that a few regions on disk accounted for the majority of writes, and the distribution of reads was very uneven. We added more instrumentation in Flashcache to learn about our workload patterns and get better metrics on caching behavior.
Before this change, 50 percent of the cache accounted for 80 percent of the disk operations. After this change, 50 percent of the cache accounted for 50 percent of the disk operations.
Our database servers use small logical block sizes — 4 kilobytes or 8 KB for compressed InnoDB tables, and 16 KB for uncompressed. When we were using 2-megabyte cache sets, we did not see big differences between various cache eviction algorithms and chose FIFO instead of LRU. The workload changed after we increased the Flashcache set sizes, so we began evaluating alternatives to FIFO.
What we found was that blocks that are referenced multiple times are promoted and moved from the midpoint to the head of the LRU. If we put them at the head of the LRU when they’re first read, then many blocks that are read once will push more frequently read pages from the LRU. If we put them in the middle of the LRU (halfway from the head), then they’d be at the 50th percentile. If we put them at the head, then they’d be at the 0th percentile.
We run multiple database instances per machine, and the one that’s been running longest gets preferential treatment in the well-established old pages area, while new ones never get out of the nursery.
Another issue we wanted to address was disk write efficiency. Flashcache can act as a reliable write-behind cache, and many writes to disk can be merged beforehand on flash.
Previously, we just tried to enforce a fixed percentage of dirty pages per cache set. Since different cache sets have different behaviors, under this model, we ended up either underallocating or overallocating cache to modified pages. Some segments were written out constantly, while others cached dirty pages for a week, which penalized read caching.
To address this, we implemented a straightforward dirty data-eviction method that did not segregate write and read operations. In the new method, all pages are treated equally, and if cache wants to reclaim a page, it just looks at the oldest entries in the LRU. If the oldest entry is dirty, cache would schedule a background eviction of that entry and then reclaim the next clean one and use it for new data.
This smoothed out our write operations while preserving maximum write-merging efficiency and instant write capabilities of write-behind cache. It also increased amount of space that can be used for reads, increasing our overall cache efficiency.
With these three changes implemented, we are now turning our attention to future work. We’ve already spent some time restructuring metadata structures to allow for more efficient data access, but we may still look at some changes to support our next-generation systems built on top of multi-terabyte cache devices and spanning tens of TB of disk storage. We’re also working on fine-grained locking to support parallel data access by multiple CPU cores.
Also, while the cost per GB for flash is coming down, it’s still not where it needs to be, and this introduces more complexities into capacity planning. SSDs have limited write cycles, so we have to make sure that we’re not writing too much. Every cache miss results in a flash data write, so having flash devices that are too small may be problematic. In this case, it’s nice to have hard disks that aren’t too fast, as any cache hierarchy has to depend on steep performance differences between multiple levels.