TL;DR : It’s hard to achieve perfect ordering of block I/O under Linux. The most probable solution of data corruption would be upgrading QEMU to 2.0 or higher.
1. Overview
From the perspective of block integrity, Linux block I/O has been the core of controversy. It’s inherently sensitive issue, because a tiny bit of misfunction can result in a massive data corruption. In the era of simple storage, such as a filesytem based on hard disks or SCSI drives, it was relatively easy to imagine what kind of corner cases could exist. As a storage stack tends to have multiple layers in these days, however, it gets more likely to encounter
unexpected data integrity issues.
In this document I’m going to summarize possible ways to achieve block integrity through the storage stack, especially QEMU/KVM + multipath + MD/RAID + etc.
2. Concepts
Traditionally, Linux has 2 major abstractions to accomplish block integrity. (See the note by Christoph Hellwig [1] )
- cache flush: it forces all block requests in the buffer to the stable storage
- barrier request: it prevents requests before and after it from being
reordered, getting passed through the storage stack until
the physical storage device.
First, cache flush is already well implemented by specific filesystems and any other layers. In Linux Kernel, REQ_FLUSH tag is set in BIO flags. On the other hand, barrier request is normally implemented with the term
“Force Unit Access” (FUA), which corresponds to REQ_FUA tag in the kernel.
3. Reality
However, a general concensus according to several discussions between Kernel developers is that Linux does not have a strict barrier any more. The only way to guarantee request ordering is to “not submit the other until the one has completed.” [2] That’s where the concept of “draining queue” comes in, being implemented by an infinite loop to wait for the completion, or polling on the queue. For the 3rd party out-of-tree drivers such as DRBD, the best practice to achieve barrier requests is to drain queue as well as flushing queue. Quite cumbersome, but that’s how it works right now.
In the QEMU/KVM and its virtio drivers, the story becomes a little more complicated. Assuming that we use virtio-blk with QEMU 2.0, cache flush is already supported. However, barrier request is not sufficiently regarded. Most of progress has been done in QEMU in the last 3 years, so that such barrier can be realized by draining and flushing queue. Even that had not been correctly supported until QEMU 1.2. In the old version QEMU, even the API bdrv_drain_all() did not exist. bdrv_drain_all() was first introduced in QEMU 1.3. [5]
Moreover, the barrier flag is not advertised by virtio-blk driver of the guest kernel. See the note on this issue written by Christoph Hellwig, [1] although a part of the document is outdated. See also a discusson about virtio spec. [3]
Having relied on QEMU 1.2 or lower since years, we have had always such risks. A libguestfs developer even tried to work around such an issue by implementing an own fsync call in libguestfs. [4]
So it’s not astonishing that, under a particular circumstance, block I/O requests from a guest become somehow reordered, especially under high I/O load, where storage target does not complete requests in an expected time.
4. What to do?
I think there’s no perfect solution to solve all this mess. All we can do is to minimize risks in particular parts where barriers can be guarantted as much as possible.
- Upgrade QEMU from 1.2 to 2.0
: QEMU 2.0 already supports queue draining and flushing nearly perfectly. For example, live migration part already does queue draining before doing migration. I think this project is already going on, and it looks like
the most feasible solution at the moment.
- Tuning / Debugging dm-multipath
: Of course multipath itself could have bugs on block I/O reordering. Changing kthread to a single-threaded one could also help. However, given that Kernel developers do not care much about block barrier right now, it’s a little doubtful that it’s possible to make dm-multipath capable of guaranteeing block barrier.
Apart from that, several bugfixes appeared in dm-multipath since last years. For example, a bugfix to avoid hanging on switching path could help us a little.
- Use IDE/SCSI drives instead of virtio-blk
: Possible, but of course, its downside is performance hit.
- make use of a QEMU interface for triggering queue draining
: This interface is already available as aio_flush. Userspace tools can trigger queue draining, any time when they are about to begin suspending VM for live migration or else.
5. References
[1] “Notes on block I/O data integrity”
<https://lists.gnu.org/archive/html/qemu-devel/2009-08/msg01385.html>
[2] “FLUSH/FUA documentation & code discrepancy”
<https://lkml.org/lkml/2012/9/4/142>
[3] “virtio-spec: document block CMD and FLUSH”
<http://lists.gnu.org/archive/html/qemu-devel/2010-05/msg00119.html>
[4] “daemon: Run fsync on block devices after sync”
<https://www.redhat.com/archives/libguestfs/2012-July/msg00009.html>
[5] “block: convert qemu_aio_flush() calls to bdrv_drain_all()”
<http://git.qemu.org/?p=qemu.git;a=commitdiff;h=922453bca6>