Last a month a blog post was published on a lengthy investigation on mysterious data corruption. Recently author of the article has updated with a core reason: a bug in MD-RAID 0/10/linear in Linux kernel.
The author did made a contact with Samsung electronics, as he originally thought that its a firmware issue of Samsung SSD. Though it turned out to be a general Linux kernel bug. Consequently a Linux developer Seunguk Shin posted a patch on RAID0 to fix that, which is again modified by Martin Petersen like that.
Looking into the issue more closely, this bug has existed since the very beginning, when MD RAID 0/10/linear started to support TRIM. When the SCSI layer initializes a command for REQ_DISCARD, it allocates a single page, which is stored in bio->bi_io_vec->bv_page.
static int sd_setup_discard_cmnd(struct scsi_cmnd *cmd)
{
struct request *rq = cmd->request;
struct scsi_device *sdp = cmd->device;
struct scsi_disk *sdkp = scsi_disk(rq->rq_disk);
...
page = alloc_page(GFP_ATOMIC | __GFP_ZERO);
if (!page)
return BLKPREP_DEFER;
...
blk_add_request_payload(rq, page, len);
Then rq->bio->bi_io_vec->bv_page gets a pointer to the single page that was just allocated above.
In the next phase, it gets passed to the MD RAID layer, where its bio gets split by bio_split().
struct bio *bio_split(struct bio *bio, int sectors,
gfp_t gfp, struct bio_set *bs)
{
...
split = bio_clone_fast(bio, gfp, bs);
bio_clone_fast() calls effectively only __bio_clone_fast(), which is:
void __bio_clone_fast(struct bio *bio, struct bio *bio_src)
{
...
bio->bi_bdev = bio_src->bi_bdev;
bio->bi_flags |= 1 << BIO_CLONED;
bio->bi_rw = bio_src->bi_rw;
bio->bi_iter = bio_src->bi_iter;
bio->bi_io_vec = bio_src->bi_io_vec;
The cloned bio basically shares with the original bio, bio_src, its array bi_io_vec[], which multiple pages are enumerated. At this point, bio->bi_io_vec, a pointer to the original single page allocated above on the SCSI layer, will be overwritten by a totally different address. Then TRIM requests could result in data corruption.
In practice, this data corruption was hardly reproducible. It’s also surprising that this bug has never been discovered so far.