Data corruption with iSCSI in-kernel backend and how to prevent it

iSCSI is a popular solution to provide access to RADOS block devices to systems that don't support RADOS block devices natively, such as Windows machines or virtualization hosts.

ceph-iscsi is a frontend for configuring iSCSI targets backed by RBD images. It uses targetcli, a command-line interface for LIO, to create iSCSI targets in Linux.

There are two ways for targetcli to use RBD images:

Running tcmu-runner in userspace. This method is what croit used in the distant past when croit OS images were based on Debian. It is also used by default since the 2312.0 release of croit.
Using the target_core_rbd backstore, implemented in the kernel. This method is the default way in SUSE-based images managed by croit versions prior to 2312.0. It's based on a custom patch that SUSE adds to their Linux kernel version. However, upstream Linux kernel developers rejected it from inclusion in the mainline kernel.

The purported benefits of SUSE's in-kernel solution are increased performance and the ability to do parallel multi-path I/O, thus removing the need for the "Fail Over Only" load balancing policy on Windows clients or its equivalent on other systems. There also was an attempt to support iSCSI persistent reservations, but it has been abandoned.

It turned out that the custom SUSE patch leads to data corruption, which affects the default croit setup. This article explains how to identify and repair the affected iSCSI disks.

Details of the corruption

By default, the iSCSI LUNs made available by the rbd backstore (i.e., through the target_core_rbd kernel module) incorrectly advertise the underlying RBD image size. In particular, one extra 512-byte sector is exposed at the end but is not backed by any storage. Any data “written” into this sector is lost.

We encountered this issue at a customer installation. The customer used the iSCSI LUNs on a Windows Server. The server used a GPT partition table on the iSCSI disks. According to the GPT specification, the backup copy of the GPT is written to the last block – i.e., to that 512-byte block that isn’t real. After a crash, partitions on those disks weren’t recognized by Windows anymore.

Is my setup affected?

If you used the croit GUI prior to version 2312.0 to expose your RBD images as iSCSI LUNs, then yes, your setup is affected, except for the case where you can prove that the last sector is, in fact, unused.

Examples of setups that are at risk:

In an RBD image exposed as an iSCSI LUN and then partitioned using GPT (GUID Partition Table), the backup copy of the GPT is located in the last sector and is therefore not persisted. Linux tolerates this and uses the primary copy, but in some cases, Windows does not recognize the disk's partitions at all. Only partition metadata, not user data, is corrupted in this case, but this still can lead to the unavailability of user data if Windows stops recognizing the disk.
If the iSCSI disk is used to store data without partitioning, then the filesystem may try to write data anywhere on the disk, including the last (fake and broken) sector. So, in this case, the user data is at risk.

Examples of setups not at risk:

If the RBD image is partitioned using the MBR scheme and the last sector is not covered by any partition, there is no danger of data loss.
If the RBD image is exposed through tcmu-runner (manually using the command line, or by using the default settings of croit v2312.0 or later), the buggy code path is not hit.
If the RBD image exposed through the in-kernel backend has the emulate_legacy_capacity attribute set to 0, its size is advertised correctly, and there is no bug.

Identifying the affected disk images

croit release 2402.0 will come with a task advisor warning that identifies all iSCSI disks that have the dangerous configuration of the in-kernel backend and the emulation of legacy capacity not disabled. It does not try to analyze the content of the disk images, and will, therefore, warn even in the “MBR with the last sector being unused” situation which is actually safe. But even in this situation, it is better to convert the disk to a setup that does not require content analysis to prove that it is, in fact, safe.

On older croit releases, you can follow the steps below.

Using the gwcli console (Services > iSCSI > Terminal), run the following commands:

/> cd /disks
ls

An example output would be:

o- disks ............................................... [30G, Disks: 3]
o- rbd ................................................... [rbd (30G)]
o- bad1 .................................. [rbd/bad1 (Unknown, 10G)]
o- ok1 .................................... [rbd/ok1 (Unknown, 10G)]
o- ok2 .................................... [rbd/ok2 (Unknown, 10G)]

For each disk listed, run the “info” command as follows:

/disks> info rbd/bad1
Image .. bad1
Ceph Cluster .. ceph
Pool .. rbd
Wwn .. 387c88ba-c9e3-400b-b00e-8d9b60b210e0
Size H .. 10G
...
Backstore .. rbd
Backstore Object Name .. rbd.bad1
Control Values
...
- emulate_legacy_capacity .. 1
...

If the backstore is user:rbd, then the disk is unaffected.

If the backstore is rbd, pay attention to the emulate_legacy_capacity control value. If it is 0 (which is only possible to set through the command line) and has always been 0, then the disk is unaffected. If it is 1 (which was the case by default), the disk cannot store data in its last sector and needs to be repaired.

Preventing further data corruption

All operations in this section have a precondition that no iSCSI initiators are using the disk. So, stop the Windows machines, hypervisors, or other systems that have this disk connected to them.

As the disk may already have partition or filesystem metadata on it, saying that it is bigger than what it actually is, the first step is to increase its size. You can skip this first step if the disk has been partitioned using MBR and the last sector is not used by any partition.

Theoretically, it is sufficient to increase the disk size by 512 bytes; however, neither the rbd command nor gwcli accepts the size in bytes. The procedure below will, therefore, extend the disk by 1 MB. It needs to be repeated for each disk identified as being at risk.

Figure out the new size of the disk by adding 1 MB to the existing size reported by gwcli. Consider that, according to gwcli, 1 GB = 1024 MB and 1 TB = 1048576 MB. Then, tell gwcli to resize the disk to the new size. Given that the rbd/bad1 disk was reported as having a 10 GB size, the correct new size is 10241 MB:

/disks> resize rbd/bad1 10241M
ok

The next step would be to enable faithful disk size reporting. This only works if the image is assigned to an iSCSI target. If this is not the case, add it now:

/disks> cd /iscsi-targets/iqn.2017-01.io.croit.iscsi:ceph-gateway/disks
/iscsi-target...gateway/disks> add rbd/bad1
ok
/iscsi-target...gateway/disks> cd /disks

To enable the faithful reporting of the disk size, issue the following command:

/disks> reconfigure rbd/bad1 emulate_legacy_capacity 0
ok

This concludes the repair of the rbd/bad1 disk. From now on, it can be safely used by the iSCSI initiator. However, as its size has changed, further steps might be needed to remove the discrepancy between the new disk size and the metadata that might still be on it. Such steps are necessarily specific to each operating system and the use case of the disk. For example, on Linux, if the disk uses GPT, it is sufficient to rewrite the partition table without any changes using fdisk or gdisk, as shown below:

# fdisk /dev/sdb
Welcome to fdisk (util-linux 2.39.2).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.
GPT PMBR size mismatch (20971520 != 20973567) will be corrected by write.
The backup GPT table is corrupt, but the primary appears OK, so that will be used.
The backup GPT table is not on the end of the device. This problem will be corrected by write.
Command (m for help): w The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.

Further plans

The procedure above involves manual steps, and the end result for the repaired disks still uses the in-kernel rbd backend, which may not be the best idea. croit GmbH is brainstorming further improvements to its product aimed at simplifying and automating the steps and switching away from the in-kernel iSCSI rbd backend. Stay tuned for our new blog posts.

On the other hand, the Ceph project has switched ceph-iscsi into the maintenance-only mode and is considering the use of NVMeOF as a replacement technology. Still, at the time of this writing, it is experimental. Furthermore, the project has released native Windows RBD drivers; we already have customers who migrated from iSCSI to these drivers. We do not recommend using iSCSI at all and suggest investigating alternative solutions for making Ceph storage available to applications that do not support Ceph natively.