Trying to code on LF OS
Some days ago, I backported quite a lot of commits from the 9p feature branch in LF OS back to the main branch,
including some important bugfixes. Everything worked fine on my machine, but it broke in CI because a random
command line flag working great locally, but not in the docker image (this turned out to be a problem of different
mtools
versions, I have a random version on my laptop - not from debian repositories). Since this isn’t the
first time mtools
annoys me with such things, I decided to finally ditch it in favor of a small tool to create a
fresh FAT32 filesystem from a bunch of files and directories - think of it like tar
, but with FAT32 as output
format. After not a lot of time, I had a working fatcreate
tool and integrated it into LF OS’ build system.
LF OS not booting
But it wouldn’t boot. Sure, fatcreate
didn’t yet create a perfect FAT32 filesystem. The FS info block was not
correct (showing more used clusters than there are), it does not implement long file names, doesn’t have a backup
superblock … but nothing that should prevent it from booting. I figured out some problems in the dd
command to
add the FAT32 image to the sgdisk
prepared hard disk image, but it still wouldn’t boot. Curious. Frustrating.
I first tried with manually starting qemu, which worked fine. I then tried to use a debug build of OVMF (which I had laying around already, from debugging LF OS’ bootloader on PXE), that also worked fine and didn’t give any hints in debug output. I then thought everything is alright and went on - except it didn’t work when I reset my build directory for a fresh build - it was broken again, using the debian-provided OVMF.
At some point, I realized it’s not even showing the block device in UEFI shell - only BLK0
, which is the CD-ROM
drive, is there. I also validated the debug build working and the debian-provided being faulty. I figured why it
was working when manually starting QEMU: I didn’t add the hard disk as NVMe disk but default interface, ATA. So
the problem in my mindset changed from “fatcreate
does something really wrong” to “QEMU or OVMF have broken NVMe
support!”. I searched the internet but only found posts from years ago, mostly about booting windows from a PCI
passthrough’d NVMe SSD.. nothing relevant at all. No matching bug reports in debian bugtracker for either QEMU nor
OVMF. Nothing. Damn, a problem I experience first.
Trying to figure it out
I knew it was working not long ago, so the change must be recent .. so I looked at the changelog for QEMU (didn’t
see anything relevant), so I digged through the history of QEMUs emulated NVMe controller - still, nothing relevant
found. I then looked into OVMF, edk2. Looking at the commits OvmfPkg
but again nothing relevant found. To pin the
problem down to either QEMU or OVMF, I updated my local edk2
repository, built it and used that debug build to get
some output for the broken state (should have done this way earlier), I now had a line of code that gives an error!
So it thinks QEMUs NVMe controller isn’t a NVMe controller? Strange. (I later learned NVMe controllers can support different command sets, NVMe being the command set you’d expect from a controller with usable storage attached).
As said before, I didn’t see any recent changes to either QEMU nor OVMF, but maybe in the NVMe driver itself? Can it be UEFI that’s broken?
Finding the problem and fixing it
NvmExpressDxe
in MdeModulePkg
really had a recent change: back in january someone added a new feature: support
for boot partitions. NVMe boot partitions allow a system to boot with only a subset of NVMe implemented, e.g. for
booting without requiring another storage for the firmware (like a SPI Flash).
Adding this required some changes to the existing structs for the NVMe registers, notably increasing the attribute
Css
(CSS
: Command Sets Supported; a bit mask for which command sets this controller supports) in Cap
(Capabilities
, a way for NVMe controllers to communicate their supported features) from 4 bit (with 7 reserved bits
following it) to 8 bit (with the new boot partition supported
bit and two reserved bits following it).
Per NVMe spec, the CSS
register is 8 bits wide, so this part of the change is correct. Prior to NVMe v1.4, all of
these bits were reserved except the LSB, showing support for the NVM command set
. Since NVMe v1.4, the MSB is
defined as showing “No I/O Command Set is supported (i.e., only the Admin Command Set is supported)”.
QEMUs NVMe controller returns 0xC1
- LSB, MSB and MSB-1 being set. This is in violation of the NVMe v1.4 spec,
because
- MSB cannot be set when any I/O command set is supported and LSB shows support for the NVM command set
- MSB-1 is reserved
MSB-1 is defined in NVMe v2.0, which QEMU seems to implement. The bits roughly means “I support I/O command sets other than the NVM command set, use the new feature to check which ones” or at least “I support the host sending the command set to use with each command”. Setting the MSB while LSB is set still isn’t allowed - but setting MSB-1 and LSB is allowed and would trigger this problem in UEFI, too.
What is the actual problem? The condition in the if
checks for equality with 1, not if bit 1 (LSB) is set. This
means the same when the NVMe controller only sets that one bit (makes enough sense, as all others were reserved
when this driver was added to UEFI) and accidentally worked until now, because LSB+1, LSB+2 and LSB+3 weren’t used
by any controller - but with the full 8 bits now added to the register, this fails with NVMe 2.0 controllers
setting MSB-1 (and also controller setting MSB and LSB, but this seems still out of spec).
The fix for this is simple: changing the Private->Cap.Css != 0x01
to !(Private->Cap.Css & 0x01)
, doing an actual
check for the LSB being set. This is the shortest and easiest fix and matches the error message, so I created the
patch, tested it some more times to make sure and opened a pull request on GitHub.
I also sent the summary and link to edk2-devel, but so far
didn’t receive any response - maybe mailing lists being broken with DMARC strikes again… For good measure, and
being a good debian testing user, I also reported the bug to debian.