Phil on Stuff: Big ESX in a Tiny Box - What's up with the delay here?!

Okay, I owe everyone an explanation, so here goes.

I've run into a problem that was not present on the prior build. This is an extremely severe problem that makes the system completely unusable. Understand this is through no fault of design or implementation here, but rather, due to a very severe bug in ESX/ESXi's Intel EtherExpress driver, specifically in the MSIX Vector section of the e1000 driver. Please understand, this bug did not present previously in any situations I tested. Remember, every component in the system is on the HCL. The problem here is with an Intel i82574L ethernet controller; you can find it under IO Devices, Networking, partner name Intel.

At this point, I'm trying to get a 12x5 Basic contract via VMware so that this bug can be escalated properly. The exact issue, going into the technical side of things, has to deal with how ESX/ESXi handles the Interrupt Vector Address Routing or IVAR for PCI-Express MSIX. If this part goes way over your head, don't worry; it's supposed to. This requires that you have prior experience developing drivers and doing kernel programming to understand. It also requires knowledge of the Intel EtherExpress family and PCI-Express bus.

So, the i82574L/LA has a 5 entry IVAR. Typical drivers will use only the first three IVARs and ignore activities on entries 4 and 5. (Technically, 3 and 4, as it starts at 0. So I'm going to be starting from 0 here.) ESX/ESXi uses or touches all of the IVAR table, 0-4. The i82574 can operate in a number of modes, which are identified by the Function registry entry. In normal operation on most systems, Function will be 0, which indicates the following list of items:
- Operating as LAN0
- Operating as LAN1
- Operating as LAN0 shared with IPMI/BMC
- Operating as LAN1 shared with IPMI/BMC
Yes. A single Function mode indicator, indicates FOUR functions. So how do we control whether we're doing operations strictly for the host, or we're doing operations for the IPMI? By asserting via MSIX and the IVAR table.
Here's where ESX/ESXi's e1000 driver breaks in a predictable and reproducible fashion. I can't explain why it's breaking, only exactly HOW it's breaking.
When operating normally, the e1000 driver will LOSE the MSIX vector completely. This results in the Interrupt Status Register being lost, causing the driver to lose awareness of the controller state as well as halting all network traffic. This is not enough to crash ESX/ESXi, and the driver continues operation without asserting an error, even though interrupts are doing nothing. This also means that the driver is unaware of link state changes, so any HA/FT features will be rendered useless as far as that host is concerned. (Remote hosts will have to assert failure condition on network going unreachable and yank control.) If you attempt to use ethtool to diagnose the i82574L/LA at any point during operation in either online or offline mode, you will get this failure:

PCPU 0 locked up. Failed to ack TLB invalidate (0 others locked
up).
cr2=0x0 cr3=0x400ed000 cr4=0x16c
0:8353/ethtool 1:6231/sfcbd *2:4109/helper1-0 3:5287/sfcbd
4:5101/openwsman 5:5001/hostd 6:5100/openwsman 7:8450/prop_of_i
--
Saved backtrace from: pcpu 0 TLB NMI
Sanitizing and rewriting to make it make sense, here's the actual code path of the failure.
FindIRQInfo+0x69
RemoveIRQInfo+0x41
vmklnx_request_irq+0x32f
e1000_diag_test+0xc5f
ethtool_self_test+0xfc
__ethtool_ioctl+0xe62
vmklnx_ethtool_ioctl+0x7a
netdev_ioctl+0x101
NicCharOpsIoctl+0x65
VmkApiCharDevIoctl+0xe6
DevFSIoctl+0x3e5
FSS_Ioctl+0x17d
UserFile_PassthroughIoctl+0x44
LinuxFileDesc_Ioct+0x7e
User_LinuxSyscallHandler+0xa3

So what's the exact problem? When it allocates the interrupt vector, it promptly loses it. It appears to be IVAR entry 5, as ESXi reports looking for 0x5a5a5a5a and instead getting 0xffffffff.

Exposure of the problem appears to be tied to a change in board settings which most users will make, meaning it's very easy to trigger. (The initial build was slightly different.) Until I get this issue escalated within VMware to the point where they're actually guaranteeing further investigation as well as a fix for this problem, I can't tell you what parts to buy because obviously, they no longer work right now, even though they are on the VMware HCL.

Sorry folks. I'm working my tail off here to get VMware to look at this. I have the crash dumps available on request. I just don't have a support contract.

Phil on Stuff

Thursday, February 11, 2010

Big ESX in a Tiny Box - What's up with the delay here?!

1 comment:

Blog Archive

About Me

Followers