Page 1 of 1

Errors in errata suck!

Posted: Sun Jan 26, 2014 2:24 pm
by just brew it!
A co-worker and I just spent a decent chunk of the weekend chasing a bug in one of Microchip's PIC32 microcontrollers. Hardware bugs are the last thing a software engineer suspects, because bugs are usually due to software!

The nature of the bug is that reading or writing a hardware device register that has side effects beyond the register access itself can get executed twice if a hardware interrupt occurs while the machine instruction that accesses the register is executing. A good example of this would be a UART with a hardware FIFO -- writing to the Tx data register stuffs a byte into the FIFO *and* increments the FIFO pointer.

There's an existing Microchip errata that says writes to device registers may be repeated if an interrupt occurs. We now have convincing evidence that reads are affected as well, so it would seem the errata itself needs an errata.

So now we get to vet all of the code for accesses to registers with side effects, and add code to disable/enable interrupts around all such accesses.

D'oh! :roll:

Re: Errors in errata suck!

Posted: Sun Jan 26, 2014 3:42 pm
by Deanjo
just brew it! wrote:
A co-worker and I just spent a decent chunk of the weekend chasing a bug in one of Microchip's PIC32 microcontrollers. Hardware bugs are the last thing a software engineer suspects, because bugs are usually due to software!

The nature of the bug is that reading or writing a hardware device register that has side effects beyond the register access itself can get executed twice if a hardware interrupt occurs while the machine instruction that accesses the register is executing. A good example of this would be a UART with a hardware FIFO -- writing to the Tx data register stuffs a byte into the FIFO *and* increments the FIFO pointer.

There's an existing Microchip errata that says writes to device registers may be repeated if an interrupt occurs. We now have convincing evidence that reads are affected as well, so it would seem the errata itself needs an errata.

So now we get to vet all of the code for accesses to registers with side effects, and add code to disable/enable interrupts around all such accesses.

D'oh! :roll:


I feel your pain. I've been bitten a few times by AMD and intel chipset errors in errata.

Re: Errors in errata suck!

Posted: Sun Jan 26, 2014 4:44 pm
by UberGerbil
My first software job we had one guy who wasn't using the "official" NOP instruction for the 6502 but one of the others... which had undocumented side effects. Which we'd never heard of, and had no internet to use to find out about. That was fun.

Re: Errors in errata suck!

Posted: Sun Jan 26, 2014 5:02 pm
by notfred
Ran into something like that whilst doing the CRS-3 Ethernet interfaces. There was an errata on the Intel processor that had a read do a double access and take the second value. The problem was that we were trying to read from read/clear statistic registers and couldn't work out why they were all zero. That errata was actually documented but we were using library code and were not aware of it being hit.

Just last year I found an issue in an SOC's Ethernet implementation that caused them to respin the chip after having already taped out the supposedly final silicon. That must have cost them a bit!

A colleague is currently tracking down an issue in some PPC SPE code, build it without SPE and it is fine, start adding debug and the failure point moves. Feels like a processor or compiler issue to me.

Re: Errors in errata suck!

Posted: Sun Jan 26, 2014 5:16 pm
by JustAnEngineer
notfred wrote:
I caused them to respin the chip after having already taped out the supposedly final silicon. That must have cost them a bit!
Up front, you might have cost them a month and a half. In the longer run, you saved them a bunch of time, money and headaches.

Re: Errors in errata suck!

Posted: Sun Jan 26, 2014 5:27 pm
by just brew it!
Back in the day (1990s) I found a bug in a Motorola chip where the DMA engine in the integrated NIC could lock up under certain conditions. Unfortunately the trigger conditions for the lockup depended on the network traffic (packet size and rate) that was being passed and was not under control of the firmware developer. The device using the chip was already fielded (first gen DSL modems in this case). My workaround (such as it was) actually made it into the official Motorola errata. The workaround consisted of periodically checking in software whether the DMA engine had stopped moving data, and forcing a hardware reset of the NIC if so. Yes, this caused some packet loss; but by that point you were already dropping packets on the floor because the DMA engine was wedged, so resetting the NIC was the lesser of two evils!

Stuff like this makes up the "dark corners" that end users never see, at least not directly. All complex chips have bugs, and a lot of those buggy chips make it into shipping products.

Re: Errors in errata suck!

Posted: Mon Jan 27, 2014 1:20 am
by NovusBogus
Ah man, that sucks. We use PICs and several other flavors of microcontroller plus an SBC that *ahem* the vendor doesn't always configure properly so I definitely feel your pain.

Moving from .NET to embedded development for a 10+ year old hodgepodge platform has been a strange experience; in the old days any bugs found were 100% guaranteed to be obviously bad code, now it's maybe 40% dev derp and 60% some kind of weird errata or technology compatibility thing that only manifests itself when the code is logically fine but not exactly what the hardware wants to see.

Re: Errors in errata suck!

Posted: Mon Jan 27, 2014 4:49 am
by just brew it!
NovusBogus wrote:
Ah man, that sucks. We use PICs and several other flavors of microcontroller plus an SBC that *ahem* the vendor doesn't always configure properly so I definitely feel your pain.

Heh, I've seen the flaky SBC show too. In fact, we initially suspected the SBC that was sending the data as the culprit, not the PIC that was receiving it.

NovusBogus wrote:
Moving from .NET to embedded development for a 10+ year old hodgepodge platform has been a strange experience; in the old days any bugs found were 100% guaranteed to be obviously bad code, now it's maybe 40% dev derp and 60% some kind of weird errata or technology compatibility thing that only manifests itself when the code is logically fine but not exactly what the hardware wants to see.

In this case, it is a custom board that was designed in-house. If the symptoms had been consistent with something wrong with the board I would've suspected hardware right off the bat, since the board design is relatively new. But it was a device internal to the PIC that was mis-behaving; I really did not expect the PIC itself to have what I consider to be a pretty blatant bug like this.

This board has been a bitch to debug. The PIC also controls the VRMs for the SBC and other devices in the system, so bugs in the PIC code can cause random flakiness like SBC reboots (by glitching the power rails). Fun times.

Re: Errors in errata suck!

Posted: Mon Jan 27, 2014 8:25 am
by Glorious
Ubergerbil wrote:
My first software job we had one guy who wasn't using the "official" NOP instruction for the 6502 but one of the others... which had undocumented side effects. Which we'd never heard of, and had no internet to use to find out about. That was fun.


Sssh, don't encourage Shining Arcanine to come back and start ranting about what we've lost with the advent of microcode.

Because, you know, since it's such a shame that later chips (i.e. any microprocessor designed since the late 70s) got enough of a transistor budget to properly implement a microcoded instruction decoder.

It's *WAY* cooler when each bit of a machine instruction directly diddles with the execution logic, as you discovered.

Right?

Re: Errors in errata suck!

Posted: Mon Jan 27, 2014 10:10 am
by morphine
Glorious wrote:
Sssh, don't encourage Shining Arcanine to come back and start ranting about what we've lost with the advent of microcode.


Do not invoke He Who Must Not Named, lest he be summoned :P

Re: Errors in errata suck!

Posted: Mon Jan 27, 2014 12:53 pm
by liquidsquid
Erratas from hell: DragonBall VZ from Freescale. It was the next gen part to go in the Palm Pilots we were using for embedded applications. Had an LCD controller built in, but a terrible bootloader (I had to custom-write one). It was a nightmare. The errata sheet was almost 10 pages long of a variety of things my compiler knew nothing about how to circumvent.
I chose this part due to the seemingly widespread adoption, turned out not to be true. Also it's low power consumption at the time plus graphics core.

Lessons learned:
Use vendor compiler, they usually have a work-around in place soon after discovery of errata.
Don't use consumer level processors (they **** this part after Palm lost momentum).
Always choose a part with a robust well-supported community. This part had almost none.