Personal computing discussed

Moderators: renee, SecretSquirrel, just brew it!

 
just brew it!
Administrator
Topic Author
Posts: 54500
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Errors in errata suck!

Sun Jan 26, 2014 2:24 pm

A co-worker and I just spent a decent chunk of the weekend chasing a bug in one of Microchip's PIC32 microcontrollers. Hardware bugs are the last thing a software engineer suspects, because bugs are usually due to software!

The nature of the bug is that reading or writing a hardware device register that has side effects beyond the register access itself can get executed twice if a hardware interrupt occurs while the machine instruction that accesses the register is executing. A good example of this would be a UART with a hardware FIFO -- writing to the Tx data register stuffs a byte into the FIFO *and* increments the FIFO pointer.

There's an existing Microchip errata that says writes to device registers may be repeated if an interrupt occurs. We now have convincing evidence that reads are affected as well, so it would seem the errata itself needs an errata.

So now we get to vet all of the code for accesses to registers with side effects, and add code to disable/enable interrupts around all such accesses.

D'oh! :roll:
Nostalgia isn't what it used to be.
 
Deanjo
Graphmaster Gerbil
Posts: 1212
Joined: Tue Mar 03, 2009 11:31 am

Re: Errors in errata suck!

Sun Jan 26, 2014 3:42 pm

just brew it! wrote:
A co-worker and I just spent a decent chunk of the weekend chasing a bug in one of Microchip's PIC32 microcontrollers. Hardware bugs are the last thing a software engineer suspects, because bugs are usually due to software!

The nature of the bug is that reading or writing a hardware device register that has side effects beyond the register access itself can get executed twice if a hardware interrupt occurs while the machine instruction that accesses the register is executing. A good example of this would be a UART with a hardware FIFO -- writing to the Tx data register stuffs a byte into the FIFO *and* increments the FIFO pointer.

There's an existing Microchip errata that says writes to device registers may be repeated if an interrupt occurs. We now have convincing evidence that reads are affected as well, so it would seem the errata itself needs an errata.

So now we get to vet all of the code for accesses to registers with side effects, and add code to disable/enable interrupts around all such accesses.

D'oh! :roll:


I feel your pain. I've been bitten a few times by AMD and intel chipset errors in errata.
 
UberGerbil
Grand Admiral Gerbil
Posts: 10368
Joined: Thu Jun 19, 2003 3:11 pm

Re: Errors in errata suck!

Sun Jan 26, 2014 4:44 pm

My first software job we had one guy who wasn't using the "official" NOP instruction for the 6502 but one of the others... which had undocumented side effects. Which we'd never heard of, and had no internet to use to find out about. That was fun.
 
notfred
Maximum Gerbil
Posts: 4610
Joined: Tue Aug 10, 2004 10:10 am
Location: Ottawa, Canada

Re: Errors in errata suck!

Sun Jan 26, 2014 5:02 pm

Ran into something like that whilst doing the CRS-3 Ethernet interfaces. There was an errata on the Intel processor that had a read do a double access and take the second value. The problem was that we were trying to read from read/clear statistic registers and couldn't work out why they were all zero. That errata was actually documented but we were using library code and were not aware of it being hit.

Just last year I found an issue in an SOC's Ethernet implementation that caused them to respin the chip after having already taped out the supposedly final silicon. That must have cost them a bit!

A colleague is currently tracking down an issue in some PPC SPE code, build it without SPE and it is fine, start adding debug and the failure point moves. Feels like a processor or compiler issue to me.
 
JustAnEngineer
Gerbil God
Posts: 19673
Joined: Sat Jan 26, 2002 7:00 pm
Location: The Heart of Dixie

Re: Errors in errata suck!

Sun Jan 26, 2014 5:16 pm

notfred wrote:
I caused them to respin the chip after having already taped out the supposedly final silicon. That must have cost them a bit!
Up front, you might have cost them a month and a half. In the longer run, you saved them a bunch of time, money and headaches.
· R7-5800X, Liquid Freezer II 280, RoG Strix X570-E, 64GiB PC4-28800, Suprim Liquid RTX4090, 2TB SX8200Pro +4TB S860 +NAS, Define 7 Compact, Super Flower SF-1000F14TP, S3220DGF +32UD99, FC900R OE, DeathAdder2
 
just brew it!
Administrator
Topic Author
Posts: 54500
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: Errors in errata suck!

Sun Jan 26, 2014 5:27 pm

Back in the day (1990s) I found a bug in a Motorola chip where the DMA engine in the integrated NIC could lock up under certain conditions. Unfortunately the trigger conditions for the lockup depended on the network traffic (packet size and rate) that was being passed and was not under control of the firmware developer. The device using the chip was already fielded (first gen DSL modems in this case). My workaround (such as it was) actually made it into the official Motorola errata. The workaround consisted of periodically checking in software whether the DMA engine had stopped moving data, and forcing a hardware reset of the NIC if so. Yes, this caused some packet loss; but by that point you were already dropping packets on the floor because the DMA engine was wedged, so resetting the NIC was the lesser of two evils!

Stuff like this makes up the "dark corners" that end users never see, at least not directly. All complex chips have bugs, and a lot of those buggy chips make it into shipping products.
Nostalgia isn't what it used to be.
 
NovusBogus
Graphmaster Gerbil
Posts: 1408
Joined: Sun Jan 06, 2013 12:37 am

Re: Errors in errata suck!

Mon Jan 27, 2014 1:20 am

Ah man, that sucks. We use PICs and several other flavors of microcontroller plus an SBC that *ahem* the vendor doesn't always configure properly so I definitely feel your pain.

Moving from .NET to embedded development for a 10+ year old hodgepodge platform has been a strange experience; in the old days any bugs found were 100% guaranteed to be obviously bad code, now it's maybe 40% dev derp and 60% some kind of weird errata or technology compatibility thing that only manifests itself when the code is logically fine but not exactly what the hardware wants to see.
 
just brew it!
Administrator
Topic Author
Posts: 54500
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: Errors in errata suck!

Mon Jan 27, 2014 4:49 am

NovusBogus wrote:
Ah man, that sucks. We use PICs and several other flavors of microcontroller plus an SBC that *ahem* the vendor doesn't always configure properly so I definitely feel your pain.

Heh, I've seen the flaky SBC show too. In fact, we initially suspected the SBC that was sending the data as the culprit, not the PIC that was receiving it.

NovusBogus wrote:
Moving from .NET to embedded development for a 10+ year old hodgepodge platform has been a strange experience; in the old days any bugs found were 100% guaranteed to be obviously bad code, now it's maybe 40% dev derp and 60% some kind of weird errata or technology compatibility thing that only manifests itself when the code is logically fine but not exactly what the hardware wants to see.

In this case, it is a custom board that was designed in-house. If the symptoms had been consistent with something wrong with the board I would've suspected hardware right off the bat, since the board design is relatively new. But it was a device internal to the PIC that was mis-behaving; I really did not expect the PIC itself to have what I consider to be a pretty blatant bug like this.

This board has been a bitch to debug. The PIC also controls the VRMs for the SBC and other devices in the system, so bugs in the PIC code can cause random flakiness like SBC reboots (by glitching the power rails). Fun times.
Nostalgia isn't what it used to be.
 
Glorious
Gerbilus Supremus
Posts: 12343
Joined: Tue Aug 27, 2002 6:35 pm

Re: Errors in errata suck!

Mon Jan 27, 2014 8:25 am

Ubergerbil wrote:
My first software job we had one guy who wasn't using the "official" NOP instruction for the 6502 but one of the others... which had undocumented side effects. Which we'd never heard of, and had no internet to use to find out about. That was fun.


Sssh, don't encourage Shining Arcanine to come back and start ranting about what we've lost with the advent of microcode.

Because, you know, since it's such a shame that later chips (i.e. any microprocessor designed since the late 70s) got enough of a transistor budget to properly implement a microcoded instruction decoder.

It's *WAY* cooler when each bit of a machine instruction directly diddles with the execution logic, as you discovered.

Right?
 
morphine
TR Staff
Posts: 11600
Joined: Fri Dec 27, 2002 8:51 pm
Location: Portugal (that's next to Spain)

Re: Errors in errata suck!

Mon Jan 27, 2014 10:10 am

Glorious wrote:
Sssh, don't encourage Shining Arcanine to come back and start ranting about what we've lost with the advent of microcode.


Do not invoke He Who Must Not Named, lest he be summoned :P
There is a fixed amount of intelligence on the planet, and the population keeps growing :(
 
liquidsquid
Minister of Gerbil Affairs
Posts: 2661
Joined: Wed May 29, 2002 10:49 am
Location: New York
Contact:

Re: Errors in errata suck!

Mon Jan 27, 2014 12:53 pm

Erratas from hell: DragonBall VZ from Freescale. It was the next gen part to go in the Palm Pilots we were using for embedded applications. Had an LCD controller built in, but a terrible bootloader (I had to custom-write one). It was a nightmare. The errata sheet was almost 10 pages long of a variety of things my compiler knew nothing about how to circumvent.
I chose this part due to the seemingly widespread adoption, turned out not to be true. Also it's low power consumption at the time plus graphics core.

Lessons learned:
Use vendor compiler, they usually have a work-around in place soon after discovery of errata.
Don't use consumer level processors (they **** this part after Palm lost momentum).
Always choose a part with a robust well-supported community. This part had almost none.

Who is online

Users browsing this forum: No registered users and 1 guest
GZIP: On