Battery EV meets OCZ

Thank you for contacting support. Have you tried turning it off and back on again?

My EV has a Flash drive; its name is Vertex 2…

A side effect of the computing revolution is that problems once confined to old-timey desktop PC users now appear everywhere. Unexpected smoke from the smart toaster? Better check the manual. Smartphone antivirus apps?  They exist, surely enough. Devious Russian election hackers living in your refrigerator? Done and done, you unlucky soul. The past week’s tech news circuit dredged up another old nemesis: It seems a number of Tesla automobiles are being bricked by Flash memory wear-out, due to bad controller behavior.

The news popped up on numerous tech and automotive sites, and the always-interesting TTAC offered a representative take: the vehicles are logging far too much data to the MCU module (i.e. the onboard control display and its circuit board), and vehicles equipped with MCUv1 are hardest hit. There’s a spread of reported failure times but four years seems to be the tipping point, and there isn’t a graceful failure mode for this scenario. The car’s response ranges from “won’t respond to various inputs” to “won’t anything.”

Sound familiar, fellow computer enthusiasts? Dial up the way-back machine aaaannndd…

Basics of Flash (the media, not the mud brick)

Hold that thought for a moment. Quick review: besides also being the now-reviled name of an increasingly-deprecated Adobe product, “Flash” refers to a form of non-volatile memory storage. Every device featuring any kind of software operating system uses it for base storage and temporary workspace these days. Smartphones, tablets, portable media players — all exist in present form because of it.  Flash is also the main component in solid-state drives (SSDs) found in most laptops and many desktop PCs.

A common Flash module memorius nonvolatilem,
newly emergent from bulk packaging.

For details on what Flash is and does, check out the Wikipedia article. At a high level, we can say each storage location (block) holds a small electric charge representing stored information, and can do so for a very long time after control power is lost. When the device is running, the charge can be modified by the device controller.  However, the block takes wear and tear in trade. The maximum charge it can store decreases until, at end of life, its charge state cannot be reliably identified.

Not all blocks take wear at the same rate, so a Flash device’s controller will attempt a wear-leveling scheme. Also, a reserve pool of spare blocks is commonly provided. When the controller detects a block is nearing end of life, the block is locked out and reprovisioned from a spare in the reserve pool. To the outside world, nothing has changed. But that can’t continue forever, and the Flash device can fail. If another device is expecting to find a critical component (such as its own operating system) in the Flash device, and doesn’t, the user may find themselves owning an expensive door stop.

In days long past, the Tech Report ran a long-term SSD endurance experiment, concluding that with good design, a Flash-based SSD can have a usable service life far exceeding the equipment in which it lives. However, many in the TR community, and elsewhere, had previously experienced the dark side of the Flash. Just under ten years ago, when SSDs were first becoming popular at mainstream prices, there were more than a few unexpected failures, and they were brutal. Traditional hard drives would sometimes give you warning that they were dying and still spit out useful or garbled information, rather like actors who die during key transitions of a film plot. SSDs, on the other hand, tended to go out like the proverbial 24-year old athlete with a previously undetected heart murmur.

We all know what a deep lime green means.

Chief among the culprits were a couple 2011/2012-ish runs of the Agility, Octane, and Vertex series SSDs from now long-dead vendor OCZ Technology. OCZ, as a brand, isn’t completely gone, but the past company lives in infamy with many enthusiasts (and its former CEO). Failure rates on some of those products allegedly reached 50%. OCZ’s problems seem to have been due to either buggy firmware or possible silicon-level design flaws in their suppliers’ hardware, particularly with one notorious generation of SandForce controllers.

It was never entirely clear how many units died because the Flash itself was damaged by bad controller behavior, and how many were bricked because of an Ouroboros effect in the firmware. The result was the same: one day you had a fast, reliable, high-tech, Flash storage device. The next day your PC would hang while displaying “no bootable media found.”

The data-logging of doom

When SSD drives were still a new technology, a popular debate on enthusiast forums was whether swap file activity would lead to premature death of an SSD. PC operating systems, and Windows in particular, would keep a running file of temporary data on the local storage drive, to be “swapped” in and out of the active system memory on demand. The situation was presumed worse if the storage drive was relatively full, since a smaller range of free space would be ‘hit’ by the ongoing swap activity.

According to Tom’s Hardware, which has looked at the computing side of the Tesla situation with bit more depth, a similar scenario has occurred here. When the vehicle designs were new, the firmware installed in the MCUv1’s Flash module was relatively small, and the datalogging software had a large playground. However, one of Tesla’s selling points has been over-the-air (OTA) updates to enable new features or improve existing ones. Likewise, the company has become famous for performing detailed data acquisition and analysis to optimize those updates. In time, the size of the installed firmware grew large, leaving less empty space for the aggressive datalogger to chew up.

The solution?

In the SSD space, things improved in several ways. SSD capacity became much larger at affordable prices. The average drive wasn’t being used as close to capacity as in early models, and overprovisioning of spare media inside the drive was also more practical. Controllers became smarter about their wear-leveling schemes, while operating systems gained awareness of SSDs and modified their activities to suit.

What Tesla will do here, remains to be seen. Other consumer devices like smartphones, tablets, and even many laptops may have an expected life less than the length of a car finance term, and are screwed and glued together with unrepairable abandon. Flash memory wear-out is something like 54th down the list of design liabilities, and even then it can be handled relatively easily in the controller and OS design. When a substantial computer is fully integrated into a car, long term reliability and repairability must be factored into the design.

Obviously, failed MCUv1 unit assemblies will need replacement, although who bears the cost of the failure is an interesting question. As long as the vehicles aren’t prone to in-flight shutdowns with associated liability, Tesla may not have a legal incentive to issue a recall. A design change to calm the car’s data-logging behavior has been asked for by a number of techs who understand the problem. In future vehicles, the amount of onboard Flash memory will doubtless be made much larger, too. Apparently Tesla vehicles with MCUv2, and the Model 3, already have a larger Flash reserve.

For future designs, the Flash reserve could be increased further at minimal cost, and even partitioned into separate banks: one for the mission-critical firmware that is only touched by updates and critical I/O, and, another for real time use and non-critical accessory operations. Perhaps, in a future version, the Flash itself will even be moved to a separate module that can be replaced without trashing the entire display assembly.

Regardless, it would be desirable to avoid killing off yet another MCU, since Infinity War and Endgame were pretty thorough in that regard.

Aaron Vienot

Engineer by day, hobbyist by night, occasional contributor, and full-time wise guy.

avatar
3 Comment threads
3 Thread replies
0 Followers
 
Most reacted comment
Hottest comment thread
5 Comment authors
tweakWirkoDragontamer5788Krogoth Recent comment authors
  Subscribe  
newest oldest most voted
Notify of
Wirko
Guest
Wirko

I choose to send people to Mars a couple years from now, not because it’s easy, but because it’s hard to do it when you put cheap consumer parts in mission control computers, har, har.

Dragontamer5788
Guest
Dragontamer5788

Likewise, the company has become famous for performing detailed data acquisition and analysis to optimize those updates. In time, the size of the installed firmware grew large, leaving less empty space for the aggressive datalogger to chew up. Hmmm… I had a discussion on Reddit and while the above theory is written about all over, it may not be correct. Lets say we have a flash chip with 30-blocks. Firmware was written to block#1 through block#25. And lets assume that firmware is rarely written: maybe once a month or less. Lets say we have a log-file. Assume it remains small… Read more »

Guest
ludi

There are several possibilities, of course, that’s the one Tom’s Hardware suspected.

FWIW, Micron refers to the two methods by opposite names from what you’ve got:

https://www.micron.com/-/media/client/global/documents/products/technical-note/nand-flash/tn2942_nand_wear_leveling.pdf

…presumably because it brings the static data blocks into the reallocation pool?

It could be that the best wear-leveling algorithm was in place and the volume of data the logger was dumping was just so massive that even the wear-leveled flash chip couldn’t cope with it.

Dragontamer5788
Guest
Dragontamer5788

FWIW, Micron refers to the two methods by opposite names from what you’ve got:

That’s probably because I herp-derped and messed up. Ah well…

It could be that the best wear-leveling algorithm was in place and the volume of data the logger was dumping was just so massive that even the wear-leveled flash chip couldn’t cope with it.

Yeah, its a possibility. I don’t think there’s evidence pointing one way or the other quite yet, but unless someone knows for sure the wear-leveling algorithm that was used (static vs dynamic), we’ll have to keep in mind the different possibilities here.

tweak
Guest
tweak

wear has to do with write operations. You’d have to update the firmware thousands of times a day for it to affect the block lifespan. Since you already stated it might be updated about once a month, well, you get the idea. Now data collection could affect wear, but I think the issue has more to do with the available space. Take for instance a smartphone with say 8gb of storage. At first it would seem its more than enough for a few apps. But in time its just not enough because whatsapp was say 20mb when it was first… Read more »

Krogoth
Guest
Krogoth

Pfft, I prefer brain bleach for permanent data removal.

Pin It on Pinterest

Share This