Zigbee -- IKEA and other EFR32MG1P based products occasionally crashing

Tags: #<Tag:0x00007fc8fbe4aea0>

There appears to be a SiLabs bug that affects most EFR32MG1P based products on the market where the device will occasionally / rarely crash. After the crash, the device will no longer respond to APS frames, but will MAC ACK anything. Group/broadcast frames, including OnOff (0x6) still work fine. The device will still broadcast link status frames, but will report zero links.

Joined end device children to the crashed device will not find a new parent. They seem to think the parent is still working fine and will never find a new parent. Crashed router will not forward all frames from end devices. For example, a aqara motion sensor will send an occupancy attribute report which is lost but the lux attribute report is successfully delivered.

This bug has been confirmed by manup, one of the deconz developers and also by some of the Home Assistant ZHA developers. It seems more prevalent in large (100+) networks but even then devices can go for a few days to a few months before crashing.

I’m posting this here to just help spread the word – I’ve tried to put a few messages forward to the IKEA Embedded System Engineering group to notify them of the issue. This bug may be fixed in Emberznet 6.7.3 as these symptoms match perfectly, but the triggering conditions do not as ECC/SmartEnergy is not being used:

I know Chris is an excellent EZSP resource (thanks for the amazing sniffer, I use it daily) and didn’t know if he had any other thoughts he could share.

Thank you!

Thanks for pointing this out. If I understand this correctly it will not be a problem for any users here since CBKE is not used, and I would expect that the ECC libraries are not compiled into any normal user code.

The ECC libraries are not provided by Silabs as standard - they require additional registration, and are only normally used for SmartMeter (SEP) systems. We use this with one customer who is an energy supplier in the USA, but no other systems that I know of that are being used with OH have the ECC libs compiled into their firmware.

Thanks @chris – It does actually affect all users of a normal HA 1.2 / ZB 3.0 network. I have confirmed cases on deconz, zigbee2mqtt, home assistant ZHA and hue bridges so far – none of which are using CBKE/ECC or SmartEnergy. I also have some packet captures using your tool showing the issue.

The bug I picked out is just the closest match so far in the EmberZNet release notes for the symptoms.

Things we have tried so far to “rescue” a crashed IKEA device:

  • ZDP leave request (with rejoin) unicast to the IEE
  • ZDP leave request (with rejoin) broadcast to all mains routers (0xFFFE) for the IEEE
  • NWK leave request (with rejoin)
  • Spoofing NWK link status with valid entry for the device to hopefully get the coordinator in it’s neighbor table
  • Artificially increasing frame counters

If interested the deconz issues chasing these down are here:


Ok, I was simply going on what I read in the image you highlighted above which distinctly talks about the ECC libraries which should not be incorporated in most devices as far as I know (certainly they aren’t included in any of the dongles).

I obviously can’t confirm this, but I would doubt that the Ikea bulbs include the ECC libraries - unless you know differently?

It sounds like the issue you’re reporting isn’t therefore related to the CBKE/ECC fix that you highlighted, which is a little confusing, and makes it a little hard to understand the issue :wink:

I’m happy to take a look - it might help my understand the issue at least :slight_smile:

Thanks.

Went back and I’ve trashed the cleanest examples of one falling off the network. I’ll try to recapture and post a PCAP file later.

The issue is definitely exacerbated when a large number of devices rejoin the network (for example if you power off a room full of bulbs, then power them back on)

Finally identified the bug - there is a low level stack bug in the SiLabs EFR32 used by IKEA devices that causes them to crash.

This bug is fixed in Emberznet 6.7.7 and higher from SiLabs – but a majority of IKEA devices are running earlier 6.0 to 6.5 releases that are all affected by this bug.

The bug occurs when a bulb is processed a ZDO parent announcement and other traffic arrives. Seems to be triggering more often on large networks.

Have some test code that does a ZDO parent flood and can now reproduce on demand.

Anyone know which is the latest version of EmberZNet FW for IKEA Trådfri Signal Repeater E1746?

That is, what version of EmberZNet is the latest official OTA update from IKEA for E1746 is based on?

I’m not sure about if the CBKE is the culprit, but #444929 does sound like the exact behavior we’re seeing with Ikea

Fixed a race condition that could result to network retry queue overflow. The race condition could occur on a router or 
coordinator device, causing the NWK Retry Queue to lock up and overflow, such that no more NWK or APS layer frames 
would be transmitted. This could occur when all of the following conditions were met: (1) An APS broadcast frame to 
addressed to 0xFFFF is pending on the parent for relaying to at least one sleepy child on next poll. (2) Parent queues this 
broadcast data as a MAC unicast for transmission to its sleepy child after receiving a Data Request and begins transmission 
of the data. (3) A second broadcast to 0xFFFF is queued for relaying to sleepy children before the in-flight transmission of 
the first broadcast completes its handoff to the end device.

CBKE is not normally included in any product except Smart Energy. Silicon Labs do not provide CBKE libraries in their standard distributions - it is licensed separately and needs to be requested separately and manually added to the build - I would be surprised if it is included in Ikea devices.

That’s what I meant. But take a look at the issue #444929 from the release notes. IIRC there were some reports for ikea lockup to happen (noticed) shortly after a power outage, when a bunch of devices come back online at once and send a bunch of broadcasts, e.g. parent announcements

I think the latest FW for E1746 is 2.2.005 and is based on EmberZNet 6.3.1