"HELP! My UPnP device isn't working right!" and other short stories

Does that sound familiar to you? Then you’ve come to the right place! With the release of 3.1GA coming in the next few weeks I wanted to document some of the updates that have been made to the underlying upnp framework (known as jupnp) that have been recently incorporated into the 3.1 snapshot (#2414 for anyone keeping track). In particular, we are now running jupnp 2.6.0. To give some back story, over the past few years there have been reports of upnp based devices (e.g. Sonos, Panasonic TVs, etc) that “act weird” or have stability issues. The changes implemented recently hope to help relieve the issues as well as provide the end user with additional configuration options to help tune their system. I’ll note up front, this is an advanced to expert level topic, so if you’re new here, please be careful and ask questions if you’re not sure what all this means!

First things first, “how do I know what I did worked”. That’s an easy one! In your karaf console “log:set INFO org.jupnp.OSGiUpnpServiceConfiguration” Careful about making “org.jupnp” anything more than FATAL, it will generate a massive amount of logs. In theory, you should only see about 5-8 lines of output to openhab.log at startup and that’s it. It will look something like…

2021-06-05 02:39:53.428 [INFO ] [g.jupnp.OSGiUpnpServiceConfiguration] - OSGiUpnpServiceConfiguration createConfiguration threadPoolSize = -1 false
2021-06-05 02:39:53.429 [INFO ] [g.jupnp.OSGiUpnpServiceConfiguration] - OSGiUpnpServiceConfiguration createConfiguration asyncThreadPoolSize = -1 false
2021-06-05 02:39:53.430 [INFO ] [g.jupnp.OSGiUpnpServiceConfiguration] - OSGiUpnpServiceConfiguration retryAfterSeconds = 10
2021-06-05 02:39:53.430 [INFO ] [g.jupnp.OSGiUpnpServiceConfiguration] - OSGiUpnpServiceConfiguration retryIterations = 5
2021-06-05 02:39:53.430 [INFO ] [g.jupnp.OSGiUpnpServiceConfiguration] - OSGiUpnpServiceConfiguration timeoutSeconds = 10

Second, and very important, DO NOT MAKE THE CHANGES BELOW WHILE OH IS STARTED! Yes, it will accept them. I’ve had a bunch of weird crazy happen with my system stability when implementing it while on. There are only a few options here, just stop/configure/start.

Third, “where do I make these changes?” There are a variety of answers, but my personal preference is services/runtime.cfg (at the bottom). Just paste the lines in there and it will be read at start (or if you make changes while it’s up). You can also do it through the karaf console, but that’s more complicated than adding a few characters to a text file. Your choice.

Ok, now that’s out of the way, there were two groups of changes that were made.

  1. Threading. Ok, so “why do I need to mess with the threading”. The most likely answer is stability. In my case, I had Sonos speakers that were going offline randomly and for no good reason. Reality was, I was running out of threads during time of high upnp load and traffic was being unintentionally dropped on the floor instead of being processed. Figuring this out was not trivial, and likely something 99.95% of OH users will not attempt to do. Something VERY important to understand about the upnp framework is that it is irrelevant what you configure in OH from the perspective of bindings, things, items, etc. The jupnp framework is added to the system the moment one upnp based binding is added to the mix (even if you don’t have a single thing configured yet). The system will identify and report every upnp device on your network to OH for processing. It will also process every upnp message it receives. There is no way to stop that. So while you may only have one TV configured, or one speaker, if you have 100 different upnp devices on your network, OH will track 100 devices and that requires multithreading! So understanding what’s on your network is very important to the overall system stability.

So, now for “what are my options”. (Defaults depicted below for reference)

org.jupnp:threadPoolSize=20
org.jupnp:asyncThreadPoolSize=20

The configuration above creates two thread pools for use. By default each is maxed at 20. DO NOT GO SMALLER THAN 20 UNLESS YOU HAVE A GOOD REASON TO! There is actually a third thread pool that is created which is the sum of those two so be aware that what ever you do it’s actually double. There is not a 1 to 1 correlation of threads to upnp devices. This is a pool, it’s used when it’s needed by each device. First step, dial this up (slowly) to see if you get added stability. You may just need a small bump.

Ok, so now what happens when you still have stability issues because your upnp environment is just massive (which was my problem and I didn’t realize it for almost 2 years). If, key word here, you have enough memory/cpu on your OH instance, you can just let the system freely create threads without a limit. You do this by setting either (or both) of the configurations above to -1. Two things happen here when you do that. First, obviously that pool is allowed to create an unlimited number of threads so that you never have contention. Second, that third pool I mentioned above, is also set to -1 behind the curtain (this is required otherwise you can have a contention issue).

If you get to the point that both are set to -1 and you still have thread issues, please open a bug and post here so we can go look at it!

  1. Retry timers and delays. By default, jupnp waits 10 minutes after a device fell off the network before it would retry to talk to it. So assume you have a thread issue above, and a device drops off because a reply was missed, that device is then offline for 10 minutes. For most people, that’s not acceptable. Now, that option is configurable! There are two other items that have been added recently that work to increase stability as well. There are cases (Panasonic TVs for example) that device manufacturers do a really bad job at their startup order. In the case of the Panasonic TVs, they begin to broadcast upnp messages before the rest of the TV is up. If OH is fast enough, it would try to connect to the TV and that would fail (and then the 10 minute timer would start). This chicken before the egg scenario is obviously suboptimal. To help fix that, a few extra configurations were added to give those devices time to get their act together and behave properly.

So, again, “what are those configuration options?” (defaults shown below for reference)

org.jupnp:retryAfterSeconds=600
org.jupnp:retryIterations=5
org.jupnp:timeoutSeconds=10

“What does each one of these options do to help me?”

retryAfterSeconds - This is as mentioned above, is how long the system waits after a device goes offline before it tries to reconnect.
retryIterations - This is how many times the system attempts to make the connection before marking the device as offline.
timeoutSeconds - This is how long we wait at each attempt/iteration before timing the attempt out.

Each of these options builds on the previous. If a device has an issue and “fails”, the first thing jupnp does is attempt to reconnect. It will wait timeoutSeconds to hear back from the device. It will then retry retryIterations before marking the device as offline. Once retryIterations is reached, the system will wait retryAfterSeconds before starting the timeoutSeconds/retryIterations/retryAfterSeconds loop again (which is endless). In many cases, most people will just need to tweak retryAfterSeconds to a lower value to accomplish their goals. In the case of the Panasonic TVs, retryIterations is likely going to need to be turned up if 5 isn’t sufficient.

I hope this helps. For those wanting to know specifics, below are the PRs that contributed to this:

(129 was not implemented, 138 replaced this, but there is a bunch of notes and discussion on the topic that were rolled into 138)

Also as a side note, 131 was also committed to fix an issue where a bunch of Samsung devices (my fridge for example) were sending out upnp messages with https URLs in the field. This was causing unnecessary errors and load, we ignore those now.

8 Likes

Bravo. It is clear and will certainly help users a lot.

2 Likes

So, I have a question; How high can one set retryIterations? I am disconnecting my sonoses with wall plugs, since they (esp. the older ones) draw so much power, so I’d like the reconnect to try forever (well…).
Can I set e.g. 100K retryIterations? Then if the Sonoses are powered off for a couple of months I don’t need to restart on when I come home from vacations and so on.

There would be no reason to set it that high. In fact, you definitely don’t want to do that as it will cause other issues. If the device is truly offline, because you unplugged it, then let the system mark it as offline correctly. Setting retryIterations that high would cause it to think that the device is still online when it’s not. That’s just going to generate additional load. Once the device goes offline, as long as the thing is still configured, it will come back the moment jupnp detects it on the network again. That can be seconds, minutes, hours, days, months. The purpose of retryIterations is for devices that aren’t behaving nicely. For example, the panasonic TVs have a tendency to start broadcasting upnp messages before they are actually ready for things to connect to them. This can cause problems for a variety of reasons. For the sonos speakers, there are times that packets can be lost under high traffic. retryIterations prevents the device from being marked as offline before attempting to retransmit a few times.

2 Likes

I made the changes to the runtime.config file, though now I see a log entry like this:

2021-12-07 16:49:32.823 [WARN ] [g.dispatch.internal.ConfigDispatcher] - Could not process config file '._runtime.cfg': Input length = 1

Any clue what this is meaning?
Did my changes pick up?

Guess; your editor has made a working copy of the file, possibly just an interim durng the save process?

Did not see the hidden file, my fault, thanks

I want to clarify something on here that I’ve seen across different threads that reference this post. The following are the DEFAULT settings that are already in the system. By setting them, you’re not actually doing anything to make any changes:

org.jupnp:threadPoolSize=20
org.jupnp:asyncThreadPoolSize=20
org.jupnp:retryAfterSeconds=600
org.jupnp:retryIterations=5
org.jupnp:timeoutSeconds=10

If you are having issues, you should start to tune them beyond those numbers. Increase the thread pool sizes (or set them to -1 to let it free fly if you have the resources and just want to let it go for it). If you are having issues with things falling off and not coming back for a while, modify the bottom 3. Either way, setting the defaults aren’t going to help resolve your situation.

1 Like