OH Core Possible Thread Pool Exhaustion Issue

In another thread the OP has been complaining that status updates from the Hue binding are delayed being processed by OH Core (and eventually not processed by OH Core at all).

In this post you can see log entries with three timestamps – namely 1. when the Hue bridge created the event, 2. when the Hue Binding forwarded the event to OH Core, and 3. when OH Core finally processed the event. You can see there is minimal time delay between 1. and 2. (so I think the binding is Ok), but (at the end of the log) there is almost one minute delay between 2. and 3.

My suspicion is that this is a Thread Pool Exhaustion issue in OH Core. => WDYT?

Some users link each and every channel to items and set refresh delays so low that their system is overloaded by events to handle. Then they either need to stop so many events from being generated or increase the number of threads that handle them so the system can keep up.

In the Hue binding, the events come in as is) via SSE push, so there is no refresh delay to set.

Most likely the user uses more than one binding. :slight_smile:

1 Like

The OP uses either the Visual Studio OH plugin, or the OH Visual Studio server (it is not so clear from what he is saying). And there is some suspicion that this (whichever it may be) could be the culprit. => WDYT?

Can confirm. Experienced this most noticeably with the old/v1 hue binding. A button click could take seconds before changing the light. Tracked it down to threadpool not executing in time/being exhausted. Increasing the thingHandler threadpool solved the issue for me.

Had a few users on the ESPHome binding reporting lost connections. Increasing the threadpool size solved the problem (but revealed other issues on the user side).

IMHO if OH will continue with a shared threadpool that bindings do I/O on flaky connections, there should at least be some sort of monitoring and logging so users can figure out that this is happening, and preferably also telling them which bindings that are causing the issue.

2 Likes

It reminds me of

I don’t know if the metrics already expose these event queues. If not, then I still want to add that.

Unfortunately, this statement does not explain why the system runs stable and without delay for days and then suddenly, and as far as I have been able to determine so far only with the HUE events, gets a delay of several minutes.
Meanwhile, interactions via the dashboard or Alexa are processed immediately. Other items are also updated.

It is probably due to the difference in architecture between bindings that poll for updates and those that receive push updates via SSE or Websocket:

  • polling model: the poller is a task that is scheduled on the OH scheduler to run at periodic intervals and which updates the channels on the OH event bus when that task is called resp. returns; the polling interval can be many seconds or minutes so delays are less likely to be noticed; and the scheduler may in the meantime catch up anyway; also all polling is done on the single polling task thread.

  • push model: the events come in real-time, and the binding creates a task on the OH scheduler that updates the channels on the OH event bus; there will be a new update task created for each incoming event.

Hello everyone,
I implemented a runtime evaluation of the HUE sensors before Christmas to be able to better analyze the behavior of the overall system.
What I can see is that at night or when we are not at home, the response times remain constantly below 100 ms. Perfect!
When I tinker with the system, change item files or rule files, the response times sometimes jump over 2 minutes. Possibly OK, although I can’t see any significant increase in processor load in HTOP.
When we are at home, the response times of the HUE sensors sometimes change to 2 seconds. Not so good if you want to go downstairs in the dark.

The only thing that is different when we are there is that the motion sensors or window contacts are triggered. These are essentially Fibaro ZWave modules.
So it could be that a fault in my Zwave network is causing the delays. Also because I can observe that in the stairwell, where a HUE strip and a ZWave strip are connected in parallel, the ZWave strip sometimes only comes on after a 2 second delay.

Question for the experts: does OH4 still have a thread pool limit of 5? I believe I have read that this is no longer the case. But something is blocking the system (rarely even at night).
Since I can also sometimes observe that a handful of rules are in running status, although they are actually only supposed to update an item, this indicates a restriction to me. Unfortunately, I cannot determine which rule or rules are at the beginning of the queue. I can only see in HTOP that the cores are bored and only two of the 600 threads are running.
Next I will try to find out if there is a delay in the ZWave logs that I can interpret.
Btw, after I not only took over my two AndroidTv Things, but also did the pairing, the delays seem(!?) to have become less. The SSL errors that occurred there no longer appear in the log.

Regards Thomas

You wouldn’t … thread pool exhaustion is an issue quite unrelated to CPU load…

Is there something you can add to logging so that we can see the number of queued tasks in the OH core thing handler executor service thread pool?

That is a LOT of time. If the OH core thing handler executor service would have a pool size of 5 that would mean each task blocking for 24 seconds. So in any case there is something wrong in your system. I think the only reason a task could block for such a long time would be if it is blocking on an IO process.

So we should examine all addons in your system which might be using the OH core thing handler executor service thread pool to do IO processes. I am guessing that you have something that is doing network IO which is hitting on a read or write timeout. You might need to try disabling addons one by one to see if there is a particular culprit.

Good morning and a happy new year!

I have now added three new items to the diagram.
The watchdog time of two ESP32s that I have connected via MQTT. And the date item, which is updated via the NTP channel.
I wanted to continue with the analyses today:
You can see that the mqtt sensors, which report every minute, are also affected. A few high outliers at bedtime, but a higher average value during the day.

At some point around 12:20 I started VS-Code, which initially led to dispatcher warnings (at least I suspect so) and then to an OH restart. Phew, I hadn’t consciously noticed that before.

I have also noticed that the number of threads in HTOP increases from about 470 after the restart to 770 before the restart.

As already suggested, I will have to check binding by binding, rule by rule. I have no idea (yet) if there is a way to assign the threads to a binding or a rule/timer.

My main suspect is still the Android binding, which fortunately I can disable without too much loss of functionality. Tapo and Onecta are also at the top of the list.

Alternatively, to cause less disruption in the house, I will try to recreate the behavior in a virtual machine… happy wife, happy life :wink:

I won’t be bored in 2025 either :slight_smile:

Regards
Thomas

1 Like

Since OH 2.x, I had changed my values to these, and it’s worked well from 3.x to 4.x also now. I have 340 things, 2722 items and 327 rules.

Two HUE bridges completely full with lots of motion detectors.

\runtime\services.cfg

threadpool:thingHandler=50
threadpool:discovery=20
threadpool:safeCall=50
threadpool:ruleEngine=50

Best, Jay

Where do I do these changes?

Do I need to add

org.eclipse.smarthome.threadpool:thingHandler=50
org.eclipse.smarthome.threadpool:discovery=20
org.eclipse.smarthome.threadpool:safeCall=50
org.eclipse.smarthome.threadpool:ruleEngine=50

To runtime.cfg?

\runtime\services.cfg

I modified my posting to add the path.

Best, Jay

I guess org.eclipse.smarthome is probably not correct since already many years.