Rule engine suddenly stops working

wwebers · May 14, 2019, 1:12pm

Hi there,

Lately, I experience the oddest thing ever with OH 2.4 and its rule engine. I have a simple setup where I installed a IoT gateway based on a OrangePi with the single task to expose my ZWave controller and my RFLink controller via ser2net.

Additionally, I run a OH2.4 setup inside a linux container on Proxmox connecting the controllers via socat. Tried even pure MQTT, with no difference.

Now for the fun part: I also have an elderly alarm system based on 433Mhz sensors. I configured sort of “virtual” sensors within OH2.4 just to be able to send a command to trigger an alarm. The physical sensors are ZWave sensors. I.e. I created a bunch of rule with are triggered by the ZWave sensors and which in turn send a command via RFLink.

The problem: This works perfectly, a while, and suddenly the rules aren’t triggered anymore. The funny thing is that OH2.4 still can send commands via RFLink and still react on ZWave events. It just the rule engine totally ignoring those triggers.

I checked the logs, no errors in sight. All I see is all the triggers but not “send command” triggered by the rule engine. As this setup worked for months and suddenly stopped worked I’m really puzzled. However, as my home alarm relies on this setup, I’m seriously thinking about ditching OH and test other rule engines if they’re more reliable. A rock stable rule engine is sort of a kill criteria for me.

rossko57 · May 14, 2019, 1:35pm

Which one? text xxx.rules (DSL) files, perhaps

How do you know? e.g. using UI perhaps

Okay. Amongst other possible causes, there are a limited number of rule threads, and ways to build rules that occasionally grab one for ever. When you run out of threads, no new rules can start.
A more detailed look

wwebers · May 14, 2019, 1:55pm

Yes, text based rules.

Yes, the I use the UI to test that, what else should I use?

I knew that one already, thus I do not use external script, timers or locks. I adjusted the RFlink binding to be able to move a configurable delay inside the binding (and removed the original fixed delay).

Here’s a tiny excerpt of the rules.


rule "Entre OPEN"
when
   Item Magnet_Entre changed to OPEN
then
   Entre_Command.sendCommand(ON)
end

rule "Entre CLOSE"
when
   Item Magnet_Entre changed to CLOSED
then
   Entre_Command.sendCommand(OFF)
end

wwebers · May 28, 2019, 2:19pm

After some further investigating activities (setting log to DEBUG level) show the following results:

BTW: I’m running 2 types of persistence: influxdb and MQTT.

No errors or warnings inside the log
It seems that the persistence layer just dies

First stop the influxdb persistence actions, some seconds after that even the MQTT actions stop sending data. The rest of the system seem to work perfectly, despite that even some parts of the rule system died as well.

A bit creepy.

rossko57 · May 28, 2019, 2:29pm

Makes a kind of sense. I can imagine a queue of other activities building up, causing a creeping death effect on other services.
I expect there a bunch of ways for influxdb to hang up e.g. waiting for a read or write to mass storage. I have no useful suggestion about how to follow through, but I would be looking hard at the file system.

wwebers · May 29, 2019, 4:16am

Nja, my influxdb server’s running just fine. It’s running on another machine and is pretty bored, no heavy load. Same for my activemq server (artemis, btw.). Running on an own server and pretty bored as well.

So, if there’s “load”, than it’s on my openhab server (running inside a linux container in a Proxmox cluster). However, I wouldn’t call it load: around 5-8 events every 10 seconds.

For me it looks like there’re some serious problems inside the persistence layer. It’s a pity I cannot see anything in my logs.

rossko57 · May 29, 2019, 9:19am

I’m not thinking “load”, I’m thinking something hangs up and causes a traffic jam.

The majority of OH users won’t be doing that, a remote database. There’s obvious scope for the communication to mess up in ways not often encountered.
Let’s say something doesn’t handle dropped packets very well, just as an example.

I don’t know how you could pry into that area of persistence. Bear in mind, you’re still not sure what is cause and what is effect.

wwebers · May 29, 2019, 2:02pm

Well, it’s my ambition to create a home automation installation based on a microservice architecture on the top of a hyper converged infrastructure. However, it seems that pretty much all of those solutions tested are based on a monolithic approach: even though they’re modular in some sense, they expect every part is installed on the same machine.

And to at least keep away the persistence layer from a Raspberry Pi installation (if one want that) is quiet natural: SD cards are not really made for this kind of use case.

However, guess I have to ditch the OpenHAB approach (as well as Home Assistant) and build up something from scratch. I originally liked the idea in using OSGI as a foundation. But if it means I have to hard bake together all needed modules before a specific device deploy, I skip that approach.

rossko57 · May 29, 2019, 3:13pm

No-one said having a remote db was a bad idea, nor that it was unsupported. Not many users do it, meaning little exposure. Likely no one at all has the same combination of containers

There’s no solid indication yet that your problem has anything to do with persistence.

It’s your choice how to proceed of course; throwing up your hands and walking away is as valid as anything else.

wwebers · May 29, 2019, 3:44pm

I’m not just throwing up my hands. Before I ask the community, I use to try a lot on my own. However, as my home automation includes my home security system it’s imperative all parts work together in a reliable way.

BrianD · October 12, 2020, 10:36am

I have a very similar setup, proxmox with a separate influxdb / grafana server. I am seeing a very similar behavior. I moved my restore on startup persistence to mapdb, to try and isolate some of the load, based on suggestions here. If I only have mapdb configured, there is no issue at all with the rules engine. Even with a clean new influx database, I add in the influx persistence service, and I start getting rule execution issues.

When this first started, restarting the influxdb service on the database server would activate the rule engine, including the back log of rule triggers that happened while the communication was faulty.

The only error I see is the communication error, with a warning that the batched data will be lost.

I am just in the process of getting the newly created influx database bedded in, but I already have rule issues again and hence rather low WAF.

Seaside · October 12, 2020, 10:42am

Make sure you are not running out of threads.
See this thread for more info:

BrianD · October 13, 2020, 10:02am

Thanks @Seaside,

I worked my way through the solutions in that thread, although with little success. However, I am not sure quite what I am doing so if you and the others here could confirm or correct my understanding, that would be much appreciated.

I edited /services/runtime.cfg to add
############### Thread pool sizes #######################
# Configuration of thread pool sizes
org.eclipse.smarthome.threadpool:thingHandler=25
org.eclipse.smarthome.threadpool:discovery=15
org.eclipse.smarthome.threadpool:safeCall=20
org.eclipse.smarthome.threadpool:ruleEngine=30

I tried just the last two lines active, as per one of the posts. I also tried all lines active, with a variety of thread counts.

If I have just mapdb active, Openhab rules work perfectly without a glitch. After a few minutes with the influxdb persistence service active, the rules no longer activate when the triggers fire. mapdb is the default persistence service.

My questions are:
Is this the right place to make the edits?
Will the edits go live once the file is saved, or does openhab need a restart, or a system reboot?
What are the implications of increasing the thread count? i.e. what is stopping a change to something like 200?

BrianD · October 22, 2020, 10:35am

Update: I ended up going for broke and putting

org.eclipse.smarthome.threadpool:safeCall=50
org.eclipse.smarthome.threadpool:ruleEngine=100

Openhab rules were fine and the influxdb persistence service started storing sensor data. I am not sure yet what the implications of jumping that high are, but for now it is working.