Rule triggers twice on received update via mqtt

j.hoekstra · September 13, 2020, 5:34pm

Alright I don’t think it’s a case of double logging, then it would happen all over my system with other rules.
The weirdness would be in the eventbus over mqtt where zwave devices are shared from an instance without rules as an remote zwave controller, with logic in the main instance.

To test it out however I tried 3 scenario’s:
1 with triggering this device on ‘changed’, this only logged once(this is what I have been using for sensor-updates so far)
1 with virtual switch and triggering on ‘received update’, this only logged once.
1 with a zwave sensor connected in the same RPI and triggering on ‘received update’, this only logged once.
I retried above tests using another action(sendNotification) with the same results

I’ve been pushing the button more often now and I notice that every now and again it does run twice(the rulecount goes up with 2 instead of 1 at 1 buttonpress.)

RPI:
2020-09-13 11:42:19.071 [INFO ] [.eclipse.smarthome.model.script.FB01] - entry 85
2020-09-13 11:42:19.081 [INFO ] [.eclipse.smarthome.model.script.FB01] - <1.0>
2020-09-13 11:42:19.092 [INFO ] [.eclipse.smarthome.model.script.FB01] - FB01 1x pushed
2020-09-13 11:42:25.583 [INFO ] [.eclipse.smarthome.model.script.FB01] - entry 86
2020-09-13 11:42:25.595 [INFO ] [.eclipse.smarthome.model.script.FB01] - <1.0>
2020-09-13 11:42:25.606 [INFO ] [.eclipse.smarthome.model.script.FB01] - FB01 1x pushed
2020-09-13 11:42:30.669 [INFO ] [.eclipse.smarthome.model.script.FB01] - entry 87
2020-09-13 11:42:30.680 [INFO ] [.eclipse.smarthome.model.script.FB01] - <1.0>
2020-09-13 11:42:30.693 [INFO ] [.eclipse.smarthome.model.script.FB01] - FB01 1x pushed
2020-09-13 11:42:42.937 [INFO ] [.eclipse.smarthome.model.script.FB01] - entry 88
2020-09-13 11:42:42.949 [INFO ] [.eclipse.smarthome.model.script.FB01] - <1.0>
2020-09-13 11:42:42.961 [INFO ] [.eclipse.smarthome.model.script.FB01] - FB01 1x pushed
2020-09-13 11:42:46.898 [INFO ] [.eclipse.smarthome.model.script.FB01] - entry 89
2020-09-13 11:42:46.910 [INFO ] [.eclipse.smarthome.model.script.FB01] - <1.0>
2020-09-13 11:42:46.922 [INFO ] [.eclipse.smarthome.model.script.FB01] - FB01 1x pushed
2020-09-13 11:42:49.797 [INFO ] [.eclipse.smarthome.model.script.FB01] - entry 90
2020-09-13 11:42:49.807 [INFO ] [.eclipse.smarthome.model.script.FB01] - <1.0>
2020-09-13 11:42:49.818 [INFO ] [.eclipse.smarthome.model.script.FB01] - FB01 1x pushed
2020-09-13 11:42:51.831 [INFO ] [.eclipse.smarthome.model.script.FB01] - entry 91
2020-09-13 11:42:51.843 [INFO ] [.eclipse.smarthome.model.script.FB01] - <1.0>
2020-09-13 11:42:51.854 [INFO ] [.eclipse.smarthome.model.script.FB01] - FB01 1x pushed
2020-09-13 11:42:53.949 [INFO ] [.eclipse.smarthome.model.script.FB01] - entry 92
2020-09-13 11:42:53.961 [INFO ] [.eclipse.smarthome.model.script.FB01] - <1.0>
2020-09-13 11:42:53.971 [INFO ] [.eclipse.smarthome.model.script.FB01] - FB01 1x pushed

Main:
2020-09-13 11:42:19.090 [INFO ] [.eclipse.smarthome.model.script.FB01] - entry 17
2020-09-13 11:42:19.091 [INFO ] [.eclipse.smarthome.model.script.FB01] - <1.0>
2020-09-13 11:42:19.092 [INFO ] [.eclipse.smarthome.model.script.FB01] - entry 18
2020-09-13 11:42:19.093 [INFO ] [.eclipse.smarthome.model.script.FB01] - <1.0>
2020-09-13 11:42:25.599 [INFO ] [.eclipse.smarthome.model.script.FB01] - entry 19
2020-09-13 11:42:25.600 [INFO ] [.eclipse.smarthome.model.script.FB01] - <1.0>
2020-09-13 11:42:25.603 [INFO ] [.eclipse.smarthome.model.script.FB01] - entry 20
2020-09-13 11:42:25.604 [INFO ] [.eclipse.smarthome.model.script.FB01] - <1.0>
2020-09-13 11:42:30.693 [INFO ] [.eclipse.smarthome.model.script.FB01] - entry 21
2020-09-13 11:42:30.694 [INFO ] [.eclipse.smarthome.model.script.FB01] - <1.0>
2020-09-13 11:42:30.694 [INFO ] [.eclipse.smarthome.model.script.FB01] - entry 22
2020-09-13 11:42:30.696 [INFO ] [.eclipse.smarthome.model.script.FB01] - <1.0>
2020-09-13 11:42:42.960 [INFO ] [.eclipse.smarthome.model.script.FB01] - entry 23
2020-09-13 11:42:42.960 [INFO ] [.eclipse.smarthome.model.script.FB01] - entry 23
2020-09-13 11:42:42.961 [INFO ] [.eclipse.smarthome.model.script.FB01] - <1.0>
2020-09-13 11:42:42.961 [INFO ] [.eclipse.smarthome.model.script.FB01] - <1.0>
2020-09-13 11:42:46.918 [INFO ] [.eclipse.smarthome.model.script.FB01] - entry 24
2020-09-13 11:42:46.918 [INFO ] [.eclipse.smarthome.model.script.FB01] - entry 24
2020-09-13 11:42:46.919 [INFO ] [.eclipse.smarthome.model.script.FB01] - <1.0>
2020-09-13 11:42:46.919 [INFO ] [.eclipse.smarthome.model.script.FB01] - <1.0>
2020-09-13 11:42:49.819 [INFO ] [.eclipse.smarthome.model.script.FB01] - entry 25
2020-09-13 11:42:49.819 [INFO ] [.eclipse.smarthome.model.script.FB01] - entry 25
2020-09-13 11:42:49.820 [INFO ] [.eclipse.smarthome.model.script.FB01] - <1.0>
2020-09-13 11:42:49.820 [INFO ] [.eclipse.smarthome.model.script.FB01] - <1.0>
2020-09-13 11:42:51.853 [INFO ] [.eclipse.smarthome.model.script.FB01] - entry 26
2020-09-13 11:42:51.853 [INFO ] [.eclipse.smarthome.model.script.FB01] - entry 26
2020-09-13 11:42:51.853 [INFO ] [.eclipse.smarthome.model.script.FB01] - <1.0>
2020-09-13 11:42:51.853 [INFO ] [.eclipse.smarthome.model.script.FB01] - <1.0>
2020-09-13 11:42:53.971 [INFO ] [.eclipse.smarthome.model.script.FB01] - entry 27
2020-09-13 11:42:53.971 [INFO ] [.eclipse.smarthome.model.script.FB01] - entry 27
2020-09-13 11:42:53.971 [INFO ] [.eclipse.smarthome.model.script.FB01] - <1.0>
2020-09-13 11:42:53.971 [INFO ] [.eclipse.smarthome.model.script.FB01] - <1.0>

This (now) leads me to believe this is device specific, and there are 2 updates triggering the rule at the same time.
I’m now looking into where to find the updates, the itemchanged is logged by default in eventlog, but not the updates(which makes sense due to info overload).

In the meantime I implemented a lock which takes care of the effect of multiple updates.
I’ll write this down, since I want to migrate to the v2 binding which is likely more robust in the department of triggering.

rossko57 · September 13, 2020, 6:35pm

Sure, if we rattle things enough there’s usually hiccups to be exposed. An important follow up would be to see dupe MQTT messaging on those occasions. I don’t think this is what you’re looking for.

But you don’t generally see that on your “source” system, which I assume is a less powerful Pi, so 'doubled" incidents would be more spread out in time.

Nor do you see that in your MQTT monitor, which absolutely would have to have a few mS between messages simply because it takes real time to send messages.
And of course that offset in time would continue to be reflected in your updates at the target system.

This is a massive clue, the consistent duplicate timestamps.
Two real updates could kick off two runs, two "copies’ of the rule. A sufficiently powerful multicore server could conceivably run those pretty much in parallel.
But there’s still only one counter variable, outside of the rule, and it’s very hard to believe that part won’t work differently.

A possibility is that there is somehow a doubled link between your single MQTT Thing and your single Item.
Items are allowed to be linked to more than one channel. The effects of linking to the same channel twice might look like this, on a server box powerful enough to run two copies of the same rule simultaneously.
I think you certainly should explore this - REST API links should show it up.
EDIT - SCRATCH THAT - it won’t help for a v1.x defined link,no Thing involved. but it remainstrue that Itemscan be linked to more than one v1.x binding path. I’ve no idea how you can verify the link there.

j.hoekstra · September 13, 2020, 8:26pm

By setting logging in karaf to: log:set TRACE smarthome.event.ItemStateEvent I was able to pinpoint the main instance does get 2 updates as opposed to the RPI only one. Also the mqtt message is going out only once(QOS2).
The item was just created connecting to 1 topic which isn’t used elsewhere and isn’t part of a group yet.
Restarted after creating the items.

I’ve been using this setup for 2 years or so without issues and with the use of a locking mechanism makes sure rules are only executed once, so for me the issue is not pressing anymore.
And since mqtt1 addon wil become obsolete anyways with Openhab v3 on the horizon, I’ll start reading up on moving config to the v2 addon.

rossko57 · September 13, 2020, 9:40pm

Doesn’t happen by magic, My best guess is still that Item is double linked.

If you wanted to prove the point, the simplest way would be to create a wholly new Item, but I expect that’s a fag with it being involved in the bus thing.

CrazyIvan359 · September 13, 2020, 11:12pm

Maybe this has something to do with your issue? QOS 2 may result in resending the message if the receiver does not respond fast enough. While I don’t think that your system is slow enough to do this, maybe the mqtt binding is not handling the QOS 2 handshake correctly. Or possibly it is putting the message on the channel twice while handling the back and forth with the broker.

Try QOS 0 and see if it still happens.

rossko57 · September 13, 2020, 11:42pm

Good point, MQTT broker’s idea of “publish once” can be several actual messages. Bit curious why it would then be Item (topic) specific though.

I’m also still curious if this is a fairly potent host (relative to the usual plodding openHAB box). I have every sympathy for users discovering buried problems most others never see.

CrazyIvan359 · September 13, 2020, 11:59pm

Sorry to burst your bubble, but my openhab has been running on an x64 desktop running debian with 4GB RAM for over 2 years now. I was running the mqtt1 binding when I first switched off the RPi and now I’m on mqtt2 and I’ve never experienced this problem. I’ve always run QOS 0 though and never had communication problems, it seemed like the right choice based on LAN/WI-FI being very high reliability in terms of networks.

I think this is either device specific or some strange thing with QOS 2 or the binding doing something weird.

rlkoshak · September 14, 2020, 4:35pm

That’d be true for QOS 1 (at least once) but QOS 2 means deliver exactly once. There is an ack protocol set up with QOS 2 to ensure the message is delivered once and only once. QOS 0 means at most once so with QOS 0 if the message is missed by the client it’s missed and there is no retry. Only QOS 1 would permit delivery of the same message more than once.

rossko57 · September 14, 2020, 4:40pm

I think the suggestion is that there’s some problem in the loop.
QoS2 - send message, await ack.
Oh no, there’s a send failure or no ack!
But I must send it, so send again.

I find it hard to believe that could all happen within 1mS but there is something screwy.

rlkoshak · September 14, 2020, 4:58pm

There is a four part handshake with QOS 2 that prevents this and ensures that the message can only be received/processed once by the client. See What is MQTT Quality of Service (QoS) 0,1, & 2? – MQTT Essentials: Part 6 for a good explanation for how that handshake works to prevent it.

I only bring it up to avoid heading down the wrong path in trying to figure out this problem.

CrazyIvan359 · September 14, 2020, 7:17pm

While I agree with you, this is the part that makes me think something weird is happening. The 3rd step involves the receiver getting a second packet related to the same message. I’m wondering if that is what is triggering the binding to put the message on the channel again. 4 small packets on a 100M network would be pretty damn fast, on 1000M network probably in the sub millisecond range.

rlkoshak · September 14, 2020, 7:51pm

That fourth packet, all the handshake really, is handled by the MQTT library and is transparent to the binding. I believe the MQTT 2.5 binding is using the HiveMQ library. But the old MQTT 1.x binding uses Paho. In either case though, the binding wouldn’t receive the message until the handshake is complete. The whole process is set up such that no matter how many times the library receives the message, it only delivers it the one time. The handshake is defined to ensure this.

The only way that the message could be received twice by the binding is:

it’s really using QOS 1, not QOS 2
there is a bug in the MQTT library the binding is using

But there are some points worth considering. QOS is defined between the client and the MQTT broker. Furthermore, the QOS can be set at the subscription and the for each published message independently. So it could publish one message at QOS 1 and another at QOS 0 and subscribe using QOS 2.

So the question is what exactly the QOS setting in the MQTT 1.x binding config applies to. Published messages only? Published messages and subscriptions? Just subscriptions? Does it apply to subscriptions at all?

By default, if the subscription doesn’t define a QOS, the QOS specified by the publisher is used I think. So, if we assume that the QOS setting only applies to the messages published by OH, one needs to set the QOS to 2 at the device that is publishing the messages in order for OH to receive messages at QOS 2.

It’s possible that the binding defaults the QOS on subscriptions and it cannot be changed.

To answer the question one would need to look at the code.

Note, this ambiguity is still present in MQTT 2.5. It’s possible to set the QOS on a per Channel basis, but it’s unclear whether that setting applies to just the publisher or both the publisher and subscriber.

I think if all other possibilities are eliminated, the code needs to be reviewed to see if the QOS 2 is being used in the subscription, the QOS for the subscription isn’t defined, or it’s hard coded to 1.

rossko57 · September 14, 2020, 10:57pm

Can we see this fairly crucial log?

However …

Okay, a single message gets published at the source.

And when that message arrives, the event bus handler of MQTTv1 binding automatically updates the given Item.

Meantime …

This Item has also been individually configured to listen to a topic that happens to be identical to the event bus topic. So of course when the event bus message arrives, the “ordinary” part of the MQTTv1 binding swings into action and updates the Item.

Voila, two updates from one message as quickly as the binding and internal events bus can handle them (and I think on this multicore box that can be quite fast).
It’s doing exactly what it has been configured to do.

You could confirm that by examining subscriptions at the broker - there’ll be a wildcard sub from events bus, a specific sub from Item

I’m still quite impressed this box can run two ‘copies’ of the rule in such close lockstep.

I’m further surprised that a global variable is not thread safe, that’s spiteful.
EDIT - I wonder if
rulecounter = rulecounter + 1
is the problem because there are separate variable ‘read’ and ‘write’
rulecounter++
might have been a better choice in this very close rule-race. Or might be just the same internally!

CrazyIvan359 · September 15, 2020, 12:08am

I’ve had this happen before, it’s really night and day when you come off an RPi to an x86/64 platform. When I switched off the RPi my start to fully loaded time for openHAB was 5-7 minutes, now with a much much larger setup openhab is fully loaded in under a minute (all non-autogen items are still in files).

Time to switch to Jython rules
With Jython rules you could easily have this scenario of two rules running essentially in parallel on an RPi, they are lightning fast compared to Xtend.

rossko57 · September 15, 2020, 12:11am

I do run on a multicore x86, and have had to learn a lot about race conditions

I understood they were queued.

CrazyIvan359 · September 15, 2020, 12:15am

Interesting, I assumed they were threaded but maybe they are just that fast. I’ve never tested it and I have ported over thread/concurrency safeties where needed from my Xtend rules.

rossko57 · September 15, 2020, 12:25am

I expect Rich will enlighten us in a minute, but I am specifically talking multiple starts of the same rule, I thought that got queued one at a time in NGRE.

5iver · September 15, 2020, 1:08am

You are correct… only one instance of a rule can run at a time. This eliminates the need for locks. If this causes an issue, have the rule spawn up a new thread. In Jython, I recommend thread.start_new_thread. Here is some discussion…

rlkoshak · September 15, 2020, 4:47pm

That’s why global lambdas in rules DSL are so dangerous. When two rules call a global lambda at the same time, they are not running two separate copies of the function, as is usually the case. They are actually running the exact same function at the same time. That means, for example, if you set a variable at the top of the lambda it might become overwritten by the second concurrent call by the time you get down to the parts of the lambda that use the variable, causing all sorts of havoc.

Also note, when OH 3 comes out the same rule is prevented from running more than one instance at a time I think even for Rules DSL as I understand it. So this problem will become somewhat ameliorated, at least for those cases where the global variable is only accessed from one rule. It doesn’t help if you have two or more rules accessing the same global variable though.

rossko57 · September 15, 2020, 9:09pm

Yeh, it’s all an aside really, but with further thought this -
rulecounter = rulecounter + 1
is problematic in this case of highly parallel rules. Because it’s not a single operation. Rule A reads the variable value, and before it has time to do its sums and write the new value back, along comes Rule B and reads the same old value.
‘rulecounter++’
might work better, looking like a single operation … but I wouldn’t be surprised if it had exactly the same problem behind the scenes.