Group:Number:SUM stopped summing and Member-of rule stopped firing

Folks,

I added some items this morning -

Group:Number:SUM                g_Utility_Proxy_Usage "Group for Proxy items"
Number Power_Solar_prxy                                 <energy> (g_Utility_Proxy_Usage)                 { lightwaverf="serial=FFC0FE,type=ENERGY_CURRENT_USAGE"}
Number Power_House_prxy                                 <energy> (g_Utility_Proxy_Usage)                 { lightwaverf="serial=179DFE,type=ENERGY_CURRENT_USAGE"}

Number Power_Solar  "Power - Solar - Watts [%.1f]"      <energy> (g_persist_change)
Number Power_House  "Power - House - Watts [%.1f]"      <energy> (g_persist_change)     
Number Power_Grid   "Power - Grid -  Watts [%.1f]"      <energy> (g_persist_change)

I then have a rule


rule "Utilit Proxy change"
when Member of g_Utility_Proxy_Usage changed
then 
    var Number pHouse = 0.00   //(Power_House.state as Number)
    var Number pSolar = 0.00   //(Power_Solar.state as Number)
    var Number pNet =  0.00    //pHouse - pSolar

//    logWarn("KKK", "message")

    pSolar = (Power_Solar_prxy.state as DecimalType) - 69;
    pHouse = (Power_House_prxy.state as DecimalType) - 69;
    pNet = pHouse - pSolar;

    Power_Grid.postUpdate(pNet)
    Power_Solar.postUpdate(pSolar)
    Power_House.postUpdate(pHouse)

end


All this worked well this morning. 5ish hours later I came back to see that the rule had stopped firing.
I added in the logWarn and changed the values of the two items myself and the group remained at NULL.

I touched the item file and the group file and saw that it said “reloading file bla bla” (I can’t remember the exact message)

Nothing I could do would get this group to SUM the values. Even min/max/avg had nothing. It remained NULL.

Another SUM group I have remained working as expected.

I stopped the OH service and restarted it and it sprang into life.

You can see in the screen grab below where I update the item and check the group - it remains NULL.
The other item in the group unfortunately is not in the print.

Anywhere I can look for why this might have happened?
This is all running on a rpi4 2GB

root@homeauto:~# java -version
openjdk version "1.8.0_152"
OpenJDK Runtime Environment (Zulu Embedded 8.25.0.76-linux-aarch32hf) (build 1.8.0_152-b76)
OpenJDK Client VM (Zulu Embedded 8.25.0.76-linux-aarch32hf) (build 25.152-b76, mixed mode, Evaluation)

Any clues I might be able to dig up after the fact?

I’m not sure what you should expect to get for SUM if any one of the members is NULL or UNDEF.

Any problems with that should not stop any Member of triggering, though.

Make sure your Item names (including Groups) are unique both across xxx.items files and any tinkering you may have made in PaperUI.

Neither of the items were NULL. See the last 3 commands before I bounced the service. I set one to 222 and the other to 666. I then check the group and its NULL.

after bouncing I rerun the same commands and it works as expected.

It’s all getting a little depressing :frowning:

Do something about it. This does work for other people. What base OH version are you using? Have you checked for duplicate Items?

Look in openhab.log at boot time, for any oddities when your xxx.items files load.
Look in events.log. Please show us when it works and when it doesn’t.

This is all running on the RPi 4 with OH 2.4 (Version is in the screen print)

In terms of logs and events - here is the abridged detail.

The NULL > UNDEF is because when I first had it I did not have it as SUM or AVG.
No further changes to the group in the log as it remains, rightly so, at UNDEF.
However, you can see that the rule is firing as the only way for Power_Grid to change is from the rule.

Events.log

2019-09-07 13:08:08.183 [vent.ItemStateChangedEvent] - Power_House_prxy changed from NULL to 1016
2019-09-07 13:08:08.187 [GroupItemStateChangedEvent] - g_Utility_Proxy_Usage changed from NULL to UNDEF through Power_House_prxy
....... (loads of events in between this time as the binding updates every second or so)
2019-09-07 13:57:35.744 [vent.ItemStateChangedEvent] - Power_Solar_prxy changed from 446 to 462
2019-09-07 13:57:35.747 [vent.ItemStateChangedEvent] - Power_Grid changed from 365 to 349
2019-09-07 13:57:35.750 [vent.ItemStateChangedEvent] - Power_Solar changed from 377 to 393

13:57 is the last log entry for the Power_Grid events. I then changed the group to be SUM/AVG/MIN/MAX trying to coax some life out of it but to no avail. The group remained NULL (which it would because I kept changing the aggregate type) but the rule did not fire which is highlighted below.

events.log

2019-09-07 16:50:01.504 [vent.ItemStateChangedEvent] - Power_Solar_prxy changed from 604 to 499
2019-09-07 16:50:04.254 [vent.ItemStateChangedEvent] - Power_House_prxy changed from 791 to 788
--no entry for grid power which gets updated by the rule.

After a restart of the service I get the following (which is me poking the items manually as shown in the screenshot.

2019-09-07 18:17:15.361 [ome.event.ItemCommandEvent] - Item 'Power_House_prxy' received command 666
2019-09-07 18:17:15.369 [vent.ItemStateChangedEvent] - Power_House_prxy changed from NULL to 666
2019-09-07 18:17:15.372 [GroupItemStateChangedEvent] - g_Utility_Proxy_Usage changed from 222 to 888 through Power_House_prxy
2019-09-07 18:17:15.393 [vent.ItemStateChangedEvent] - Power_Grid changed from NULL to 444
2019-09-07 18:17:15.397 [vent.ItemStateChangedEvent] - Power_Solar changed from NULL to 153
2019-09-07 18:17:15.399 [vent.ItemStateChangedEvent] - Power_House changed from NULL to 597

What I am seeing, which is the last time the rule actually fired is this:

2019-09-07 13:54:31.839 [WARN ] [clipse.smarthome.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@113736f' takes more than 5000ms.
2019-09-07 13:56:31.843 [WARN ] [clipse.smarthome.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@113736f' takes more than 5000ms.
2019-09-07 13:57:31.842 [WARN ] [clipse.smarthome.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@113736f' takes more than 5000ms.

This is related to our previous conversation / suspicion that the system is to busy.
I’ve been stripping back on all the rules which are not “critical” in an attempt to clean up. Items are next.

I see there are sporadic (high?) writes to the SD card (captured in the screen print) which I can’t track down.

I’m putting the logs onto a USB3 key today / separate again from the SD card and disabling a lot of the nice-to-have metrics from various power sensors…

I am wondering about that high CPU usage. I did an htop just before restarting again and see that one core was pegged at 100%. There is something internal that is causing this.

After a restart and waiting for it to settle it’s now running at roughly 15-25% which is more understandable.

Anyway to determine the internal process which might be causing this?

By way of explanation only - all Items are initially created NULL. A Group would stay that way until a member updates, and then the aggregation function (SUM) runs. Of course other members might still be NULL etc.and that aggregation fail, for now. The resulting state would be UNDEF.

Writing that out is a useful reminder that group NULL state carries a significance - “I’ve never even tried to update” - different from UNDEF - “I messed up”.

In general, this item+group stuff is all reliable, barring config mistakes like mixing contact types into a number group and messing up any maths.

I would agree with your analysis that this is about a general system logjam, or more specifically an openHAB event bus logjam. The group updating stuff is likely just not getting a look-in.

I think Rich’s general advice here still holds


I do not expect there to be a magic solution, “fix that rule”. We suspect your system is built on an approach that exposes it to problems.

As part of the cleanup there are / were no cron jobs and, at the time this problem was happening, no (that I remember) frequent rules running. The only thing which was still very busy was the number of items receiving updates from mqtt (1x) updates.

I’ve now disabled a number of chatty ones (chatty being once every 3ish seconds) which are not functional (where I measure appliance power usage I was also measuring PF and V)

Is there some metric / log setting I can put on which will show how busy the bus actually is? A time-to-process-each-message type of thing? That would help confirm the suspicion.

I think you would be looking at a symptom, not a cause. I’d bet it’s I/O bound.

2-3 hours after another restart it happens again. :frowning:

Logs and event.log now on USB3 key. Removed a lot of the items which are not that important.
IO is “bursty” onto both the SD (Samba reading) and USB key (OH writing I assume)

This is a brand new SD card - might try another one just to remove it from the list.

“It happens again” If you think this has something to do with the Group function, it would be helpful to show the events.log with the Item updates that you would expect to cause a group update.

Indeed. If it happens again.

Like you say though, I think I’m looking at symptoms of another problem. I’m refactoring a lot so will see off the back of that.

C