Group:Number:SUM stopped summing and Member-of rule stopped firing

Folks,

I added some items this morning -

Group:Number:SUM                g_Utility_Proxy_Usage "Group for Proxy items"
Number Power_Solar_prxy                                 <energy> (g_Utility_Proxy_Usage)                 { lightwaverf="serial=FFC0FE,type=ENERGY_CURRENT_USAGE"}
Number Power_House_prxy                                 <energy> (g_Utility_Proxy_Usage)                 { lightwaverf="serial=179DFE,type=ENERGY_CURRENT_USAGE"}

Number Power_Solar  "Power - Solar - Watts [%.1f]"      <energy> (g_persist_change)
Number Power_House  "Power - House - Watts [%.1f]"      <energy> (g_persist_change)     
Number Power_Grid   "Power - Grid -  Watts [%.1f]"      <energy> (g_persist_change)

I then have a rule


rule "Utilit Proxy change"
when Member of g_Utility_Proxy_Usage changed
then 
    var Number pHouse = 0.00   //(Power_House.state as Number)
    var Number pSolar = 0.00   //(Power_Solar.state as Number)
    var Number pNet =  0.00    //pHouse - pSolar

//    logWarn("KKK", "message")

    pSolar = (Power_Solar_prxy.state as DecimalType) - 69;
    pHouse = (Power_House_prxy.state as DecimalType) - 69;
    pNet = pHouse - pSolar;

    Power_Grid.postUpdate(pNet)
    Power_Solar.postUpdate(pSolar)
    Power_House.postUpdate(pHouse)

end


All this worked well this morning. 5ish hours later I came back to see that the rule had stopped firing.
I added in the logWarn and changed the values of the two items myself and the group remained at NULL.

I touched the item file and the group file and saw that it said “reloading file bla bla” (I can’t remember the exact message)

Nothing I could do would get this group to SUM the values. Even min/max/avg had nothing. It remained NULL.

Another SUM group I have remained working as expected.

I stopped the OH service and restarted it and it sprang into life.

You can see in the screen grab below where I update the item and check the group - it remains NULL.
The other item in the group unfortunately is not in the print.

Anywhere I can look for why this might have happened?
This is all running on a rpi4 2GB

root@homeauto:~# java -version
openjdk version "1.8.0_152"
OpenJDK Runtime Environment (Zulu Embedded 8.25.0.76-linux-aarch32hf) (build 1.8.0_152-b76)
OpenJDK Client VM (Zulu Embedded 8.25.0.76-linux-aarch32hf) (build 25.152-b76, mixed mode, Evaluation)

Any clues I might be able to dig up after the fact?

I’m not sure what you should expect to get for SUM if any one of the members is NULL or UNDEF.

Any problems with that should not stop any Member of triggering, though.

Make sure your Item names (including Groups) are unique both across xxx.items files and any tinkering you may have made in PaperUI.

Neither of the items were NULL. See the last 3 commands before I bounced the service. I set one to 222 and the other to 666. I then check the group and its NULL.

after bouncing I rerun the same commands and it works as expected.

It’s all getting a little depressing :frowning:

Do something about it. This does work for other people. What base OH version are you using? Have you checked for duplicate Items?

Look in openhab.log at boot time, for any oddities when your xxx.items files load.
Look in events.log. Please show us when it works and when it doesn’t.

This is all running on the RPi 4 with OH 2.4 (Version is in the screen print)

In terms of logs and events - here is the abridged detail.

The NULL > UNDEF is because when I first had it I did not have it as SUM or AVG.
No further changes to the group in the log as it remains, rightly so, at UNDEF.
However, you can see that the rule is firing as the only way for Power_Grid to change is from the rule.

Events.log

2019-09-07 13:08:08.183 [vent.ItemStateChangedEvent] - Power_House_prxy changed from NULL to 1016
2019-09-07 13:08:08.187 [GroupItemStateChangedEvent] - g_Utility_Proxy_Usage changed from NULL to UNDEF through Power_House_prxy
....... (loads of events in between this time as the binding updates every second or so)
2019-09-07 13:57:35.744 [vent.ItemStateChangedEvent] - Power_Solar_prxy changed from 446 to 462
2019-09-07 13:57:35.747 [vent.ItemStateChangedEvent] - Power_Grid changed from 365 to 349
2019-09-07 13:57:35.750 [vent.ItemStateChangedEvent] - Power_Solar changed from 377 to 393

13:57 is the last log entry for the Power_Grid events. I then changed the group to be SUM/AVG/MIN/MAX trying to coax some life out of it but to no avail. The group remained NULL (which it would because I kept changing the aggregate type) but the rule did not fire which is highlighted below.

events.log

2019-09-07 16:50:01.504 [vent.ItemStateChangedEvent] - Power_Solar_prxy changed from 604 to 499
2019-09-07 16:50:04.254 [vent.ItemStateChangedEvent] - Power_House_prxy changed from 791 to 788
--no entry for grid power which gets updated by the rule.

After a restart of the service I get the following (which is me poking the items manually as shown in the screenshot.

2019-09-07 18:17:15.361 [ome.event.ItemCommandEvent] - Item 'Power_House_prxy' received command 666
2019-09-07 18:17:15.369 [vent.ItemStateChangedEvent] - Power_House_prxy changed from NULL to 666
2019-09-07 18:17:15.372 [GroupItemStateChangedEvent] - g_Utility_Proxy_Usage changed from 222 to 888 through Power_House_prxy
2019-09-07 18:17:15.393 [vent.ItemStateChangedEvent] - Power_Grid changed from NULL to 444
2019-09-07 18:17:15.397 [vent.ItemStateChangedEvent] - Power_Solar changed from NULL to 153
2019-09-07 18:17:15.399 [vent.ItemStateChangedEvent] - Power_House changed from NULL to 597

What I am seeing, which is the last time the rule actually fired is this:

2019-09-07 13:54:31.839 [WARN ] [clipse.smarthome.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@113736f' takes more than 5000ms.
2019-09-07 13:56:31.843 [WARN ] [clipse.smarthome.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@113736f' takes more than 5000ms.
2019-09-07 13:57:31.842 [WARN ] [clipse.smarthome.core.internal.events.EventHandler] - Dispatching event to subscriber 'org.eclipse.smarthome.io.monitor.internal.EventLogger@113736f' takes more than 5000ms.

This is related to our previous conversation / suspicion that the system is to busy.
I’ve been stripping back on all the rules which are not “critical” in an attempt to clean up. Items are next.

I see there are sporadic (high?) writes to the SD card (captured in the screen print) which I can’t track down.

I’m putting the logs onto a USB3 key today / separate again from the SD card and disabling a lot of the nice-to-have metrics from various power sensors…

I am wondering about that high CPU usage. I did an htop just before restarting again and see that one core was pegged at 100%. There is something internal that is causing this.

After a restart and waiting for it to settle it’s now running at roughly 15-25% which is more understandable.

Anyway to determine the internal process which might be causing this?

By way of explanation only - all Items are initially created NULL. A Group would stay that way until a member updates, and then the aggregation function (SUM) runs. Of course other members might still be NULL etc.and that aggregation fail, for now. The resulting state would be UNDEF.

Writing that out is a useful reminder that group NULL state carries a significance - “I’ve never even tried to update” - different from UNDEF - “I messed up”.

In general, this item+group stuff is all reliable, barring config mistakes like mixing contact types into a number group and messing up any maths.

I would agree with your analysis that this is about a general system logjam, or more specifically an openHAB event bus logjam. The group updating stuff is likely just not getting a look-in.

I think Rich’s general advice here still holds

I do not expect there to be a magic solution, “fix that rule”. We suspect your system is built on an approach that exposes it to problems.

As part of the cleanup there are / were no cron jobs and, at the time this problem was happening, no (that I remember) frequent rules running. The only thing which was still very busy was the number of items receiving updates from mqtt (1x) updates.

I’ve now disabled a number of chatty ones (chatty being once every 3ish seconds) which are not functional (where I measure appliance power usage I was also measuring PF and V)

Is there some metric / log setting I can put on which will show how busy the bus actually is? A time-to-process-each-message type of thing? That would help confirm the suspicion.

I think you would be looking at a symptom, not a cause. I’d bet it’s I/O bound.

2-3 hours after another restart it happens again. :frowning:

Logs and event.log now on USB3 key. Removed a lot of the items which are not that important.
IO is “bursty” onto both the SD (Samba reading) and USB key (OH writing I assume)

This is a brand new SD card - might try another one just to remove it from the list.

“It happens again” If you think this has something to do with the Group function, it would be helpful to show the events.log with the Item updates that you would expect to cause a group update.

Indeed. If it happens again.

Like you say though, I think I’m looking at symptoms of another problem. I’m refactoring a lot so will see off the back of that.

C

Did you have this happen again? Any indication of what the cause could be? It looks like I am running into the exact same problem. My system (OH 2.5.1) has been running stable for a very long time. And suddenly I am losing SUM groups being updated.

Nope.
It would be helpful if you provided Item/Group definitions and events.log showing “missed” update.

Sure.

All updates of the base items happen in rules. Here is such (Jython) a rule:

@rule("PV uitlezen")
@when("Time cron 0 0/5 5-23 * * ?")
def pvUitlezen(event):
	log.info("PV uitlezen")
	# Connect to inverter using SBFspot
	# Read last retrieved data from SQlite DB
	# Allow 120s for operation to complete
	pvdata = executeCommandLine("/home/openhabian/smadata/scripts/loadSolarDataOH.sh", 120000)
	log.debug("pvdata: {}".format(pvdata))

	pvbuffer = pvdata.split("|")
	pvLastUpdateGS2 = pvbuffer[0].replace(" ","T")
	pvPac1GS2 = pvbuffer[1]
	pvIac1GS2 = pvbuffer[2]
	pvUac1GS2 = pvbuffer[3]
	pvFreqGS2 = pvbuffer[4]
	pvPdc1GS2 = pvbuffer[5]
	pvIdc1GS2 = pvbuffer[6]
	pvUdc1GS2 = pvbuffer[7]
	pvPdc2GS2 = pvbuffer[8]
	pvIdc2GS2 = pvbuffer[9]
	pvUdc2GS2 = pvbuffer[10]
	pvConditiesGS2 = pvbuffer[11]
	pvGridrelayGS2 = pvbuffer[12]
	pvTempGS2 = pvbuffer[13]
	pvEDayGS2 = pvbuffer[14]
	pvETotGS2 = pvbuffer[15]

	events.postUpdate("PVLastUpdateGS2", pvLastUpdateGS2)

	pveTodayGS2 = float(pvEDayGS2)/1000
	events.postUpdate("PVEtodayGS2", str(pveTodayGS2))
	events.postUpdate("PVEtodayGS1", str(pveTodayGS2 * ratio))

	pvPowerGS2 = float(pvPac1GS2)
	events.postUpdate("PVPowerGS2", str(pvPowerGS2))
	events.postUpdate("PVPowerGS1", str(pvPowerGS2 * ratio))

	prevTotalGS1 = items["PVEtotalGS1"].floatValue()
	prevMeter1 = items["PVEmeter1"].floatValue()
	prevMeter2 = items["PVEmeter2"].floatValue()

	# Take 1 minute later than real update time to make sure the right values are captured from the persistence service
	totalGS1Update = DateTime(PersistenceExtensions.lastUpdate(ir.getItem("PVEtotalGS1"), "influxdb")).plusMinutes(1)
	meter1Update = DateTime(PersistenceExtensions.lastUpdate(ir.getItem("PVEmeter1"), "influxdb")).plusMinutes(1)
	meter2Update = DateTime(PersistenceExtensions.lastUpdate(ir.getItem("PVEmeter2"), "influxdb")).plusMinutes(1)
	histTotalGS2GS = PersistenceExtensions.historicState(ir.getItem("PVEtotalGS2"), totalGS1Update, "influxdb").state.floatValue()
	histTotalGS1M = PersistenceExtensions.historicState(ir.getItem("PVEtotalGS1"), meter1Update, "influxdb").state.floatValue()
	histTotalGS2M = PersistenceExtensions.historicState(ir.getItem("PVEtotalGS2"), meter2Update, "influxdb").state.floatValue()
	totalGS2 = float(pvETotGS2)/1000
	totalGS1 = prevTotalGS1 + (totalGS2 - histTotalGS2GS)*ratio
	meter1 = prevMeter1 + (totalGS1 - histTotalGS1M)
	meter2 = prevMeter2 + (totalGS2 - histTotalGS2M)
	
	events.postUpdate("PVEtotalGS1", str(totalGS1))
	events.postUpdate("PVEtotalGS2", str(totalGS2))
	events.postUpdate("PVEmeter1", str(meter1))
	events.postUpdate("PVEmeter2", str(meter2))

I sum most of the values (1 & 2) using a sum group. I defined these in PaperUI. As an example, you can see the definition of PVPower, PVPowerGS1 and PVPowerGS2 from the REST API.
PVPower:

{
  "members": [
    {
      "link": "http://openhabianpi:8080/rest/items/PVPowerGS2",
      "state": "0.0",
      "type": "Number",
      "name": "PVPowerGS2",
      "label": "PV Power GS2",
      "tags": [],
      "groupNames": [
        "gEMeting",
        "PVPower"
      ]
    },
    {
      "link": "http://openhabianpi:8080/rest/items/PVPowerGS1",
      "state": "0.0",
      "type": "Number",
      "name": "PVPowerGS1",
      "label": "PV Power GS1",
      "tags": [],
      "groupNames": [
        "gEMeting",
        "PVPower"
      ]
    }
  ],
  "groupType": "Number",
  "function": {
    "name": "SUM"
  },
  "link": "http://openhabianpi:8080/rest/items/PVPower",
  "state": "0.0",
  "editable": true,
  "type": "Group",
  "name": "PVPower",
  "label": "PV Power",
  "tags": [],
  "groupNames": [
    "gEMeting"
  ]
}

PVPowerGS1:

{
  "link": "http://openhabianpi:8080/rest/items/PVPowerGS1",
  "state": "0.0",
  "editable": true,
  "type": "Number",
  "name": "PVPowerGS1",
  "label": "PV Power GS1",
  "tags": [],
  "groupNames": [
    "gEMeting",
    "PVPower"
  ]
}

PVPowerGS2:

{
  "link": "http://openhabianpi:8080/rest/items/PVPowerGS2",
  "state": "0.0",
  "editable": true,
  "type": "Number",
  "name": "PVPowerGS2",
  "label": "PV Power GS2",
  "tags": [],
  "groupNames": [
    "gEMeting",
    "PVPower"
  ]
}

And here is my influxdb persistence file (I also use mapdb all for RestoreOnStartup):

Strategies {
	everyMin  : "0 * * * * ? *"
	everyHour : "0 0 * * * ? *"
	default = everyChange
}

Items {
	gEMeting*													: 	strategy = everyChange
	Emeter, PVPower, PVEmeter, PVEtotal, PVEtoday, TotVerbruik	:	strategy = everyChange
}

In the event.log, I have:

2020-02-24 17:15:08.156 [vent.ItemStateChangedEvent] - PVPowerGS2 changed from 18.0 to 3.0
2020-02-24 17:15:08.167 [vent.ItemStateChangedEvent] - PVPowerGS1 changed from 9.93744 to 1.65624
2020-02-24 17:15:08.171 [GroupItemStateChangedEvent] - PVPower changed from 27.93744 to 4.65624 through PVPowerGS1
...
2020-02-24 17:40:07.265 [vent.ItemStateChangedEvent] - PVPowerGS2 changed from 15.0 to 6.0
2020-02-24 17:40:07.274 [GroupItemStateChangedEvent] - PVPower changed from 14.2812 to 9.31248 through PVPowerGS1
2020-02-24 17:40:07.279 [vent.ItemStateChangedEvent] - PVPowerGS1 changed from 8.2812 to 3.31248
...
2020-02-24 17:45:06.831 [vent.ItemStateChangedEvent] - PVPowerGS2 changed from 6.0 to 0.0
2020-02-24 17:45:06.837 [GroupItemStateChangedEvent] - PVPower changed from 9.31248 to 3.31248 through PVPowerGS2
2020-02-24 17:45:06.841 [GroupItemStateChangedEvent] - PVPower changed from 3.31248 to 0.0 through PVPowerGS1
2020-02-24 17:45:06.844 [vent.ItemStateChangedEvent] - PVPowerGS1 changed from 3.31248 to 0.0

And later, after a restart of openHAB, restoreOnStartup from mapdb kicking in:

2020-02-24 23:10:40.474 [vent.ItemStateChangedEvent] - PVPowerGS2 changed from NULL to 0.0
2020-02-24 23:10:40.515 [vent.ItemStateChangedEvent] - PVPowerGS1 changed from NULL to 0.0

And later in the morning, first readings above 0, but no sum being made anymore from then on:

2020-02-25 08:00:07.143 [vent.ItemStateChangedEvent] - PVPowerGS2 changed from 0.0 to 30.0
2020-02-25 08:00:07.149 [vent.ItemStateChangedEvent] - PVPowerGS1 changed from 0.0 to 16.5624

Just before the restart, I did some maintenance in other areas of openHAB. I did not touch the items above or the persistence definitions.

So in summary, the Group SUM is not working after a reboot.

It sounds rather like one of those horrid race conditions OH seems prone to when initially loading.

Assuming the REST API output is taken during the “fault” situation, I can’t think of any other info to be gleaned,

If you had Items files I would bet a touch would fix … as you’re working with PaperUI, I would try a harmless edit of Group Item label or suchlike to see if it will relink.

I think the same. I hate having these unpredictable and difficult to analyze issues.
I just did an update to OH 2.5.2, a reboot of my Raspberry, and the issue is gone. It is adding up correctly again.