Struggles with stability

rch1708 · February 26, 2020, 5:18pm

Platform information:
- Hardware: RPI3
- OS: Raspbian 4.19
- Java Runtime Environment: No idea
- openHAB version: 2.5.2
Issue of the topic: Various stability issues, such as loss of Evohome connection and loss of connection between openhab2/InfluxDB/Grafana

Dear Experts,

I’ve been struggling to learn how to apply OH for the last week or so. Although I’m a newbie to OH, I’m not a stranger to software engineering. But the last code I wrote of any kind was Python about 5 years ago and the last time I worked on a serious (professional) project was probably more than 20 years ago. So I can tell you all about C and nothing about Java. Thus, something with OH are no problem at all and others are a complete mystery to me.

To be clear, I have a straight forward OH installation setup and - in theory - running. It has Evohome, Plugwise, Smart Meter bindings and things. I’ve also installed InfluxDB and Grafana and have been producing beautiful graphics of power consumption and temperatures. I also made a full backup at this point. However, I then started experimenting with rules and, as you can expect, made mistake after mistake. At some point, after a restart the Evohome binding, which was previously happily doing its thing, failed to initialise with saying that it failed to login to my TCC account. After much exploration & reading, used open-cli to delete all links and items (I define items in a file, things in paper UI), I emptied the cache and tmp directories, used opencli to clean-the-cache and then started and… yes, indeed Evohome started working again.

However, something else screwed up badly enough that I ended up with duplicate things that I couldn’t remove in Paper UI. So I restored my backup and cleared the cache/tmp dirs. Etc. Today, firstly Evohome stopped working again. So I went through the processes of cleaning up links and items and restarted. But then Grafana started progressively showing “No Data” as time went by. It turns out for some reason that there’s no data flowing from openhab to Grafana. Further, there’s something odd going on with rules - some rules are complaining the that Items that they reference are not present, while as far as I can tell from Paper UI, they’re alive and kicking. I’ve not YET gone all the way back to my backup to restore the situation, but I have deleted items, links, cache and tmp again, but I still have the problem(s) when I restart.

To be honest, I’m getting a bit fed up with this seeming fragility. It’s natural that I’m going to screw up with rule scripts and the system should be resilient enough not to be damaged by what must be common mistakes - if that’s what’s causing all these problems. But maybe it’s something else?

I would appreciate your help and advice.

Bruce_Osborne · February 26, 2020, 5:26pm

You need Java 8 not Java 11

rch1708 · February 26, 2020, 5:47pm

So yeah, my RPI reports Java 11.0.6. Doesn’t OH flag such an inconsistency? I have to say I’d be mighty pi**ed if I’ve just spent a huge amount of time struggling with problems caused by such a fundamental discrepancy.

Bruce_Osborne · February 26, 2020, 5:49pm

No. OH3 will be Java 11 but the current stable version requires Java 8

rch1708 · February 26, 2020, 5:51pm

Amazing.

rch1708 · February 26, 2020, 5:57pm

Java 8 installed. Will OH automatically pickup Java 8 over 11 or do I need to set a default somewhere?

Bruce_Osborne · February 26, 2020, 6:05pm

Not sure. I think there is a command update-alternatives ?

rch1708 · February 26, 2020, 6:23pm

Got it. Now let’s see if there’s any improvement. I hope I don’t have to go back to the restore though…

rch1708 · February 26, 2020, 6:44pm

Pfff… better but no cigar. Looks like I’ll have to restore the backup.

rlkoshak · February 26, 2020, 6:57pm

I know there is a lot written here but not a whole lot of details. But I see at least four different and unrelated issues here.

First, I recommend using openHABian since you are on an RPi anyway. This is a known good configuration which will eliminate a number of sources of error such as having the wrong JRE installed.

Issue 1: Evohome

Clearly this binding is not happy for some reason. Put that binding into debug logging, gather the logs around the time that it loses it’s connection, and submit an Issue on the binding if it seems warranted. It’s impossible to say what is going on and why it’s losing the connection without some logs. But it does seem to be a binding issue.

I didn’t know you could delete links using openhab-cli. But that would only work if you are defining the Links in PaperUI. Are you doing so or are your Links defined in your .items files (e.g. { channel="evo:...)?

Also note that openhab-cli clean-the-cache just deletes the cache and tmp directories. There is no need to do both.

When you clear the cache, it causes OH to download and reinstall the bindings again. That’s probably why it started to work again. But I’m willing to bet that restarting OH or just restarting the bundle from the Karaf console would have been sufficient.

Issue 2: Duplicate Things

This is an odd one but it could be caused by the deleting of the Links or having a mix of Links defined by PaperUI and other defined in .items files. When you have stray Links in JSONDB that refer to Things or Items that no longer exist OH get’s confused and you end up with zombie Items or Things. But without more details it’s impossible to tell what is going on.

Issue 3: Grafana

Well, the data really flows from OH to InfluxDB. Grafana pulls the data from there. So you need to focus your looking on InfluxDB. Is it running? Are new values being written to it? Is the add-on still installed?

Again, putting the add-on into debug or trace level logging would be helpful.

Issue 4: Items don’t exist in Rules

This is a known issue that hits only a minority of users some of the time. There is an issue open for it but a fix will probably have to wait for OH 3.

The work around is to wait for OH to finish starting after clearing the cache an simply restart OH. A very few have reported needing to restart OH twice or more but I’ve only ever had to restart once. The issue is related to the indeterminate boot ordering and stuff starts running before dependencies are loaded.

You will potentially see this error every time you clear the cache (which includes updates).

Honestly, I think Evohome may have some sort of bug and the rest is caused by your efforts to fix that bug.

You really should not have to delete the Links and I’m not sure where you got that idea. It’s unclear how you are defining your Links and how you do will greatly impact what openhab-cli will do when it removes the Links.

The fact that the binding works for awhile but then fails and the fact that it continues to do this after one clear the cache is an indicator that you should not clear the cache to address that original problem. So stop doing that. Try a simple restart of OH. If that doesn’t kick start Evohome uninstall and then reinstall just that one binding. Don’t clear the cache.

When you see the no such Item errors, all you need to do is wait and reboot. But if you stop clearing the cache, you shouldn’t see those errors any more.

I can’t say what’s going on with InfluxDB/Grafana without more details and getting a more stable system to debug from.

rch1708 · February 26, 2020, 7:13pm

Hi Rich,

Thanks for your very helpful answer. Indeed there’s a mish-mash of things going on here, some related, some not. Your remark about Issue 4 is very helpful. On Issue 1, I have a .items file with all my items defined with their channel mappings. So indeed, open-cli strips out the items and the links, but I leave the things definitions alone. On boot the items and links/channels are restored. But yeah, this does rather seem like overkill. However a simple reboot (of the whole RPI even) didn’t fix the Evo problem. If it reoccurs I guess I’ll have to read up on the diagnostics you mention.

Issue 2, I’m not sure what happened. I think the the Plugwise binding ‘things’ check (re)discovered PW circles that were already present as things and in my ignorance I clicked on them in the inbox. Certainly I’ve seen the plugwise binding consistently rediscovering existing things. Whatever, it ended up in a big mess.

Issue 3, sure I know that the data flows through Influx. I checked that Influx was running, entered the Influx console and checked what little I know, but nothing jumped out. I’ll have to figure out how to check if new items are flowing into it.

Issue 4. Good to know. My sw background is in concurrency. My experiments with rules the last days have taught me that they don’t have run-to-completion semantics, which leads to interesting (!) cascades of events running through the system when mass updating a group of items.

My backup should now be restored and ready to for me to check out. I’ll go and see whether things are any better…

rossko57 · February 26, 2020, 7:26pm

You might find this thread helpful, or not

rch1708 · February 26, 2020, 7:37pm

Interesting example. As an educational example I setup a Group of Switches and a rule that when one switch was turned on, all others that were off would also be turned on. It was interesting to see how the Info logs from the resultant casade of update events to the group was interleaved. Thus I learnt that rules do not have run-to-completion.

This would all be very enjoyable if it weren’t for the problems. Anyway, after restoring the backup and saved config files I seem to be back where I was with what seems to be a working system.

rch1708 · February 26, 2020, 7:50pm

Sigh… still problems. Only some data is making it through to Grafana. Evohome data is flowing but the Plugwise data stopped - see graph. There are also some seriously corrupted looking messages from Plugwise in the logs. I guess it’s time to start from scratch. Don’t think I have the enthusiasm to start that this evening.

The PW log looks like this:

2020-02-26 20:44:51.541 [WARN ] [gwise.internal.PlugwiseMessageSender] - Error sending: No ACK received after 1 second: 0023000D6F000072970B47AE
2020-02-26 20:44:52.694 [WARN ] [gwise.internal.PlugwiseMessageSender] - Error sending: No ACK received after 1 second: 0012000D6F000072970B573E
2020-02-26 20:44:53.035 [WARN ] [se.internal.PlugwiseMessageProcessor] - Plugwise protocol message error: 0000153101E9
00131531000D6F000027B000C010500000151
001315320D600A61010D40011F4700000000000EEDDB
2020-02-26 20:44:53.848 [WARN ] [gwise.internal.PlugwiseMessageSender] - Error sending: No ACK received after 1 second: 0012000D6F0000AF671FAC98
#ASeusNdIf:Suc A:006007C8#ASeusNdIf:DsiainMC PlugwiseMessageProcessor] - Plugwise protocol message error: 000015330C4F
0DF00F7F
00241533000D6F0000AF671F140291400004487001856539070140234E0844C202B23B

rlkoshak · February 26, 2020, 8:21pm

Do you see Items updating in events.log but not showing up in the chart? If not than the problem has to do with the Plugwise binding. If so then we need debug logging from the InfluxDB add-on.

rch1708 · February 27, 2020, 7:24am

Yes. Plugwise events are appearing in the events log and current status is appearing in the Paper UI. Data from Evohome items appears in Grafana, but no data from Plugwise items.

csi_oh · February 27, 2020, 9:08am

Hi,

Just wanted to say that I’m running OH2 stable on Buster/Java 11 - /wo any issues. But yes, as the docs say Java 8 is required.

Can you please verify in Grafana’s data-source administration that there is everything up to date?

Just wanted to tell you that there’s no direct data-flow from OH to Grafana. Grafana “pulls” data from a data-base (in most cases a time-series DB is preferred, e.g. influxDB).

If you need a GUI for data exploration, I strongly recommend Chronograf. After installation, you can easily browse your influxdb and see what happens.

One thing you could also double-check if persistence is set-up correctly. There was an issue last week, maybe this helps!

rlkoshak · February 27, 2020, 3:57pm

OK, if the events are coming in than the issue is with the InfluxDB configuration. Double check your influxdb.persist file and make sure everything that should be persisted is listed and the strategy makes sense. Remember that * is not a wild card in .persist files. “Foo*” means all members of the Group “Foo”, not all Items whose names start with “Foo”.

Put the InfluxDB addon into debug or trace logging to see if there is any activity when the Items update/change state from the InfluxDB addon.

I’ve never seen it selectively fail like what is described so I’m going to guess somewhere along the way the configuration got messed up.

rch1708 · February 28, 2020, 8:58am

Hi Rich, you hit the nail on the head. It is indeed an influx .persist problem. All my plugwise items have names beginning with “PW”. So originallly my Influx persist file had an entry “PW* : strategy : everyUpdate”. This worked and I assumed that PW* expands to all items with names starting with PW. (Yes, I have to get my naming conventions standardised at some point )

But it actually doesn’t make sense to log anything but power and energy measurements to Influx. While trying to track down the cause of the various problems I’ve been seeing, I decided to reduce what was being logged to Influx. I have a group defined for all plugwise energy measurements “PWE” and another group for all plugwise power measurements “PWP”. I have a main group “PW”, that includes these (sub) groups. So I changed the .persist entries to “PWP : strategy everyUpdate” and “PWE : strategy everyUpdate”, expecting that now only plugwise power & energy items would be logged. But yeah, this is the problem.

If I understand you correctly, PW* works because it expands the MAIN “PW” group and not because it’s a wildcard for all PW items? Hmmmm… I don’t think I would have discovered that on my own. I suppose it’s an argument for better naming conventions in my definitions.

At the moment everything seem to be running smoothly. I’ll make a new backup and set about the hard task of figuring out how to do arithmetic with items and the harder task of how to accumulate energy consumption by period. The penny is yet far from dropping on how item methods work and why/how one needs to cast .states into different types to do simple sums. Eg, I can make a rule work that does simple arithmetic with a Number item. But if I change that item to Number:Energy, I can’t figure out how to change the local var need to perform arithmetic into the equivalent of Number:Energy. I lack proper understanding of what’s going on, I guess.

Anyway, many thanks for your help.

Bruce_Osborne · February 28, 2020, 9:54am

You can create the other groups too. An Item can belong to multiple groups.