Rule execution lags, possibly Java GC

Id like to bolt on my experience with slow running rules, which is an intermittent problem. I think I have good indications in my example, that my own problem is related to rule processing in some way.

For the record, Im on an Armv8 processor with 4GB of ram, running off NAND flash and the system is only there for Openhab, so its not doing any other tasks. System is Ubuntu based. A couple of basic facts:

  • This has occurred across multiple revisions of Openhab.
  • I have rebuilt the system from scratch to see that its not related to a conflict somewhere.
  • No OS patches/updates seem to have resolved it.
  • I was on MQTT v1, but have updated to MQTT v2 and the problem remains.
  • The problem is intermittent, but repeatable.
  • There are no errors in my Openhab logs and I have a clean system startup.
  • Everything generally runs fine and as expected.

To give a simple example, I have some Sonoff basics that run lights in the house. I also have some buttons that run 433Mhz RF. When you press the button, it sends the RF command to the RF bridge, which sends an MQTT command to Mosquito, after which Openhab has a rule that says “if this command appears from the RF bridge, turn the light on or off, depending on its current state”

rule "Bedside Lights button1"
when
    Item RFDataLights received update "811301"
then 
	if ( gBedroom1.state == OFF ) { gBedroom1.sendCommand("ON") }
	else gBedroom1.sendCommand("OFF")
	end

If you have used one of these rules, then this rule and all rules like it, process absolutely fine for anything up to the next 5-8 hours (best estimate). However, if the rules havnt been processed for that 5-8 hours, you get a delay of around 20 seconds before the rule runs.

To show this occurring, below is a capture direct from Mosqitto where you can see that the MQTT command arrives at 22:58 and 58 seconds, but Openhab’s logs, shows that the On/Off command that the rule doesnt run until 22:49 and 20 seconds (22 seconds later).

Obviously, because I have to wait 5-8 hours for this issue to occur, its hard for me to get diagnostic data together, but I managed to do another test today, as follows… Can I turn the light on/off in the Openhab interface, while Im waiting for this rule to be processed? e.g. is Mosquitto, MQTT or Openhab (other than rule processing) an issue?

So I opened the Openehab Android app on my phone, and pressed my RF button… The lights didnt turn On at this point. I then turned the lights on in the Openhab Android app, and they came on… then I turned them off again… all of this within the 20 seconds time. After another 10 seconds or so, the lights turned on, which would be the rule being processed.

Ive always noticed that Openhab rules act this way on a reboot of the system (or restart of the Openhab services)… which I assume its something to do with loading of the rules into memory for the first time. As such, I assume that after my 5-8 hours, something (Java or Openhab) has garbage collected the rules out of memory and its performing the same load up process as if you had restarted the system.

Ive attempted in the past to give Openhab more threads to work with as well as more memory for Java, which had no effect, and to be fair, my Openhab isnt doing much in the background anyway (its 98% dealing with presses from commands in the web interface), though its taken me a couple of months to really have a good chance to look at trends occurring and try different things.

Im open to trying any suggestions, though it may be slow going because of the intermittency of the issue. Im also open to being told this is just the way it is, deal with it.

I was wondering if there may be a rule I could create, in that specific rules file, that would force the rules file to be used every 60 minutes or so, to see if that kept the whole rules file in memory… Though I dont know Openhab internal core well enough to know if this kind of keep alive would keep the entire rules file in memory… but Im happy to try it if someone has a suggestion for a rule.

Sorry for the long post, but hopefully its detailed enough to explain the situation. Any thoughts are welcome!

Thanks

Wondering why you use Ubuntu on ARM. It’s 64bit presumably which is consuming a lot more RAM than 32bits do. Consider using openHABian. It’ll work with RPi4 and eventually with your machine, too (your description reads as if you don’t dare to name that as that’s not a RPi).
OpenHABian comes with a serious set of optimizations on memory such as ZRAM and Java parameters.

You say you gave Java more mem but your problem description reads as if it’s Java garbage collection.
Did you increase -Xms ? -Xmx is the wrong one. And did you account for the additional mem 64bit requires?

1 Like

Hi Markus

Thanks for the reply!!

I am on a non-raspberry pi system… just a generic ARM platform (as you suspected). I did try Openhabian about 1-2 years ago and there were issues (at the time) with certain file paths not existing during install, so the installer didn’t successfully complete. As such I just performed a standard install of Openhab and have gone that route ever since. I can try Openhabian again to see if it now if it gets through the install, though Im guessing you’re indicating that it wont do at this moment.

Not sure Ive been through the ZRAM stuff before… I should just follow your guide you link above?? ZRAM

As for the memory increase, I did use -Xms & -Xmx , though I also recall there was another parameter or so I used…though I cant recall immediately what they are. My system is running Azul zulu-embedded-8, which I believe is a 32 bit version of Java (but running on a 64 bit OS) so I allocated for a 32 bit system version of Java. Do you know of any articles I should look at or keywords I might look for on the community forum?

My system has about 2+GB free to play with, so plenty of overhead to adjust things.

Real memory 816.68 MiB used / 1.09 GiB cached / 3.27 GiB total

Thanks

No, that’s for openHABian only. I guess ZRAM would work on your box, too, but you’ll have to risk, try and do the work yourself without support unless you migrate to openHABian.

1 Like

Have you actually tried the cron-rule “keep active” workaround?
Curious to see if it works here.

1 Like

@rossko57 Ive not tried that… I assume its an actual bash file that you run on a frequency? I quickly searched the forum here for a match on that. Have you a code snippet or link to what I need to do/setup?

More than happy to give it a go (will take a few days of course)… but happy to try!

@mstormi Ill give @rossko57’s suggestion a go over a couple of days… see how that works and if it solves the problem. If not, Im happy to take a shot at trying Java/Zram or happy to see if OpenHabian now installs without problem. Ill keep feeding back here as I go!

Nope. In each rules file that seems to be affected by “swapping out” of rules, you add another rule with a periodic cron trigger, that does nothing much. The theory is that keeping that rule active also stops other rules being swapped out of memory.

Be careful if applying to multiple rule files, to keep every rule “name” unique.

1 Like

@rossko57 Ok… Ill add it in.

Ill try this one out from the list of whats on that link.

    rule "keep this thing in memory"
    //use the compiled code in this rule file every minute
    when
        Time cron "0 0/1 * 1/1 * ? *"
    then
        return;
    end

As only my lights are time sensitive really (I dont mind if other things take 10-20 seconds), Ill only apply it to my lighting rules file for now. I guess 1 minute is a bit overkill really… though ill try it with that frequency initially over a period of 4-5 days and report back here when I have some results.

Thanks

Some have reported the rule needs to do something, maybe a logInfo().
Maybe just making an unused var fred = null would do, don’t know.

You would think so, but I expect all this is very system dependent.

If leaving this kind of rule in service, I’d tune the cron to go off at second 45 or suchlike, to avoid everything else that fires on whole minutes.

1 Like

@rossko57 Fair point. Ive gone with a switch that has no purpose other than to change state every 1 minute. I can watch it change to check the rule is running and dont have to fill my logs up with anything… so an easy way to check at any point in time that things are kept in memory and interacting with variables in my items list.

Item

Switch 	keepalive

Rule

    rule "keep this thing in memory"
    //use the compiled code in this rule file every minute
    when
        Time cron "0 0/1 * 1/1 * ? *"
    then
    	if (keepalive.state == OFF) {keepalive.postUpdate("ON")}
    	else keepalive.postUpdate("OFF")
    end

Sitemap (Hidden in a sub-menu)

Switch item=keepalive			mappings=["ON"="On","OFF"="Off"]

You should look in your events.log :wink:
You could halve the traffic by using postUpdate instead of sendCommand

1 Like

@rossko57 Hah… ok… Have edited the code snippet on on my system and the above email in case anyone uses it in future. :slight_smile:

Ill let you know how things are going in a few days time!!

Thanks for the help.

Could you please help with finding optimum Java parameters?
Your system exhibits the “wanted” behavior we need to have to test if to change Java options makes a difference. We don’t have that chance very often that’s why I’d be thankful if you help here.

Could you please start your java with these extra options. Cumulative, i.e. try #1 first, if that does not make a difference try #1 + #2, finally #1 + #2 + #3
I don’t know your install method but there’s an environment variable EXTRA_JAVA_OPTS usually set in /etc/default/openhab2 where you can add options.

  1. -XX:-UseAdaptiveSizePolicy -XX:+UseParallelGC -XX:InitiatingHeapOccupancyPercent=95
  2. -XX:MinHeapFreeRatio=15 -XX:MaxHeapFreeRatio=30
  3. -XX:+UseCodeCacheFlushing

I’m not sure but I think you should remove the -Xms parameter before you test these new ones.
Try if removing Xms (and not adding anything else in the first place) results in “reliable” occurence of the delay problem, if so leave Xms out in further tests.

1 Like

@rossko57 The on/off switch rule
Whilst that rule has remained running fine; this morning, first thing I tried was pressing one of my RF buttons and sadly, I encountered the 20+ second delay. My rule for the on/off switch is in my lighting.rules file, where the rules for my RF button presses are…so… had it just kept the one rule in memory rather than the whole rules file? (Im not expecting an answer to that question, it was just my thought on how java/rules interact). More than happy to try doing something different with that rule if you have any other suggestions?

@mstormi Changing the Java options
More than happy to give these a go. It may take me anything up to 3-7 days to fully test them through, as theres the very occasional time where things do work flawlessly without any delays, so in that circumstance and with 3 x sets of things to try, having to wait for the issue to occur etc. Well, you get the idea.

What I will do is:

  • add/change 1 setting at a time as you request
  • perform a full restart of the system after each change
  • when its started up ill press an RF button so we know the rules are working/loaded,
  • test the change for 24-48 hours to see if they delay problem occurs, with any 20+ second delay classing as a failure (RF working normally is usually 0.5 a second to 1.5 seconds experience from button press to light on, so obviously ill note any 2 seconds upwards experiences).
  • If I encounter the delay issue, Ill move though setting 1, 2 until I reach 3.

Ill also turn off the on/off switch rule for now.

Ill feed back here when I can.

Thanks

That’s very long. The usual report is 5-8 seconds on a Pi (maybe others get same effect with shorter delay but it gets accepted/ignored).
We must not forget there may be multiple causes for similar symptoms.
Can we see the events.log, there should be a clear delay between rule trigger event and expected action in this case.

I don’t know how the “keep alive” rule is supposed to work, some report success and some like you don’t :smiley:
I don’t see the problem - but I run on fat Windows boxes, different resources, different java, etc. I do see the very-first-rule-run delay of a couple of seconds, but not delay after long rest.

1 Like

@rossko57 I will up my logging on the events.log (as I think I have it turned down) and gather that info.

@mstormi Further to my earlier post from about 1 hour ago, it looks as though -XX:+UseParallelGC is set on by default on Azul Zulu Java 8 and it looks as though if you send the command in the EXTRA_JAVA_OPTS telling it the same command again, Java doesnt like that and Openhab wont start. Though in a nutshell, as my 1st settings option Im going to use:

EXTRA_JAVA_OPTS="-XX:-UseAdaptiveSizePolicy -XX:InitiatingHeapOccupancyPercent=95"

Though if you want to know what I tested/how I reached this conclusion, my tests/reseach is below:

1st try (Openhab wont start, no http jetty message and nothing at all in the logs, but fresh log files are created if I delete them and start Openhab)

EXTRA_JAVA_OPTS="-XX:-UseAdaptiveSizePolicy -XX:+UseParallelGC -XX:InitiatingHeapOccupancyPercent=95"

2nd try (Same result as 1st try)

EXTRA_JAVA_OPTS="-XX:-UseAdaptiveSizePolicy -XX:+UseParallelGC"

3rd try (Openhab loads up and works)

EXTRA_JAVA_OPTS="-XX:-UseAdaptiveSizePolicy"

4th try (Openhab loads up and works)

EXTRA_JAVA_OPTS="-XX:-UseAdaptiveSizePolicy -XX:InitiatingHeapOccupancyPercent=95"

As such, I guess it doesnt like the -XX:+UseParallelGC

As mentioned Im on Zulu JDK8 from Azul. I found this site which allows you to search the list of options within Java builds/versions and that I think suggests UseParallelGC should be a valid option(?) and that it is a default on setting for Zulu JDK8(?).

https://chriswhocodes.com/zulu_jdk8_options.html

I noted there was a UseParallelOldGC, so thought id try

5th Try (OLD GC version)

EXTRA_JAVA_OPTS="-XX:-UseAdaptiveSizePolicy -XX:+UseParallelOldGC -XX:InitiatingHeapOccupancyPercent=95"

But Openhab wouldnt start with that either.

So finally, I tried a minus sign “-” against UseParallelGC and Openhab Started up…

6th Try (Used a - minus against Parallel GC)

EXTRA_JAVA_OPTS="-XX:-UseAdaptiveSizePolicy -XX:-UseParallelGC -XX:InitiatingHeapOccupancyPercent=95"

So Im assuming in some way that UseParallelGC is turned on as default within Azul Zulu 8 Java, and it doesnt like being told to use it twice on startup.

As such, Im going to omit it from the EXTRA_JAVA_OPTS and go with my command line from my 4th try (if you agree).

Thanks

You can try combinations of Java options offline like this.
This also shows the available -XX options.
Also note the output unlike what you say UseParallelGC = false on my box (Azul 8 on RPi 3)

[13:35:48] openhabian@openhabianpi:~$ java -XX:+PrintFlagsFinal -version|grep ParallelGC
    uintx ParallelGCBufferWastePct                  = 10                                  {product}
    uintx ParallelGCThreads                         = 0                                   {product}
     bool ParallelGCVerbose                         = false                               {product}
     bool UseParallelGC                             = false                               {product}
1 Like

@mstormi Ok, well I guess it may be turned off then!

java -XX:+PrintFlagsFinal -XX:+PrintFlagsInitial -version|grep ParallelGC

    uintx ParallelGCBufferWastePct                  = 10                                  {product}
    uintx ParallelGCThreads                         = 0                                   {product}
     bool ParallelGCVerbose                         = false                               {product}
     bool UseParallelGC                             = false                               {product}

And this following command line doesnt throw any errors or complain, though I appreciate its not actually running a java file. It does at least confirm my Java version and that -XX:+UseParallelGC doesnt throw any errors.

java -XX:-UseAdaptiveSizePolicy -XX:+UseParallelGC -XX:InitiatingHeapOccupancyPercent=95 -version

openjdk version "1.8.0_152"
OpenJDK Runtime Environment (Zulu Embedded 8.25.0.76-linux-aarch32hf) (build 1.8.0_152-b76)
OpenJDK Client VM (Zulu Embedded 8.25.0.76-linux-aarch32hf) (build 25.152-b76, mixed mode, Evaluation)

Java; I have to admit is not an vast area of expertise for me… and you’re welcome to shoot me down :slight_smile: Though Im trying to understand. I read the below article and tried -XX:+UseParNewGC which does work on the EXTRA_JAVA_OPTIONS… (as in, Openhab does start up).

is that any use for me to try/use for the purposes of these tests?

@rossko57 @mstormi

Im slowly working through the Java options, Im on step 2 currently, though I am using the -XX:+UseParNewGC

Rossko57, you asked me for an event log when the delay happened, which Ive managed to get one here for you:

2020-04-29 20:09:57.137 [vent.ItemStateChangedEvent] - harmonyhub_device_HarmonyHub_51535350_buttonPress changed from UNDEF to PowerOff
2020-04-29 20:09:57.567 [vent.ItemStateChangedEvent] - harmonyhub_device_HarmonyHub_51535350_buttonPress changed from PowerOff to UNDEF
2020-04-29 20:10:43.743 [vent.ItemStateChangedEvent] - RFDataLights changed from NULLFORTESTING to 3D1301
2020-04-29 20:10:56.670 [vent.ItemStateChangedEvent] - Current_DateTime changed from 2020-04-29T20:09:56.642+0100 to 2020-04-29T20:10:56.649+0100
2020-04-29 20:10:56.913 [ome.event.ItemCommandEvent] - Item 'gBedroom1' received command ON
2020-04-29 20:10:56.940 [ome.event.ItemCommandEvent] - Item 'lsBed1' received command ON
2020-04-29 20:10:56.954 [ome.event.ItemCommandEvent] - Item 'lsBed1a' received command ON
2020-04-29 20:10:56.961 [vent.ItemStateChangedEvent] - RFDataLights changed from 3D1301 to NULLFORTESTING
2020-04-29 20:10:56.964 [nt.ItemStatePredictedEvent] - lsBed1 predicted to become ON
2020-04-29 20:10:56.977 [nt.ItemStatePredictedEvent] - lsBed1a predicted to become ON
2020-04-29 20:10:56.992 [vent.ItemStateChangedEvent] - lsBed1 changed from OFF to ON
2020-04-29 20:10:56.994 [vent.ItemStateChangedEvent] - lsBed1a changed from OFF to ON

This one shows a 13 second delay. The 3rd line is me pressing the button.

On the line that says “RFDataLights changed from 3D1301 to NULLFORTESTING” I put this into my rules to reset the RFdataLights variable so that the event log would always show a change event when I pressed a button.

Ill keep on with the Java settings, but hope this provides the event.log detail you wanted.