Memory Leak in OH 4.1.1 and/or InfluxDB Persistence?

stadelmannj · February 12, 2024, 7:30am

Platform information: Raspberry Pi 4
Hardware: Arm / 4GB RAM / 64GB SD
OS: Raspian 11
Java Runtime Environment: openjdk version “17.0.9” 2023-10-17
openHAB version: 4.1.1 on Openhabian

Since I’m running a (fresh) OH 4.1.1 Openhabian installation, the system seems to leak on Memory.
the only thing I added on the new installation is InfluxDB Persistence, where I store a few values in an external InfluxDB.

it worked for weeks up to a point where InfluxDB did not receive any value. checking Openhab log, I saw lot’s of errors about “Out of Memory” related to InfluxDB, as shown below

checking on OH Memory, consumption was up to ~1.5GB (still lot’s of free mem available ??)
restart of InfluxDB did not fix the issue, hence I decided to restart OH.
after restart last saturday, Memory cosumption was at 450MB. I decided to monitor Memory consumption on OH Graph and I see memory consumption go up and up
2 days later consumption is at almost at 700MB from initial 450, still growing. → see attached screenshot

I wonder if there’s a memory leak related to the InfluxDB Persistence Service ?

I don’t have a lot of Bundles loaded and have not had the issue with the old installation without InfluxDB persistence running

openhab> bundle:list | grep Add-ons
249 x Active x  80 x 4.1.1                  x openHAB Add-ons :: Bundles :: Automation :: JavaScript Scripting
250 x Active x  80 x 4.1.1                  x openHAB Add-ons :: Bundles :: Persistence Service :: RRD4j
251 x Active x  80 x 4.1.1                  x openHAB Add-ons :: Bundles :: Exec Binding
253 x Active x  80 x 4.1.1                  x openHAB Add-ons :: Bundles :: NTP Binding
254 x Active x  80 x 4.1.1                  x openHAB Add-ons :: Bundles :: Network UPS Tools Binding
255 x Active x  80 x 4.1.1                  x openHAB Add-ons :: Bundles :: iCalendar Binding
258 x Active x  80 x 4.1.1                  x openHAB Add-ons :: Bundles :: KNX Binding
282 x Active x  80 x 4.1.1                  x openHAB Add-ons :: Bundles :: MQTT Broker Binding
283 x Active x  81 x 4.1.1                  x openHAB Add-ons :: Bundles :: MQTT EspMilightHub
284 x Active x  81 x 4.1.1                  x openHAB Add-ons :: Bundles :: MQTT Things and Channels
285 x Active x  82 x 4.1.1                  x openHAB Add-ons :: Bundles :: MQTT HomeAssistant Convention
286 x Active x  82 x 4.1.1                  x openHAB Add-ons :: Bundles :: MQTT Homie Convention
287 x Active x  82 x 4.1.1                  x openHAB Add-ons :: Bundles :: MQTT Ruuvi Gateway
290 x Active x  80 x 4.1.1                  x openHAB Add-ons :: Bundles :: Netatmo Binding
291 x Active x  80 x 4.1.1                  x openHAB Add-ons :: Bundles :: Network Binding
292 x Active x  75 x 4.1.1                  x openHAB Add-ons :: Bundles :: Transformation Service :: Map
300 x Active x  80 x 4.1.1                  x openHAB Add-ons :: Bundles :: IO :: openHAB Cloud Connector
301 x Active x  80 x 4.1.1                  x openHAB Add-ons :: Bundles :: Persistence Service :: InfluxDB
openhab>

openhab.log error output

2024-02-10 10:16:49.754 [ERROR] [.influxdb.InfluxDBPersistenceService] - bundle org.openhab.persistence.influxdb:4.1.1 (301)[org.openhab.persistence.influxdb.InfluxDBPersistenceService] : Cannot register component
java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
	at java.lang.Thread.start0(Native Method) ~[?:?]
	at java.lang.Thread.start(Thread.java:809) ~[?:?]
	at java.util.Timer.<init>(Timer.java:188) ~[?:?]
	at org.apache.felix.scr.impl.ComponentRegistry.updateChangeCount(ComponentRegistry.java:735) ~[bundleFile:?]
	at org.apache.felix.scr.impl.ComponentRegistry.registerComponentHolder(ComponentRegistry.java:295) ~[bundleFile:?]
	at org.apache.felix.scr.impl.BundleComponentActivator.validateAndRegister(BundleComponentActivator.java:455) [bundleFile:?]
	at org.apache.felix.scr.impl.BundleComponentActivator.initialize(BundleComponentActivator.java:244) [bundleFile:?]
	at org.apache.felix.scr.impl.BundleComponentActivator.<init>(BundleComponentActivator.java:218) [bundleFile:?]
	at org.apache.felix.scr.impl.Activator.loadComponents(Activator.java:592) [bundleFile:?]
	at org.apache.felix.scr.impl.Activator.access$200(Activator.java:74) [bundleFile:?]
	at org.apache.felix.scr.impl.Activator$ScrExtension.start(Activator.java:460) [bundleFile:?]
	at org.apache.felix.scr.impl.AbstractExtender.createExtension(AbstractExtender.java:196) [bundleFile:?]
	at org.apache.felix.scr.impl.AbstractExtender.modifiedBundle(AbstractExtender.java:169) [bundleFile:?]
	at org.apache.felix.scr.impl.AbstractExtender.modifiedBundle(AbstractExtender.java:49) [bundleFile:?]
	at org.osgi.util.tracker.BundleTracker$Tracked.customizerModified(BundleTracker.java:488) [osgi.core-8.0.0.jar:?]
	at org.osgi.util.tracker.BundleTracker$Tracked.customizerModified(BundleTracker.java:420) [osgi.core-8.0.0.jar:?]
	at org.osgi.util.tracker.AbstractTracked.track(AbstractTracked.java:232) [osgi.core-8.0.0.jar:?]
	at org.osgi.util.tracker.BundleTracker$Tracked.bundleChanged(BundleTracker.java:450) [osgi.core-8.0.0.jar:?]
	at org.eclipse.osgi.internal.framework.BundleContextImpl.dispatchEvent(BundleContextImpl.java:949) [org.eclipse.osgi-3.18.0.jar:?]
	at org.eclipse.osgi.framework.eventmgr.EventManager.dispatchEvent(EventManager.java:234) [org.eclipse.osgi-3.18.0.jar:?]
	at org.eclipse.osgi.framework.eventmgr.ListenerQueue.dispatchEventSynchronous(ListenerQueue.java:151) [org.eclipse.osgi-3.18.0.jar:?]
	at org.eclipse.osgi.internal.framework.EquinoxEventPublisher.publishBundleEventPrivileged(EquinoxEventPublisher.java:229) [org.eclipse.osgi-3.18.0.jar:?]
	at org.eclipse.osgi.internal.framework.EquinoxEventPublisher.publishBundleEvent(EquinoxEventPublisher.java:138) [org.eclipse.osgi-3.18.0.jar:?]
	at org.eclipse.osgi.internal.framework.EquinoxEventPublisher.publishBundleEvent(EquinoxEventPublisher.java:130) [org.eclipse.osgi-3.18.0.jar:?]
	at org.eclipse.osgi.internal.framework.EquinoxContainerAdaptor.publishModuleEvent(EquinoxContainerAdaptor.java:217) [org.eclipse.osgi-3.18.0.jar:?]
	at org.eclipse.osgi.container.Module.publishEvent(Module.java:499) [org.eclipse.osgi-3.18.0.jar:?]
	at org.eclipse.osgi.container.Module.start(Module.java:486) [org.eclipse.osgi-3.18.0.jar:?]
	at org.eclipse.osgi.internal.framework.EquinoxBundle.start(EquinoxBundle.java:445) [org.eclipse.osgi-3.18.0.jar:?]
	at org.apache.karaf.bundle.command.Restart.doExecute(Restart.java:51) [bundleFile:?]
	at org.apache.karaf.bundle.command.BundlesCommand.execute(BundlesCommand.java:55) [bundleFile:?]
	at org.apache.karaf.shell.impl.action.command.ActionCommand.execute(ActionCommand.java:84) [bundleFile:4.4.4]
	at org.apache.karaf.shell.impl.console.osgi.secured.SecuredCommand.execute(SecuredCommand.java:68) [bundleFile:4.4.4]
	at org.apache.karaf.shell.impl.console.osgi.secured.SecuredCommand.execute(SecuredCommand.java:86) [bundleFile:4.4.4]
	at org.apache.felix.gogo.runtime.Closure.executeCmd(Closure.java:599) [bundleFile:4.4.4]
	at org.apache.felix.gogo.runtime.Closure.executeStatement(Closure.java:526) [bundleFile:4.4.4]
	at org.apache.felix.gogo.runtime.Closure.execute(Closure.java:415) [bundleFile:4.4.4]
	at org.apache.felix.gogo.runtime.Pipe.doCall(Pipe.java:416) [bundleFile:4.4.4]
	at org.apache.felix.gogo.runtime.Pipe.call(Pipe.java:229) [bundleFile:4.4.4]
	at org.apache.felix.gogo.runtime.Pipe.call(Pipe.java:59) [bundleFile:4.4.4]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:840) [?:?]

rlkoshak · February 12, 2024, 2:39pm

When you see OOM errors, the parts of the code complaining is not always the root source of the problem. It Object A is leaking, you’ll get an error when Object B tries to acquire some memory. Consequently no conclusion can be draw as to the source of the memory leak based on the source of the OOM errors.

Java works as follows (simplified for clarity, it’s actually a bit more complicated in practice). It will acquire a certain amount of memory when it starts. As it runs it will continue to acquire more memory until a maximum amount is acquired. At that point it performs “garbage collection” where it internally frees up some memory to be used elsewhere in the program. But it does not give that memory back to the OS.

So the typical memory usage chart for a Java program is going to show it start up with a minimum amount of memory, grow to the maximum, and then stay there.

OOM errors occur when the max memory is reached but there is nothing left to garbage collect.

You can control the minimum and maximum through a command line arguments to the java command that starts OH: e.g. “-Xms400m -Xmx650m”

That’s why there is lots of free memory available. A Java program doesn’t grow to consume all available memory. It only grows up to a certain point then works with what it has. You can still get OOM errors even if there is tons of free memory.

Could be but I doubt it as there are lots of users of this binding and I’d expect more problems being reported if that were the source. Though you might have some edge case that no one has seen yet so it cannot be ruled out.

It’s easy enough to test though. Disable or remove it and see if the memory leak goes away. If not you know that’s not it and you can move on to the next binding. If so you’ve found the culpret and narrowed down the source and can open an issue.

matt1 · February 13, 2024, 2:05am

openHAB and all java processes use the heap and not ram like a normal program.

You can graph this usage with the channel usedHeapPercent in the system info binding.
Because of the garbage collection, you need to be graphing it very often to get a true picture of what it is doing, so this places an extra load on your system and best not to leave it running full time like I do for the reasons of catching a bug in my bindings quicker.

stadelmannj · February 20, 2024, 8:50am

Hi @rlkoshak ,
thanks for the good explanation. really apreciate it.
I monitored further and still “loosing” memory every day. I did some further investigations and figured the anonymous processes (ANON) under the main Openhab Java Process are increasing day by day, taking up all the memory.
interesting also, the RES memory in top/htop is meanwhile higher than the Xmx in the /etc/default/openhab Files, which is set to 768

EXTRA_JAVA_OPTS="-Xms192m -Xmx768m -XX:+ExitOnOutOfMemoryError"

as of today, htop shows 785M RES and 1796M VIRT, both increasing day by day.

using pmap “OH java process id”, i see the “anon” processes increasing every day. after 1 week+, i count ~5000 anon processes.
example :

b9f10000     12K -----   [ anon ]
b9f13000    308K rw---   [ anon ]

I’m not an expert in that area. but searching on Google, I understood this is somehow related a problem with threading ?!?
any idea what could be wrong ? my OH installation is really simple. it’s openhabian based with very little bindings as per previous bundle:list output and ~ 5 very simples rules.
thanks.

splatch · February 20, 2024, 9:53am

You face a thread leak. Over time your OH installation generates new threads without suspending/removing old ones. Over time it looses ability to allocate new threads and ends up with this error:

This is out of memory, however message clearly indicates that process can not create more threads. Yet, to make things harder - it doesn’t mean that influxdb addon is guilty for sure. It is affected cause it attempts to create a new thread, but this error might occur in any other place which attempts to start a new thread.

One thing which wonders me - do you restart periodically any of these addons? Cause error stack trace you have tells me you restarted addon.

stadelmannj · February 20, 2024, 12:35pm

Hi @splatch no, i don’t. usually, OH runs for month withouth any intervention. when influxdb stopped sending records, the 1st think i tried is to re-start the addon. that’s probably why you see this in the log. but since it did not fix the issue, i finally restarted OH.

any idea what could cause the behavior ? this is a complete new OH 4.1 installation. i have not had the issue with the old installation, which was running for a long time. reason for new install was, old instance was based on older OS, hence I decided to setup a complete new instance.

matt1 · February 25, 2024, 10:16pm

What is different between the two? Is it a pure reload of the backup into the new OS or did you change other stuff? Need to work out what has changed then then eliminate each change one by one till you work out the cuase. If it really is only the influxdb, then stop using it to see if the problem goes away then someone can look into why once it is narrowed down and doubly confirmed.

brianlay · February 25, 2024, 10:40pm

I’ve been using Influx with OH 4.1.1 since I moved from 3.4.5 and have seen no symptoms of a memory leak.

I run only OH on my Pi and run Influx on a separate machine.

stadelmannj · April 29, 2024, 5:51am

Hi @rlkoshak @matt1
I took me quite a while to isolate the issue but I do have confidence now it’s the InfluxDB package causing the issue. please see attached screenshot , Red Line showing 2nd OH System
I finally built a 2nd OH Server from stretch (new install) , which is now basically a 1:1 replica of my primary OH Server (same HW, Softare [now OH 4.1.2] , configuration). i started to add add-on by add-on, thing by think, item by item, always monitoring how memory changes and behaves over time.
after a while, the configuration was 1:1 the same as my primary node, except some rules file and the influxdb.persist file, which did not add yet. the influxdb add-on was however already loaded, configured and running.
on Thur 25th, afternoon, I finally copied the influxdb.persist file from my primary node to the new node. since then, memory consumption increases continously, exacltly like in the primary node.
on Sun 29th in the morning, I removed the influxdb.persist file again and memory increase stopped immediately.
the Influx Database, Version 2, is hosted on a seperate Server on same LAN, using regular tcp and an API Key to authenticate. the same DB is used by other system without no issue.
All changed/regular values send by OH appear properly in InfluxDB. so no Data lost.

the persistant file contains only a few group items (all Netatmo), which point to a limited amount of real items (~20). see details attached.

any idea what’s wrong ?

// persistence strategies have a name and a definition and are referred to in the "Items" section
Strategies {
        everyMinute : "0 * * * * ?"
        every5Minute : "0 0/5 * * * ?"
        everyHour : "0 0 * * * ?"
        everyDay  : "0 0 0 * * ?"

        // if no strategy is specified for an item entry below, the default list will be used
        default = everyChange
}

/*
 * Each line in this section defines for which item(s) which strategy(ies) should be applied.
 * You can list single items, use "*" for all items or "groupitem*" for all members of a group
 * item (excl. the group item itself).
 */
Items {
        // persist all items once a day and on every change and restore them from the db at startup
        Netatmo* : strategy = everyChange, every5Minute
        CO2* :  strategy = everyChange, every5Minute
        Humidity* : strategy = everyChange, every5Minute
        pressure* : strategy = everyChange, every5Minute
        noise* : strategy = everyChange, every5Minute
}

Number                  Netatmo_Indoor_Temperature      "Innentemperatur [%.1f %unit%]"         <temperature>   (Netatmo)       { channel = "netatmo:weather-station:*:temperature#value"}
Number                  Netatmo_Indoor_CO2              "Innen Home [%d ppm]"                   <co2>           (CO2)           { channel = "netatmo:weather-station:*:airquality#co2"}
Number                  Netatmo_whz_humidity_lbst69     "Humidity LBST69 Whz [%d %unit%]"       <humidity>      (Humidity)      { channel = "netatmo:weather-station:*:humidity#value" }
Number                  Netatmo_wzh_airpressure_lbst69  "Airpressure LBST69 Whz [%d mb]"        <barometer>     (pressure)      { channel = "netatmo:weather-station:*:pressure#value"}
Number                  Netatmo_wzh_noise_lbst69        "Noise LBST69 Whz [%d dB]"              <lautstaerke>   (noise)         { channel = "netatmo:weather-station:*:noise#value"}
Number                  Netatmo_Indoor_Temperature2     "Innen Jenny [%.1f %unit%]"             <temperature>   (Netatmo)       { channel = "nnetatmo:indoor:*:*:temperature#value"}
Number                  Netatmo_Indoor_CO22             "Innen Home 2[%d ppm]"                  <co2>           (CO2)           { channel = "netatmo:indoor:*:*:airquality#co2"}
Number                  Netatmo_jenny_humidity_lbst69   "Humidity LBST69 Jenny [%d %unit%]"     <humidity>      (Humidity)      { channel = "netatmo:indoor:*:*:humidity#value" }
Number                  Netatmo_Outdoor_Batt1           "Batterie Indoor Jenny [%d %%]"         <batterylevel>                  { channel = "netatmo:indoor:*:*:battery#value"}
Number                  Netatmo_Outdoor_Temperature     "Aussentemperatur [%.1f %unit%]"        <temperature>   (Netatmo)       { channel = "netatmo:outdoor:*:*:temperature#value", channel="knx:device:bridge:Suntracer:NAOutsideTemp" [profile="follow"]}
Number                  Netatmo_outdoor_humidity_lbst69 "Humidity LBST69 Outdoor [%d %unit%]"   <humidity>      (Humidity)      { channel = "netatmo:outdoor:*:*:humidity#value" }
Number                  Netatmo_Outdoor_Batt2           "Batterie Outdoor Home [%d %%]"         <batterylevel>                  { channel = "netatmo:outdoor:*:*:battery#value"}
Number:Speed            Netatmo_Wind_Strength           "Wind Geschwindigkeit [%.0f kmh]"       <wind>          (wind)          { channel = "netatmo:wind:*:*:wind#strength"}
Number:Speed            Netatmo_Wind_Gust               "Wind Peak [%.0f kmh]"                  <wind>          (wind)          { channel = "netatmo:wind:*:*:wind#gust-strength"}
Number                  Netatmo_Wind_Batt               "Batterie Windmesser [%d %%]"           <batterylevel>                  { channel = "netatmo:wind:*:*:battery#value"}
Number                  Netatmo_Indoor_Temperature3     "Innen Chalet [%.1f %unit%]"            <temperature>   (Netatmo)       { channel = "netatmo:weather-station:*:temperature#value"}
Number                  Netatmo_Indoor_CO23             "Innen Chalet [%d ppm]"                 <co2>           (CO2)           { channel = "netatmo:weather-station:*:airquality#co2"}
Number                  Netatmo_indoor_humidity_chalet  "Humidity Chalet Unten [%d %unit%]"     <humidity>      (Humidity)      { channel = "netatmo:weather-station:*:humidity#value" }
Number                  Netatmo_Indoor_Temperature4     "Innen Chalet 2 [%.1f %unit%]"          <temperature>   (Netatmo)       { channel = "netatmo:indoor:*:*:temperature#value"}
Number                  Netatmo_Indoor_CO24             "Innen Chalet 2 [%d ppm]"               <co2>           (CO2)           { channel = "netatmo:indoor:*:*:airquality#co2"}
Number                  Netatmo_indoor2_humidity_chalet "Humidity Chalet Oben [%d %unit%]"      <humidity>      (Humidity)      { channel = "netatmo:indoor:*:*:humidity#value" }
Number                  Netatmo_Outdoor_Batt4           "Batterie Indoor Chalet 2 [%d %%]"      <batterylevel>                  { channel = "netatmo:indoor:*:*:battery#value"}
Number                  Netatmo_Outdoor_Temperature2    "Aussen Chalet [%.1f %unit%]"           <temperature>   (Netatmo)       { channel = "netatmo:outdoor:*:*:temperature#value"}
Number                  Netatmo_outdoor_humidity_chalet "Humidity Chalet Outdoor [%d %unit%]"   <humidity>      (Humidity)      { channel = "netatmo:outdoor:*:*:humidity#value" }
Number                  Netatmo_Outdoor_Batt3           "Batterie Outdoor Chalet [%d %%]"       <batterylevel>                  { channel = "netatmo:outdoor:*:*:battery#value"}

Lolodomo · April 29, 2024, 7:57am

Now that you clearly identified the guilty of the memory leak, the best would be to open a GitHub issue to give more visibility to the issue to OH maintainers and contributors.

matt1 · April 30, 2024, 1:13am

What version of openhab? There was a big update of the libraries that the influx dB binding uses for the latest milestone. You can dow load the jar from the jfrog server and try both the latest and also the previous jars to see if there is a difference. If you uninstall the merged binding, it’s simple to drop different jar files into your addons folder to see if there is a difference.

stadelmannj · April 30, 2024, 5:43am

Hi @matt1 . I’m on Release 4.1.2.
I see lot’s of packages on jfrog. but since I’m not really familiar with this, can you advice which package version I should try ? not sure I can install a 4.1.3 or 4.2 version on my 4.1.2 Install.
for the install, can I just follow the instructions here Installation of Add-ons | openHAB to manually install it ? or is there a better way via Openhab console ?

matt1 · April 30, 2024, 3:56pm

You can download a jar that will work here
https://www.pcmus.com/openhab/
Any of the 4.x jars will work as it’s rare to introduce compatibility issues until openhab move from V3 to V4 or v5.

Up to you, but the milestone version I feel are a great way to stay up to date if your actively working on your system. Up to you which way works best.

Edit: I should also like to state your description of your issue does not make sense to me. Your influx dB is installed on a separate server on the network, yet your saying the free ram goes down on the openhab server. Java uses the heap and not ram directly so it’s not clear to me your issue, what’s happening etc…

Free ram always goes down in Linux and is normal. Google what free vs available ram means under Linux. I would change the jar as a first step to see the effect.

ninjaltd · May 20, 2024, 3:33am

I believe I am also seeing this memory leak in influxdb plugin since upgrading to openhab 4.1.2 from 3.4. It’s a slow leak under normal use, but if you for example ask openhab to render a long term graph with data from influx, the resident size just ballooons instantly:

After startup, running normally:

openhab 14461 41.5 9.1 1064392 360916 ? Ssl 13:23 2:09 /usr/bin/java

After clicking on on my solar feed in graphs in site map:

openhab 14461 49.9 23.5 1070736 924040 ? Ssl 13:23 3:19 /usr/bin/java

These graphs are just weekly, monthly and yearly sitemap graphs of the solar feed in power obained via influx.

The yearly graph never finishes rendering. After this openhab pretty shortly runs out of heap space and dies.

This will eventually slowly happen over time even if I don’t hit the graph, but the graph seems to heavily emphasize the leak.

-PC

MartOs · May 20, 2024, 8:04am

Don’t know if it’s related, but I have been having memory leak issues too

I think mine is related to calling minimumSince too often with a rule to count energy meters. But my issue seems to happen with either influx or rrd4j as the database. My temporary fix was to limit the rule to only run at most every 10 seconds, instead of the 5ish times per second it would do otherwise.

ninjaltd · May 21, 2024, 4:32am

Could be, I also have similar rules that update my solar consumption, generation numbers in influxDB every 5-10 seconds. Right now I’m feeling v4 has some fairly critical issues that make it not ready for production use.

ninjaltd · October 9, 2024, 11:47am

Did anyone get any further with this? Still a big issue for me on OH4.
Wondered if any of the culprits had been isolated yet …

matt1 · October 9, 2024, 8:05pm

There have been some big changes to the influxdb binding libraries in the latest stable release of openhab. Why don’t you upgrade and see if your issues are fixed?

ninjaltd · October 9, 2024, 10:19pm

Thanks matt, that’s what I wanted to hear. I’ll give newer versions a try.