openHAB 3 runs out of memory / java heap space errors, CPU 100%+ after a few hours

mhilbush · February 15, 2021, 12:53pm

@morph166955 After installing your bundle, the problem still occurs.

Installing your bundle freed up the memory that was being held somewhere. But enabling the bad rule, still resulted in the continuous memory increase.

morph166955 · February 15, 2021, 1:02pm

That’s very interesting. Also gives me some direction as it looks like the restarting of org.openhab.core.model.script.runtime caused the memory to free up which I would hope means that the leak is somewhere in there (or something associated with it). Thank you!

EDIT: For anyone who is following all of this, if you get to a point where you see the memory increasing but the system has not had an OOM crash, what happens when you do bundle:restart on org.openhab.core.model.script.runtime inside the karaf console? Does the memory release?

mhilbush · February 15, 2021, 3:37pm

BTW, I see why you get the stack trace and I don’t. You must have debug enabled on the bundle.

github.com

openhab/openhab-core/blob/127724c0e31ad8abf0855d87fbf96204267a39be/bundles/org.openhab.core.automation.module.script/src/main/java/org/openhab/core/automation/module/script/internal/handler/ScriptActionHandler.java#L66


    public @Nullable Map<String, Object> execute(final Map<String, Object> context) {
        Map<String, Object> resultMap = new HashMap<>();
        getScriptEngine().ifPresent(scriptEngine -> {
            setExecutionContext(scriptEngine, context);
            try {
                Object result = scriptEngine.eval(script);
                resultMap.put("result", result);
            } catch (ScriptException e) {
                logger.error("Script execution of rule with UID '{}' failed: {}", ruleUID, e.getMessage(),
                        logger.isDebugEnabled() ? e : null);
            }
        });
        return resultMap;
    }
}

Seaside · February 16, 2021, 8:16pm

Good work by @mhilbush and @morph166955 trying to fix this issue. I would like to see a comment by any of the maintainers for this problem, seems to be a big issue for a lot of people.

Regards, S

mhilbush · February 16, 2021, 9:00pm

You can see the latest status on the issue here.

mhilbush · February 20, 2021, 5:36pm

I submitted a fix for the invalid UI DSL rule memory leak. I’m aware of one other memory leak, but it occurs much less frequently, so should not be as critical. I’ll open a separate bug report for that issue shortly.

DanielMalmgren · February 20, 2021, 7:06pm

Looks like there’s work ongoing for most of these problems. Is there any more info needed?

My memory usage for the last week looks like this:

(I started measuring this monday and short thereafter restarted OH)
I’m planning to do a OH restart again soon too get more free RAM again. Anything I could check before doing that to provide more information about what’s leaking? I’m on 3.0.1.

morph166955 · February 20, 2021, 8:10pm

Looks like it was committed and @Kai has cherry picked it for the next 3.0.x release.

Sascha_ · February 21, 2021, 11:54am

I’ve deployed the Snapshot S2217 into my production today 10:33am and memory still looks good to me, only a slightly increasing graph:

Thank you all very much for this

mhilbush · February 21, 2021, 1:31pm

I opened a separate issue for another memory leak. This one is much less severe than the one caused by an invalid rule. I’m not clear yet on how this one should be fixed.

morph166955 · February 21, 2021, 9:11pm

For anyone looking to test this, 3.1.0-S2216 and beyond should include the fix committed yesterday. I’d be very curious to know if someone wants to upgrade to the snapshot and see if it fixes the issues.

Schnuecks · February 22, 2021, 7:45am

I had setup my PI4 with the latest snapshot and it seems to solve that issue. Had run it over a few hours with that rules that break the other one and it runs at normal memory usage no error so far. thanks for all.

johannesbonn · February 22, 2021, 9:04am

Is the latest snapshot stable enough to use it for my productive system? Actually I am using 3.1m1 without problems, separate from the heap space issue. Do you have any problems with the snapshot build?

Schnuecks · February 22, 2021, 9:23am

I had set it up as a test system, i personally wouldn’t use a snapshot in productive, but if it solves a more or less critical problem with the memory it’s worth a shot.

mhilbush · February 22, 2021, 11:43am

I would never recommend to use snapshot releases, but I will share my own personal experience.

FWIW, I’ve been using snapshot versions for the past 5 years (2.x and 3.x) on two productive systems. In fact, I’ve never used anything other than a snapshot release. I don’t like waiting for bug fixes. And I always take a backup before upgrading, so there’s always a fallback. You do have to be a bit careful about which snapshots you use as some of them (in practice, a very small percentage) have severe problems.

morph166955 · February 24, 2021, 10:42pm

I agree with @mhilbush on normally never running snapshots but making exceptions here. I’ve been running the OH3 snapshot since it became “mostly stable” months ago (pre-release). I really don’t have any issues on it on the whole. I have had one or two issues where a commit caused some issues and no one caught it that day, loaded the snapshot, had a horrible result, and just backed it out. That said, that’s rare. One thing to definitely note though, this isn’t a 3.0.x-SNAPSHOT, this is a 3.1-SNAPSHOT. So you are pulling a whole bunch of new stuff in and you need to be aware of any conflicts that may not have had documentation published yet formally. Just backup your system first.

Andrew_Rowe · February 25, 2021, 1:10am

Thanks Mark and Morph for running this down and finding a fix, I followed progress on github and this was a nasty buggy

DanielMalmgren · February 25, 2021, 6:02am

I’d like to hold on to this, so I’m still on 3.0.1. From what I understand the plan is to release a 3.0.2 with these critical fixes? Until then, my system works fine if I just keep an eye on the free memory and restart OH when it’s getting full, which has been around once a week now for a while.

dikay1969 · March 12, 2021, 3:35pm

Raspbian GNU/Linux 10 (buster)
Linux 5.10.17-v7l+ x86
Raspberry Pi 4 Model B Rev 1.4 8GB
Openhab 3.1.0-2253

Since what felt like the last 50 snapshots, Openhab 3.1.x has become more and more sluggish after a running time of approx. 8 hours and then stops working completely.
Logging is still active.
That means the Habpanel, HappApp and also the DSL rules are no longer processed.
After a restart, OH runs perfectly again for about 6-8 hours.

The first warning messages in the logbook are:
2021-03-08 06: 52: 41.939 [WARN] [ab.core.internal.events.EventHandler] - Dispatching event to subscriber ‘org.openhab.core.internal.items.ItemUpdater@a0ac64’ takes more than 5000ms.
2021-03-08 06: 55: 39.494 [WARN] [ab.core.internal.events.EventHandler] - Dispatching event to subscriber ‘org.openhab.core.internal.items.ItemUpdater@a0ac64’ takes more than 5000ms.
2021-03-08 06: 58: 23.832 [WARN] [ab.core.internal.events.EventHandler] - Dispatching event to subscriber ‘org.openhab.core.internal.items.ItemUpdater@a0ac64’ takes more than 5000ms.
2021-03-08 06: 58: 46.834 [WARN] [ab.core.internal.events.EventHandler] - Dispatching event to subscriber ‘org.openhab.core.internal.items.ItemUpdater@a0ac64’ takes more than 5000ms.
2021-03-08 06: 59: 20.055 [WARN] [ab.core.internal.events.EventHandler] - Dispatching event to subscriber ‘org.openhab.core.internal.items.ItemUpdater@a0ac64’ takes more than 5000ms.

Then followed by timeouts of all bindings and rules failures, etc.
Then “OutOfMemoryError: Java heap space” can also be found in the log!

I haven’t made any major changes to the system in weeks.

I only switched from 3.0.0/1 to the 3.1.x snapshots because of the “Java Heap Space” problem with DSL rules via MainUI in OH3.0, because unfortunately the problem was only fixed here.

I don’t find the problem, it must be due to a significant change in the OH3.1.x snapshot.

I hope not that this bug will flow into the 3.1 final.

Unfortunately, I can no longer say exactly from which snapshot version this problem occurred.

Can someone help me or confirm a similar problem?

dikay1969 · March 14, 2021, 2:49pm

To all people who have problems that no one else has, cannot be understood and no developer can explain himself.

The error described was eliminated by a complete new installation (new image installed and backup imported).

This means that the error was not due to the above snapshots, but to the environment around it (Linux, Java etc. etc.), which may be caused by file errors.

→ Just as a tip, if in doubt, always reinstall the entire Openhab environment first to rule out such errors!

A wish to the developers of Openhab, can’t you check the environment when starting Openhab to see if everything around is OK?