Openhab filling up Memory and Swap

David_Graeff · July 3, 2019, 6:55pm

Actually I was going to respond yesterday, but refrained from doing so because I wasn’t sure if its just me.

That’s what they are doing so badly on their architecture. I had problems with the hypervisior as well and I’m really wondering why they not went with systemd’s process management capabilities.

They are also using apparmor, probably because they are using docker instead of podman or others, and need to start containers as root user. Apparmor takes its time on startup to compile all application profiles.

I’m sure you can do better than hass.io, but as I said, I think its brave that they are trying. openHAB itself is first of all a monolith, but at the same time developed very modular due to OSGi. Going from monolith to microservices is actually one specification away: Tutorial: Building your first OSGi Remote Service - Eclipsepedia.

Cheers, David

Bruce_Osborne · July 3, 2019, 8:37pm

Actually, they need to use the unit testing features of Python & a CI infrastructure to actually test before release!

Java has a history of a huge overhead. Other languages are much easier to use and just as powerful

David_Graeff · July 4, 2019, 11:09am

Java is old. The heads behind Java know that they need to change and adapt to nowadays usecases and that’s why Java 9 introduced the module system. More and more is getting pulled out of the runtime into libraries. Java 11 simplified the language syntax and that will continue.

I don’t think that a modern JVM in a few years is any worse than the V8 js engine or the python runtime. JVM languages can be optimized far better than untyped Javascript. Python as well as Java can be compiled ahead of time, providing native performance (still garbage collected of course).

My conclusion is that not the language is the problem. It is the architecture that has been chosen here.

Bruce_Osborne · July 4, 2019, 11:27am

I do know that the UIs, both Paper & Habmin are noticeably slow on my Pi 3B+. The HA Lovelace UI was much more responsive on the same hardware & basically the same configuration.

mstormi · July 4, 2019, 12:36pm

Not even that because while there is a lot of overhead, you can still run it on (almost) any SBC, and it does NOT have the requirements (e.g. on scalability or redundancy) of other today’s cloud services that microservice-like architectures are built for.
Memory leaks or other bugs on the other hand side affect microservice architectures as well as monolithic ones.
And I prefer to have all bugs bundled inside a single monolith rather than scattered across many libs, containers or demons.

David_Graeff · July 4, 2019, 8:39pm

There are some crazy people that are running kubernetes on RPIs. But that’s not necessary for the absolute maximum of maybe 50 container processes (“addons”) that are targeted here. There is no container orchestration necessary, a simple supervisior like systemd is enough.

Any why exactly? A simple tool like “top” tells you exactly which process is leaking in a “micro”-service world. Try that with the monolithic openHAB. The usual heap memory analyzers are almost worthless, because openHAB proxies (“safeCall”) calls all the time, especially when calling into bindings. That unfortunately means it removes valueable stack information.

mstormi · July 5, 2019, 11:40am

Granted to find memleaks it’s easier (although top just tells you there is a problem but not where exactly), but it’s adding the need for inter-process communication overhead and for work on proper bundling/orchestration for the end user. The required IPC is overhead and in turn may introduce new problems and may be part of existing ones. YMMV.

rlkoshak · July 5, 2019, 5:12pm

I think David’s point is if each binding were running in it’s own process, then indeed it will tell us exactly were, at least to a higher degree of specificity than we have now. Right now we need to do a whole bunch of trial and error removal of bindings until the leak stops to discover which binding has the leak. If each binding runs in it’s own process, then top will tell you at least which binding has the leak. That’s a whole lot better than using trial and error.

The IPC overhead would be more, but would it be unacceptable. MQTT involves IPC (over the network no less) and it seems perfectly performant in a home automation context. And I totally agree, the IPC will introduce new and interesting problems. TANSTAAFL.

My main point above is that any change that will drastically and fundamentally change the overall architecture of OH is going to require:

convincing a super majority of the current maintainers
convincing a good portion of OH users, at least those who exist on this forum
be implemented in cooperation with both sets of people

Short of that, it could be the best idea in the world but the only way it will be implemented is via a fork, which I hope no one wants.

Bruce_Osborne · July 5, 2019, 5:29pm

Sometimes I find the UI slowness of OH reminding me of trying the HotJava browser several years ago. It was (coincidentally?) written in Java too.

David_Graeff · July 5, 2019, 5:46pm

But that’s because OH UIs are written in the abandoned, old Angular1 javascript framework. Google engineers for a reason fundamentally changed Angular2+. New interfaces are already written, but a new core release is required first, ie a 2.5 release.

Addon developers most of the time do not produce efficient code, ranging from wrong data structure or algorithm choice to too much and careless debugging statements, useless proxy methods and overengineered abstraction. At least a lot of code that I have reviewed is not efficient.

IPC is probably not even noticeable which such code but would add a layer of security and robustness.

I’m not so sure about that. I guess it’s up to an addon developer to opt in for “remote OSGi services” and only the OH2-addon developers would need to agree and enforce such a new way of deployment.

rlkoshak · July 6, 2019, 1:28am

HotJava was never intended to be used like Chrome or Firefox. It was created to embed web pages in Java programs. I once used a wrench to hammer in a nail and it didn’t work very well. All wrenches must be really bad tools. I’ve used some slow as f@$& Python programs. Are all Python programs slow?

Java is not inherently slow. It is not inherently buggy. For s certain class of problems it is second to none in both performance and ease of worrying code. There is a reason why it is the most used language in existence (if you believe TIOBE). Well it remain so? Probably not, but that doesn’t mean it’s slow and unusable.

At the time that OH was started, Java is IMHO the best choice among the alternatives for a project like this.

And as David and I pointed out, the slowness with the UIs had to do with JacaScript, not the underlying Java.

That’s fantastic to hear. I can’t wait to try them.

You mean "could’? I don’t think that it’s a guarantee.

So which of those binding developers will create the unified interface that hides all the details with the containers and stuff? Which ones provide a marketplace where we can click on a binding and have it installed and test to use? Which ones provide the standardized IPC. Without all that, it’s not OH any more. It’s some other system that (maybe?) Can use bindings written for OH. Without support from the core developers to make this happen, you are creating a new core.

So, like I said, it’d be a brand new project at best or a fork at worst.

higgers · August 30, 2019, 7:50pm

I also had frequent crashes due to java running out of memory. I ended up assigning 8GB of RAM to the Ubuntu VM running OH (since I had some spare RAM on that server) and assigning 6GB to java. Even then OH still crashed.

What I did notice though is that the crashes occurred a few hours after I accessed PaperUI via Chrome. The longer I spent using the PaperUI the sooner the crash would occur.

Has anyone else experienced this?

rlkoshak · August 30, 2019, 8:08pm

There are periodic reports of memory leaks in OH but rarely is there enough information to pinpoint the root cause of the problem. You have to go by process of elimination by removing add-ons and waiting to see if the memory problem goes away. They you will have identified the culprit and can file an issue.

Personally, I have seen cases where Grafana running on the same machine as openHAB configured to generate static versions of charts will run amok and consume all available CPU and memory until eventually the OS kills openHAB.

Bruce_Osborne · August 30, 2019, 8:10pm

The command top -a should show what the top memory using process is. (control-C to exit)

higgers · August 30, 2019, 8:16pm

The problem I see with that approach is that removing utility from the home automation system here causes a fair amount of consternation in the other members of the household and detracts from the OH acceptance factor. I have a stable system here due to a scheduled OH restart every night. If I now remove a load of addons the attitude to OH in the house will turn to change to, “that useless system dad’s always tinkering with that doesn’t actually do anything useful.”

rossko57 · August 30, 2019, 8:21pm

Understood, this is why it is difficult to track down. Any morsels, like your PaperUI suspicion, are always useful.

Bruce_Osborne · August 30, 2019, 8:22pm

In the Grafana example, the solution might be to run Grafana on a separate server.

higgers · August 30, 2019, 8:26pm

Maybe providing a set of steps for users to go through when their OH system crashes would be useful. Something like add a post to a Github issue with the following data:

Attach a capture of the java stack dump
List the addons that were installed at the time of the crash
(Other stuff that developers would find useful)

Maybe then we’ll see some commonality amongst the addons or any of the other characteristics that people post in the Github issue?

higgers · August 30, 2019, 8:28pm

I’ve tried that, I moved mysql and grafana to a separate VM but it didn’t help. Well, it didn’t help with the crashes. However, it did help when I rebuilt my OH server. I did a openhab-cli backup on the old VM and an openhab-cli restore on the new one and all my existing persistence data and grafana data was still available. Magic.

Bruce_Osborne · August 30, 2019, 8:31pm

A separate VM on the same host might not help since all the VMs share the same physical memory.