This binding is one I have been thinking of building for a while, it is designed to help diagnose issues on someones system that does not know Linux/Java and has no idea what free VS available ram means, let alone the Java heap. Install the binding and get a system check up. The system info binding requires you to know what to look at and how to interrupt the results, this binding will hopefully over time get better at finding issues in an automated way. If you have some ideas on what it can check for, then feel free to post.
If you enjoy the binding, please consider sponsoring or a once off tip as a thank you via the links. This allows me to purchase software and hardware to contributing more bindings. Also some coffee to keep me coding faster never hurts
Paypal can also be used via matt A-T pcmus D-O-T C-O-M
Features
Binding already helps to warn in your logs when:
CPU overheats
Heap is wrongly sized
Heap is growing and not shrinking back when garbage collections are done. OOME Out of memory errors and memory leaks should get detected and picked up early.
Ram is full or getting close to 100% full to give you a warning something needs to be looked at.
Detects when Raspberry Pi power supply or cable is not good enough.
Allows you to graph the heap after it is first cleaned by the garbage collector.
Not yet implemented but planned to look into possible addition:
Raspberry Pi power supply is not good enough Added
Swap file is getting used a lot or runs out of space
Zram checks
Watch for continually growing number of Processes and Threads
Check addon jar files are all the same version as openHAB
Example log output in DEBUG
2024-03-16 06:04:40.650 [INFO ] [.thedoctor.internal.TheDoctorHandler] - Will include health checks for your:Raspberry Pi 3 Model B Plus Rev 1.3
2024-03-16 06:04:40.654 [DEBUG] [.thedoctor.internal.TheDoctorHandler] - GOOD: Pi is not reporting any current throttle conditions.
2024-03-16 06:04:40.708 [DEBUG] [.thedoctor.internal.TheDoctorHandler] - GOOD: Heap is only 24% full, and ranges from 0% to 24%
2024-03-16 06:04:40.710 [DEBUG] [.thedoctor.internal.TheDoctorHandler] - GOOD: RAM is 46% full
2024-03-16 06:04:40.733 [DEBUG] [.thedoctor.internal.TheDoctorHandler] - GOOD: CPU temperature is 52.078c
2024-03-16 06:05:56.668 [WARN ] [.thedoctor.internal.TheDoctorHandler] - BAD : Full Pi throttle code is 80008
2024-03-16 06:05:56.670 [WARN ] [.thedoctor.internal.TheDoctorHandler] - BAD : Pi, Soft temperature limit active
2024-03-16 06:05:56.671 [WARN ] [.thedoctor.internal.TheDoctorHandler] - BAD : Your Pi's power supply or cable was not good enough to supply power without an under-voltage event occuring.
2024-03-16 06:05:56.672 [DEBUG] [.thedoctor.internal.TheDoctorHandler] - GOOD: Heap is only 18% full, and ranges from 18% to 28%
2024-03-16 06:05:56.676 [DEBUG] [.thedoctor.internal.TheDoctorHandler] - GOOD: RAM is 67% full
2024-03-16 06:05:56.679 [WARN ] [.thedoctor.internal.TheDoctorHandler] - BAD : CPU temperature is 61.762c and may cause instability. Do you have a heatsink and fan?
2024-03-16 06:06:11.683 [WARN ] [.thedoctor.internal.TheDoctorHandler] - BAD : Full Pi throttle code is 80000
2024-03-16 06:06:11.684 [WARN ] [.thedoctor.internal.TheDoctorHandler] - BAD : Your Pi's power supply or cable was not good enough to supply power without an under-voltage event occuring.
Changelog
Version 0.3
Fixed bug in reading the raspberry Pi throttle codes.
Changes to logging to give less output.
Version 0.2
Added Raspberry Pi power supply and throttle code checks
Adjusted heap detection a little based on real world testing of new pi setup with defaults.
I really like the idea of this. Would everything work in a container or is this only going to work in a bare metal install? I imagine the Java stuff should work but wonder about swap/CPU temp/etc.
EDIT: It appears that CPU temp is not available but I shouldnāt be surprised by that since Iām running in a Docker container running on a VM. I donāt even know if the VM has access to CPU temp. RAM is available though.
Iām sure itās already on the roadmap but having Channels for each stat that represent the GOOD/BAD status of the stat would be great! Then it can be used to generate an alert using the mechanism of the userās choice.
Would it be allowed for an add-on to report internal OH stuff like orphaned links, orphaned Item metadata, and stuff like that? OF course if you donāt want to make it an official add-on who cares but if that is your end goal I donāt know if that sort of thing would be allowed. But they can be quite useful things to report. A lot of orphaned stuff like that can be an indication of a problem.
One minor bug to report is the add-on settings does not let me change the logging level.
I lack experience with containers, so no idea so thanks for confirming. My thoughts are that if you run containers you have a little more knowledge, however this is probably wrong as I am looking to purchase a newer router/firewall and it has tons of power and could run openHAB in its built in container support. New users could take this route to use spare power in a device they already pay money to run.
I was thinking to keep it simple and to have a CSV error code channel, then an example rule to read the channels state and send/push a message.
Now that I know about the health check built in feature it kind of makes sense to take the approach like this:
System Info binding if you want to graph and setup gauges.
Main UI health check for as much as possible that is low stress on the system, @seime perhaps a SCAN button that a user can press to do more stressful testing with on demand?
This binding that can do more aggressive checks that may not get approved to be merged into the core.
To explain what I doubt would get put into a built in health checker would be the graph this binding has that triggers a Garbage Collection every 15 minutes to give a cleaned up heap value. This will halt everything running in Java whilst it does a clean up which is very fast, but it may have side effects for time sensitive code. The binding should be low impact, but I would probably limit its use to maybe a week after you install a new openHAB version and then disable it with the pause button when you make a change you wish to check for memory leaks. I may add a channel to turn this on and off, or a config.
The idea behind this is, you can get a base line heap value, install a binding, use it for half and hour and then uninstall it, and the heap should go back to the original value if there are no leaks. I may look at improving it to state in MBytes how much it has grown.
Sadly I help far too many people on this forum who have no business using containers because itās whatās supported or someone on the Internet told them it was the best idea since sliced bread.
I wonder itās possible to detect when a stat like CPU temp is not available and keep the Item NULL in that case.
Iām generally not a fan of forcing the user to have to write a rule like that. If the user wants to show this on their UI (MainUI or Sitemaps), send an alert, or really do anything else with it would require a rule or at least some transformation profiles.
If you donāt break it up though, at least use JSON or XML so that itās at least easier to parse out the value desired through a standard transform profile. It will be really hard for a lot of users if they have to use REGEX or a script transform to pull the data out, even with examples.
I believe they should then use the System info binding and set it up to work how they want. They then get the choice of setting the threshold of exactly when the CPU temp is too high. The idea of this binding is for someone that does not know what to measure, nor what an acceptable threshold is, they just want to know what to ask for help on in the forum after first searching for the ERROR the logs are giving.
I was thinking along the lines of a very basic single String channel called something like Fault which would probably only contain one in normal use, but would grow if more then 1 error occurs. You can just send the String and it is clear what the error is.
Overheating,MemoryLeak,PowerSupply
I probably used the wrong term, error code as it was more a fault condition. Hopefully the above string makes it clearer.
Great, do you mind giving an example of what would be best so I can adopt it? I prefer JSON as am used to using gson lib. Using transforms is not my strong point, hence the wish to use a simple plain text fault that can be sent as is, or a simple āif contains XXXXā thenā¦
This would be fine to add if it was just an ON/OFF switch representing bad/good. I just donāt love the idea of having 30+ Switch items that grow over time as new features get added, when a single channel as described above is more SIMPLE. Will have to consider what makes the most sense as a pure LOG output is not enough as people will leave it running then never check the logs, so a way to send a notification that they need to look at the logs in more detail is what I am after.
I really do not care about if it gets merged, this is more about making it useful and not having to give a user a lecture on what free and available RAM is and that a memory leak has nothing to do with graphing used ram with the system info binding. Itās for someone that complains they have crashes weekly, then this binding tells them that their raspberry pi power supply is not capable and causing stability issues. The goal is always to create something that can be merged, but if the binding can be more useful by breaking rules, then that makes more sense if more people get a working system and stop blaming openHAB as being unstable for X reason they can not diagnose.
Nice one, thanks.
Given the number of people to use Raspis, openHABian and by default ZRAM with it, would you consider double-checking that, too?
disksize vs. filling level, zram mem-used vs. zram mem-limit
You can use exec binding or check the sources to eventually find out about some java-level means to access.
Iād vote for JSON as JSONPATH is easier to work with in OH than XPath is. Iām not sure what kind of example I can give though. If itās a relatively flat JSON:
{ prop1: "one",
prop2: "two",
prop3: "three }
The JSONPATH for the second property would be a simple JSONPATH('$.prop2').
Simplest for the developer for sure. Simplest for the users? Iām not so sure. But you are the developer so all I can do is make suggestions. From the end user, there can be a hundred Channels but they can choose which one(s) they care about and ignore the rest. Putting everything in one Channel means they have to deal with every possible value, even if they only care about one.
Thatās what I assumed the ultimate goal was. But what if a user only cares about CPU temp and nothing else? Do they need to deal with alerts for everything else? Those are the sorts of usability things that come to my mind.
Unless of course there is more than just a summary of stats. If there is more, like a summary of findings (e.g. āDoctor binding sees your system load is high, RAM utilization is high, and SWAP is in use. Your machine doesnāt have enough RAM!ā) maybe separating each stat isnāt all that important. If the binding can process all the stats and come up with a recommended course of action then it doesnāt necessarily need to report most of the stats at all.
You can use the log, but I am currently trying to implement it right now using the linux command which will return a hex code
vcgencmd get_throttled
It should return throttled=0x0
There is also another method cat "/sys/devices/platform/soc/soc:firmware/get_throttled"
| Bit | Meaning |
|:---:|---------|
| 0 | Under-voltage detected |
| 1 | Arm frequency capped |
| 2 | Currently throttled |
| 3 | Soft temperature limit active |
| 16 | Under-voltage has occurred |
| 17 | Arm frequency capped has occurred |
| 18 | Throttling has occurred |
| 19 | Soft temperature limit has occurred
I get throttled=0x80000 returned.
Translated to binary is 1000 0000 0000 0000 0000
And this means that my pi3 has done a soft temperature limit, as it booted up with no heat sink attached.
Thanks for the suggestion, Iāll look into it as I find time.
Just added a new version that will check the Raspberry Pi range for power supplies that are not handling the voltage/current requirements and looks for freq and heat throttling from not having sufficient cooling on the CPU.
I have considered it, and feel that is the wrong place to add what I am wanting to achieve, have addressed why in posts above.
Is it normal to see this message repeating? This seems like it is worth 1 warning, but Iām not convinced it needs to come out every half hour.
OH 4.1.2
2024-04-18 10:53:35.125 [WARN ] [.thedoctor.internal.TheDoctorHandler] - Heap has increased from 20 to 25 and may indicate a memory leak if this number keeps growing. This binding has a channel you can use to watch the heap with.
2024-04-18 11:23:35.648 [WARN ] [.thedoctor.internal.TheDoctorHandler] - Heap has increased from 20 to 26 and may indicate a memory leak if this number keeps growing. This binding has a channel you can use to watch the heap with.
2024-04-18 11:53:36.174 [WARN ] [.thedoctor.internal.TheDoctorHandler] - Heap has increased from 20 to 26 and may indicate a memory leak if this number keeps growing. This binding has a channel you can use to watch the heap with.
2024-04-18 12:23:36.680 [WARN ] [.thedoctor.internal.TheDoctorHandler] - Heap has increased from 20 to 26 and may indicate a memory leak if this number keeps growing. This binding has a channel you can use to watch the heap with.
2024-04-18 12:53:37.196 [WARN ] [.thedoctor.internal.TheDoctorHandler] - Heap has increased from 20 to 26 and may indicate a memory leak if this number keeps growing. This binding has a channel you can use to watch the heap with.
2024-04-18 13:23:37.710 [WARN ] [.thedoctor.internal.TheDoctorHandler] - Heap has increased from 20 to 26 and may indicate a memory leak if this number keeps growing. This binding has a channel you can use to watch the heap with.
2024-04-18 13:53:38.228 [WARN ] [.thedoctor.internal.TheDoctorHandler] - Heap has increased from 20 to 26 and may indicate a memory leak if this number keeps growing. This binding has a channel you can use to watch the heap with.
2024-04-18 14:23:38.749 [WARN ] [.thedoctor.internal.TheDoctorHandler] - Heap has increased from 20 to 26 and may indicate a memory leak if this number keeps growing. This binding has a channel you can use to watch the heap with.
I see the same in my logs. As the heap is garbage collected it shrinks. Then it grows and if it grows too much the binding generates the warning.
I wonder if it would help if we could set the threshold at which it reports. For example, for you maybe a value of 8 or 10 would make more sense. It is useful to get the warning if the heap continues to grow. It helped my identify a leak I had in one of my rules. Yes, they can happen.
I agree and have just fixed this, so thank you for reporting it. Now it will only warn you once for each 1% it increases.
That is a good idea. In the newer build it is based on 6% increase. If you pause and un-pause the thing, it will re-calibrate the reference size it uses to what the system has in use when you un-pause the thing.
Could there be a problem with the undervoltage alarm? The throttle code is 80008, which corresponds to an active temperature alarm and formerly occurred one.
2024-04-26 11:48:07.813 [WARN ] [.thedoctor.internal.TheDoctorHandler] - BAD : Full Pi throttle code is 80008
2024-04-26 11:48:07.814 [WARN ] [.thedoctor.internal.TheDoctorHandler] - BAD : Pi, Soft temperature limit active
2024-04-26 11:48:07.815 [WARN ] [.thedoctor.internal.TheDoctorHandler] - BAD : Your Pi's power supply or cable was not good enough to supply power without an under-voltage event occuring.
2024-04-26 11:48:07.817 [WARN ] [.thedoctor.internal.TheDoctorHandler] - BAD : CPU temperature is 61.224c and may cause instability. Do you have a heatsink and fan?
On my Raspberry pi this file contains a number 80000. This is not preceded by 0x as it should be if you want to consider this number as hexadecimal.
So I think that the line String content = Files.readString(filePath);
should be adapted to something like String content = "0x" + Files.readString(filePath);
I may be looking at an old version of the code, as I have difficulties finding my way in githubā¦
The binding was installed from the addon community marketplace (14 april 2024 7:02)
Thank you, I have updated and uploaded a fixed version with a similar change so thanks for reporting and pointing this out. To upgrade, just uninstall and then install it again from the marketplace.
Hi,
Iām trying out your binding, which I find I interesting, however, I can only find one channel popping up in the Binding Thing, the cleaned-heap-percent. I downloaded the binding today, May 10 (latest update today). Iām on RPi4, Openhabian, Debian Bullseye, OpenHAB 4.1.2.
/Erik