The Doctor Binding helps find issues with your system [4.0.0;4.9.9]

This binding is one I have been thinking of building for a while, it is designed to help diagnose issues on someones system that does not know Linux/Java and has no idea what free VS available ram means, let alone the Java heap. Install the binding and get a system check up. The system info binding requires you to know what to look at and how to interrupt the results, this binding will hopefully over time get better at finding issues in an automated way. If you have some ideas on what it can check for, then feel free to post.

If you enjoy the binding, please consider sponsoring or a once off tip as a thank you via the links. This allows me to purchase software and hardware to contributing more bindings. Also some coffee to keep me coding faster never hurts :slight_smile:

Sponsor @Skinah on GitHub Sponsors

Paypal can also be used via
matt A-T pcmus D-O-T C-O-M

Features

Binding already helps to warn in your logs when:

  • CPU overheats
  • Heap is wrongly sized
  • Heap is growing and not shrinking back when garbage collections are done. OOME Out of memory errors and memory leaks should get detected and picked up early.
  • Ram is full or getting close to 100% full to give you a warning something needs to be looked at.
  • Detects when Raspberry Pi power supply or cable is not good enough.
  • Allows you to graph the heap after it is first cleaned by the garbage collector.

Not yet implemented but planned to look into possible addition:

  • Raspberry Pi power supply is not good enough Added
  • Swap file is getting used a lot or runs out of space
  • Zram checks
  • Watch for continually growing number of Processes and Threads
  • Check addon jar files are all the same version as openHAB

Example log output in DEBUG

2024-03-16 06:04:40.650 [INFO ] [.thedoctor.internal.TheDoctorHandler] - Will include health checks for your:Raspberry Pi 3 Model B Plus Rev 1.3

2024-03-16 06:04:40.654 [DEBUG] [.thedoctor.internal.TheDoctorHandler] - GOOD: Pi is not reporting any current throttle conditions.

2024-03-16 06:04:40.708 [DEBUG] [.thedoctor.internal.TheDoctorHandler] - GOOD: Heap is only 24% full, and ranges from 0% to 24%

2024-03-16 06:04:40.710 [DEBUG] [.thedoctor.internal.TheDoctorHandler] - GOOD: RAM is 46% full

2024-03-16 06:04:40.733 [DEBUG] [.thedoctor.internal.TheDoctorHandler] - GOOD: CPU temperature is 52.078c
2024-03-16 06:05:56.668 [WARN ] [.thedoctor.internal.TheDoctorHandler] - BAD : Full Pi throttle code is 80008

2024-03-16 06:05:56.670 [WARN ] [.thedoctor.internal.TheDoctorHandler] - BAD : Pi, Soft temperature limit active

2024-03-16 06:05:56.671 [WARN ] [.thedoctor.internal.TheDoctorHandler] - BAD : Your Pi's power supply or cable was not good enough to supply power without an under-voltage event occuring.

2024-03-16 06:05:56.672 [DEBUG] [.thedoctor.internal.TheDoctorHandler] - GOOD: Heap is only 18% full, and ranges from 18% to 28%

2024-03-16 06:05:56.676 [DEBUG] [.thedoctor.internal.TheDoctorHandler] - GOOD: RAM is 67% full

2024-03-16 06:05:56.679 [WARN ] [.thedoctor.internal.TheDoctorHandler] - BAD : CPU temperature is 61.762c and may cause instability. Do you have a heatsink and fan?

2024-03-16 06:06:11.683 [WARN ] [.thedoctor.internal.TheDoctorHandler] - BAD : Full Pi throttle code is 80000

2024-03-16 06:06:11.684 [WARN ] [.thedoctor.internal.TheDoctorHandler] - BAD : Your Pi's power supply or cable was not good enough to supply power without an under-voltage event occuring.

Changelog

Version 0.2

  • Added Raspberry Pi power supply and throttle code checks
  • Adjusted heap detection a little based on real world testing of new pi setup with defaults.

Version 0.1

  • initial release

Resources

http://pcmus.com/openhab/TheDoctorBinding/org.openhab.binding.thedoctor-4.2.0-SNAPSHOT.jar

Source Code

13 Likes

I really like the idea of this. Would everything work in a container or is this only going to work in a bare metal install? I imagine the Java stuff should work but wonder about swap/CPU temp/etc.

EDIT: It appears that CPU temp is not available but I shouldnā€™t be surprised by that since Iā€™m running in a Docker container running on a VM. I donā€™t even know if the VM has access to CPU temp. RAM is available though.

Iā€™m sure itā€™s already on the roadmap but having Channels for each stat that represent the GOOD/BAD status of the stat would be great! Then it can be used to generate an alert using the mechanism of the userā€™s choice.

Would it be allowed for an add-on to report internal OH stuff like orphaned links, orphaned Item metadata, and stuff like that? OF course if you donā€™t want to make it an official add-on who cares but if that is your end goal I donā€™t know if that sort of thing would be allowed. But they can be quite useful things to report. A lot of orphaned stuff like that can be an indication of a problem.

One minor bug to report is the add-on settings does not let me change the logging level.

I look forward to seeing how this develops.

2 Likes

I lack experience with containers, so no idea so thanks for confirming. My thoughts are that if you run containers you have a little more knowledge, however this is probably wrong as I am looking to purchase a newer router/firewall and it has tons of power and could run openHAB in its built in container support. New users could take this route to use spare power in a device they already pay money to run.

I was thinking to keep it simple and to have a CSV error code channel, then an example rule to read the channels state and send/push a message.

Now that I know about the health check built in feature it kind of makes sense to take the approach like this:

  • System Info binding if you want to graph and setup gauges.
  • Main UI health check for as much as possible that is low stress on the system, @seime perhaps a SCAN button that a user can press to do more stressful testing with on demand?
  • This binding that can do more aggressive checks that may not get approved to be merged into the core.

To explain what I doubt would get put into a built in health checker would be the graph this binding has that triggers a Garbage Collection every 15 minutes to give a cleaned up heap value. This will halt everything running in Java whilst it does a clean up which is very fast, but it may have side effects for time sensitive code. The binding should be low impact, but I would probably limit its use to maybe a week after you install a new openHAB version and then disable it with the pause button when you make a change you wish to check for memory leaks. I may add a channel to turn this on and off, or a config.

The idea behind this is, you can get a base line heap value, install a binding, use it for half and hour and then uninstall it, and the heap should go back to the original value if there are no leaks. I may look at improving it to state in MBytes how much it has grown.

Sadly I help far too many people on this forum who have no business using containers because itā€™s whatā€™s supported or someone on the Internet told them it was the best idea since sliced bread.

I wonder itā€™s possible to detect when a stat like CPU temp is not available and keep the Item NULL in that case.

Iā€™m generally not a fan of forcing the user to have to write a rule like that. If the user wants to show this on their UI (MainUI or Sitemaps), send an alert, or really do anything else with it would require a rule or at least some transformation profiles.

If you donā€™t break it up though, at least use JSON or XML so that itā€™s at least easier to parse out the value desired through a standard transform profile. It will be really hard for a lot of users if they have to use REGEX or a script transform to pull the data out, even with examples.

1 Like

I believe they should then use the System info binding and set it up to work how they want. They then get the choice of setting the threshold of exactly when the CPU temp is too high. The idea of this binding is for someone that does not know what to measure, nor what an acceptable threshold is, they just want to know what to ask for help on in the forum after first searching for the ERROR the logs are giving.

I was thinking along the lines of a very basic single String channel called something like Fault which would probably only contain one in normal use, but would grow if more then 1 error occurs. You can just send the String and it is clear what the error is.

Overheating,MemoryLeak,PowerSupply

I probably used the wrong term, error code as it was more a fault condition. Hopefully the above string makes it clearer.

Great, do you mind giving an example of what would be best so I can adopt it? I prefer JSON as am used to using gson lib. Using transforms is not my strong point, hence the wish to use a simple plain text fault that can be sent as is, or a simple ā€œif contains XXXXā€ thenā€¦

This would be fine to add if it was just an ON/OFF switch representing bad/good. I just donā€™t love the idea of having 30+ Switch items that grow over time as new features get added, when a single channel as described above is more SIMPLE. Will have to consider what makes the most sense as a pure LOG output is not enough as people will leave it running then never check the logs, so a way to send a notification that they need to look at the logs in more detail is what I am after.

I really do not care about if it gets merged, this is more about making it useful and not having to give a user a lecture on what free and available RAM is and that a memory leak has nothing to do with graphing used ram with the system info binding. Itā€™s for someone that complains they have crashes weekly, then this binding tells them that their raspberry pi power supply is not capable and causing stability issues. The goal is always to create something that can be merged, but if the binding can be more useful by breaking rules, then that makes more sense if more people get a working system and stop blaming openHAB as being unstable for X reason they can not diagnose.

Nice one, thanks.
Given the number of people to use Raspis, openHABian and by default ZRAM with it, would you consider double-checking that, too?
disksize vs. filling level, zram mem-used vs. zram mem-limit
You can use exec binding or check the sources to eventually find out about some java-level means to access.

admin@mysmarthouse:~ $ zramctl --output-all
NAME       DISKSIZE   DATA  COMPR ALGORITHM STREAMS ZERO-PAGES  TOTAL MEM-LIMIT MEM-USED MIGRATED MOUNTPOINT
/dev/zram2       2G 341,5M  76,6M zstd            4       5462  81,6M      600M    81,6M     1,9K /opt/zram/zram2
/dev/zram1       3G   1,2G  43,6M zstd            4     205569    49M      300M    49,1M     1,7K /opt/zram/zram1
/dev/zram0       1G 436,4M 170,4M lzo-rle         4       7036 186,2M      300M   250,4M     7,5K [SWAP]

how do you determine that ? ā€˜undervoltageā€™ messages in syslog to show up ?

Iā€™d vote for JSON as JSONPATH is easier to work with in OH than XPath is. Iā€™m not sure what kind of example I can give though. If itā€™s a relatively flat JSON:

{ prop1: "one",
  prop2: "two",
  prop3: "three }

The JSONPATH for the second property would be a simple JSONPATH('$.prop2').

Simplest for the developer for sure. Simplest for the users? Iā€™m not so sure. But you are the developer so all I can do is make suggestions. From the end user, there can be a hundred Channels but they can choose which one(s) they care about and ignore the rest. Putting everything in one Channel means they have to deal with every possible value, even if they only care about one.

Thatā€™s what I assumed the ultimate goal was. But what if a user only cares about CPU temp and nothing else? Do they need to deal with alerts for everything else? Those are the sorts of usability things that come to my mind.

Unless of course there is more than just a summary of stats. If there is more, like a summary of findings (e.g. ā€œDoctor binding sees your system load is high, RAM utilization is high, and SWAP is in use. Your machine doesnā€™t have enough RAM!ā€) maybe separating each stat isnā€™t all that important. If the binding can process all the stats and come up with a recommended course of action then it doesnā€™t necessarily need to report most of the stats at all.

You can use the log, but I am currently trying to implement it right now using the linux command which will return a hex code

vcgencmd get_throttled

It should return throttled=0x0

There is also another method
cat "/sys/devices/platform/soc/soc:firmware/get_throttled"

| Bit | Meaning |
|:---:|---------|
| 0 | Under-voltage detected |
| 1 | Arm frequency capped |
| 2 | Currently throttled |
| 3 | Soft temperature limit active |
| 16 | Under-voltage has occurred |
| 17 | Arm frequency capped has occurred |
| 18 | Throttling has occurred |
| 19 | Soft temperature limit has occurred

I get throttled=0x80000 returned.
Translated to binary is 1000 0000 0000 0000 0000
And this means that my pi3 has done a soft temperature limit, as it booted up with no heat sink attached.

Thanks for the suggestion, Iā€™ll look into it as I find time.

This is really a good idea. Would you not consider merge it in the System Info binding ?

Just added a new version that will check the Raspberry Pi range for power supplies that are not handling the voltage/current requirements and looks for freq and heat throttling from not having sufficient cooling on the CPU.

I have considered it, and feel that is the wrong place to add what I am wanting to achieve, have addressed why in posts above.

Is it normal to see this message repeating? This seems like it is worth 1 warning, but Iā€™m not convinced it needs to come out every half hour.

OH 4.1.2

2024-04-18 10:53:35.125 [WARN ] [.thedoctor.internal.TheDoctorHandler] - Heap has increased from 20 to 25 and may indicate a memory leak if this number keeps growing. This binding has a channel you can use to watch the heap with.
2024-04-18 11:23:35.648 [WARN ] [.thedoctor.internal.TheDoctorHandler] - Heap has increased from 20 to 26 and may indicate a memory leak if this number keeps growing. This binding has a channel you can use to watch the heap with.
2024-04-18 11:53:36.174 [WARN ] [.thedoctor.internal.TheDoctorHandler] - Heap has increased from 20 to 26 and may indicate a memory leak if this number keeps growing. This binding has a channel you can use to watch the heap with.
2024-04-18 12:23:36.680 [WARN ] [.thedoctor.internal.TheDoctorHandler] - Heap has increased from 20 to 26 and may indicate a memory leak if this number keeps growing. This binding has a channel you can use to watch the heap with.
2024-04-18 12:53:37.196 [WARN ] [.thedoctor.internal.TheDoctorHandler] - Heap has increased from 20 to 26 and may indicate a memory leak if this number keeps growing. This binding has a channel you can use to watch the heap with.
2024-04-18 13:23:37.710 [WARN ] [.thedoctor.internal.TheDoctorHandler] - Heap has increased from 20 to 26 and may indicate a memory leak if this number keeps growing. This binding has a channel you can use to watch the heap with.
2024-04-18 13:53:38.228 [WARN ] [.thedoctor.internal.TheDoctorHandler] - Heap has increased from 20 to 26 and may indicate a memory leak if this number keeps growing. This binding has a channel you can use to watch the heap with.
2024-04-18 14:23:38.749 [WARN ] [.thedoctor.internal.TheDoctorHandler] - Heap has increased from 20 to 26 and may indicate a memory leak if this number keeps growing. This binding has a channel you can use to watch the heap with.

I see the same in my logs. As the heap is garbage collected it shrinks. Then it grows and if it grows too much the binding generates the warning.

I wonder if it would help if we could set the threshold at which it reports. For example, for you maybe a value of 8 or 10 would make more sense. It is useful to get the warning if the heap continues to grow. It helped my identify a leak I had in one of my rules. :wink: Yes, they can happen.

I agree and have just fixed this, so thank you for reporting it. Now it will only warn you once for each 1% it increases.

That is a good idea. In the newer build it is based on 6% increase. If you pause and un-pause the thing, it will re-calibrate the reference size it uses to what the system has in use when you un-pause the thing.

3 Likes

Could there be a problem with the undervoltage alarm? The throttle code is 80008, which corresponds to an active temperature alarm and formerly occurred one.

2024-04-26 11:48:07.813 [WARN ] [.thedoctor.internal.TheDoctorHandler] - BAD : Full Pi throttle code is 80008
2024-04-26 11:48:07.814 [WARN ] [.thedoctor.internal.TheDoctorHandler] - BAD : Pi, Soft temperature limit active
2024-04-26 11:48:07.815 [WARN ] [.thedoctor.internal.TheDoctorHandler] - BAD : Your Pi's power supply or cable was not good enough to supply power without an under-voltage event occuring.
2024-04-26 11:48:07.817 [WARN ] [.thedoctor.internal.TheDoctorHandler] - BAD : CPU temperature is 61.224c and may cause instability. Do you have a heatsink and fan?