New Add-on bundle for Prometheus health Metrics

friesenkiwi · July 20, 2018, 11:22am

Hello everyone,

we are developing an add-on bundle for Eclipse Smart Home/OpenHAB, experience some difficulties with it and are looking for help here now. We tried out a lot of different approaches but all failed. I think, the bundle will be helpful for others too, so we release it under EPL, and maybe it is interesting enough to even directly incorporate it into upstream later.

The bundle exposes internal metrics of ESH/openHAB in Prometheus’ format (https://prometheus.io/docs/concepts/data_model/) so it can be scraped/ingested there and fed into a Grafana board, to get a good understanding of the system’s health, potentially especially for multiple systems deployed in the field, not directly physically accessible to the one who monitors it.

The metrics there currently are

openhab_bundle_state{bundle="[BUNDLENAME]"} [BUNDLESTATE]
openhab_thing_state{thing="[THINGID]"} [THINGSTATE]
openhab_inbox_count [INBOXCOUNT]
smarthome_event_count{source="[BUNDLENAME]"} [EVENTCOUNT]

We now would like to add internal states of several bindings (most importantly Z-wave), which are hard to aquire, because they are internal. But regularly, the bindings output quite a lot of interesting information (e.g. size of send queue, number of sending attempts) to the logs in DEBUG level.
It will, in the general approach, add more metrics to the output like

openhab_logmessages_total{level="[LOGLEVEL]"} [MESSAGECOUNT]
openhab_logmessages_error{type="[LOGGERNAME]"} [MESSAGECOUNT]
Afterwards we probably want to filter, process and parse a custom set of messages to get e.g. Z-waves internal message queque etc.
The Prometheus javaclient library which we use and which comes as OSGi bundles, has logging collectors for log4j, log4j2 and logback (https://github.com/prometheus/client_java#logging), so the “only” thing we’d need to do is to setup our bundle (or a class in it) as an appender to the openHAB logging facility, to get the information parallel to the usual file logging.
And here comes the problem: We were not able to figure out how to hook it up to the Karaf/Felix/Pax in the distribution runtime (we managed to do it for Logback/Equinox inside the development IDE).
Can anybody here point us in the right direction, how or where we need to register our bundle/fragment so that all the logging messages can be configured to end up here? We googled a lot, but it seems nobody had that exact same requirement earlier, all articles and hints we found are outdated or only providing bits or are based on different requirements.

You can see the current state here: https://github.com/KuguHome/openhab-prometheus-metrics , build instructions etc. are included in the README.md, a working jar is here: https://github.com/KuguHome/openhab-prometheus-metrics/releases/download/v0.9/com.kuguhome.openhab.prometheusmetrics-2.4.0-SNAPSHOT.1.jar. To make it run in the addons/ folder, you will need to add two more files:

The master branch has the working version, which is missing the openhab_logmessages_ metrics, the eclipse-workable-logback-metrics branch has those metrics, but will only work in the Eclipse IDE, not inside the standalone distribution.

An example scrape looks like this: https://github.com/KuguHome/openhab-prometheus-metrics/wiki/Example-scrape

We have set it up on a couple of running instances and this is how it looks like when scraped by Prometheus and visualized by Grafana:

The Grafana Dashboard code is here: https://github.com/KuguHome/openhab-prometheus-metrics/wiki/Grafana-source

What we did until now to get it running inside openHAB distribution runtime:

added appender to org.ops4j.pax.logging.cfg
put jar bundle into maven repo according to org.ops4j.pax.url.mvn.cfg
(so bundle should be picked up at startup).
put line of appender here startup.properties
mvn:io.prometheus/simpleclient_log4j2/0.4.0 = 8
mvn:org.ops4j.pax.logging/pax-logging-api/1.10.1 = 8 <---- already was there
Got that:
2018-07-04 17:34:59,960 CM Configuration Updater (Update: pid=org.ops4j.pax.logging) ERROR Unable to invoke factory method in class class org.apache.logging.log4j.core.config.AppendersPlugin for element Appenders. java.lang.NullPointerException

Please let me know if you need any more information.
Also we are very interested in your opinion of the bundle or any other kind of feedback.

Some more insights about our process of thoughts:

We don’t want to read the Log from the files, because it seems very ineffcient to parse things which were there as objects before and then have been serialized/formattet
We don’t want to use a Syslog output and pipe it into a solution like Logstash/Elasticsearch etc. because that’s as inefficient (object->format->parse) and needs much more setup overhead, other things running on the openHAB machine
We looked into the Openhab Log Viewer (which is running independently on a different Port 9001) and the Log Reader Binding, but they do the same (read the files) and are also not helping our target of getting internal states onto a time-based, visually appealing graph for good diagnostics
Rather, we want to hook into the stream of raw object data directly inside of openHAB, using the APIs that are there in Logback/Log4j(2)/Pax for this exact purpose to build an own collector
There was some talk about Promethes style metrics earlier here: Overview of expectable command latency on different hardware? but we think the Hardware/OS data (memory, CPU etc.) is easier to aquire by running a node_exporter next to openHAB and scrape it from there. But maybe @fish is interested anyways? (Having said he has a soft spot for metrics)

Looking forward to your responses,
Cheers,
Hanno - Felix Wagner (Product Manager for Kugu Home GmbH)

PS: I crosspost this to the ESH forum, because it seems to be a very deep technical problem (https://www.eclipse.org/forums/index.php/m/1792637/#msg_1792637)

rlkoshak · July 20, 2018, 3:25pm

I can’t really help with your problem. But I do have a recommended metric that would be useful and that is reporting the status of the thread pools, in particular the Rules thread pool. This is a resource that when it runs out will cause all your Rules to have latency or stop completely. Consequently, its a pretty important health and status indicator.

How/whether you can do that from the Karaf or some other way I can’t say. I look forward to trying this out in the coming weeks. I’ll need some time to sit down and work it out. I’m exploring health and status options as I have way more than just OH to monitor. Prometheus is on the short list of options to look at though.

friesenkiwi · July 20, 2018, 3:44pm

Hey, thank you for your interest!
The recommendation sounds very promising. Indeed, we had some problems with stopping rules lately, so this might very well have something to do with those threads. Do you have any directions (documentation, code) about better understanding the implementation of the pools in OH?
If they are regular Java threads, then it should be possible to get information about them via the JVM API or JMX, maybe?

rlkoshak · July 20, 2018, 5:00pm

Look here for help with that: (OH 1.x and OH 2.x Rules DSL only] Why have my Rules stopped running? Why Thread::sleep is a bad idea.

I just wrote this up last week. It is more focused on the Rules pool. I’ve not seen the other pools be much of a problem.

They are regular threads but I know OH/ESH is using a pool of pre-created threads. I’m not sure what library they are using for the pools though. I’d assume Apache but perhaps that is more of an indication of the last time I coded in Java then an indication of what is used the most these days.

This post from Scott on that same thread might be useful to you.

I would expect the number of threads running from each pool would be available through JMX. What I don’t know is how available the total number of threads in the pool is and you really need both to know if there is a problem or not. The total number of threads can be modified through parameters in runtime.cfg. Except for Timers and cron triggered Rules which are run by Quartz and setting the pool size for Quartz is buried in the OH runtime folders (/usr/share/openhab2/somewhere/quartz.properties). But the point is we can’t rely on an unchangeable default for the number of threads in the pool.

Kai · July 26, 2018, 10:01am

Hi @friesenkiwi,

as mentioned in the Eclipse Forum, let me try to help you on the Karaf logging configuration here. Nonetheless, seeing @rlkoshak’s feedback, I wonder whether a dedicated health&metrics API in ESH might not make sense on the long run.

In general, I would expect that adding a log4j2 appender should work out of the box. Afaik, neither openHAB nor Karaf does anything special, so it should be plain log4j2/pax-logging configuration issues.

Got that:

2018-07-04 17:34:59,960 CM Configuration Updater (Update: pid=org.ops4j.pax.logging) ERROR Unable to
invoke factory method in class class org.apache.logging.log4j.core.config.AppendersPlugin for element 
Appenders. java.lang.NullPointerException

That sounds to me like an issue in the logging configuration. Unfortunately, I am not a all an expert in pax-logging & log4j2 - did you try to search the web or check on their discussion forums, what might be the reason for this problem? As I said, I don’t think that anything openHAB-specific is involved here, so it should be a question anyone using Karaf/pax-logging/log4j2 should have.

Regards,
Kai

friesenkiwi · July 26, 2018, 12:45pm

Hi,

@rlkoshak thank you again for the suggestion and directions, we now added those metrics to the bundle output:

openhab_pool_size (this corresponds to the org.eclipse.smarthome.threadpool:ruleEngine=5 setting in conf/services/runtime.cfg, in Java it is called corePoolSize
openhab_pool_size_max
openhab_pool_threads_count_current
openhab_pool_threads_count_active (most of the current threads may be sleeping/blocked etc., these are the ones actually running right now)
openhab_pool_threads_count_largest
openhab_pool_tasks_count_completed
openhab_pool_tasks_count_total (completed+not yet started or blocked, running, etc.)
openhab_pool_queue_count
openhab_pool_keepalive_time_seconds (the time after which a queued task is dismissed)

All of those above metrics are per-pool, so they are tagged via {pool="[POOLNAME]"} and this is working for all pools launched through org.eclipse.smarthome.core.common.ThreadPoolManager.
E.g. on one of our systems, these pools (and sizes) are present:

thingLinkManager 5.0
ConfigStatusService1.0
items 1.0
RuleEngine 5.0
lsp 1.0
discovery 5.0
safeCall 1.0
upnp-io 5.0
thingHandler 5.0
astro 5.0
usb-serial-discovery-linux-sysfs 5.0
thingManager 5.0
persist 5.0
file-processing 5.0

It should be possible for you, to have a pretty close look on the pools, tasks and threads this way and analyse problems.
The official Java documentation on ThreadPoolExecutors is pretty insightful to understand, how all those metrics work together.

We also added some standard metrics of the JVM about memory, class loading and garbage collection, essentially taking the place of the JMX agent’s things.

The latest version is available for download (you’ll need an additional file) and on Github (new branch), the example scrape has been updated to reflect this, and so has the Grafana board, new image:

@Kai glad you find it interesting
I think, the technical discussion about the log appender problem can stay in the Eclipse forum, as I expect there to be the people familiar with karaf, Pax&Log4j2.
Also, in the latest version of the bundle, we made some effort to make it universally usable also for and from other bundles.
So e.g. if Z-wave would like to announce a certain metric like the queue size, it can do so with a simple call, much like the famous logger.error("Error during metrics startup");. And if this would make it’s way into the ESH core, we are more than happy I think all of the above metrics are not openHAB specific, but applicable to ESH in general.

Cheers,
Hanno

friesenkiwi · August 7, 2018, 7:15pm

We were able to solve the problem with the logger, details can be found over at the eclipse board.
The latest version of the bundle has some additional metrics:

# HELP openhab_logmessage_count logmessage count at various log levels
# TYPE openhab_logmessage_count counter
openhab_logmessage_count{level="debug",} 64.0
openhab_logmessage_count{level="warn",} 0.0
openhab_logmessage_count{level="trace",} 0.0
openhab_logmessage_count{level="error",} 0.0
openhab_logmessage_count{level="fatal",} 0.0
openhab_logmessage_count{level="info",} 375.0

For now this is only a counter of messages, but we want to extract certain values from the logging messages soon.

Did anybody of you already test the thing out, maybe @rlkoshak @fish ?

rlkoshak · August 7, 2018, 8:22pm

Sadly not. It will actually be quite a configuration challenge for me. I’ll need to spin up a new VM and I’ve a lot of other stuff I want to host on it for monitoring as well which means developing a lot of new Ansible playbooks and the like. It will take some time before I have the time to get this far.

friesenkiwi · August 8, 2018, 5:28am

Ah, yes, I can see that it is challenging and takes time to set up.
Since you’re talking about Ansible and I have read something about it somewhere in the forums or in the openHABian bugtracker: We are also using Ansible for the configuration and orchestration of our whole infrastructure (including Prometheus and Grafana) and of openHAB and the Raspberries themselves. Maybe I can help a bit with pulling together a nice playbook for this?

rlkoshak · August 8, 2018, 2:12pm

That would be pretty awesome!

friesenkiwi · August 8, 2018, 2:14pm

What do you already have and what would you like to achieve? I mean, which software and configuration?

rlkoshak · August 8, 2018, 2:35pm

My end goal is quite a bit bigger than this. Ultimately I’d like to migrate all of my system and services monitoring off of openHAB and to a dedicated system. I’ve one ESXi server running five VMs, a firewall running pfSense, three RPis, some ESP8266s, and around 10 services mostly running in Docker containers (I’m not using an orchestration engine like Kubernetes) I’d like to monitor mainly so that I can generate alerts when a problem occurs (e.g. Plex Media Server is no longer responding, OH Rules have stopped running, RAM usage on one of the VMs is too high, etc).

I’m in the process of jockeying my VMs around a bit (I ran out of the number of CPUs I can assign with the free ESXi license) and when I’m done with that I plan on building a monitoring VM (I know monitoring a computer from itself isn’t the best idea but it’s good enough for now) where I intend on hosting Nessus, some sort of health and status server (Prometheus + Grafana is definitely one of my top picks, but I might go old school with Nagios is also ), and stuff like that.

Playbook roles to deploy the Prometheus server and any client software needed on the machines would be handy. I’ve a role to deploy and configure Grafana for OH but it isn’t working for me right now and I’ve not had time to debug it.

I like the idea of Prometheus as it seems a bit more flexible and customizable compared to Nagios.

friesenkiwi · August 9, 2018, 10:29am

We are using Prometheus for monitoring of the main infrastructure as well as deployed field systems, on the hardware/os level but also on the application level. It is a very thought-through, lightweight, versatile and modern tool with a big community, active development and many integrations at many places, so I’ll definitely recommend it.

The roles (that seem to fit here) we use are:

We are also internally running a couple of playbooks for configuration of openHAB which we might be able to turn into a proper role in the future.

sjcliffe · August 14, 2018, 12:56am

Hi @friesenkiwi, I’m trying to use your add-on but am getting the following error:

2018-08-14 10:39:00.539 [ERROR] [m.kuguhome.openhab.prometheusmetrics] - 
[com.kuguhome.openhab.prometheusmetrics.exposable.LoggerMetric(3)] Could not load implementation 
object class com.kuguhome.openhab.prometheusmetrics.exposable.LoggerMetric
java.lang.ClassNotFoundException: com.kuguhome.openhab.prometheusmetrics.exposable.LoggerMetric 
cannot be found by com.kuguhome.openhab.prometheusmetrics_2.4.0.201808071840

I’m running the OH2 2.3 release and your v0.9.2beta jar along with the 3 required modules.

friesenkiwi · August 14, 2018, 9:58am

Hi @sjcliffe thanks for your interest and sorry for the error. I just released a new version 0.9.3 which will get rid of the error message. The bundle should have worked fine before as well. You can check with bundle:install in the console or by opening in a browser
http://127.0.0.1/rest/metrics/prometheus

sjcliffe · August 15, 2018, 5:00am

With 0.9.3 I now get:

2018-08-15 14:55:28.053 [ERROR] [m.kuguhome.openhab.prometheusmetrics] - 
[com.kuguhome.openhab.prometheusmetrics.exposable.OpenHABThingStateMetric(291)] Error during 
instantiation of the implementation object
java.lang.NoClassDefFoundError: Could not initialize class 
com.kuguhome.openhab.prometheusmetrics.exposable.OpenHABThingStateMetric

bundle:list shows that it’s running but trying to access http://127.0.0.1:8080/rest/metrics/prometheus results in a 404 not found.

friesenkiwi · August 16, 2018, 9:10am

You probably hot-swapped the file inside the addons/ folder? So delete the old one, put in the new? For some reason unknown to us, that sadly doesn’t work with our bundle currently. You will need to restart openhab after removal of the old bundle.
I just tried again on my test system with the 0.9.3 jar from GitHub and it works well without errors.
Can you try to restart?

sjcliffe · August 16, 2018, 9:22am

I’m pretty sure I did replace the file, shutdown, cleared the cache and restarted.

Noticing the “2.4.0-SNAPSHOT” in the filename I’ve now updated to the latest OH snapshot (I was running 2.3) and have a different problem:

2018-08-16 13:26:09.030 [WARN ] [org.apache.felix.fileinstall        ] - Error while starting bundle: 
file:/packages/OpenHAB2/addons/com.kuguhome.openhab.prometheusmetrics-2.4.0-SNAPSHOT.jar
org.osgi.framework.BundleException: Could not resolve module: 
com.kuguhome.openhab.prometheusmetrics [225]
Unresolved requirement: Import-Package: io.swagger.annotations; version="1.5.7.SNAPSHOT"

The runtime bundled version is 1.5.8

friesenkiwi · August 16, 2018, 9:25am

I am running OH 2.3 release as well, the bundle is not tested with the latest openHAB snapshot, sorry. Probably, the io.swagger.annotation version has changed lately

friesenkiwi · August 20, 2018, 7:47am

@sjcliffe does it work for you now with the 2.3 stable release?