Programmatically check OH config files have loaded

idkpmiller · April 26, 2019, 11:07pm

Hi,
I am looking for a way to check if items, rules and sitemap files have been loaded and identify those that have been ignored due to errors and not loaded.
As this is typically a type in my case and it happens way too often, I would like to have one of my diagnostic rules look after this error condition. Once it has stabilised it should need zero maintenance to stop it being prone to the same typo issue

Anyone know if this can be achieved and how?

Thanks

Paul

RolfV · April 27, 2019, 12:22am

Just thinking out loud here basically:
Suppose that you have 3 main rules files, and a 4th diagnostic rules file.
For each rule file you could make an item that holds a date time like

DateTime MainRulesLastRun (RulesLastRun)
DateTime TimerRulesLastRun (RulesLastRun)
DateTime GardenRulesLastRun (RulesLastRun)

You might want to add them in a Group that I for now just called RulesLastRun.

Then at the bottom of each main rules file, you could add a rule that runs every hour/every 12 hours/every day, whatever you prefer basically, that sets the DateTime item of that file to the current date & time.

In your diagnostic rule file you can then check every same interval but let’s say a minute later, if all DateTimes have been updated (for example by looping through the group I suggested. That could in turn result in a notification if a DateTime item hasn’t been updated according to the prescribed interval, because something might be wrong then.

For items, you would have to check at least if you can get for example the state of 1 item in each of your item files. (using a try{} catch(){} statement for example).

For sitemaps, I’m not sure actually.

idkpmiller · April 27, 2019, 12:39am

Thanks @RolfV
I like the ideas a lot.

I will start to implement and see how it behaves, I have way more rule and item files so the scaling will slow me down initially, however its rare nowadays that I introduce a new one so once I catch up it should be easy to manage.

I need to ensure I write the diagnostic rule is a way to make it unlikely to require updating often.

Thanks
Paul

rlkoshak · April 27, 2019, 3:41pm

A big challenge is that typically you want something external to monitor a system. For example, if you implement rules to monitor whether or not rules get loaded, what happens when those rules fail to load? No alert.

This is an important principal in fault tolerant engineering, which is what you are after.

So what u would do is write a script that watches openhab.log for the 'loaded model x" lines and compared the list of expected and when one is missing generate an alert. You could also look for known errors like lines with “ERROR” and “syntax” in them.

If you were to do this from OH Rules, you can use the Log Reader binding to watch OH’s logs.

Rolf’s idea is a good one but it will only really tell you something about .rules files. For all the rest I think you need to watch the log files.

idkpmiller · April 27, 2019, 9:49pm

As I am a glutton for punishment I think I will try both ideas and advise how I get on. If nothing else it will provide feedback into the community.

I recognise the two main issues with Rolfs suggestion

It relies on an openhab rule to test openhab rules.
it really only works well for rules.

I can see the advantage of writing an external script, bash or python etc to monitor HO log files for specific strings, the issue I see here is I often catch the typo errors and then go back and correct them so it will need to keep a state for each of the errors to see if its corrected which is starting to make this complex.

Thanks

Paul

Virus-free. www.avast.com

idkpmiller · April 28, 2019, 8:29am

I have spent a lot of time on the Rules idea, and if I could make the check below a little more robust for NULL it would be usable while I investigate the log parsing, which to be honest is sounding like the far better solution, although I am consider a number of implementation types to see which I favour.

Can someone advise the best way to capture an item that has a state of NULL (according to REST) and not to let it fall through to the error throw

rule "Scehuled 60 minute diagnostics check"
when   
	Time cron "30 1 * ? * *"
then
	logInfo("diags-auto.rules", "Scheduled 1 hour diagnostic check.")

    RulesLastRun.members.forEach[ profile |
		try {
			var _rule = profile.name.split("_").get(0)
			var _LastSeen = (now.millis - (new DateTime(profile.state.toString).millis)) /1000
			if (_LastSeen > 3600 || _LastSeen==NULL || _LastSeen==UNDEF) {
				logInfo("diags-auto.rules","somethings wrong with "+_rule)
				sendMail("me@example.com", "Rule Diag Check.", "The rule "+_rule+" has not ran successfully in the last hour.")
			} else {
				logInfo("diags-auto.rules","rule running successfully: "+_rule)
			}
			logInfo("diags-auto.rules", "   Rule: "+profile.name+" Timer: "+_LastSeen)
		}
		catch(Throwable t) {
			logError("ReadPodID", "Error was caught: {}", t)
			sendMail("me@example.com", "Rule Diag Check error thrown.", "Error: "+t)
		}	
    ]
end

The body of the email I am getting is

Error: java.lang.IllegalArgumentException: Invalid format: "NULL"

Thanks

Paul

rlkoshak · April 29, 2019, 5:51pm

Just check if(profile.state == NULL) and take the appropriate action. I’m not sure what the appropriate action would be. NULL means that the profile Item has never received an update since OH last loaded or OH loaded the .items file those Items are defined in.

You should probably process the NULL ones separately.

idkpmiller · April 30, 2019, 5:51am

Thanks Rich,
I was indeed checking for NULL but not the profile.state
The code is now functioning as I expected and provides an hourly check that rules are being run.

DIAG RULE

///////////////////////////////////////////////////
// Actions scheduled ever 60 minutes
rule "Scehuled 60 minute diagnostics check"
when   
	Time cron "30 1 * ? * *"
then
	logInfo("diags-auto.rules", "Scheduled 1 hour diagnostic check.")

    RulesLastRun.members.forEach[ profile |
		var _rule = profile.name.split("_").get(0)
	
		if (profile.state == NULL) {
			logInfo("diags-auto.rules","somethings wrong with "+profile.name)
			sendMail("me@example.com", "Rule Diag Check.", "The rule "+_rule+" rule has not ran successfully in the last hour.")
			return;
		}
		
		try {
			var _LastSeen = (now.millis - (new DateTime(profile.state.toString).millis)) /1000
			if (_LastSeen > 3600 || _LastSeen==NULL || _LastSeen==UNDEF ) {
				logInfo("diags-auto.rules","somethings wrong with "+_rule)
				sendMail("me@example.com", "Rule Diag Check.", "The rule "+_rule+" has not ran successfully in the last hour.")
			} else {
				logInfo("diags-auto.rules","rule running successfully: "+_rule)
			}
			logInfo("diags-auto.rules", "   Rule: "+profile.name+" Timer: "+_LastSeen)
		}
		catch(Throwable t) {
			logError("ReadPodID", "Error was caught: {}", t)
			sendMail("me@example.com", "Rule Diag Check error thrown.", "Error: "+t)
		}	
    ]
end

The rule is in a file on its own so less likely to be caught out with a typo, and is setup to use a group and an item file as suggested by @RolfV

ITEMS

// Group definition
Group RulesLastRun


// files to check rules are loaded ok
DateTime alexa_Rules_LastRun 			(RulesLastRun)
DateTime appliances_Rules_LastRun 		(RulesLastRun)
DateTime asterisk_Rules_LastRun 		(RulesLastRun)
DateTime astro_Rules_LastRun 			(RulesLastRun)

As @rikoshak quite rightly pointed out this is not the ideal solution, it was not that easy to setup and requires ongoing maintenance AND only looks after rule files.
I selected the check to be made in each Rule file like the example one below every hour.

Individual RULE file entry

//  perform diag check every 60 minutes.
rule "Check 60 minute diags alexa"
when
    Time cron "20 0 * * * ?"
then
	alexa_Rules_LastRun.postUpdate(new DateTimeType)
end```

Now the alternative solution proposed by Rich is to use an external application or script to parse the openhab log file, I am sure there are some great utilities out there to do this with style, but in the end I chose a simple approach.
I run the bash script below three times a day using a crontab entry.

**CRONTAB**
```csv
0 6,16,23 * * * /etc/openhab2/scripts/chk_log

And the script chk_log

#!/bin/bash
# Checks openhablog for errors by Paul Miller 2019 
# Revisions: 0.1 - 29/04/2019 Initial Release 0.1 -

logger "***********************************************"
logger "Script $0 has been executed."
logger "***********************************************"

##set your email address 
EMAIL="me@example.com"

## set the openhab log filename and full path
LOG=/var/log/openhab2/openhab.log


#### DO NOT CHANGE anything BELOW ####

    MESSAGE="$(grep "error\|Refreshing" $LOG)"
    SUBJECT="Log check $(hostname) $(date) "
    echo "   $MESSAGE "  | mail -s "$SUBJECT" "$EMAIL"

Remembering to set the file to be executable
and then check your emails for the interesting log snippets.

The patterns to be matched can be increased I only have two showing “error|Refreshing”.

I also have this running.
As it really takes some time to get these setup exactly as you like I suggest a couple weeks trial before I make drastic changes.

I hope this helps some one else and once again thanks for the suggestions.

Regards

Paul

rlkoshak · April 30, 2019, 2:47pm

It occurs to me this might be a good use of the Expire binding. Then the timing out of the profile would actually be implemented outside of Rules. It would require a rethinking of your Rule as now instead of looping through all your RulesLastRun Items you would trigger a Rule when one of those members goes to UNDEF or OFF. See Design Pattern: Expire Binding Based Timers.

It also occurs to me, too late to be of much use to you, that this Rule is a special case of [Deprecated] Design Patterns: Generic Is Alive.

Either or both of these might give you some ideas for refactoring or refinement.

Thanks for posting the script solution too.