Modbus error management in openHAB

rossko57 · April 25, 2019, 10:52pm

These notes apply to modbus binding version 2.x
This is not a Modbus how-to, but rather a tuning guide for advanced users.
Related - Modbus performance management in openHAB 2

Fault detection

Later we’ll look at ways to handle faults and errors. But to begin, we’ll look at how you can detect a Modbus problem using openHAB, so that you can build rules to take some action; send a warning message, put control systems into a fallback mode, or whatever.

A serious fault with your Modbus slave or network will most often show up in openHAB as a persistent failure to communicate. The binding will log error messages into openhab.log of course, but that won’t trigger a rule.

Detecting UNDEF state

If you use the ‘stale data’ technique described later to set Items to UNDEF, it is of course easy to trigger a rule from state change. As this is Item based, you may gather many into a Group to “share” a rule.

Rule "modbus fault"
when
   Member of gTellTales changed to UNDEF
then
   sendMail([you@email.net](mailto:you@email.net), "Fault report", "Modbus Item faulty " + triggeringItem.name)
end

Note that any fault is likely to affect all of the Things defined under one TCP/serial Bridge Thing (representing a physical slave); potentially several poller and many data Things.

You need only check on one Item from each slave for simple reporting of in-service faults.

Be aware that Items linked with a slave already broken at openHAB boot time will stay at NULL state, and not change. The binding would normally complete its first read poll before rules become available, and update “good” Items; so you could trigger a rule at System started and check for Items remaining NULL at that time.

Detecting Thing offline

Note that the TCP or serial bridge Thing representing your Modbus slave would NOT normally go OFFLINE in response to an error. That may seem odd, but the TCP or serial pathway to attempt communications is in fact still available.

A faulty slave will instead be detected by a poller Thing, this being the Thing that actually fails to read. In your configuration, it could have several poller Things.

We can detect a change using Thing based rule triggers.

Rule "modbus thing fault"
when
   Thing "modbus:poller:slave22:s22_coils" changed
then
   var Tstatus = getThingStatusInfo("modbus:poller:slave22:s22_coils").getStatus()
   if (Tstatus.toString == "OFFLINE") {
      sendMail([you@email.net](mailto:you@email.net), "Fault report", "Modbus slave22 offline")
   }
end

Unfortunately there is not at the moment (OH 2.4) any elegant way to share rule code by using “group” style Thing triggers and finding out which Thing it was.

Write errors

None of the preceding methods will detect a problem writing over Modbus. In most cases, you will be reading from the same slave so a problem will get noticed eventually.

However, write-only slaves are not impossible, or perhaps you have some Very Important data that you need to be sure of, like a fire alarm.

The binding supplies extra channels for each data Thing, as described in the documentation, providing datestamps for last read and write error.

You could create an extra Item linked to one of those, as well as your “real” Item.

Switch my_coil_switch "my output" {channel= "modbus:data:slave22:s22_coils:d03:switch"}
Datetime MB_write_stamp "Last write error" {channel= "modbus:data:slave22:s22_coils:d03:lastWriteError"}

If there is a failure to write my_coil_switch, the timestamp Item will be changed.

Rule "modbus write fault"
when
   Item MB_write_stamp changed
then
   sendMail([you@email.net](mailto:you@email.net), "Fault report", "Modbus slave22 important write error")
end

Stale data

If your Modbus slave has an error or fault, note what it says in the binding documentation:

In case of read errors, all data channels are left unchanged

So, your openHAB Items will contain the last valid data (from the last good read poll) indefinitely.

Often, that is desirable. Perhaps the fault is transient, perhaps your slave is rebooting, perhaps that cable will be plugged back in with an “Oops”, and another poll attempt will be successful soon.

But at some point the frozen data will become ‘stale’, and you will not want to continue to use it in your rules. Example, a “stuck” room temperature could result in heating never turning off.

How to avoid stale data

The easiest way is to make use of the expire binding. You may choose for each Item how long to tolerate stale data. To be consistent with other bindings, update the Item to UNDEF.

Number KitchenT "Kitchen Temp [%d]" {expire="2m, state=UNDEF", channel="modbus:data:sl22:temp:number"}

Regular updates from your Modbus poller keep restarting the expire timeout.

But beware – if you have used the updateUnchangedValuesEveryMillis parameter to reduce the frequency of updates to your Item, you must take that into account and set expire for some longer time.

Remember to configure your rules to handle UNDEF states without error. Also, should a slave be broken at system start, Item states will remain as NULL. Deal with both cases.

rossko57 · April 25, 2019, 11:05pm

Error performance overview

A Modbus network may have many slave devices, of varying types, located far apart. Modbus is generally designed to be robust, and recover from transient errors.

But continuous failure of a single slave or part of a network may have an unexpected impact on the performance of the surviving devices. This note is a guide to help you understand and minimize that impact. (If you have only one Modbus slave, this part is not really applicable.)

Working example

Let’s imagine a Modbus-RTU setup with SlaveA and SlaveB, each with ‘coil’ type registers controlling lighting relays, and ‘input’ type registers reporting wall switches.

Our Modbus binding configuration would have two serial bridge Things – describing the serial port for each slave – and four poller Things – one each for CoilsA, InputsA, CoilsB, InputsB – and we’ve chosen to read poll at 1s intervals.

To analyse a fault situation, first let’s understand a few key features of this openHAB binding.

Queued transactions

The binding processes read and write requests in a serial fashion. That’s to say, one query-response transaction at a time. Each transaction must complete (or fail) before we move to the next one in the queue.

This is a legacy of the original Modbus concept of many devices sharing one wired bus, where only one can “talk” at a time. But it can still apply even to TCP/IP Modbus devices - a TCP gateway may lead to a real serial bus with the same limitations.

In reality the binding separately manages queuing for each “endpoint”, where an endpoint is a single serial bus (ttyS0, COM1) or a TCP target IP address. Transactions with “real” TCP slaves at different IP may therefore take place in parallel, but note that of course the pollers for each individual slave are queued.

For slaves reached via a TCP gateway, they will all have the same endpoint and so will queue as though on a serial port.

Timeout

Slaves are allowed to be “busy” when queried, and data rates may be low over wired or TCP networks, so we must allow some time for a response to get back to the binding. We do want to detect “broken” slaves, or else we could wait forever, so the binding allows a Timeout period before abandoning this transaction.

Retry

No network is perfect, serial buses may corrupt messages due to electrical interference, TCP networks may drop packets due to overload, slaves may be too busy to respond in time. The Modbus binding includes a basic automatic error recovery mechanism – it can retry a failed or rejected transaction a number of times, and this often works in the case of a transient error.

What can go wrong.

In our imagined setup, let’s power off the box containing SlaveB for some lighting maintenance.

The modbus binding has a queue of scheduled tasks – polling at one second intervals.

CoilsA, InputsA, would read-poll in the usual way.

Then we get to CoilsB, which isn’t going to respond since it is powered off. Binding sends a query, and the timeout setting takes effect. Binding will wait 1.5s (default timeout period) before deciding that this transaction has failed.

Now the retry mechanism kicks in – the binding tries again. Once more we will wait for the timeout period, for a response that never comes.

The default number of tries is 3, so altogether we will wait 3 x 1.5s for the broken slave, before giving up on this one transaction.

Binding will move on to the next transaction in its queue, InputsB. As slaveB is still broken, we will spend another 4.5s trying this poll.

Altogether then, we have spent 9s trying our two polls. Next in the bindings queue is a poll for CoilsA – but we are now eight seconds late actioning this. Then it’ll be another 9s before we get around to it the next time. The poll rate for SlaveA is effectively now 9s instead of 1s even though there’s nothing wrong with SlaveA, this is probably not acceptable.

As you can see, a single failure can have an unexpectedly large knock-on effect.

What can be done?

It’s worth pointing out that you can accidentally produce this effect by misconfiguring your Modbus Things to (try to) use invalid devices or registers. Make sure your Things are all correct!

The root of the problem is the poll/timeout/retry loop. We can modify this by altering settings from default.

Timeout

Transactions always take a real amount of time, so we must have some kind of timeout setting which allows for that. You have to take into account several factors.

How fast does your slave turnaround a transaction? In the worst case it could wait for a motor to move or suchlike before acknowledging a command, a long time in computer terms.

How long are the inherent delays in your network? A TCP/IP network may have bridges and WAPs and gateways, each adding delay. A serial bus may run at low data rates.

Timeout parameters

Setting the timeout for a serial bridge Thing is straightforward, using parameter receiveTimeoutMillis.

Setting for a TCP bridge Thing is more complex. The host system (Unix, Windows etc.) will have its own TCP timeout defined, and we cannot exceed that, although we may go shorter. Be careful about altering that system setting, which is shared with other e.g. internet connections and is generally best left unaltered.

Thing parameter connectTimeoutMillis does act like a simple timeout, whether you have configured to reconnect TCP for every Modbus transaction (the default), or have chosen to hold a TCP connection open for longer periods. If a Modbus-TCP slave or gateway fails, it soon becomes evident as an error reconnecting TCP.

If you use TCP-to-serial gateways, it gets complex. TCP settings for the openHAB-to-gateway path are as above, but timeout for the gateway-to-serial-slave(s) path is something you may need to set within the gateway itself.

In practice, I would recommend trying a timeout of one second. You could try shortening to a half-second, but I would not advise less than that. (See also retries below.)

Sometimes you just need a more generous setting; but you can have different settings for different slaves.

Polling rate

Many people instinctively want to poll Modbus slaves as frequently as possible, but it is worth taking a step back and thinking about this.

For example a pushbutton input, you do want a frequent poll, since a thumb press takes less than a second.

A room temperature sensor however – there’s no need to read that more than once a minute, or even 5 minutes…

Polling coils to see what state a relay is in is something worth considering in more detail. If you always control the relays from openHAB, do you need to read this back often? Or even at all? The binding does allow you to configure poller Things that don’t actually poll, and data Things (registers) that are write-only.

With use of no-poll pollers, you can implement a read-on-demand scheme. Issuing a REFRESH command to a Modbus Item from a rule will cause a one-time read poll (along with any other channels/Items also belonging to that poller). That would be a rare case, but has uses. For example, initializing an otherwise write-only Item at openHAB start up, by reading the slave.

Often you will reach a compromise here, because you want some of the registers in a poller’s ‘block’ more often than others, but the poll rate will have to be set by the most frequent. There’s not much benefit in splitting a poller Thing into two or more blocks, as you would then just be increasing the total number of polls.

Indeed it is worth looking at the opposite – can you combine poller Things? Sometimes you will have a slave with say registers 15 and 19 of interest, but registers 16-18 are undefined by the manufacturer. Some slaves – but not all – will let you read a block of registers including undefined ones. You could try to set up one poller Thing for all registers 15-19, but with data Things for just 15 and 19. (Caution – the design of Modbus protocol limits you to 120 or so registers in one poll)

Reducing unnecessary polling has other benefits in reducing updating overhead in openHAB events, persistence, etc. as well as our main job here: to minimize how often a faulty device causes a “pause”.

Retries

The retry count acts as a multiplier for long term problems, so we should give it careful thought.

We DO want retries of some kind – no network is perfect, remember. But for frequently polled reads, another poll will be due soon anyway. Is that good enough?

You might consider reducing poller maxTries to 1, which means no retries.

For TCP slaves, bridge Thing connectMaxTries (default 1) will act as the effective retry count for a “broken” slave. But for a transient error, poller maxTries at 2 or more will allow recovery.

For TCP gateways, we get a complication. The TCP side may be working fine, but we can get a transient error on the serial side. Possibly your gateway can handle that itself, but more likely you’d need to consider setting poller maxTries to 2 or more for recovery.

For myself I set poller Thing read maxTries to 2, so allowing for one retry to overcome any genuine transient error.

The timeout setting comes into play here as well, as we do want to allow a sensible time between retries for the transient problem to go away, and so do not want a very short timeout.

For writing, each data Thing has an independent writeMaxTries parameter, defaulting to 3.

You are thus able to prioritise how hard you will try to write each register – if unimportant, could try just once. For something more important (e.g. a STOP function) try a few times.

In most cases though, writing is infrequent enough just to leave this at default.

Summary

With thought, we can optimize the Modbus binding settings to reduce the impact of external faults on performance of surviving devices, without greatly injuring the capability to recover from transient errors.

These are “passive” measures, for many users this will be sufficient. After all, we hope not to experience many faults in our supposedly “robust” Modbus technology!

rossko57 · April 25, 2019, 11:14pm

Positive action

Having seen that the error/timeout/retry loop can cause performance loss for survivors, and done what we can to optimize the loop, there is one more trick we could use. Simply not to enter the loop, or at least not so often. We’ve optimized our polling rate for normal use of course, but if we greatly reduce polling further after we’ve encountered a fault - we will also greatly reduce those time wasting error loops. We should still retry once in while, hoping that fault has been repaired – but every five minutes is probably enough.

Slave Thing disabling

Earlier we saw how to detect a failing poller Thing and use that to trigger a rule. Within that rule, we can use the openHAB REST API to disable the TCP/serial bridge Thing that “owns” the poller. That will also disable any other pollers on this Modbus slave.

The rule can also start an independent timer that will enable the Thing again in a few minutes. At that time, the binding will attempt to recover the connection to the slave and restart polling. Should this fail again, our rule will be triggered once more.

Rule "modbus slave22 error"
when
   // any poller Thing for this slave
   Thing "modbus:poller:slave22:s22_coils" changed or
   System started      // in case faulty at boot up
then
   var Tstatus = getThingStatusInfo("modbus:poller:slave22:s22_coils").getStatus()
   if (Tstatus.toString == "OFFLINE") {
      // set the TCP/serial bridge Thing offline
      // this will stop new polling requests on all associated pollers
     sendHttpPutRequest( "http://localhost:8080/rest/things/modbus:tcp:slave22/enable", "application/json", "false" )
      // now schedule a future recovery attempt
      createTimer(now.plusMinutes(5), [ |
         sendHttpPutRequest( "http://localhost:8080/rest/things/modbus:tcp:slave22/enable", "application/json", "true")
      ] )
   }
end

Note that you will likely see a couple of further errors logged after the disabling, as already scheduled polls work through the queue.

There is a potential snag – if we reboot with a faulty slave then the rule will not be triggered by poller status change and we would never reduce polling or attempt recovery. The workaround is addition of System started to the rule trigger; normally the poller will not be offline at boot, but if it is we will start our periodic attempts instead of regular polling.

You could instead use the REST API to alter the refresh parameter of the poller Thing. You would also need to arrange for restoration of “normal” refresh parameter. There are drawbacks to that method:

(a) It won’t work if you use xxx.things text files to configure modbus, not editable later.

(b) You may have several pollers to manage for each slave.

Summary

Some simple rules can further reduce performance impact of Modbus faults, while still allowing for automatic recovery after “repair”.

gitMiguel · April 26, 2019, 4:16am

Really good and usefull writeup even for experienced modbus users. IMO, this post should be linked in the official modbus 2.x docs troubleshoot section

rossko57 · August 13, 2019, 10:07pm

This is a more developed solution to automatically disable failed Modbus Things, and re-enable periodically to attempt recovery.
There are two parts to it.

But first, create a dummy Item to use as a communication tool between the rules

// catcher for Modbus Poller Thing status changes
String MBpollerChanged "Failed Modbus UID [%s]" <tools> {autoupdate="false", expire="20m,state= "}

I send commands here, rather than state updates, so that a flurry of incidents in quick succession will be handled in a nice queue and not overwrite each other.

That means we can use the unused Item state to hold any message we like, if we stop autoupdate working on the command. We can also use expire to auto clear the message if updates stop (i.e. errors stop).

The Thing status catcher

I stole this from here, thankyou Yannick @ysc

As mentioned earlier, ordinary DSL rules do not offer neat ways to deal with status changes from several Things. NGRE rules do - but the PaperUI Experimental rules editor still hides the useful methods we want to use here.
What we can do is use the NGRE rules importer feature to get round that. We write the rule in JSON in a file, and then import into openHAB rules.

So make a file mbthingstatus.json and put it where you like. We only need to import it once, but you might want it again for a later reinstall or something. I chose to put mine in conf/scripts/

[
    {
        "uid": "MBpollerStatusChanged",
        "name": "Modbus poller thing status changes",
        "description": "",
        "visibility": "VISIBLE",
        "enabled": true,
        "triggers": [
            {
                "id": 1,
                "label": "When a ThingStatusInfoChangedEvent is received",
                "configuration": {
                    "eventTopic": "smarthome/*",
                    "eventSource": "",
                    "eventTypes": "ThingStatusInfoChangedEvent"
                },
                "type": "core.GenericEventTrigger"
            }
        ],
        "conditions": [],
        "actions": [
            {
                "id": 2,
                "label": "execute a given script",
                "inputs": {},
                "configuration": {
                    "type": "application/javascript",
                    "script": "// js poller thing status change reporter
                        var thingid = event.topic.split('/')[2];
                        // examine events for modbus:poller:xxx things
                        if (thingid.split(':')[0]=='modbus' && thingid.split(':')[1]=='poller') {
                            var myjson = JSON.parse(event.payload);
                            if (myjson[0].status=='OFFLINE') {
                                // pass to DSL rule via designated Item
                                events.sendCommand('MBpollerChanged', thingid.toString());
                                print('Modbus failure ' + thingid + ' ' + myjson[0].status);
                            };
                        };"
                },
                "type": "script.ScriptAction"
            }
        ]
    }
]

Clever folk could probably do more work in this rule, but I have chosen to pass the info to an ordinary DSL rule.
I would have thought we could filter for modbus things in the "eventTopic": "smarthome/*", line but that doesn’t work for me, so the filtering is done in javascript code.

The file needs importing by typing in the karaf console

smarthome:automation importRules /openhab/conf/scripts/mbthingstatus.json

Disable and re-enable Bridge Thing

So I did this in an ordinary DSL rule

var Integer slowRetry = 10  // minutes for Modbus recover attempt

rule "Modbus poller failure"
when
	Item MBpollerChanged received command
	// command contains poller Thing UID
	// issued by NGRE rule when Thing status changes to OFFLINE
	// note that can happen multiple times for one incident
then
	logInfo("MBerrman", "poller went offline " + receivedCommand)
		// fetch poller status
	val pollerjson = sendHttpGetRequest("http://localhost:8080/rest/things/" + receivedCommand , 3000)
	if (pollerjson !== null) {
			// extract owning tcp/serial bridge uid
		val slaveuid = transform("JSONPATH", "$.bridgeUID", pollerjson)
			// log it
		MBpollerChanged.postUpdate(slaveuid)
			// check if bridge is online still
	    val slavestatus = getThingStatusInfo(slaveuid).getStatus()
		logInfo("MBerrman", "slave " + slaveuid + " status " + slavestatus)
		if (slavestatus.toString == "ONLINE") {
				// set the TCP/serial bridge thing offline, not the poller
			logInfo("MBerrmgr", "Disabling slave " + slaveuid + " for " + slowRetry.toString + "mins")
			sendHttpPutRequest("http://localhost:8080/rest/things/" + slaveuid + "/enable", "application/json", "false")
				// now schedule a future retry
			createTimer(now.plusMinutes(slowRetry), [ |
	    		logInfo("MBerrmgr", "Setting slave ONLINE for retry " + slaveuid)
				sendHttpPutRequest("http://localhost:8080/rest/things/" + slaveuid + "/enable", "application/json", "true")
			] )
		}
	} else {
		logInfo("MBerrmgr", "OOPS disaster fetching REST for " + receivedCommand)
	}
end

This will disable the Modbus slave (TCP/serial thing) that owns a failing poller, and then try to re-enable again in ten minutes. If it fails again, we will go around the two-rule loop again.

You can display a line on your sitemap

Text item=MBpollerChanged visibility=[MBpollerChanged!=""]

Note that there are a couple of limitations to this method.

If the device is broken at system boot time, the binding will likely have already polled and failed before your rules are available. There won’t be a Thing status change to trigger management when the rules are ready.

If the system is closed down while a device is broken, the Thing OFFLINE (DISABLED) setting will be “remembered”, and poll attempts will not resume upon system reboot because the timer to do that was lost.

hobsA34 · January 26, 2021, 2:46pm

Just migrated OH2 to OH3. Works like a charm! Bindings: mqtt, modbus, astro, rfxcom.
In ‘Modbus error management’ it says: the …json file needs importing by typing in the karaf console. When I do the following:

openhab-cli console
openhab>smarthome:automation importRules /openhab/…/mbthingstatus.json
It gives: command not found: smarthome:automation.
How can I install that or what can I do? Thanks for help.

[/quote]

rossko57 · January 26, 2021, 2:57pm

That I cannot help with, but you can use the rules UI to directly create rules.
It’s probably possible to do all the work in one rule in the new environment.

hobsA34 · January 27, 2021, 6:15pm

Thanks. The current rule editor has indeed those capabilities. Again OH3 seems a great improvement.