How the application talks to z-wave
Writing a host application/binding for z-wave is a big and complex job. The z-wave serial interface that allows commands to pass to and from the network to the host application/binding relies on careful coding and transaction management.
As I write this, Chris is posting tantalising posts indicating that he is improving things in the binding that will improve the performance of the binding further so it is an exciting time for binding user.
The openHab binding and many other host applications are already very good but in the end sending and receiving messages that require a response to a complex web of devices and making sure all of the responses get back to the correct place is not simple.
The API uses a callback mechanism to make sure the response from the z-wave controller is matched to the command that was sent by the application/binding. If you have never written a computer program think of this as being similar to when you shout down the stairs and wait for an answer to your question. You know if that answer relates to your question. This allows the binding to know it is the answer to the question.
The application/binding waits for this callback but for how long does it wait before the code gives up and assumes it is not coming back? Just like you standing at the top of the stairs there is a point when you are going to shout again or give up. The amount of time that the binding should wait for each type of command is defined in a list of standard wait times after which the application/binding should consider that no response is coming.
There are lots of reasons that the callback may not come back in a timely fashion. The node may be broken or have no power, the route may be bad or transient conditions have disrupted the network for starters. These are inevitable over time but are not every day.
Over this all of those devices can send reports to the controller updating their status and these have to take priority across the serial interface or the network, controller and devices will start to clog up as they are low power devices with very little memory.
There is one small paragraph in the developers guide that tells what will happen if too many incoming reports hit the controller in the worst case.
6.4.3 Missing callbacks
In some situations a serial API callback may be lost due to an overflow in the UART transmit buffer. This condition may occur if a lot of unsolicited traffic comes in from the Z-Wave side. For this reason a SerialAPI based host application SHOULD guard all its callbacks with a timer. The timer values are given in references , , ,  or  for each of the Z-Wave API functions which use callbacks.
I have tested and I can absolutely confirm that the MAY could be changed to WILL BE if there are too many reports and commands flowing.
Another misconception is that the controller and devices have queues and can buffer lot of messages. They do not.
In zwave all queuing is in the application/binding and a biq queue depth is not good if commands need to be executed in a timely fashion. If a callback fails and the command need to be retried that is bad enough for you as a user but then inevitably anything queued up is going to wait longer so it is really bad for you. No different from a trip to the doctors you can end up with a long wait.
So now you are thinking about the challenges that Chris is managing in the binding we can start to think how we as users can help ourselves and get the best out of z-wave and the excellent and soon to be improved z-wave binding or if you are not an openHAB user make the best of your platform.
I have talked to many network owners about their networks and in their and certainly in mine there are a lot of reporting sensors.
My home is packed full of capability to measure power, energy, luminescence, temperature, voltage, power factor, reactive energy, humidity, movement, door and window open or close, valve position and as I am a geek I still have CO2, pressure and a few others on my list.
All of these sensors can be configured, and by default are configured to send a lot of reports very regularly.
I had a network when the first modules with energy measurement came out. The network was stable and fast but as I added a few of these modules I started to notice lags and hiccups. This was before zniffer was available as a free download.
When I started investigating by testing if one or another module was the culprit, I discovered the symptoms were markedly worse when I turned my coffee machine on.
I assumed a faulty module so I swapped it for another. Sadly no difference. I was starting to think an issue with my coffee machine which I am very dependent on so I was not feeling great.
As the coffee machine was hot I thought a cup of coffee would help.
This is not the machine I had at the time but so you get the idea how strong I like my coffee.
And as I sipped that double shot of coffee I noticed that the little light that indicates if the coffee machine’s boiler was heating or not was pulsing and the power of coffee to fuel the mind was proven. The machine has a PID controller on the element. The way this controller was working was changing the power at high frequency to keep the coffee boiler at the ideal temperature.
The module of course detecting these changes and was doing what it was designed to do. Sadly in the process flooding the network with reports.
Now if I had been interested in mapping how well the PID was working the reports may have been vaguely useful but I was not and I certainly valued my automation performance more than having lots of reports.
And from that point on I took great care of how I set reporting modules up.
A couple of years later I was in communication with Peter @petergebruers` who I had discussed the welding of relays due to inrush current with in the past. I had read and was trying to decode the z-wave documentation that had been released and that I had interpreted badly in many cases the first time I read it.
Luckily Peter is a patient and generous person and eventually I untangled my understanding. During the conversations we both realised that we had observed the same issues with too much reporting and that many people were suffering from the impact of their device configuration. They thought they had bad devices or z-wave did not work or there where issues with routing or the controller was rubbish.
And a few of them did have bad devices and other issues like faulty controllers but for the vast majority it was just too many reports.
In his generosity Peter wrote a simple script to analyse the volumes of reports from devices and make some observations how the configuration could be improved. A similar script could be written for the openHAB z-wave binding logs. He called it the Sanity Check Script and it deserves that name because it has saved a few people’s sanity and possibly a few marriages.
I am not claiming all of these network got to the point where the returns did not make it back as warned in the documentation but the symptoms reported included:
- lags and irregular execution of commands
- not able to add devices using NWI
- very slow to add devices
- lost reports
and many more.
As everyone has a different number of nodes and a different mix and need it is hard to say what the exact target is for a particular device. The general rule that I am sure @petergebruers was the key person behind, that has worked for many is:
No more than a base level of 10 commands and reports a second .
Now before you go and turn all of your reporting off as you have some of these symptoms, there is a further fly in the ointment. Some devices have bugs so that when you try to turn the reporting down, the device turns the reporting up so you need to change them gradually and measure the success or not of the result.
You will see these reports in the openHAB logs but you will see it a lot faster in zniffer so again I recommend if you are serious about getting your network good you should get yourself a zniffer.
There is also an interesting issue with switch modules that report power and energy.
If you want to rapidly switch a few on in fast succession they mostly immediately issue a report of power and energy on switching. Now if there were say 10 modules that control all of your open plan rooms lighting, even if you get your background reporting down their will probably be a noticeable delay from first light to last. After all that is 30 odd commands and reports in addition to whatever else is going on. The interesting thing is that if you experiment with leaving a pause the end result may be better than if you send them all rapid fire and let the queue in the application send them through. It is worth the experiment of putting 100ms delay between commands.
If you switch 20 modules as a background task why would you want a delay as it was happening just as you walked into a dark room. Again you may want to experiment with a pause between each command to allow interactive commands to execute fast. I do this for example when all of my heating valves move from heat to economy heat. It makes little difference that I space the changes out with a minute gap but it is surprising how small a delay I notice if my coffee machine does not switch on immediately when I want it even though it takes 5 minutes to heat up. Perception of time is such an odd thing.
For me I have turned off a lot of what I consider is unnecessary reporting. Any that are delivered on a regular period I have disabled if possible or set to the maximum. If I do not need power I turn it off etc etc.
I have plenty of good information though. Possibly my total energy report is a kWh or two in front compared to adding all the individual devices but does it impact my life like a 2s delay when I walk onto my landing at night. I only have power from my HEM as I realised I really did not care if my washing machine was using 1kW or 2kW if I had the total kWh.
I hope this helps a few people get more out of their network.
Signing out now and heading for a coffee.