MODBUS binding intermittent failures

rossko57 · July 19, 2019, 9:59am

Ahh, I do not know. Experience offers starting probabilities (a) wiring faults are more likely than equipment faults (b) earth faults give you the weirdest elusive symptoms.
(I do think you’ve eliminated earth troubles here btw)

But that incident with the small turnaround gap or even overlap is really suggestive of design issue at HVAC end. It is hard to conjure some external event that would make it do that.

We don’t yet understand that mystery switch you found that may affect error rate.

Yep. We must though remember that RS485 is a very old spec, modern chips can work reliably with weaker bias or none at all. I may be obsessing over nothing here. The dongle may be exactly as intended.
(Termination remains as important as ever btw)

Suggestion; leave the onboard stuff alone on the dongle and temporarily rig 1K resistors on wires to pullup A and to pulldown B. Let’s find out if it makes a difference as a temporary job.

What they’re talking about here is the “earth”, the C terminal, the Common line. For example you’d connect the shield of a shielded cable to C. But if you connected both ends to C you risk an earth loop (like I was fretting about earlier).
The general practice is to connect shield at one end only. It’s still earthed and does its job, but cannot provide a loop path.
I don’t think you need to get involved in any of that because you’ve already shown your devices are not earthed.

justinj · July 19, 2019, 11:42am

I’ll give this a try and see how it goes (Conveniently there are jumpers to temporarily attach THT resistors to )

Shield I’ve heard of, but a few pages I’ve read suggested twisting 1 pair together to act as the common that is only connected to one end and using another pair for the differential data pair. I’ve tried that earlier today and that made no difference which is what I was expecting. The link I provided before is one of many that suggest just that “YES—please use a signal common wire, grounded at one end only”

justinj · July 19, 2019, 12:02pm

So a 1K pulldown and pullup on A/B (I’m assuming it doesn’t matter which is pulled which way?) resulted in a constant stream of 0 byte frames being received back and the slave in a constant offline state. Leaving only the pullup on B seems to have about the same error rate as before and seemingly as intermittent.

Answered my own question with impatience and swapped them around, constant offline state resulted again.

rossko57 · July 19, 2019, 1:13pm

Mm, that’s odd. It certainly does matter which way round. We’re trying to simulate a continuous stop bit (1). No spurious data can be received until someone comes along and superimposes a start bit (0), which is opposite polarity of course.

So yep, wrong polarity bias looks like continuous start followed by zeroes.

This does not explain how you could get the same “data” for opposite cases.

I think I’ m leading you up the garden path here. I suggest put all back to “normal” and use your ananalyzer to capture more data about the times of failure.

justinj · July 19, 2019, 3:55pm

So having a pullup didn’t make a difference to end results (but you can see in the captures it has as far as voltage goes) nor did having a common connected at one end only (still confused as to why multiple places suggest this). One thing I did notice in analysing the captures was that, what I thought was a latch up I believe is actually just a no transmission period that went for a very short time where neither controller was making a transmission then the slave responding prematurely.
What I did find out is I can see easily where the error occurs as it has a repeated request after a brief pause (minimum time between transactions set in the serial “thing”). What I can also see is that on every occurrence of this error, the transaction from the slave has ended abnormally short.
This is the full 10 second capture with a transmission error and a 1k pull up on B

Next is the full capture of an earlier capture (the one that I thought was latching up) with no additional bias resistors

Lastly here is a capture with NO transmission errors (confirmed with logs on each capture if an error occurred during the time period) and a 1k pull up

As you can see in all of them, the poller timings are more or less the same at around 500ms but looking closer at the repeated transactions the slave response is abnormally short.
You can see the repeated response here is ~88ms (Timings info on right) which seems about standard from every measurement I took but you can see the slave response just before 7s was incredibly short.

This is the same thing occurring in all other captures too. Here you can see it is much longer than the above failed, but comes in far shorter at 59ms.

It doesn’t seem to have an apparent reason for the termination such as additional noise on the line or a ramped pulse, it just ends (below is the end of the 59ms transmission)

Based on my testing and troubleshooting thus far, would it be fair to call it a HVAC controller protocol fault between short intervals between request and response and shorter than expected transmissions?
I’d spend the effort and money in getting another adaptor if I knew that would fix the issue, but from what I’m thinking so far, none of that would resolve the issue.

rossko57 · July 19, 2019, 4:15pm

A “short” response from the slave could well be a “nack”. Let’s suppose the slave detects an incoming CRC error. It can just remain silent, but Modbus spec allows it to respond to the master with a fault code, which is the nicer thing to do.
EDIT - rubbish, slave should not respond to CRC incoming.

This might be evidence of CRC in both directions.

I’ll look at the pictures properly later on a better screen.

Openhab should detect and report such fault codes, assuming they too are not corrupted of course.

Slaves can also respond that way to nonsense commands e.g read imaginary register. There it would be a case of looking to your poller config.

rossko57 · July 19, 2019, 11:00pm

Screensot 6 - “extra poll” we might call that maybe.
How many poller things have you got at this time, just to be sure? The picture suggests only one, of course.

The double poll looks like things are working as they should, from openHAB viewpoint.
The first poll of the pair is measurably on same time schedule, like the preceding polls.
Assuming it’s a poll for the same data, the response does indeed look either short or has started very early.
Assuming also that response is truncated at one end or the other, it’ll give a CRC error, as evidenced in your log I think.
openHAB will not run the timeout - it’s got a response of some kind - so it will retry quickly, at soonest after the minimum gap configured. That would be the second poll of the doubled pair.
We can further assume the response was good this time around, because openHAB resumes its regular schedule i.e. no further retries.

I don’t believe ‘slave started responding early’ because it can’t respond until it’s finished receiving and CRC checking and decoding the poll query.

Also evident in this picture is variable time between query-response, no biggy of itself, slaves are allowed to be about other business and answer when convenient.

Screenshot 7 - “short response”
It was rubbish I gave before about slave ‘nack’ response to detecting a CRC. It is supposed to remain silent for that (after all if the message is corrupt, the slave cannot be sure it was the addressed target).
It can give a brief 5-byte exception response for “too busy” or “invalid command” type responses.
The question here then is what openHAB logged as going on, either CRC or exception, but I’m guessing you have seen no exceptions.
OH has definitely seen something bad, because we soon retry. That retry is successful and normal business resumed.

Screenshot 9 - detailed short response
hmm, is that possibly a correctly constructed 5-byte exception? Maybe, but looks too long to me.
What really catches my attention is the very beginning of that response. A nasty little spike, not a full swing. We cannot tell if that is too narrow timewise to register properly, or if it really didn’t swing the full voltage.

Screenshot 10 - a longer but still short response.
Definitely not an exception response. I think we can take that one as good solid evidence of truncation, because you know it should be consistent length with the good responses.

Screenshot 11 - detail of that truncated response.
Ooh look there’s a nasty little spike at the end of this truncated response, like it’s been chopped off midway through a bit.

Now, looking at those detail pictures again. I’m not sure of polarity here, but I think the query transmission always starts and ends with a “high”? Have a look and check.
I’d expect that, it’d usually enable the TX at “stop” level, a brief pause (maybe just one bit time) and then begin the real serial data with a “start” bit. Of course it’d always end the same way, with a “stop” level. That might also extend a time before the TX is disabled.
The point here is that begin/end levels will always be same for that device.
Conceivably some devices could open with a “start”, but again it would always be the same begin/end levels.
I’m expecting that comparing good with bad slave responses, you will find that first and last levels are sometimes inconsistent.

You should also look to see if I am overreacting to the begin/end spikes, and you see anything like that in good responses.

I’ll make a wild leap - the slave RS485 transmitter chip enable is wonky, turning on or off part way through the data stream. Might be sinister causes like software control glitches or power supply spikes, but I would bet it’s just chip failure.
Weighing against that, why doesn’t it stop and restart part way through a data package? I wouldn’t be that surprised if it simply doesn’t work that way, but you might find eventually evidence of this if you look.

Why don’t I think it’s the master dongle TX randomly turning on/off out of turn? Because if it did it would generate high or low. Because we’ve got no bias (which I’ve now accepted ) we would see that as different from the idling no-volts condition that we can see.
Our little poll queries don’t offer much opportunity to get corrupted if that did happen, but surely we would once in a while. The result would be no response from slave, and a timeout log in openHAB, which we have not seen so far.

There is one more test to do before condemning it though - in the really unlikely event the slave takes around 500mS to respond to a poll, it could transmit a response just in time to cross paths with the next poll query, sometimes working, sometimes trampling. I don’t believe that for a second !
But let’s rule that out and change your polling period to 1500 or so and look to see if anything changes.

justinj · July 20, 2019, 12:24am

Yes just the one poller setup for these tests. I removed the extras as they were just going offline and it would make it harder to fault find. A master CRC or transmission error can be ruled out as these images are the pulses from a known good scan stacked with ‘bad’ occurrence and the retry. You can see the timings are identical so from my standpoint the issue is only in the slave’s response.

Yes that was pretty much my assumption I had based on my analysis which is what is making me point the finger at the HVAC controller.

I don’t believe that the slave responded too early either as the one with the 55ms response (as opposed to the expected 88ms) had a 5ms delay between the end of the master’s request and the start of the slave’s response which is what should be there. That also eliminates data collisions as a possible cause as there is the expected delay.

I’m not sure about in screenshot 7 as that was from a few days ago but the majority if not all the other errors are CRC errors. The ones that occur in the other screenshots are definitely CRC errors. The retry is what I’d expect and as per the signals side by side just above you can see the request is exactly the same.

That is longer than 5 bytes. I’m not sure the spike is of concern either as that should be seen as an error on a full length response, not a truncated response. In addition it appears that every single transmission that starts with a low pulse has that little spike so my guess is it’s a little noise or blip as the slave goes from tri-state to low. The spike goes from 0 > 200mv > -400mv (-400mv being the beginning of the low pulse) in 2µs which I think is far too quick to be registered as a data. Even if it were hypothetically measured as a data pulse the result would be a full length transmission that registers as a CRC error. We can see that the slave is not interrupted by the master at all so there’s no logical reason I can see that the transmissions are short.

Every transmission from the slave seems to begin and end with a spike in the opposing direction of the first/last pulse. I’m guessing it’s something of a flyback as the slave comes out of or goes into it’s idle tri-state. Again as per the starting spike, that spike measures at around 2µs.

It looks as though the master is always beginning and ending on the same level pulse, the slave always appears to begin and end on opposing pulses. This appears true for both truncated and correct responses. That being said I’ve only analysed a half dozen faults and 30 or so good transmissions so the sample size isn’t huge, but I feel it’s enough to draw a conclusion that the begin and end state is consistent most or all of the time.

Noted earlier in this response

I am going to go with software/hardware design flaw as I have a ‘spare parts’ controller (earlier revision of the device that alarms on every error condition) that I tried last night and it also generated CRC errors. Unfortunately when it generated those errors the controller alarmed and I thought it was going to be a 1AM wake up for my kids . Interesting to note, the older model controller DID NOT work with the pullup resistor in place. After removing the added bias the controller began responding to polls and it too generated CRC errors after a short while. I was unable to repeat to capture errors due to the noise the controller generated on it’s fault condition but analysis of the good transmissions yields identical details as noted above; approx 88ms, beginning and ending on opposing pulses and a spike in the opposing direction of the beginning or ending pulse as it enters or comes out of tri-state. A design flaw would also explain why the transmission isn’t ending suddenly.
Also, after analysing the bad responses of each capture, it is worth noting that they begin differently to every other good response too, which when the data at the beginning should not change, should obviously not occur. It’s a little messy but here is the start of the long bad response, the entire short bad response and the start of two good responses stacked.

If that were the case I’d expect to see the amplified master signal showing in part of the transmission (seeing as how they are substantially higher voltage compared to the slave) which it is not. I have however previously tried a longer polling time (and longer time between transactions time) which gave no improvement on results. It is also the reason I changed the baud lower and higher.

rossko57 · July 20, 2019, 1:55am

signals2
You can read the bits here, we know we’re expecting x01 x03 …
And it’s LSB first, start=0, stop=1, so the good bits are
start 1 0 0 0 0 0 0 0 stop start 1 1 0 0 0 0 0 0 stop …

The bad traces are nothing like we expect. I suppose it’s possible the slave firmware makes a terrible mess of setting up the data packet; maybe there’s a big buffer that messes up on wraparound or something.

Are all these traces to same timescale / baud? Because the very short bad one has different time bits in it!

Still think this may be a hardware control kind of flaw, rather than a data structure build problem. You had a thought that there may be cross-contamination with another serial service on this board? It may be more about misrouting data than noise.
But who cares - they’re not going to fix it for you, so we’d best come up with a way to live with it as best we can.

I would stick to 9600, as the panel’s default. I cannot see any value in messing with other low level stuff e.g. stop bits.
Little point in messing with timeouts for CRC errors.

The OH retry mechanism works, but I will suggest you abandon it. That’s counterintuitive, but there’ll be another poll along shortly. Your OH data is preserved, not set to UNDEF, so at most your data will only be a poll or two out of date. If you set poller thing maxTriesto zero, the poll itself becomes the retry. The slave gets a rest.

It costs nothing to set serial Thing timeBetweenTransactionsMillis more generously, say 80mS, to let the slave have a breathing space. (Obviously that doesn’t influence the slaves turnaround time.) This might be of benefit when you have multiple pollers.

Try to review your fetch size. Do you really need a hundred registers? Of course the ones you want may be scattered across the address range. One big poller is less overhead than several small ones. Against that we weigh that whatever the slave’s sickness is, its probably more likely to show up in a big transfer. This is play it by ear country.

justinj · July 20, 2019, 2:18am

All are the same baud but the timescale might be slightly off as there’s no option to pick a % to zoom into so as they are from different captures I’ve had to zoom in to the closest approximation of it. With the ones that had identical patterns it was a lot easier as I could ALT+TAB between screenshot and capture to get the zoom correct. I put it there to mostly show that there is a significant difference at the beginning when one would expect it to be the same.

Probably not. They were good the first time around with sending out a replacement due to repeated alarms when the ‘system temp low’ alarm triggered overnight when the temperature dropped and the HVAC was off but I can’t imagine them being too willing to do that again. I’ll contact them none the less with what I have here and see what they have to say. The MCU is a PIC so they might even be able to send me a compiled image to flash to it of a newer version if it is a firmware bug.

Stop bits and parity are hard coded in the HVAC controller, only option I can change is baud and ID.

Wouldn’t that cause the poller to be flagged as offline and a 10s delay before attempting a reconnect?

I did try this as well, but not with multiple pollers, just with the single to create a bigger delay between retries.

Yes and no. There are more things to go wrong when you have multiple pollers and I need to poll about 45 registers in 8 blocks over a range of about 150 registers. The other data in between is things like zone names, but realistically they won’t change so they don’t have to be read. The other 8 blocks are 6 x zone status, temps, settings / 1 x system status, temp, settings / 1 x outputs.
My theory was that it would be more efficient to poll all registers at once and then pull the data from the large poller rather than having it update from multiple pollers or individual items.

rossko57 · July 20, 2019, 9:51am

I didn’t think it would … not currently got any serial on binding v2 to play.
But, is that a disaster anyway?

Maybe set maxTries to 2 or 3 then.