Zigbee binding

puzzle-star · February 17, 2019, 4:46pm

Normally yes, but if suddenly we try to send hundreds of messages for different nodes, I would expect it to grow accordingly, and go to zero after a while (as it is doing). I do not think this is necessarily broken (and this is happening with a standard OH and ZSS).

In 1.1.10, I see this normal on a big network when initializsing (as far is it goes back to 0). With the transaction manager, I agree this should not happen.

As I see it in the logs, the dongle is responding correctly. After every UCAST, it is replying a SEQ:xx and an OK (or an error when there are to many unicasts on air, later, but not at the beginning).

Looking at new logs, I think the issue might come from the TransactionID overflowing (of course, just 0-255), and a response for a command “killing” the transaction for more than one sent command (with same TID)?

Looking at the logs, I do not see issues in the SEQ/OK/ACK/NACK sequences. Looking further into the “rolling over” TIDs…

PS: I am building my own libraries

chris · February 17, 2019, 4:46pm

@puzzle-star I note in your log that it seems you’re also trying to run the ZSmartSystems console -:

2019-02-16 20:09:50.331 [WARN ] [org.apache.felix.fileinstall        ] - Error while starting bundle: file:/usr/share/openhab2/addons/com.zsmartsystems.zigbee.console.main-1.1.11-SNAPSHOT.jar

I don’t think this is likely to cause a major problem, but it is causing a couple of exceptions in the log.

Maybe this is downloaded via @5iver’s magic upgrade script? If so, maybe one to remove Scott?

5iver · February 17, 2019, 4:47pm

Nope… not from the script!

puzzle-star · February 17, 2019, 4:51pm

@chris, @5iver,

Not from the script. Just copying all built jars from the MVN target directories to the addons. I will remove both the consle and serial jars, causing exceptions

chris · February 17, 2019, 5:00pm

I disagree - at least with the Transaction Manager that should not be the case. It should always be close to 0 - this queue is only there to buffer messages to the dongle to take up the slack in the communication with the dongle (which is quite slow with the Telegesis dongle).

All transactions should be paced by the dongle.

Agreed - my initial comment was based on your statement that the timer was going off, which should only happen if the dongle doesn’t respond.

Why do you think this? I’m not quite sure how it can “overflow” so can you explain your theory? I do think something is not right in this area. If I search for “-> TX_ACK ==” there are only the first 10 transactions shown - after this they all seem to be NAK.

chris · February 17, 2019, 5:05pm

Hmmm - I don’t think this is the issue. I think it’s as I mentioned earlier with the “Telegesis interrupted in sendRequest” messages. The transactions are being interrupted, so there will be NO transaction Id in the returned packet from the frameHandler.sendRequest call, and therefore the correlation fails.

If you look at the log, the transactions work ok until the “interrupted in sendRequest” messages start - then nothing works properly.

chris · February 17, 2019, 5:12pm

Can you try increasing TelegesisFrameHandler.transactionTimeout - maybe try a couple of seconds. I wonder if 1 second is too short on some systems (you didn’t answer my question about what hardware you have?)

puzzle-star · February 17, 2019, 5:28pm

We agree to disagree With the transaction manager this should not happen. Without it, at least in my installation it has been happening for months…

Well, I understand that only up to the ACK/NACK. But the transaction will not be completed untill we receive the response we expect (i.e. reading an attribute). The transaction manager will not send more transactions (if max reached) even if there are no packets “on air”. Anyway, not the issue here.

Transaction counter is an “integer & 0xff”. As new transactions come through, we put a TID and increment. After a while, I have seen in my logs a new transaction being assigned a TID that is in use by an already outstanding transaction. When the response goes to the transactionlistener, it matches both. Not usual, but prob > 1/256 when the network has been running a while and you have some retried transactions in the queues, which is not negligible…

puzzle-star · February 17, 2019, 5:29pm

rPi3. Remember the issue when writing the network state file

I will increase the timeout, but I do not see too many of these in the logs, as you said, so definetely not causing the issue

chris · February 17, 2019, 5:44pm

Yes, without the transaction manager there is no other queue, so it will happen. The purpose of the transaction manager is to better manage the queue, and handle the pacing, and timing of transactions…

True, but that should be the same thing. If there is a response, it get’s complete - if there is no response, then the dongle will send the NAK. The transaction manager will wait for something like 12 seconds for the response, but the dongle should timeout within 8 seconds max (the zigbee MAC polling rate) but normally less if it’s not an ED, so in general, the dongle should always pace the transactions.

Well - that should not happen unless there are more than 256 transactions outstanding. Since you have a lot more than this, I suspect that what you are seeing is an artefact of the problem, and probably not the cause (??).

Yep - I do now . I get so many logs and my memory doesn’t have enough room.

It’s a different timeout… There is a 5 second timeout, and a 1 second timeout - I got them mixed earlier and it is the 1 second timeout in which the response is expected from the device. It is used in the future.get() call.

    public TelegesisStatusCode sendRequest(final TelegesisCommand command) {
        Future<TelegesisCommand> future = sendRequestAsync(command);
        try {
            future.get(transactionTimeout, TimeUnit.MILLISECONDS);
        } catch (InterruptedException | ExecutionException | TimeoutException e) {
            logger.debug("Telegesis interrupted in sendRequest {}", command);
            future.cancel(true);
            return null;
        }

        return command.getStatus();
    }

puzzle-star · February 17, 2019, 5:51pm

Not really. “outstanding” we can have up to 10 in the transaction manager (I am speaking of transaction IDs, not Frame SEQ from the telegesis). In the queue of the devices, there is no limit (other than amount of transactions that can get there in the timeout period).

Commands are assigned a TID when they are queued. If there is a burst (i.e. network discovery, or startup) and / ore retries, these will end up in the queues with assigned old TID, after several retries and sime time, even with only 10 outstanding transactions, a new one will end up matching the TID of an old queued one (I have seen it in today’s logs).

In an (probably abnormal) sitation with lots of queued transactions and retries, it is definetely happening. I am lloking into it in case this has something to do.

Definetely, the frame handler queue should not grow with the transaction manager, but it is doing, so there must be “something somewhere”

Aware

chris · February 17, 2019, 6:04pm

Ok, by outstanding, I meant in the dongle queue, which would “overrun” the translation map used to link the TID to the dongle message Id.

I thought I’d changed this so they are allocated when they are sent, but it seems you’re right. I will change this.

Yes, I’m pretty sure it is this timer I mention above…

The mechanism is as follows -:

The transaction is added to the dongle queue, the timer is started, and the dongle transaction is waiting to complete for up to 1 second. This fails, and we see the “Telegesis interrupted in sendRequest” message showing that the future failed. This ultimately results in a NAK completion being sent to the transaction manager, and it then releases the next transaction.

The code that sends the NAK is this -:

            public void run() {
                frameHandler.sendRequest(command);

                // Let the stack know the frame is sent
                zigbeeTransportReceive.receiveCommandState(msgTag,
                        command.getStatus() == TelegesisStatusCode.SUCCESS ? ZigBeeTransportProgressState.TX_ACK
                                : ZigBeeTransportProgressState.TX_NAK);

                // Multicast doesn't have a sequence returned, so nothing more to do here
                if (command instanceof TelegesisSendMulticastCommand) {
                    return;
                }

                messageIdMap.put(((TelegesisSendUnicastCommand) command).getMessageId(), msgTag);
            }

In the above, the sendRequest fails due to the interrupted exception with this 1 second timer. The transaction is then not SUCCESS, so the TX_NAK is sent back to the manager.

We see that in the logs as well -:

2019-02-16 20:09:51.947 [DEBUG] [gesis.internal.TelegesisFrameHandler] - Telegesis interrupted in sendRequest TelegesisSendUnicastCommand [address=48557, sourceEp=0, destEp=1, profileId=260, clusterId=0, messageData=00 07 00 04 00, messageId=null]
2019-02-16 20:09:51.949 [DEBUG] [tsystems.zigbee.ZigBeeNetworkManager] - RX STA: msgTag=07 state=TX_NAK

puzzle-star · February 17, 2019, 6:46pm

Got it! BTW, the dongle handler transaction is completed when we get the SEQ + OK/ERROR responses, so no need to wait for >8 seconds. But, as you say, the 1sec is not enough:

2019-02-17 19:09:10.393 [DEBUG] [transaction.ZigBeeTransactionManager] - transactionListenerAdded: 6 outstanding
2019-02-17 19:09:10.399 [DEBUG] [tsystems.zigbee.ZigBeeNetworkManager] - TX CMD: BindRequest [0/0 -> 20935/0, cluster=0021, TID=06, srcAddress=90FD9FFFFE8365FF, srcEndpoint=1, bindCluster=768, dstAddrMode=3, dstAddress=000D6F001113A331, dstEndpoint=1]
2019-02-17 19:09:10.401 [DEBUG] [.zsmartsystems.zigbee.zcl.ZclCluster] - readSync request: ZclAttribute [cluster=ELECTRICAL_MEASUREMENT, id=1285, name=RMSVoltage, dataType=UNSIGNED_16_BIT_INTEGER, lastValue=null]
2019-02-17 19:09:10.405 [DEBUG] [tsystems.zigbee.ZigBeeNetworkManager] - TX APS: ZigBeeApsFrame [sourceAddress=0/0, destinationAddress=20935/0, profile=0000, cluster=0021, addressMode=DEVICE, radius=31, apsSecurity=false, apsCounter=06, payload=00 FF 65 83 FE FF 9F FD 90 01 00 03 03 31 A3 13 11 00 6F 0D 00 01]
2019-02-17 19:09:10.417 [DEBUG] [ngle.telegesis.ZigBeeDongleTelegesis] - Telegesis send: TelegesisSendUnicastCommand [address=20935, sourceEp=0, destEp=0, profileId=0, clusterId=33, messageData=00 FF 65 83 FE FF 9F FD 90 01 00 03 03 31 A3 13 11 00 6F 0D 00 01, messageId=null]
2019-02-17 19:09:10.431 [DEBUG] [gesis.internal.TelegesisFrameHandler] - TX Telegesis queue: 2
2019-02-17 19:09:10.434 [DEBUG] [transaction.ZigBeeTransactionManager] - 90FD9FFFFE6F5EB0: Creating new Transaction Queue
2019-02-17 19:09:10.438 [DEBUG] [e.transaction.ZigBeeTransactionQueue] - 90FD9FFFFE6F5EB0: Added transaction to queue, len=1, transaction=ZigBeeTransaction [queueTime=0, state=WAITING, sendCnt=0, command=BindRequest [0/0 -> 40076/0, cluster=0021, TID=--, srcAddress=90FD9FFFFE6F5EB0, srcEndpoint=1, bindCluster=1, dstAddrMode=3, dstAddress=000D6F001113A331, dstEndpoint=1]]
2019-02-17 19:09:10.441 [DEBUG] [e.transaction.ZigBeeTransactionQueue] - 90FD9FFFFE6F5EB0: Timer 50 >> 0 -- 1550426950440 == -1550426950440, outstanding=0/2
2019-02-17 19:09:10.443 [DEBUG] [transaction.ZigBeeTransactionManager] - 000B57FFFEAC1F8D: Creating new Transaction Queue
2019-02-17 19:09:11.542 [DEBUG] [gesis.internal.TelegesisFrameHandler] - Telegesis interrupted in sendRequest TelegesisSendUnicastCommand [address=52581, sourceEp=0, destEp=1, profileId=260, clusterId=0, messageData=00 05 00 04 00, messageId=null]
2019-02-17 19:09:10.447 [DEBUG] [e.transaction.ZigBeeTransactionQueue] - 000B57FFFEAC1F8D: Added transaction to queue, len=1, transaction=ZigBeeTransaction [queueTime=1, state=WAITING, sendCnt=0, command=BindRequest [0/0 -> 33280/0, cluster=0021, TID=--, srcAddress=000B57FFFEAC1F8D, srcEndpoint=1, bindCluster=768, dstAddrMode=3, dstAddress=000D6F001113A331, dstEndpoint=1]]
2019-02-17 19:09:11.542 [DEBUG] [gesis.internal.TelegesisFrameHandler] - Telegesis interrupted in sendRequest TelegesisSendUnicastCommand [address=48557, sourceEp=0, destEp=1, profileId=260, clusterId=0, messageData=00 04 00 04 00, messageId=null]
2019-02-17 19:09:11.545 [DEBUG] [e.transaction.ZigBeeTransactionQueue] - 90FD9FFFFE6F5EB0: Timer 50 >> 0 -- 1550426951545 == -1550426951545, outstanding=0/2
2019-02-17 19:09:11.541 [DEBUG] [gesis.internal.TelegesisFrameHandler] - Telegesis interrupted in sendRequest TelegesisSendUnicastCommand [address=20935, sourceEp=0, destEp=0, profileId=0, clusterId=33, messageData=00 FF 65 83 FE FF 9F FD 90 01 00 03 03 31 A3 13 11 00 6F 0D 00 01, messageId=null]
2019-02-17 19:09:11.547 [DEBUG] [tsystems.zigbee.ZigBeeNetworkManager] - RX STA: msgTag=05 state=TX_NAK
2019-02-17 19:09:11.547 [DEBUG] [tsystems.zigbee.ZigBeeNetworkManager] - RX STA: msgTag=04 state=TX_NAK
2019-02-17 19:09:11.548 [DEBUG] [e.transaction.ZigBeeTransactionQueue] - 000B57FFFEAC1F8D: Timer 50 >> 0 -- 1550426951547 == -1550426951547, outstanding=0/2
2019-02-17 19:09:11.549 [DEBUG] [tsystems.zigbee.ZigBeeNetworkManager] - RX STA: msgTag=06 state=TX_NAK
2019-02-17 19:09:11.549 [DEBUG] [transaction.ZigBeeTransactionManager] - notifyTransactionProgress: TID 4 -> TX_NAK == 6
2019-02-17 19:09:11.551 [DEBUG] [transaction.ZigBeeTransactionManager] - notifyTransactionProgress: TID 6 -> TX_NAK == 6
2019-02-17 19:09:11.550 [DEBUG] [transaction.ZigBeeTransactionManager] - notifyTransactionProgress: TID 5 -> TX_NAK == 6
2019-02-17 19:09:11.549 [DEBUG] [e.transaction.ZigBeeTransactionQueue] - 000B57FFFEAC1F8DTimer set 1550426951599
2019-02-17 19:09:11.553 [DEBUG] [transaction.ZigBeeTransactionManager] - notifyTransactionProgress: TID 5 synced
2019-02-17 19:09:10.448 [DEBUG] [gesis.internal.TelegesisFrameHandler] - RX Telegesis Data:SEQ:A3
2019-02-17 19:09:11.558 [DEBUG] [gesis.internal.TelegesisFrameHandler] - RX Telegesis Data:OK

In the log above, the Telegesis send happens at “19:09:10.417”, is aborted at “19:09:11.541”, and the SEQ / OK arrive at “19:09:10.448” / “19:09:11.558”

Don’t ask me why the entries on the log are not in order ! (???)

I am increasing it to 3 seconds, and see what happens…

puzzle-star · February 17, 2019, 8:13pm

Hi @chris,

Seems to be working mostly OK now. Thanks!

But only “mostly”, because I still have seen a few TX_NAKs. Not too many, but I will investigate a bit, in case there is some contention somewhere (maybe some synchronized block?)

I have also tested, and without a “fast track” for the final outstanding transaction queue, my network is “choppy”: half of the times I have to wait several seconds for turning lights on/off, and startup is veeeery slow.

The reason: the 10 outstanding transactions get filled very soon with transactions for several sleepy devices. The transaction manager waits then for some to finish, which takes as average 1/2 of 8-12 seconds…

Can we revisit the need for a fast queue? (not really a queue, the implementation I showed above is working OK for me, and the complexity it adds is that of “less than 15 lines of code”, but any solution that keeps a minimum “bandwidth” for non-sleepy devices would do )

Pedro

chris · February 17, 2019, 8:49pm

Great stuff - thanks for your help. As a matter of interest, what does the Telegesis queue size get to now?

Ok, it would be good if you can take a look over these to see if there’s a reason…

Well, I’m always open to being persuaded, but I think it will take some doing . Currently I don’t think that bypassing the queue is a good idea - maybe there is some tuning of the queues that could be done (the inter-message delays for example) - these were largely arbitrary anyway.

I also don’t see why we should bump the priority of transactions where the receiver is on at the expense of those that sleep. Again, the queue management ought to account for that by only releasing one transaction for sleeping nodes at a time - that is one of the principal requirements here since parent routers in a ZigBee network are only required to buffer a single packet per child.

What I have just realised is that I don’t have different queue profiles for sleepy devices (and as above, this is needed). I had planned that this would be done from the NetworkManager, and the default profile would only be used if there was no specific profile defined, but clearly I’ve missed this. I would expect that this will solve the issue you are seeing as it will do a similar thing, but smarter.

AFromD · February 18, 2019, 12:47pm

Hello Chris,

I am on nightly build 1524 (and zsmartsystems 1.1.10), and hoped to enjoy the new configurable polling. I have added

, zigbee_reporting_polling="15" ]

to a thing in my .thing file. But so far, it doesn’t seem to work.
Is it supposed to work like that?
Why a minimum of 15 seconds? I hoped to be able to set two seconds, as I want a switch to control lights.

chris · February 18, 2019, 2:57pm

It should be all that’s needed.

Because it is very easy to swamp the system and cause huge delays to commands. Polling is not a particularly good option (I know in your case there may not be any better option).

AFromD · February 18, 2019, 3:08pm

Ok, then it might be because of the startup errors the binding is throwing. Operation of zigbee lights does work though.

2019-02-18 16:06:10.359 [ERROR] [org.openhab.binding.zigbee          ] - bundle org.openhab.binding.zigbee:2.5.0.201902161217 (228)[org.openhab.binding.zigbee.internal.converter.ZigBeeChannelConverterFactory(34)] : Could not load implementation object class org.openhab.binding.zigbee.internal.converter.ZigBeeChannelConverterFactory
2019-02-18 16:06:10.370 [ERROR] [org.openhab.binding.zigbee          ] - bundle org.openhab.binding.zigbee:2.5.0.201902161217 (228)BundleComponentActivator : Unexpected failure enabling component holder org.openhab.binding.zigbee.internal.converter.ZigBeeChannelConverterFactory
2019-02-18 16:06:10.416 [ERROR] [org.openhab.binding.zigbee          ] - bundle org.openhab.binding.zigbee:2.5.0.201902161217 (228)[org.openhab.binding.zigbee.internal.ZigBeeThingTypeProviderImpl(38)] : Could not load implementation object class org.openhab.binding.zigbee.internal.ZigBeeThingTypeProviderImpl
2019-02-18 16:06:10.422 [ERROR] [org.openhab.binding.zigbee          ] - bundle org.openhab.binding.zigbee:2.5.0.201902161217 (228)BundleComponentActivator : Unexpected failure enabling component holder org.openhab.binding.zigbee.internal.ZigBeeThingTypeProviderImpl
2019-02-18 16:06:11.104 [ERROR] [org.openhab.binding.zigbee          ] - bundle org.openhab.binding.zigbee:2.5.0.201902161217 (228)[org.openhab.binding.zigbee.internal.converter.ZigBeeChannelConverterFactory(34)] : Could not load implementation object class org.openhab.binding.zigbee.internal.converter.ZigBeeChannelConverterFactory
2019-02-18 16:06:11.112 [ERROR] [org.openhab.binding.zigbee          ] - bundle org.openhab.binding.zigbee:2.5.0.201902161217 (228)[org.openhab.binding.zigbee.internal.ZigBeeThingTypeProviderImpl(38)] : Could not load implementation object class org.openhab.binding.zigbee.internal.ZigBeeThingTypeProviderImpl
2019-02-18 16:06:11.131 [ERROR] [org.openhab.binding.zigbee          ] - FrameworkEvent ERROR - org.openhab.binding.zigbee
2019-02-18 16:06:11.175 [ERROR] [org.openhab.binding.zigbee          ] - FrameworkEvent ERROR - org.openhab.binding.zigbee

puzzle-star · February 18, 2019, 4:08pm

Hi @chris,

Now it keeps stable. At some moment during network initialization it has gone up to 13 (with some timeouts), but it has recovered quite well from it, back to 1. Nothing to do with the ~400 without the transaction manager, and no TOO_MANY_UNICAST errors.

I have seen that my OH’s java process is consuming much more than it used to, even when “doing nothing”. This may explain why I still get some timeouts. One question, in ZigBeeDongleTelegesis.sendCommand():

// We need to get the Telegesis SEQ number for the transaction so we can correlate the transaction ID
// This is done by spawning a new thread to send the transaction so we get the response.
new Thread() {
    @Override
    public void run() {
        frameHandler.sendRequest(command);

        // Let the stack know the frame is sent
        zigbeeTransportReceive.receiveCommandState(msgTag,
                command.getStatus() == TelegesisStatusCode.SUCCESS ? ZigBeeTransportProgressState.TX_ACK
                        : ZigBeeTransportProgressState.TX_NAK);

        // Multicast doesn't have a sequence returned, so nothing more to do here
        if (command instanceof TelegesisSendMulticastCommand) {
            return;
        }

        messageIdMap.put(((TelegesisSendUnicastCommand) command).getMessageId(), msgTag);
    }
}.start();

We are spawning a new thread for every single sent command. May this have something to do? It is a costly operation. Would it make sense to use a persistent Thread instead?

Neither do I. What I propose is that, from the maxOutstandingTransactions (10 by default) we reserve a certain amount (I tested with 5) for fast transactions, so that when we have reached this thresold, only “fast” transactions will be fetched from the node queues and sent to the dongle.

Not really bumping their priority, but just reserving a minimum amount of “send slots” for these.

As a user, when I turn on a light, I would prefer it to happen inmediately, and I do not care if a network discovery packet, or anything that goes to a device that anyway will take 4 seconds in average, is delayed a few ms due to this. Sorry for using Agile wording here

Let me explain why I think this is needed: i.e the Telegesis dongle will only handle 10 transactions “on air”. This is consistent with 10 transactions enforced by the Transaction Manager.

The running transactions may be a mix of:

Commands that go to RxOnIdle devices (fast transactions, “negligible” time).
Commands sent to sleepy devices (will take 0 to 8 seconds, 4 as average)
Commands to offline devices (will timeout in about 8 seconds)

There is a good probability that at a given moment all transactions in the Outstanding Transactions list are either to slow devices or to devices that are offline. This is because they take comparatively so much time compared to the fast ones.

When this happens, we will have the fast transactions competing for a slot with all other queued ones (fast and slow), and in any case waiting for a slow transaction to complete.

A fair queing practice (vs a random one) would give inverse weight to the time each transaction takes to better shape traffic. A low latency queueing practice would reserve some slots for fast transactions (or keep separate queues).

This is an improvement in terms of user experience, more considering that the commands to non sleepy devices (lights, plugs, etc) are the ones the user would expect to be taken more immediately.

This is normally solved by having fast and slow queues, or any equivalent solution (as the proposed one, but again, this is just one possible implementation).

This cannot be solved in the “per node” queues, as we may still have more queues than available outstanding transactions.

A solution for this requires the transaction manager being aware of this, and making a decission on picking a packet from the queue of a fast device, when the outstanding transactions list reaches a given thresold (which could be up to just 1 only free slot, even if I used 5 for testing).

I hope this clarifies a bit why I think this should be necessary: it is a user experiece improvement, not a fix for a “non working” scenario.

Pedro

chris · February 18, 2019, 5:28pm

Yes - I will change this to use a scheduler.

After thinking about this more last night, I agree. I will likely implement it differently than you did as I think that your implementation doesn’t guarantee this. The situation that it needs to protect against is all 10 transactions being for sleepy devices.