Zigbee binding

zolakk · February 12, 2019, 9:45pm

So that’s for a future binding version then? My confusion is that I only have a color channel and no other channels, yet my color channel is still going to UNDEF at times

puzzle-star · February 15, 2019, 10:40am

Hi @chris,

Regarding this CT / Color topic, could you please have a look at Fix color control channel updates #386.

This might be addressing part of the CT vs non-CT issues on the Color channel.

This also fixes an issue from an old commit I saw while checking it (I can’t recall from whom) that replaced float point inequality comparisons on the channel with < .00001 or something like that, but the sign of the comparison was inverted, so it is hardly possible that the color channel was working propertly after that commit.

Pedro

chris · February 15, 2019, 2:23pm

Hi Pedro,
Apologies for not looking over this. I did look at it a few days back, but as you said you don’t have any Hue bulbs I wanted to give this a test before I merge it. I will do that over the weekend - it’s been a busy week (again).

Chris

puzzle-star · February 16, 2019, 11:40am

No worries, I just thought you might have missed the PR

dbadia · February 16, 2019, 12:41pm

Been a while since I’ve done anything with Zigbee, which dongle(s) are compatible these days?
Is the Telegesis ETRX3 dongle still a good option?

chris · February 16, 2019, 12:48pm

Yes, it’s not a bad option, but just to note that I don’t believe it will be updated for ZB3.

My preferred dongle, due to the fact that it has the most use (including commercial users) is the Ember dongles, but there aren’t so many USB dongles available (The ETRX3 can be programmed with the Ember firmware as the HW is the same, but the programmer isn’t cheap, and firmware is not readily available without buying the devkit).

puzzle-star · February 16, 2019, 8:08pm

Hi @chris,

I have tested with the transaction manager branch, and the queue has gone to over 14000. I am looking into it, but my network is not functional with this branch…

Pedro

chris · February 16, 2019, 8:17pm

Strange - I’ve been testing with a few of other people and it’s been working well. Please try and work out what is wrong or at least provide logs that I can take a look at.

Also, what queue do you mean has gone to this size? There are queues in different places for different reasons. I guess logs would really be useful if you can provide them.

puzzle-star · February 16, 2019, 9:38pm

Hi @chris,

2019-02-16 20:38:08.761 [DEBUG] [gesis.internal.TelegesisFrameHandler] - TX Telegesis queue: 6727

The log (OH startup): https://goo.gl/TVKY4r

With 1.1.10 (and before) my queue went up to ~400 while starting up and then went down back to 1. I normally have TOO_MANY_UNICASTS and TOO_MANY_BROADCASTS errors, but I lived with it waiting for the transaction manager

With the transaction manager, I do not see any TOO_MANY_BROADCASTS errors, but I do have TOO_MANY_UNICASTS since the very beginning, which I understand should not happen as the transaction manager would not let more than 10 live messages “on the air”.

I am looing into it (it might be something specific to my setup), but attached the log in case you are curious.

Pedro

chris · February 16, 2019, 10:23pm

The queue in the dongle handler should not be affected at all by these changes - the transaction manager is in the main code and operates on transactions before they get sent through the transport layer.

I’ll take a look at the log…

puzzle-star · February 17, 2019, 12:03am

Agree, but I can hardly imagine how the queue in the dongle handler could grow up that much if the transaction manager only allows for 10 concurrent transactions.

Looking at the logs, it looks like some commands time out after 5 seconds in the telegesis frame handler (I have several “limited function devices” and sleepy ones). But this timeout is not one of the command within the dongle, but the handler aborting it: it is still “on air” according to the dongle.

This makes the transaction manager think it can send more transactions, and the dongle ends up with TOO_MANY_UNICAST errors. Now, the transaction manager retries failed commands, and probably the queue will never stabilize. The frame handler queue in the dongle grows without limit, and so does the “time in the queue”, while the transaction manager thinks it can still send transactions.

This raises to me a couple of questions:

Should the timeout in the dongle frame handler be less by default than the time for which a command could still be sent? This causes the frame handler to abort transactions that could still work.
If we abort a transaction that was sent to the dongle, should not we (is there a way to) tell the dongle to abort that transaction? Otherwise, we will send the dongle more transactions when it cannot handle them.

Also, the way the queues are polled could make the total amount of concurrent allowed transactions be “waiting” for slow / sleepy end devices, making all other transactions wait.

Would it make sense, to fix this scenario, to reserve a minimum of “fast” transactions from the maximum outstanding transactions? By fast meaning:

Transactions that go to non-sleepy devices (known from the node descriptor), and
That have not failed before (so not prioritizing transactions for devices that could be offline)

This could be done with something like (pseudocode).

ZigBeeTransactionManager:

private final int OUTSTANDING_TRANSACTIONS = 10;
private final int RESERVED_FAST_TRANSACTIONS = 5;

private int reservedFastTransactions = RESERVED_FAST_TRANSACTIONS;

// ...

private void sendNextTransaction() {
    // ...
    transaction = queue.getTransaction(outstandingTransactions.size() >= maxOutstandingTransactions - reservedFastTransactions); // tell the queue whether we only want "fast" transactions
    // ...
}

public boolean isFast(ZigBeeTransaction transaction) {
    if (transaction.getSendCnt() != 0) return false;

    ZigBeeAddress address = transaction.getDestinationAddress();
    if (!address instanceof ZigBeeEndpointAddress) return true;
    if (ZigBeeBroadcastDestination.isBroadcast(address.getAddress())) return true;

    ZigBeeNode node = networkManager.getNode(address.getAddress());
    return node == null || node.getNodeDescriptor().getMacCapabilities().contains(MacCapabilitiesType.RECEIVER_ON_WHEN_IDLE);
}

ZigBeeTransactionQueue:

protected ZigBeeTransaction getTransaction(final boolean fastOnly) { // Added parameter
    // ...
    if (queue.isEmpty() || nextReleaseTime > System.currentTimeMillis()
            || outstandingTransactions >= profile.getMaxOutstandingTransactions()
            || fastOnly && !queue.peek().isFast()) {
        return null;
}

ZigBeeTransaction.java:

public boolean isFast() {
    return transactionManager.isFast(this);
}

Would this make sense, as an approach to prevent sleepy devices from affecting the whole transaction manager network latency? (it could also be that all the above is nonsense… )

Pedro

puzzle-star · February 17, 2019, 12:50am

Hi @chris,

Answering myself:

I have applied the above changes, and now my network works like a charm…

Even considering I still get a few TOO_MANY_UNICAST errors while initializing (I have not touched the dongle frame handler timeout), everything works and now the queues “behave” as never.

Still thinking anyway the timeout on the dongle frame handler topic needs to be revisited…

Pedro

chris · February 17, 2019, 9:43am

I’m not sure that I like this approach - it shouldn’t I think be necessary and adds what
I think is unnecessary complexity. Everything should be paced by the dongle itself - not these timers, and the TransactionManager shouldn’t have to worry about fast or slow transactions - the dongle sorts this out.

Probably the transaction timer in the Telegesis driver needs extending. The transactions should normally always be cancelled/failed/completed by the dongle itself - the timer should really only be there to handle and “bad stuff” and stop the whole system choking.

I suspect that 5 seconds in the Telegesis dongle is too short since transactions can take up to ~7.5 seconds on sleepy devices.

chris · February 17, 2019, 10:31am

This 5 second timer should only be the time it takes to SEND a command, and should not (if I remember correctly) be the time it takes to receive the ACK from the device. So, it’s a little strange it’s taking so long - possibly it’s caused by overloading the number of frames the Telegesis can handle at once, so the dongle then starts to queue frames, and then things start to fall over?

I don’t really understand how the latest code can cause this though, as none of this has changed, and if anything, the new code will delay commands more than the old code! In the existing code, all transactions get sent directly to the dongle handler, where now this is not the case.

My guess is that this problem was always there - this queue should not get to 400 in the first place (which is what you said was happening in the current code) and my guess is that something else has changed to cause this (eg a few more polling requests, another device added…), and not the transaction manager. I suspect that if you have a queue of 400 messages in the dongle level, it would not take too much for this to keep increasing.

puzzle-star · February 17, 2019, 2:25pm

Hi @chris

I have double checked: reverting back to 1.1.10, the queue grows up to ~400 during initialization, but it goes back to ~1 after some time. When initializing, the network is way too chatty, and for every node we send a lot of discovery messages, attribute update requests, binding requests, etc: on a >40 device network, the amount of messages to be sent in a very short period of time goes easilly over 1000.

Both on 1.1.10 and with the transaction manager, very soon I start getting TOO_MANY_UNICAST errors from the dongle, but there is a difference here: the transaction manager will retry errors, so it will not allow the queue to empty. That is probably why I do not see the queue go down with the transaction manager (indeed, after several hours, it seems to have gone down…)

The issue I need to find is why the dongle frame handler queue grows with the transaction manager, as it should be holding the transactions on the node transaction queues instead (am I right?)

If the transaction manager does not take care of this, it is too late for the dongle, unless I am missing something, to do anything about fast or slow transactions: the transaction manager may send the maximium allowed transactions and will necessarily have to wait if these happen to go to a sleepy device, which is not difficult.

Fast and slow queues are a common approach in traffic shaping, and the transaction manager is doing a traffic shaping functionallity.

Being this a prioritization between transactions, why do you think the transaction manager should not take care of this? ZigBee does have “fast” and “slow” messages, from the point of view of the time these may be queued in the network. If the transaction manager is not aware of it, it can negatively affect network latency. And the transaction timer is already taking a bit of care of priorities and shaping, by putting, i.e. retries, at the beginning of the queue, and by keeping a maximimum count of transactions per node and in general.

Will look deepter into it, then. Somewhere in the whole transaction path the transaction manager ends up sending more transactions when the running ones are still “on air”. This has two effects: the frame handler queue growing, and TOO_MANY_UNICAST errors happening.

Pedro

chris · February 17, 2019, 3:09pm

I would say that this is definately already broken. It’s certainly not correct operation as this queue should always be close to 0.

Why? The pacing of commands comes from the dongle, right? The issue I see is that the dongle is (for reasons I don’t currently understand) not replying immediately to the queue commands, and then the driver is timing out and telling the manager to send another transaction.

Sure, but it should already be tuned for this - that is the point of the transaction manager (or more specifically, the transaction queue). It should not need a fast and slow queue since the queues themselves should ultimately be configured for the device. Each queue should be tuned to only release frames based on the destination.

The queue should only release commands when the dongle is able to send them. The problem I think (if I understand you correctly) is that the Telegesis dongle is not responding quickly to say it has (or hasn’t) queued a frame. The 5 second timer triggers instead, thus allowing more frames into the dongle handler.

My understanding of the Telegesis protocol is as follows -:

The binding sends a frame. The dongle should respond with OK. This transaction is “protected” with a 5 second timer to stop things stalling. Some time later, when the transaction is complete (either failed or acked) the dongle sends another message.

Both of these responses from the dongle (the OK and ACK in this diagram) get sent to the transaction manager, so the transaction manager is able to know what is happening.

Instead, what we get is the 5 second timer timing out -:

This results in the transaction manager sending down another message before the previous one is complete (in fact, before the dongle has even sent it!).

I don’t really know why the dongle is not responding immediately as it is meant to. These are fundamentally atomic transactions in that no other communications can take place with the dongle until it responds.

I should also add that this doesn’t happen with the Ember dongle - even on a large, very heavily loaded system.

chris · February 17, 2019, 3:28pm

I just did a search for this timeout message (“TELEGESIS Timer: Timeout”) and it only happens 3 times in this 170MB logfile, so I’m not convinced that’s really the problem?

The log is quite hard to read as the log entries are really out of order for some reason

chris · February 17, 2019, 3:45pm

Looking through your log for issues…

2019-02-16 20:20:32.573 [DEBUG] [gesis.internal.TelegesisFrameHandler] - Telegesis interrupted in sendRequest

This shows up a lot - is this why you think the timer is triggering? I’m not sure what causes the interrupt off the top of my head.

…

2019-02-16 20:20:44.140 [DEBUG] [gesis.internal.TelegesisFrameHandler] - RX Telegesis Data:NACK:D1
2019-02-16 20:20:44.140 [DEBUG] [gesis.internal.TelegesisFrameHandler] - RX Telegesis: TelegesisNackMessageEvent [messageId=209]
2019-02-16 20:20:44.141 [DEBUG] [ngle.telegesis.ZigBeeDongleTelegesis] - No sequence correlated for NAK messageId 209

The correlation of the responses from the dongle to the transaction IDs is not working with the Telegesis dongle. This will stop the transaction manager from sending commands in a well paced way, although it probably should result in them being sent slower.

…

2019-02-16 20:20:44.147 [DEBUG] [transaction.ZigBeeTransactionManager] - transactionComplete: ZigBeeTransaction [queueTime=653375, state=FAILED, sendCnt=165, command=ReadAttributesCommand [Basic: 0/0 -> 42100/1, cluster=0000, TID=0B, identifiers=[4]]]  FAILED

I don’t know why the sendCnt here is a rather large number . I think it should be limited to 3 retries, but maybe I’ve missed this check somewhere, or maybe under some situation it’s not checked.

chris · February 17, 2019, 4:09pm

@puzzle-star what computer do you use? I wonder if there is a synchronisation issue?

When sending telegesis transactions, the following is used to get the completed transaction, and record the message ID -:

        new Thread() {
            @Override
            public void run() {
                frameHandler.sendRequest(command);

                // Let the stack know the frame is sent
                zigbeeTransportReceive.receiveCommandState(msgTag,
                        command.getStatus() == TelegesisStatusCode.SUCCESS ? ZigBeeTransportProgressState.TX_ACK
                                : ZigBeeTransportProgressState.TX_NAK);

                // Multicast doesn't have a sequence returned, so nothing more to do here
                if (command instanceof TelegesisSendMulticastCommand) {
                    return;
                }

                messageIdMap.put(((TelegesisSendUnicastCommand) command).getMessageId(), msgTag);
            }
        }.start();

This is then correlated when we get the response -:

        if (event instanceof TelegesisAckMessageEvent) {
            TelegesisAckMessageEvent ackEvent = (TelegesisAckMessageEvent) event;
            if (messageIdMap.get(ackEvent.getMessageId()) == null) {
                logger.debug("No sequence correlated for ACK messageId {}", ackEvent.getMessageId());
                return;
            }

            zigbeeTransportReceive.receiveCommandState(messageIdMap.remove(ackEvent.getMessageId()),
                    ZigBeeTransportProgressState.RX_ACK);
            return;
        }

However, we see in the following that everything is working ok with the dongle, but it’s not correlating…

2019-02-16 20:20:44.452 [DEBUG] [gesis.internal.TelegesisFrameHandler] - RX Telegesis Data:SEQ:ED
2019-02-16 20:20:44.453 [DEBUG] [gesis.internal.TelegesisFrameHandler] - RX Telegesis Data:OK
2019-02-16 20:20:44.667 [DEBUG] [gesis.internal.TelegesisFrameHandler] - RX Telegesis Data:ACK:ED
2019-02-16 20:20:44.669 [DEBUG] [gesis.internal.TelegesisFrameHandler] - RX Telegesis: TelegesisAckMessageEvent [messageId=237]
2019-02-16 20:20:44.670 [DEBUG] [ngle.telegesis.ZigBeeDongleTelegesis] - No sequence correlated for ACK messageId 237

I wonder if the response to the ACK is being passed in before the SEQ. There are 200ms between them, so it should be ok, but clearly it’s not working here. I will add another log entry here to see if we can catch this.

On further inspection, it looks like this is probably linked to the “interrupted exception” messages. The Telegesis transactions are not completing I think, and this means the frameHandler.sendRequest(command) is not getting the message ID.

chris · February 17, 2019, 4:27pm

@puzzle-star I’ve found the reason for this at least. Failed transactions were being added back into the queue in two places - one checked the maxRetries, but one didn’t so it was bypassing this check.

I’ll push an update shortly - are you compiling your own libraries, or are you using Maven snapshots?