Zigbee binding

I am testing a fix already. Will post as soon as I see if it is working (together with a couple more of findings)

1 Like

Hi @chris

Mostly working OK, but my findings (and possible fixes) so far:

ZIgBeeNetworkManager:

  • Is it possible that ā€œtransactionManager.shutdown()ā€ is never called from the NetworkManager? I have not found a call to it, and according to my logs, the transaction manager kept running after reloading the bundles (with the corresponding mayhem both in the logs and the networkā€¦).

Possible fix:

diff --git a/com.zsmartsystems.zigbee/src/main/java/com/zsmartsystems/zigbee/ZigBeeNetworkManager.java b/com.zsmartsystems.zigbee/src/main/java/com/zsmartsystems/zigbee/ZigBeeNetworkManager.java
index 4a57a250..a1b8e713 100644
--- a/com.zsmartsystems.zigbee/src/main/java/com/zsmartsystems/zigbee/ZigBeeNetworkManager.java
+++ b/com.zsmartsystems.zigbee/src/main/java/com/zsmartsystems/zigbee/ZigBeeNetworkManager.java
@@ -531,6 +531,8 @@ public class ZigBeeNetworkManager implements ZigBeeNetwork, ZigBeeTransportRecei
             }
         }

+        transactionManager.shutdown();
+
         transport.shutdown();
     }

TelegesisFrameHandler:

  • In the Telegesis frame handler, the comments for the meaning of the variables ā€œtransactionTimeoutā€, ā€œcommandTimeoutā€, and their corresponding constant defaults, seem reversed vs the meanining in the setter functions, and indeed vs the code. This means that their values seem also reversed:
    • DEFAULT_TRANSACTION_TIMEOUT: should be 8000 instead of 3000. It accounts, looking at the code, for the full time from the moment the transaction goes to into the queue.
    • DEFAULT_COMMAND_TIMEOUT: should be 3000 instead of 8000. It accounts, looking at the code, for the time allowed for the dongle to respond once a comand is sent.
    • Both should probably anyway be increased on ā€œslowā€ systems (Telegesis under heavy load, I do not think this is related to the rpi3 here), I have seen transactions for >2 seconds in the queue before being sent to the dongleā€¦

Possible fix:

diff --git a/com.zsmartsystems.zigbee.dongle.telegesis/src/main/java/com/zsmartsystems/zigbee/dongle/telegesis/internal/TelegesisFrameHandler.java b/com.zsmartsystems.zigbee.dongle.telegesis
/src/main/java/com/zsmartsystems/zigbee/dongle/telegesis/internal/TelegesisFrameHandler.java
index 770897fb..8355b47d 100644
--- a/com.zsmartsystems.zigbee.dongle.telegesis/src/main/java/com/zsmartsystems/zigbee/dongle/telegesis/internal/TelegesisFrameHandler.java
+++ b/com.zsmartsystems.zigbee.dongle.telegesis/src/main/java/com/zsmartsystems/zigbee/dongle/telegesis/internal/TelegesisFrameHandler.java
@@ -98,23 +98,23 @@ public class TelegesisFrameHandler {
     private ScheduledFuture<?> timeoutTimer = null;

     /**
-     * The maximum number of milliseconds to wait for the response from the stick once the request was sent.
+     * The maximum number of milliseconds to wait for the completion of the transaction after it's queued
      * This must account for any delays in the host.
      */
-    private final int DEFAULT_TRANSACTION_TIMEOUT = 3000;
+    private final int DEFAULT_TRANSACTION_TIMEOUT = 8000;

     /**
-     * The maximum number of milliseconds to wait for the completion of the transaction after it's queued
+     * The maximum number of milliseconds to wait for the response from the stick once the request was sent.
      */
-    private final int DEFAULT_COMMAND_TIMEOUT = 8000;
+    private final int DEFAULT_COMMAND_TIMEOUT = 3000;

     /**
-     * The maximum number of milliseconds to wait for the response from the stick once the request was sent
+     * The maximum number of milliseconds to wait for the completion of the transaction after it's queued
      */
     private int transactionTimeout = DEFAULT_TRANSACTION_TIMEOUT;

     /**
-     * The maximum number of milliseconds to wait for the completion of the transaction after it's queued
+     * The maximum number of milliseconds to wait for the response from the stick once the request was sent
      */
     private int commandTimeout = DEFAULT_COMMAND_TIMEOUT;

ZigBeeTransactionManager:

  • Global sleepy transaction counter has to be kept consistent when a queue type changes. I think I understand why you might prefer not to having to reset it, as both the queue and the transaction manager are receiving notifications of the transaction having finished asynchronously, and as these may be processed in any order:
    • Incrementing the counter is safe, as we just got the transaction from the queue and we know whether it is a sleepy queue
    • Decrementing the counter is not safe, as when the transaction finishes, the queue might have changed status, and synchronization is not trivial: not easy to know how many of the transactions already sent by the queue were counted as sleepy and which not. Trying to synchronize everything seems sort of a nightmare
    • The solution is to replace the sleepyTransaction counter by a list of outstanding sleepy transactions (similar to outstandingTransactions). Instead of incrementing and decrementing a counter, we add and remove the transactions to the list. We always know how many outstanding transactions were sleepy when sent and this is accounted for accordingly.
  • Moved ā€œremoveTransactionListener(transaction)ā€ inside the ā€œsynchronized(this)ā€ block just following it, as in this method we update the outstandingTransactions, and accessing from other places in the code the ā€œoutstandingTransaction.size()ā€ is not done synchronized on the list but on ā€œthisā€ instead, so concurrent access to size() while modifing the list could lead (with very low probability, anyway) to an undefined behaviour.

Possible fix:

diff --git a/com.zsmartsystems.zigbee/src/main/java/com/zsmartsystems/zigbee/transaction/ZigBeeTransactionManager.java b/com.zsmartsystems.zigbee/src/main/java/com/zsmartsystems/zigbee/transaction/ZigBeeTransactionManager.java
index 5d88a53a..7b8414a9 100644
--- a/com.zsmartsystems.zigbee/src/main/java/com/zsmartsystems/zigbee/transaction/ZigBeeTransactionManager.java
+++ b/com.zsmartsystems.zigbee/src/main/java/com/zsmartsystems/zigbee/transaction/ZigBeeTransactionManager.java
@@ -114,9 +114,9 @@ public class ZigBeeTransactionManager implements ZigBeeNetworkNodeListener {
     private int maxSleepyTransactions = MAX_SLEEPY_TRANSACTIONS;

     /**
-     * A counter holding the number of sleepy transactions
+     * A list of outstanding sleepy transactions
      */
-    private int sleepyTransactions = 0;
+    private Set<ZigBeeTransaction> sleepyTransactions = new HashSet<>();

     /**
      * Executor service to execute update threads for discovery or mesh updates etc.
@@ -488,15 +488,14 @@ public class ZigBeeTransactionManager implements ZigBeeNetworkNodeListener {
      */
     protected void transactionComplete(ZigBeeTransaction transaction, TransactionState state) {
         logger.debug("transactionComplete: {}  {}", transaction, state);
-        removeTransactionListener(transaction);

         synchronized (this) {
+            removeTransactionListener(transaction);
             ZigBeeTransactionQueue queue = getTransactionQueue(transaction);
             queue.transactionComplete(transaction, state);
-
-            if (queue.isSleepy()) {
-                sleepyTransactions--;
-            }
+            sleepyTransactions.remove(transaction);
+            logger.debug("transactionComplete: {} outstanding transactions ({} sleepy)", outstandingTransactions.size(),
+                    sleepyTransactions.size());
         }

         sendNextTransaction();
@@ -614,7 +613,7 @@ public class ZigBeeTransactionManager implements ZigBeeNetworkNodeListener {
                     }

                     // If this is a sleepy queue, and we've exceeded the sleepy transmissions, then ignore the queue
-                    if (queue.isSleepy() && sleepyTransactions >= maxSleepyTransactions) {
+                    if (queue.isSleepy() && sleepyTransactions.size() >= maxSleepyTransactions) {
                         continue;
                     }

@@ -622,12 +621,15 @@ public class ZigBeeTransactionManager implements ZigBeeNetworkNodeListener {
                     transaction = queue.getTransaction();
                     if (transaction != null) {
                         if (queue.isSleepy()) {
-                            sleepyTransactions++;
+                            sleepyTransactions.add(transaction);
                         }

                         // Send the transaction.
                         send(transaction);
                         sendDone = false;
+
+                        logger.debug("sendNextTransaction: {} outstanding transactions ({} sleepy)",
+                                outstandingTransactions.size(), sleepyTransactions.size());
                     }

                     // If the queue has no more transactions,

With the above fixes, my network is running quite smoothly.

BTW, everything works increadibly better than before having the Transaction Manager (so thanks! :slight_smile: ).

Probably for small networks, or networks without too many sleepy devices, things might go well without it, but in the last months, with my network growing >40 devices, my Telegesis was mostly launching TOO_MANY_UNICAST errors during intitalization, so network discovery and setup (i.e. reporting) was a mess (not really working).

But I still see at some moments in my log the transaction manager trying to send a transaction when the telegesis is still processing 10 concurrent ones (so ā€œError 72: TOO_MANY_UNICASTā€). It is not happening too much, and the side effects are limited, as the transaction manager will retry the offending transaction, but there must be some race condition, so looking for it: this time I do not see timeouts, so I need to dig deeper.

Pedro

1 Like

Thanks @puzzle-star for the review :slight_smile: . I will try and take a look at some of this during my flightā€¦

This is good news - it wasnā€™t wasted effort then :slight_smile: .

Yes, I completely agree - this enhancement is targetted to make things work for larger networks - well - itā€™s to make things work well for ANY size network. One of the drivers was @TheNetStriker who has always had problems - his network isnā€™t as big as yours but it has some unique features (lots of old Hue bulbs that have no/limited reporting configuration.

Iā€™m certainly happy to look at any instances. It might be that the queue profiles need some tuning - some of the values were set a little arbitrarily during developmentā€¦ Any thoughts/feedback is welcome.

Well, letā€™s consider a trivial optionā€¦

As you say, the issue really only occurs when decrementing the counter as the queue may have changed from a non-sleepy to a sleepy queue after the transaction was sent. This could mean that as the queue isSleep() is now true when the transaction completes, and we get out of sync by making sleepyTransactions negative.

So, in my mind, all we need to do is to ensure that sleepyTransactions doesnā€™t go negative. I donā€™t really think thereā€™s a need for adding the transactions into a separate Set is there (this seems overly complicated - ok - maybe not complicated, but unnecessary?).

Why not just the addition of sleepyTransactions > 0 -:

            if (queue.isSleepy() && sleepyTransactions > 0) {
                sleepyTransactions--;
            }

The opposite way should never happen (ie the queue changes from sleepy to non-sleepy) so we shouldnā€™t have to worry about sleepyTransactions never going back to 0.

WDYT? Iā€™d prefer the really simple option if possible :wink:

The issue on the counter going down is the opposite: when the transaction finishes, the queue may have changed from sleepy to non sleepy, so it will not be decremented even when it should.

It will not go negative: it will think there are sleepy transactions live when three arenā€™t, and these will never be discounted, so we may end up not allowing any more sleeping transactions as there will live ā€œleakedā€ as not being discounted.

I agree we should not add the extra list if not needed, but it is probably needed to correctly account for queue type changes.

Pedro

Yes, youā€™re right - itā€™s the other way around in the Descriptor.

Iā€™ll take a better look at this tomorrow when Iā€™m not sleepy! (Iā€™ve been travelling for the last 14 hours!). Iā€™m sure we can find a way that will work acceptably for the very few times when this is an issue, without resorting to the complexity of tracking these transactions separately.

@chris, I just noticed the Concentrator Type setting (currently on Low RAM), but I couldnā€™t find any documentation (and havenā€™t looked in the code). Everything has been working really well lately, so I didnā€™t want to play with it without asking first!

Hi @chris

The other way I found is adding a transaction.setSleepy(boolean) and transaction.isSleepy(), and then checking on the transacion when complete. This avoids the list, but requires the extra setter / getter methods (and the variable itself) on the transaction.

Pedro

I did update the docs, but itā€™s not flowed through into the online documentation yet -:

Concentrator Type (zigbee_concentrator)

The Concentrator is used to improve routing within a ZigBee network, and is especially useful in a network where much of the traffic is sent to or from a central coordinator. If the coordinator has sufficient memory, it can store routing information, thus reducing network traffic.

If supported, the High RAM concentrator should be used.

I thought of another way that might work - and that is to default the descriptor to include RECEIVER_ON_WHEN_IDLE. This would make my earlier suggestion work, as the only change would then be from not-sleepy to sleepy.

Arguably, this has another advantage in that when devices are first included into the network the transaction manager will treat them as if they are awake (which they should be!) and send transactions to them rather than letting them go to sleep which may cause initialisation issuesā€¦

But on the other side we are assuming a non-sleepy device is. If a full network discovery is needed, we will struggleā€¦

Is it realy that bad to correctly manage the sleepy transactions count? Adding a HashSet should not be a big deal, or even if not, transaction.setSleepy() / transaction.getSleepy() would do to correctly manage the counterā€¦

Pedro

What do you mean? Is what?

No, but if itā€™s not needed, then why add it? The other day you were complaining that the new code was running slower - adding things like this will run even slower (ok, not much, but if we can avoid it, then all the better - right?).

Sorry, I meant the other way round: we would be assuming a sleepy transaction might not be, and fill the outstanding transaction queue with sleepy transactions, and in a moment where we have heavy traffic, which is what we wanted to avoid.

Ok itā€™s only if we need to fully rediscover the network (i.e. if the XML is lost). I think the transaction manager may be ā€œheavyā€ in terms of CPU, but needed anyway. Adding a hashset, or setting and checking a Boolean in the transactions, is not a big deal computationally speaking (even better then the second option).

Pedro

Well, currently we assume ALL devices are sleepy on initialisation which to me seems a bad idea as wellā€¦

True, but in any case, under heavy load, having them in a ā€œsleepyā€ queue till node descriptor will slow down three or four transactions maximum of it was not a sleepy node, and will make sure it does not impact other transactions if it was are sleepy.

Without load it makes no difference.

I would prefer the conservative approach, but up to you. :slight_smile:

Pedro

Do you know if the HUSBZB-1 supports it :slightly_smiling_face:? And if you donā€™t, could there be issues if I try it but it is not supported?

Remember, this flag is only half of the story. The queue management is completely different as well. It may also adversely impact a devices ability to initialise, and may make it go to sleep quicker if it doesnā€™t receive any requests.

Thatā€™s not actually true - again, the queue management is different for sleepy and non-sleepy devices.

Both options have positives and negatives. I think that defaulting to send frames immediately will likely help sleepy devices initialise when they first join, but if the case is a reinitialisation due to lost persistence file, then it may be a hindrance.

Iā€™ll have a bit more of a think about itā€¦

Iā€™m not sure, but I would expect soā€¦

I donā€™t think it will hurt to try it. I think the worst that will happen is you will get an error and it will continue to do what itā€™s already doingā€¦

1 Like

So I think for now I donā€™t want to change the default state of the NodeDescriptor - this might be something to look at later, but Iā€™d prefer not to loose such a change in this PR.

However, it is easy with the current information to simply count the number of sleepy transactions in the queue if the state changes, so I will do that. This is something that should happen nearly never, and only ever once per sleepy device, so this avoids any additional load in the processing of transactions.

For those testing the latest transaction code, I have just deployed an updated version of the 1.1.11-SNAPSHOT version.

1 Like