Normally yes, but if suddenly we try to send hundreds of messages for different nodes, I would expect it to grow accordingly, and go to zero after a while (as it is doing). I do not think this is necessarily broken (and this is happening with a standard OH and ZSS).
In 1.1.10, I see this normal on a big network when initializsing (as far is it goes back to 0). With the transaction manager, I agree this should not happen.
As I see it in the logs, the dongle is responding correctly. After every UCAST, it is replying a SEQ:xx and an OK (or an error when there are to many unicasts on air, later, but not at the beginning).
Looking at new logs, I think the issue might come from the TransactionID overflowing (of course, just 0-255), and a response for a command âkillingâ the transaction for more than one sent command (with same TID)?
Looking at the logs, I do not see issues in the SEQ/OK/ACK/NACK sequences. Looking further into the ârolling overâ TIDsâŠ
Not from the script. Just copying all built jars from the MVN target directories to the addons. I will remove both the consle and serial jars, causing exceptions
I disagree - at least with the Transaction Manager that should not be the case. It should always be close to 0 - this queue is only there to buffer messages to the dongle to take up the slack in the communication with the dongle (which is quite slow with the Telegesis dongle).
All transactions should be paced by the dongle.
Agreed - my initial comment was based on your statement that the timer was going off, which should only happen if the dongle doesnât respond.
Why do you think this? Iâm not quite sure how it can âoverflowâ so can you explain your theory? I do think something is not right in this area. If I search for â-> TX_ACK ==â there are only the first 10 transactions shown - after this they all seem to be NAK.
Hmmm - I donât think this is the issue. I think itâs as I mentioned earlier with the âTelegesis interrupted in sendRequestâ messages. The transactions are being interrupted, so there will be NO transaction Id in the returned packet from the frameHandler.sendRequest call, and therefore the correlation fails.
If you look at the log, the transactions work ok until the âinterrupted in sendRequestâ messages start - then nothing works properly.
Can you try increasing TelegesisFrameHandler.transactionTimeout - maybe try a couple of seconds. I wonder if 1 second is too short on some systems (you didnât answer my question about what hardware you have?)
We agree to disagree With the transaction manager this should not happen. Without it, at least in my installation it has been happening for monthsâŠ
Well, I understand that only up to the ACK/NACK. But the transaction will not be completed untill we receive the response we expect (i.e. reading an attribute). The transaction manager will not send more transactions (if max reached) even if there are no packets âon airâ. Anyway, not the issue here.
Transaction counter is an âinteger & 0xffâ. As new transactions come through, we put a TID and increment. After a while, I have seen in my logs a new transaction being assigned a TID that is in use by an already outstanding transaction. When the response goes to the transactionlistener, it matches both. Not usual, but prob > 1/256 when the network has been running a while and you have some retried transactions in the queues, which is not negligibleâŠ
Yes, without the transaction manager there is no other queue, so it will happen. The purpose of the transaction manager is to better manage the queue, and handle the pacing, and timing of transactionsâŠ
True, but that should be the same thing. If there is a response, it getâs complete - if there is no response, then the dongle will send the NAK. The transaction manager will wait for something like 12 seconds for the response, but the dongle should timeout within 8 seconds max (the zigbee MAC polling rate) but normally less if itâs not an ED, so in general, the dongle should always pace the transactions.
Well - that should not happen unless there are more than 256 transactions outstanding. Since you have a lot more than this, I suspect that what you are seeing is an artefact of the problem, and probably not the cause (??).
Yep - I do now . I get so many logs and my memory doesnât have enough room.
Itâs a different timeout⊠There is a 5 second timeout, and a 1 second timeout - I got them mixed earlier and it is the 1 second timeout in which the response is expected from the device. It is used in the future.get() call.
Not really. âoutstandingâ we can have up to 10 in the transaction manager (I am speaking of transaction IDs, not Frame SEQ from the telegesis). In the queue of the devices, there is no limit (other than amount of transactions that can get there in the timeout period).
Commands are assigned a TID when they are queued. If there is a burst (i.e. network discovery, or startup) and / ore retries, these will end up in the queues with assigned old TID, after several retries and sime time, even with only 10 outstanding transactions, a new one will end up matching the TID of an old queued one (I have seen it in todayâs logs).
In an (probably abnormal) sitation with lots of queued transactions and retries, it is definetely happening. I am lloking into it in case this has something to do.
Definetely, the frame handler queue should not grow with the transaction manager, but it is doing, so there must be âsomething somewhereâ
Ok, by outstanding, I meant in the dongle queue, which would âoverrunâ the translation map used to link the TID to the dongle message Id.
I thought Iâd changed this so they are allocated when they are sent, but it seems youâre right. I will change this.
Yes, Iâm pretty sure it is this timer I mention aboveâŠ
The mechanism is as follows -:
The transaction is added to the dongle queue, the timer is started, and the dongle transaction is waiting to complete for up to 1 second. This fails, and we see the âTelegesis interrupted in sendRequestâ message showing that the future failed. This ultimately results in a NAK completion being sent to the transaction manager, and it then releases the next transaction.
The code that sends the NAK is this -:
public void run() {
frameHandler.sendRequest(command);
// Let the stack know the frame is sent
zigbeeTransportReceive.receiveCommandState(msgTag,
command.getStatus() == TelegesisStatusCode.SUCCESS ? ZigBeeTransportProgressState.TX_ACK
: ZigBeeTransportProgressState.TX_NAK);
// Multicast doesn't have a sequence returned, so nothing more to do here
if (command instanceof TelegesisSendMulticastCommand) {
return;
}
messageIdMap.put(((TelegesisSendUnicastCommand) command).getMessageId(), msgTag);
}
In the above, the sendRequest fails due to the interrupted exception with this 1 second timer. The transaction is then not SUCCESS, so the TX_NAK is sent back to the manager.
Got it! BTW, the dongle handler transaction is completed when we get the SEQ + OK/ERROR responses, so no need to wait for >8 seconds. But, as you say, the 1sec is not enough:
In the log above, the Telegesis send happens at â19:09:10.417â, is aborted at â19:09:11.541â, and the SEQ / OK arrive at â19:09:10.448â / â19:09:11.558â
Donât ask me why the entries on the log are not in order ! (???)
I am increasing it to 3 seconds, and see what happensâŠ
But only âmostlyâ, because I still have seen a few TX_NAKs. Not too many, but I will investigate a bit, in case there is some contention somewhere (maybe some synchronized block?)
I have also tested, and without a âfast trackâ for the final outstanding transaction queue, my network is âchoppyâ: half of the times I have to wait several seconds for turning lights on/off, and startup is veeeery slow.
The reason: the 10 outstanding transactions get filled very soon with transactions for several sleepy devices. The transaction manager waits then for some to finish, which takes as average 1/2 of 8-12 secondsâŠ
Can we revisit the need for a fast queue? (not really a queue, the implementation I showed above is working OK for me, and the complexity it adds is that of âless than 15 lines of codeâ, but any solution that keeps a minimum âbandwidthâ for non-sleepy devices would do )
Great stuff - thanks for your help. As a matter of interest, what does the Telegesis queue size get to now?
Ok, it would be good if you can take a look over these to see if thereâs a reasonâŠ
Well, Iâm always open to being persuaded, but I think it will take some doing . Currently I donât think that bypassing the queue is a good idea - maybe there is some tuning of the queues that could be done (the inter-message delays for example) - these were largely arbitrary anyway.
I also donât see why we should bump the priority of transactions where the receiver is on at the expense of those that sleep. Again, the queue management ought to account for that by only releasing one transaction for sleeping nodes at a time - that is one of the principal requirements here since parent routers in a ZigBee network are only required to buffer a single packet per child.
What I have just realised is that I donât have different queue profiles for sleepy devices (and as above, this is needed). I had planned that this would be done from the NetworkManager, and the default profile would only be used if there was no specific profile defined, but clearly Iâve missed this. I would expect that this will solve the issue you are seeing as it will do a similar thing, but smarter.
I am on nightly build 1524 (and zsmartsystems 1.1.10), and hoped to enjoy the new configurable polling. I have added
, zigbee_reporting_polling="15" ]
to a thing in my .thing file. But so far, it doesnât seem to work.
Is it supposed to work like that?
Why a minimum of 15 seconds? I hoped to be able to set two seconds, as I want a switch to control lights.
Because it is very easy to swamp the system and cause huge delays to commands. Polling is not a particularly good option (I know in your case there may not be any better option).
Now it keeps stable. At some moment during network initialization it has gone up to 13 (with some timeouts), but it has recovered quite well from it, back to 1. Nothing to do with the ~400 without the transaction manager, and no TOO_MANY_UNICAST errors.
I have seen that my OHâs java process is consuming much more than it used to, even when âdoing nothingâ. This may explain why I still get some timeouts. One question, in ZigBeeDongleTelegesis.sendCommand():
// We need to get the Telegesis SEQ number for the transaction so we can correlate the transaction ID
// This is done by spawning a new thread to send the transaction so we get the response.
new Thread() {
@Override
public void run() {
frameHandler.sendRequest(command);
// Let the stack know the frame is sent
zigbeeTransportReceive.receiveCommandState(msgTag,
command.getStatus() == TelegesisStatusCode.SUCCESS ? ZigBeeTransportProgressState.TX_ACK
: ZigBeeTransportProgressState.TX_NAK);
// Multicast doesn't have a sequence returned, so nothing more to do here
if (command instanceof TelegesisSendMulticastCommand) {
return;
}
messageIdMap.put(((TelegesisSendUnicastCommand) command).getMessageId(), msgTag);
}
}.start();
We are spawning a new thread for every single sent command. May this have something to do? It is a costly operation. Would it make sense to use a persistent Thread instead?
Neither do I. What I propose is that, from the maxOutstandingTransactions (10 by default) we reserve a certain amount (I tested with 5) for fast transactions, so that when we have reached this thresold, only âfastâ transactions will be fetched from the node queues and sent to the dongle.
Not really bumping their priority, but just reserving a minimum amount of âsend slotsâ for these.
As a user, when I turn on a light, I would prefer it to happen inmediately, and I do not care if a network discovery packet, or anything that goes to a device that anyway will take 4 seconds in average, is delayed a few ms due to this. Sorry for using Agile wording here
Let me explain why I think this is needed: i.e the Telegesis dongle will only handle 10 transactions âon airâ. This is consistent with 10 transactions enforced by the Transaction Manager.
The running transactions may be a mix of:
Commands that go to RxOnIdle devices (fast transactions, ânegligibleâ time).
Commands sent to sleepy devices (will take 0 to 8 seconds, 4 as average)
Commands to offline devices (will timeout in about 8 seconds)
There is a good probability that at a given moment all transactions in the Outstanding Transactions list are either to slow devices or to devices that are offline. This is because they take comparatively so much time compared to the fast ones.
When this happens, we will have the fast transactions competing for a slot with all other queued ones (fast and slow), and in any case waiting for a slow transaction to complete.
A fair queing practice (vs a random one) would give inverse weight to the time each transaction takes to better shape traffic. A low latency queueing practice would reserve some slots for fast transactions (or keep separate queues).
This is an improvement in terms of user experience, more considering that the commands to non sleepy devices (lights, plugs, etc) are the ones the user would expect to be taken more immediately.
This is normally solved by having fast and slow queues, or any equivalent solution (as the proposed one, but again, this is just one possible implementation).
This cannot be solved in the âper nodeâ queues, as we may still have more queues than available outstanding transactions.
A solution for this requires the transaction manager being aware of this, and making a decission on picking a packet from the queue of a fast device, when the outstanding transactions list reaches a given thresold (which could be up to just 1 only free slot, even if I used 5 for testing).
I hope this clarifies a bit why I think this should be necessary: it is a user experiece improvement, not a fix for a ânon workingâ scenario.
After thinking about this more last night, I agree. I will likely implement it differently than you did as I think that your implementation doesnât guarantee this. The situation that it needs to protect against is all 10 transactions being for sleepy devices.