Adventures implementing reliable network detection using arping

Hi!
For the last 2 days I’ve been experimenting with reliable network detection for our iphones using arping. This thread documents my findings and will result in some pull requests and/or requests for @David_Graeff probably :-). It’s not easy to just open an issue for this, I think it needs some discussion. I found a way changing the code that should make arping 100% reliable, using arping as follows:

arping -I eth0 -C 1 -c 100 -W 0.1 192.168.1.60

This will basically spam the IP for arp requests and immediately exits when the iphone responds, or exits with exit code 1 after 10 seconds. When my iphone is in deep sleep, this reliably works 100% of the time.

OK great, on to changing the actual binding code. Note that I’m running everything in Docker. The first thing I found was that ping isn’t installed in the Openhab container. I’ll make a PR afterwards to get this added, as I found absolutely no issues to work with ping. The second thing I found was that auto-discovery for the network binding doesn’t work. Looking at the code, this doesn’t even seem Docker related. Looking inside the class NetworkDiscoveryService in the startScan method, there is

final ExecutorService service = executorService;
if (service == null) {
      return;
}
...
executorService = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors() * 2);

This code never allows service to become non-null, so “startScan” would never work. Simply flipping the statement into

service != null

makes auto-discovery suddenly work perfectly, also inside my docker container.

Next up, changing the arping call. As inside the Docker container ARPing 2.14 is installed, the binding uses the following arping cmd:

arping -c 1 -i interfaceName ip

So I changed that to

arping -c 100 -C 1 -W 0.1 -i interfaceName ip

Then, performed a mvn clean install, uninstalled the existing binding from my setup, pasted the .jar in the /addons folder, this works perfect. For a while…

Thanks to my additional logging, I suddenly see a lot of InterruptedException logs from performARPping. Thinking this was due to my change, I reverted everything but only added a logging statement in the catch (InterruptedException e) statement. This also shows that the normal implementation is constantly throwing InterruptedExceptions.

Let’s try to beat this thing together, we could have perfect presence detection and the forum would be less bloated with arping configuration issues. Why is ProcessBuilder.waitFor() throwing these exceptions?. I don’t have enough OpenHAB knowledge to reason about the threading constraints or when threads in specific bundles are interrupted.

It would be great if others could confirm their network binding is also throwing these InterruptedExceptions. I made a new network binding jar that is exactly the same as the normal 2.4.0 one, but with added error logging:

After placing this .jar in your /addons folder, and you start seeing things like

it confirms that there’s currently a threading issue with this binding, also preventing my fix for reliable arping :frowning:

6 Likes

Great analysis. Thanks for looking into this and running with it.

Auto discovery is disabled on purpose. We are leaking resources somewhere and I haven’t found yet where. The binding itself leaks as well but not as fast as the discovery service.

I guess it has to do with how we exec external processes, could also be because of those mentioned interrupt exceptions.

1 Like

Hi David, thanks for joining the conversation this quick :slight_smile: nice to see confirmation of some issue going on

Suspecting there is some serious scheduling-thread-in-thread-process-forking going on which could be the cause of issues, I threw away all the threading stuff, something like (shortened)

    boolean active = false;    
    if (interfaceNames != null) {
        for (final String interfaceName : interfaceNames) {
            active = active || performARPping(interfaceName);
        }
    }
    if (pingMethod != null) {
        if (pingMethod != IpPingMethodEnum.JAVA_PING) {
            active = active || performSystemPing();
        }
    }

    submitFinalResult(active);

The behaviour is now completely deterministic and blocking, additionally it will simply skip even trying to use ping if the arping went fine. There’s no guesswork in which thread finished already and which didn’t, no intermittent checking of whether final results have to be established.

So far, this now runs 100% stable. The moment I turn wifi off my iphone is shown as OFFLINE, and immediately turns and stays ONLINE when switching it on again. I’ll keep this running for the night to see just how stable this is. If this all looks like solid stuff, do you want me to create an extensive PR containing:

  • re-enabling auto-discovery, or disabling the automatic auto-discovery menu
  • arping the new way, where we mash arp packets for 10 seconds and instantly exit if success
  • no more parallellism, deterministic behaviour of final state feedback
  • only try on the interface you’re actually using, in my case the binding was redundantly checking the “docker0” interface as well

… or will you incorporate this in the stuff you’re already working on? I don’t want to create a mess :slight_smile:

Are you using a software tool like ‘visual vm’? It’s a handy app to see the threads, stack and memory usage in real-time. If you have a leak it can help track it down.

https://visualvm.github.io

Nice tool!

Single threaded is always easier for the mind. But you have to think about a broader usage scenario. You are suspecting that arping is the tool of choice anyway, but think of network segment boundaries. If someone wants to check a different-subnet IP, arping will always fail, but block the entire status check.

I suggest a different approach.

  • Use CompletableFuture (the async methods). They already care about creating and shutting down threads so that burden is taken from us.
  • They can be chained together. No wait-for-many logic anymore.
  • We don’t interrupt them. A tool runs as long as it wants to run. (We are probably leaking while forcing the interrupt).
  • We store the last chained CompletableFuture. Next ping period, we check if it has finished:
    • If so: Start again
    • If not: Report to the user that something is misbehaving (in the Thing Status -> not the logs) and skip that ping period.

Parallelism is too important here to get rid of it. So far we have arping, ping, dhcp. We need ipv6 solicitation messages and DHCPv6 to account for IPv6 enabled devices and maybe even more methods in the future and I do not want to run them one after the other.

Yes, agreed. Iphone presence has stayed stable all night, so the deterministic approach not showing any interrupts might have been good just as a POC for leakage/instability. The thing I care about most is the arping command specifics :slight_smile: it might be a good idea to allow expert usage in which not only the path, but the full command is defined by the end-user. Also, I do think it’s in 95% of cases overkill to query all network interfaces, this should also be an opt-in/out, as whatever the final solution will be, unnecessarily forking twice the amount of processes every detection cycle isn’t great.

Let me know if you want me to implement/test/verify anything. I don’t want to walk in the way here. I’ll focus on fixing the Nikobus binding next :stuck_out_tongue:

You would help already but adding configuration options to the binding to change the arping command line. I wouldn’t expose the full command line, but the options that are useful for you.

The correct network interface could be cached to avoid running on all interfaces. True.

OK - I’ll add more arping parameters control once the more stable thread handling is in place. My arping approach really needs to run as long as it has to without interrupting, so it’s no point adding it in to the binding as-is.
At that point, I’ll also do some more rigorous testing, and involve some people from the community with other setups, so we can have stable iphone detection once and for all and stop bloating the forum with arping issues :slight_smile:

I will not be able to work on the network binding for the next months :confused:
Could you please submit your changes to the binding, before that gets lost.
I can still rework parts of it another time.

Just like we would cache interfaces per device, we should also cache whether or not it ever responded via arping or not. If it does, only arping is fine. The end-goal should be to spawn as few processes as possible.

I’ll set up a separate branch (still have to get all this stuff in place) and bake off a marketplace .zip from that. Then afterwards, you can take some parts and work them into the official binding. I’d rather not wait months to get this tested by others.

OMG please don’t do that if possible. If you submit a PR, I will be the reviewer. Your change is in before you finished saying abracadabra backwards. But please never use the marketplace for “beta” versions. openHAB will be highly confused with the next update etc.

Yeah I completely understand, but you’re saying it will take months before the fix is in that will allow my change to be in as well. I’ll make a PR but it’ll be unmergeable and just sit there?

Why would it be unmergable. I will use your approach for now. If arpings repeats are configurable, a default value of 1 (like it is now) will not block the process for long, so a blocing arping->ping chain is alright for me for now.

OK understood, I thought you wanted to keep the blocking approach out. Expect a PR in the coming days - I need some time to figure out the whole contribution thing. My development setup isn’t ideal and it’s pretty hard to do proper testing (I just bake .jar’s and put them in my running system and look at logs, dunno if anyone figured out remote debugging?)

Having issues with iphone detection, too. So I stumbled over this post and wanted to try out your arping command.

I am using latest docker container (2.4.0), net=host and Arch Linux Host. When I execute the arping command, the following error occurs and I am getting not the correct results. Any ideas why this happens?

:/openhab# arping -C 1 -c 100 -W 0.1 -I enp0s25 10.83.86.231
ARPING 10.83.86.231
arping: libnet_write(): libnet_write_link(): only -1 bytes written (Invalid argument)


--- 10.83.86.231 statistics ---
1 packets transmitted, 0 packets received, 100% unanswered (0 extra)

I am having the same issues with an iphone.
wouldn’t be more easy to change the arping path in the binding config to a little script instead that just runs arping -I eth0 -C 1 -c 100 -W 0.1 and returns the last value?
this way no changes in network binding will be necessary

Some additional configuration would be good though. After some experimenting I noticed that the binding uses the wrong network card for my system (Ubuntu 18.04 with network card names based on systemd). I got phone detection working by adding configuration options to the command line:

/usr/sbin/arping -I enp4s0

in the binding configuration.

I tweaked the network binding: retry=10, timeout=59000, refreshInterval=600000
it seems to work ok so far with iphone.
obviously there will be some delay when leaving the house, but I can live with that

Practical values are very welcome for the readme :slight_smile:
@interface: The binding tries to ping on all network interfaces. Except if the OH network interface has been fixed.