OH1.8.3 uses 7900 Threads, 12 GB RAM

jochen-mattes · June 9, 2016, 7:17pm

When does OH1.8.3 fork?
With an installation of ~4000 items and ~30 jython skripts, I am running into the threads limit (ulimit -u): 7900 using ~12 GB of RAM.

Please let me know what debug information is most helpful.

steve1 · June 9, 2016, 7:35pm

What are those threads doing?

rlkoshak · June 9, 2016, 7:39pm

Items? Updates per second?

This feels like either reaching a scaling point beyond which OH was designed to handle or you are hitting some weird bug causing a proliferation of threads. 7900 threads is really high.

Without more details I’ve no suggestions but some high level ideas assuming you are generating that many events:

split out separate functions into separate OH instances and run them on separate machines with minimal yet controlled interaction between them
move some of the processing out of OH into helper scripts or move some of the processing closer to the source of the data
reduce the refresh rates where possible so there are fewer events
scrub rule triggers to make sure that you are only triggering when absolutely necessary
reduce the number of rules by making sure there is no more than one rule triggered when an event occurs
consider a mix of jython and Rules DSL; this may make things worse but it is possible that the Rules DSL is better able to manage threads than the JSR233 binding (e.g. thread pooling); review the code to determine if this would work (maybe @steve1 knows whether JSR233 or Rules DSL engine uses thread pooling)

jochen-mattes · June 9, 2016, 9:00pm

Sorry, 4000 items. The number of update should lie at around 200 per second.

I assume that it is some sort of bug. The threads never close; otherwise I’d be able to log in via ssh.

(Direct access is difficult as the ubuntu instance is running on an HyperV host system which I cannot access at the moment. I’ll update you once I regain access.)

Is there any way I can kill OH remotely?

rlkoshak · June 9, 2016, 9:18pm

OK, that is freaking huge. I suspect at least an order of magnitude larger than OH is designed to handle. OH is designed for HOME automation, not industrial SCADA applications. What on earth are you trying to use OH to do?

I suspect what is happening is that OH is unable to complete processing for all 200 events that occur per second within a second so it gets behind and the number of threads just continues to grow and grow until the machine runs out of RAM and comes to its knees. As the processing threads pile up it gets less and less able to process and event in a timely manner becomes even less and the problem gets worse and worse.

Sadly diagnosing the specific problem is not going to be easy. And even once diagnosed there is no guarantee it can be corrected without a complete rearchitecture of OH.

To start I would do whatever you need to do drop the number of updates per second by half and see if it will run stably at that rate. Keep tuning the number of events per second as best you can until you find the sweet spot where OH can keep up. Armed with that information you then need to rearchitecture your automation solution so you send no single instance of OH running on similar hardware any more events than that.

Alternatively, if you are good at Java optimization, hook a profiler up to it and start to find in the code where problems are occurring that you can try to optimize it. I bet the devs would love to receive some major optimization improvements in a PR.

Not that I know of short of pressing the ON/OFF button on the machine.

Try unplugging the network and see if OH is able to catch up enough to closeout some of those threads. Otherwise a hard reset may be in order.

jochen-mattes · June 9, 2016, 11:17pm

thank you! Unplugging the cable helped.

The logs also give some insights:

2016-06-09 21:17:26.009 [ERROR] [org.quartz.core.ErrorLogger   ] - Job (DEFAULT.VentilationHerget.py#org.python.proxies.__builtin__$Ventilation_IncrementWriteCounter$133@9925b8f#*/1 * * * * ? threw an exception.
org.quartz.SchedulerException: Job threw an unhandled exception.
        at org.quartz.core.JobRunShell.run(JobRunShell.java:224) ~[quartz-all-2.1.7.jar:na]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz-all-2.1.7.jar:na]
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method) ~[na:1.8.0_72]
        at java.lang.Thread.start(Thread.java:714) ~[na:1.8.0_72]
        at org.openhab.core.jsr223.internal.engine.scriptmanager.Script.executeRule(Script.java:276) ~[na:na]
        at org.openhab.core.jsr223.internal.engine.scriptmanager.ScriptManager.executeRules(ScriptManager.java:137) ~[na:na]
        at org.openhab.core.jsr223.internal.engine.TimeTriggerJob.execute(TimeTriggerJob.java:39) ~[na:na]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) ~[quartz-all-2.1.7.jar:na]
        ... 1 common frames omitted

I’ll give you more details tomorrow.

jochen-mattes · June 9, 2016, 11:30pm

oh, this seems to be caused by an unreachable modbus device. Here the last couple of messages while the system was still responsive:

2016-06-09 20:14:43.318 [WARN ] [.b.modbus.internal.ModbusSlave] - ModbusSlave (vsr5): not connected -- aborting request net.wimpi.modbus.msg.WriteSingleRegisterRequest@1f1e4e15 2016-06-09 20:14:46.230 [ERROR] [dbusSlaveConnectionFactoryImpl] - connect try 1/3 error: Keine Route zum Zielrechner. Connection TCPMasterConnection@71a1598c[socket=<null>]. Endpoint ModbusTCPSlaveEndpoint@16f8b0aa[address=192.168.135.16,port=502] 2016-06-09 20:14:46.318 [ERROR] [dbusSlaveConnectionFactoryImpl] - connect try 1/3 error: Keine Route zum Zielrechner. Connection TCPMasterConnection@5b381d99[socket=<null>]. Endpoint ModbusTCPSlaveEndpoint@2db306c6[address=192.168.135.60,port=502] 2016-06-09 20:14:49.230 [ERROR] [dbusSlaveConnectionFactoryImpl] - connect try 2/3 error: Keine Route zum Zielrechner. Connection TCPMasterConnection@71a1598c[socket=<null>]. Endpoint ModbusTCPSlaveEndpoint@16f8b0aa[address=192.168.135.16,port=502] 2016-06-09 20:14:49.318 [ERROR] [dbusSlaveConnectionFactoryImpl] - connect try 2/3 error: Keine Route zum Zielrechner. Connection TCPMasterConnection@5b381d99[socket=<null>]. Endpoint ModbusTCPSlaveEndpoint@2db306c6[address=192.168.135.60,port=502] 2016-06-09 20:14:49.526 [ERROR] [g.openhab.io.net.http.HttpUtil] - Fatal transport error: java.net.ConnectException: Verbindungsaufbau abgelehnt 2016-06-09 20:14:49.526 [ERROR] [.o.b.http.internal.HttpBinding] - No response received from 'http://192.168.135.14/momval.htm' 2016-06-09 20:14:52.284 [ERROR] [dbusSlaveConnectionFactoryImpl] - connect try 3/3 error: Keine Route zum Zielrechner. Connection TCPMasterConnection@71a1598c[socket=<null>]. Endpoint ModbusTCPSlaveEndpoint@16f8b0aa[address=192.168.135.16,port=502] 2016-06-09 20:14:52.284 [ERROR] [dbusSlaveConnectionFactoryImpl] - re-connect reached max tries 3, throwing last error: Keine Route zum Zielrechner. Connection TCPMasterConnection@71a1598c[socket=<null>]. Endpoint ModbusTCPSlaveEndpoint@16f8b0aa[address=192.168.135.16,port=502] 2016-06-09 20:14:52.285 [ERROR] [dbusSlaveConnectionFactoryImpl] - Error connecting connection TCPMasterConnection@71a1598c[socket=<null>] for endpoint ModbusTCPSlaveEndpoint@16f8b0aa[address=192.168.135.16,port=502]: Keine Route zum Zielrechner 2016-06-09 20:14:52.293 [WARN ] [.b.modbus.internal.ModbusSlave] - ModbusSlave (ventilationHouse): Error getting a new connection for endpoint ModbusTCPSlaveEndpoint@16f8b0aa[address=192.168.135.16,port=502]. Error was: Unable to validate object 2016-06-09 20:14:52.293 [WARN ] [.b.modbus.internal.ModbusSlave] - ModbusSlave (ventilationHouse): not connected -- aborting request net.wimpi.modbus.msg.WriteSingleRegisterRequest@759ec360 2016-06-09 20:14:52.318 [ERROR] [dbusSlaveConnectionFactoryImpl] - connect try 3/3 error: Keine Route zum Zielrechner. Connection TCPMasterConnection@5b381d99[socket=<null>]. Endpoint ModbusTCPSlaveEndpoint@2db306c6[address=192.168.135.60,port=502] 2016-06-09 20:14:52.318 [ERROR] [dbusSlaveConnectionFactoryImpl] - re-connect reached max tries 3, throwing last error: Keine Route zum Zielrechner. Connection TCPMasterConnection@5b381d99[socket=<null>]. Endpoint ModbusTCPSlaveEndpoint@2db306c6[address=192.168.135.60,port=502] 2016-06-09 20:14:52.318 [ERROR] [dbusSlaveConnectionFactoryImpl] - Error connecting connection TCPMasterConnection@5b381d99[socket=<null>] for endpoint ModbusTCPSlaveEndpoint@2db306c6[address=192.168.135.60,port=502]: Keine Route zum Zielrechner 2016-06-09 20:14:52.319 [WARN ] [.b.modbus.internal.ModbusSlave] - ModbusSlave (vsr6): Error getting a new connection for endpoint ModbusTCPSlaveEndpoint@2db306c6[address=192.168.135.60,port=502]. Error was: Unable to validate object 2016-06-09 20:14:52.319 [WARN ] [.b.modbus.internal.ModbusSlave] - ModbusSlave (vsr6): not connected -- aborting request net.wimpi.modbus.msg.WriteSingleRegisterRequest@35285827 2016-06-09 20:14:55.230 [ERROR] [dbusSlaveConnectionFactoryImpl] - connect try 1/3 error: Keine Route zum Zielrechner. Connection TCPMasterConnection@5270460e[socket=<null>]. Endpoint ModbusTCPSlaveEndpoint@16f8b0aa[address=192.168.135.16,port=502] 2016-06-09 20:14:55.318 [ERROR] [dbusSlaveConnectionFactoryImpl] - connect try 1/3 error: Keine Route zum Zielrechner. Connection TCPMasterConnection@2494c714[socket=<null>]. Endpoint ModbusTCPSlaveEndpoint@2db306c6[address=192.168.135.60,port=502] 2016-06-09 20:14:57.424 [WARN ] [.DomesticWater_ImmersionHeater] - DomesticWater_Pump (Type=SwitchItem, State=OFF) 2016-06-09 20:14:58.230 [ERROR] [dbusSlaveConnectionFactoryImpl] - connect try 2/3 error: Keine Route zum Zielrechner. Connection TCPMasterConnection@5270460e[socket=<null>]. Endpoint ModbusTCPSlaveEndpoint@16f8b0aa[address=192.168.135.16,port=502] 2016-06-09 20:14:58.322 [ERROR] [dbusSlaveConnectionFactoryImpl] - connect try 2/3 error: Keine Route zum Zielrechner. Connection TCPMasterConnection@2494c714[socket=<null>]. Endpoint ModbusTCPSlaveEndpoint@2db306c6[address=192.168.135.60,port=502] 2016-06-09 20:20:14.005 [ERROR] [org.quartz.core.JobRunShell ] - Job DEFAULT.Ventilation_Control.py#org.python.proxies.__builtin__$Ventilation_HergetActuationSignal$86@550ed456#* * * * * ? threw an unhandled Exception: 2016-06-09 20:20:14.008 [ERROR] [org.quartz.core.ErrorLogger ] - Job (DEFAULT.Ventilation_Control.py#org.python.proxies.__builtin__$Ventilation_HergetActuationSignal$86@550ed456#* * * * * ? threw an exception.

steve1 · June 10, 2016, 10:39am

Although you didn’t respond to my question about the threads, I’ll add (in case it’s useful for others) that an effective way to diagnose these problems is to get a memory dump of the openHAB Java process, including the stack trace for each thread. That information can be used to determine what the threads are doing (if they are blocking or if they are actively running code).

I used this technique to diagnose issues with the Sonos binding. I have Jython rules that use the JMX ThreadMXBean to monitor the process thread count. Another rule triggers when the count crosses a specified threshold and uses the HotSpotDiagnosticMXBean to trigger a memory dump and send me a notification.

I then used the Eclipse Memory Analyzer to study the memory dump and determine what the thousands of threads were actually doing. Typically most of them are blocked at a similar way. If they are actively running, the stack traces will show what code they were executing at the time of the dump.

jochen-mattes · June 13, 2016, 7:28am

sorry for not responding to this earlier.
At that moment, I was not able to connect to the machine in any way.
Now I’m making periodic memory dumps. So if it happens again, I’ll be able to debug it // provide you with the information.