InfluxDB stopped working all of a sudden

bechte · May 28, 2020, 9:04pm

Hey guys,

I have a Raspi4 running up and well with OpenHab 2.5. Everything configured via configuration files and since a couple of months, I havent had any issues.

I run an influxdb on the Raspi4 as well, configured by the OpenHab-Config tool. Since three weeks I included my Photovoltaik Power Control which is sending a lot of status updates to the influxdb. And still, it was running smoothly until Monday.

Now, I see a lot of errors in the log file, all with the same message:

2020-05-28 23:00:24.357 [ERROR] [org.influxdb.impl.BatchProcessor    ] - Batch could not be sent. Data will be lost

retrofit.RetrofitError: 4096

	at retrofit.RetrofitError.unexpectedError(RetrofitError.java:44) ~[retrofit-1.9.0.jar:?]

	at retrofit.RestAdapter$RestHandler.invokeRequest(RestAdapter.java:400) ~[retrofit-1.9.0.jar:?]

	at retrofit.RestAdapter$RestHandler.invoke(RestAdapter.java:240) ~[retrofit-1.9.0.jar:?]

	at org.influxdb.impl.$Proxy205.writePoints(Unknown Source) ~[?:?]

	at org.influxdb.impl.InfluxDBImpl.write(InfluxDBImpl.java:151) ~[influxdb-java-2.2.jar:?]

	at org.influxdb.impl.BatchProcessor.write(BatchProcessor.java:171) [influxdb-java-2.2.jar:?]

	at org.influxdb.impl.BatchProcessor$1.run(BatchProcessor.java:144) [influxdb-java-2.2.jar:?]

	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_222]

	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_222]

	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_222]

	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_222]

	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_222]

	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_222]

	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222]

Caused by: java.lang.ArrayIndexOutOfBoundsException: 4096

	at okio.Buffer.writeUtf8(Buffer.java:818) ~[okio-1.4.0.jar:?]

	at okio.Buffer.writeUtf8(Buffer.java:793) ~[okio-1.4.0.jar:?]

	at okio.RealBufferedSink.writeUtf8(RealBufferedSink.java:57) ~[okio-1.4.0.jar:?]

	at com.squareup.okhttp.internal.http.HttpConnection.writeRequest(HttpConnection.java:172) ~[okhttp-2.4.0.jar:?]

	at com.squareup.okhttp.internal.http.HttpTransport.writeRequestHeaders(HttpTransport.java:76) ~[okhttp-2.4.0.jar:?]

	at com.squareup.okhttp.internal.http.HttpEngine$NetworkInterceptorChain.proceed(HttpEngine.java:890) ~[okhttp-2.4.0.jar:?]

	at com.squareup.okhttp.internal.http.HttpEngine.readResponse(HttpEngine.java:760) ~[okhttp-2.4.0.jar:?]

	at com.squareup.okhttp.Call.getResponse(Call.java:274) ~[okhttp-2.4.0.jar:?]

	at com.squareup.okhttp.Call$ApplicationInterceptorChain.proceed(Call.java:230) ~[okhttp-2.4.0.jar:?]

	at com.squareup.okhttp.Call.getResponseWithInterceptorChain(Call.java:201) ~[okhttp-2.4.0.jar:?]

	at com.squareup.okhttp.Call.execute(Call.java:81) ~[okhttp-2.4.0.jar:?]

	at retrofit.client.OkClient.execute(OkClient.java:53) ~[retrofit-1.9.0.jar:?]

	at retrofit.RestAdapter$RestHandler.invokeRequest(RestAdapter.java:326) ~[retrofit-1.9.0.jar:?]

	... 12 more

I am wondering why this happens now all of a sudden. I didnt change a thing since I set everything up a couple of weeks ago.

Any ideas how I could try to resolve this?

Thanks Bechte

bechte · May 28, 2020, 9:06pm

My Idea:

I think there are too many values to store in the influxdb in a too short amount of time. Therefore, the underlying buffer used seems to be exceeded.

Could be a bug or an configurational issue.

Any ideas?

Wolfgang_S · May 29, 2020, 1:18am

Might not be the same root cause but the same error message, have you checked this:

Here e.g. SD card was reported to be the problem:

bechte · May 29, 2020, 6:15am

Hi, thanks for the pointers.

The retentionPolicy is valid, set to: retentionPolicy=autogen in influxdb.cfg

In the timeout Post above there is one comment:

Edit: I found a few items that reported a lot of values to the database. (Probably every few miliseconds) I’ve now changes those items from everyChange to everyMinute. With this configuration the error is gone. I guess there were just too many values every second to be sent to the server.

This is close to my observation. I will change the according items and see if that solves the issue as well for me.

bechte · May 29, 2020, 6:24am

Fun thing, the change did not solve the issue, but a restart afterwards did.

rossko57 · May 29, 2020, 10:00am

Yes, that’s expected. Persistence services are still version 1.x really and not expected to pick up dynamic configuration changes.

bechte · June 12, 2020, 6:23pm

So, after a couple of tests I set the items to store their values every 5 seconds but still get errors after a couple of hours. This is a bit awkward. Storing the values every 30 seconds or something alike does not make much sense in this case… Are there other solutions? Would it help to scale the influxdb server to another node to gain more performance on the influxdb side or is it a openhab related issue?

rossko57 · June 12, 2020, 6:50pm

Why not? What are you recording in a home automation environment that requires this? everyMinute is common enough for e.g. temperatures or power consumption for charting purposes, but most other persist needs are met with everyChange.

bechte · June 12, 2020, 7:08pm

I record the AC Power Supply of my PV. These values change more than 3 times a second.

I wish to have them as accurate as possible. Maybe I store them locally and write the total sum every minute into a virtual item…

rossko57 · June 12, 2020, 7:14pm

Okay. They’re still ‘accurate’ if you store them every five minutes, it’s just a coarser time slice.
But presuming this is only a dozen or so Items, and you’re using different policies for other stuff, it shouldn’t really be a problem.

bechte · June 12, 2020, 7:23pm

They are only accurate to that specific moment in time. Assuming clouds pass by they do mot show the real generation. Therefore, I would then sum them up before storing.

I only have 10 of those items. And yes: Still confused why this is an issue.

rossko57 · June 12, 2020, 7:47pm

Alright, I assumed the PV system had cumulative metering e.g. kWh so minute by minute variations don’t matter. But you can build that yourself before storing. I’d go for everyMinute in the end, just because its so easy for charting.

I’m guessing you are working off an SD card, which can be an I/O bottleneck. Don’t forget your frequent updates are probably also competing to write logfiles to the same card.
You might consider tuning that a bit - zram?

detond09 · June 13, 2020, 9:17am

I am facing the same problem @bechte. Changing the persistence frequency is changing the error rate frequency as well. But there seems to be something wrong in general with our setup.

Kim_Andersen · June 13, 2020, 9:52am

1/3 of a second is imho way too much for a PV… No clouds makes it worht persisting such a case.
I´m running a PV system as well (through modbus though) and I do persist these values too. But my persisting is setup to everychange. I get updates aprox every 2-3 second, which is good enough for measuring the PV generation…
I use SMA inverters. The Sunny Portal (SMA´s web cloud for data measurements) they measure every 5 minutes.

I think you should reconsider if a VERY small cloud (1/3 second) really is worth it.

If you´re using an Rpi 3B+ or worse and a SD card, this is very wrong way of doing it. Switch to better hardware and alot faster storage, like an SSD. Perhaps you could change to every second or maybe more.

detond09 · June 13, 2020, 11:09am

@bechte: I just replaced my SD Card and I got rid of the influxdb “batch could not be sent” error. Will keep you posted if it returns.

bechte · June 13, 2020, 12:48pm

Cool. Great to hear. This might be the same issue with me. I get a huge amount of item state change logs since I connected the heating and the PV.

I now disabled logging by setting the log level of group ‚smarthome.event’ to WARN.

Will report if that helps.