Influx db error: Re-queuing x elements, failed to write batch, cache-max-memory-size exceeded

stefan.hoehn · September 15, 2023, 7:41pm

I am on openHAB 4.1.0.M1

Since a few days I am getting the following error:

[DEBUG] [rnal.influx1.InfluxDB1RepositoryImpl] - Writing to database failed
org.influxdb.InfluxDBException$CacheMaxMemorySizeExceededException: engine: cache-max-memory-size exceeded: (1074461556/1073741824)
	at org.influxdb.InfluxDBException.buildExceptionFromErrorMessage(InfluxDBException.java:153) ~[?:?]

....

[WARN ] [.influxdb.InfluxDBPersistenceService] - Re-queuing 2118 elements, failed to write batch.

The first time I encountered this. I stopped influx (stopped, started zram) and restarted openHAB and didn’t have an error … for some time.

No it has reappeared. I read in a German forum a post of August that there was a type issue but that wasn’t available in a snapshot by then but I cannot find anything.

Now I stopped influxdb and then tried to restart and it doesn’t even start again:

Sep 15 21:43:53 openhabian systemd[1]: Stopped InfluxDB is an open-source, distributed, time series database.
░░ Subject: A stop job for unit influxdb.service has finished
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A stop job for unit influxdb.service has finished.
░░
░░ The job identifier is 25587 and the job result is done.
Sep 15 21:43:53 openhabian systemd[1]: influxdb.service: Start request repeated too quickly.
Sep 15 21:43:53 openhabian systemd[1]: influxdb.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ The unit influxdb.service has entered the 'failed' state with result 'exit-code'.
Sep 15 21:43:53 openhabian systemd[1]: Failed to start InfluxDB is an open-source, distributed, time series database.
░░ Subject: A start job for unit influxdb.service has failed
░░ Defined-By: systemd
░░ Support: https://www.debian.org/support
░░
░░ A start job for unit influxdb.service has finished with a failure.
░░
░░ The job identifier is 25587 and the job result is failed.

I have stopped OH and rebooted and it seems it has recovered but I doubt it will last for long.

Does anyone know what the root cause might be for ?

cache-max-memory-size exceeded: (1074461556/1073741824)

J-N-K · September 16, 2023, 4:57pm

It might be that there is a limit to the size of a write (i.e. some sort of “maximum batch size”). The question is why there are 2118 elements that need to be written to the database, that sound awfully much. How many items do you persist?

stefan.hoehn · September 16, 2023, 5:13pm

I have 603 items and to be honest, I haven’t yet filtered it which isn’t probably a good idea

I can at least filter the weather data but other than that I have to work harder on that to only persist the really needed ones.

I also think I should lower the amount of savings:

Strategies {
	everyMinute : "0 * * * * ?"
        default = everyChange, restoreOnStartup
}

Items {
	SysteminfoRP4_Used, SysteminfoRP4_UsedMemory, SysteminfoRP4_CPUTemperature, SysteminfoRP4_Available : strategy = everyHour
	* : strategy = everyUpdate, everyMinute
}

to

Strategies {
	everyMinute : "0 0/1 * * * ? *"
	everyHour : "0 0 * * * ?"
    default = everyChange, restoreOnStartup
}

Items {
	SysteminfoRP4_Used, SysteminfoRP4_UsedMemory, SysteminfoRP4_CPUTemperature, SysteminfoRP4_Available : strategy = everyHour
	OneCallAPIWeather* = everyHour
	* : strategy = everyUpdate, everyMinute
}

J-N-K · September 16, 2023, 5:24pm

No it should work. But even with 600 items, 2100 datapoints to persist is 3 points per item, which sounds much, because the add-on tries to commit with an interval of 3s.

So the question is: why are there so many datapoints? Is there any error in the log before? Or in the influxdb log at that time?

stefan.hoehn · September 16, 2023, 5:30pm

Now the problem has changed

org.influxdb.InfluxDBException: engine: error rolling WAL segment: error opening new segment file for wal (2): close /var/lib/influxdb/wal/openhab/autogen/2/_00025.wal: file already closed

which points to what was discussed here:

which though looks good to me:

[19:32:22] root@openhabian:/var/lib/influxdb/wal/openhab/autogen/2# ls -l
total 10492
-rw-r--r-- 1 influxdb influxdb 10741791 Sep 16 19:22 _00025.wal

mpuff · December 24, 2023, 8:18am

i have now the same issue with openhab 4.1

openschwall · December 25, 2023, 6:26pm

same issue here with oh 4.1 since 3-4 days

2023-12-25 19:23:33.781 [WARN ] [.influxdb.InfluxDBPersistenceService] - Re-queuing 14 elements, failed to write batch.

with

023-12-25 19:29:43.963 [WARN ] [rnal.influx1.InfluxDB1RepositoryImpl] - database error: java.net.ConnectException: Failed to connect to localhost/[0:0:0:0:0:0:0:1]:8086

intermittent

pls68 · December 27, 2023, 11:55am

I just upgraded from 3.4 to 4.1
I have the same problem.

OzZy_HH · December 28, 2023, 11:42pm

I’ve upgraded from 4.0.3 to 4.1 Release and also have this issue

tarcus · December 31, 2023, 7:31am

In the past few days I upgraded from OpenHab 2 to 4.1.0-1 and influxdb 1.6.7, keeping the config files from the previous versions of both packages with just automatic changes made. I persist everything into influxdb and previously had no issues or errors. Now I get the same errors others are seeing:

[.influxdb.InfluxDBPersistenceService] - Re-queuing 40150 elements, failed to write batch.

The above number of elements starts off low, e.g. 15, then rises nearly every second but data is still persisted. It’s hard to tell if all of it is of course.

Grafana is not having trouble with missing data though that I can see. This is my current persistence file, I am going to remove the * line to see if that changes anything.

Strategies {
    everyMinute : "0 * * * * ?"
    everyHour   : "0 0 * * * ?"
    everyDay    : "0 0 0 * * ?"
    default = everyChange
}

Items {
    gRadiatorSetpoints*, gRadiatorTemps*, gRadiatorValves*,gbatteries*,gMotion*   : strategy = everyChange, everyHour, restoreOnStartup
    BathTemp_Temperature1   : strategy = everyChange, everyHour, restoreOnStartup
    NetatmoOutdoor_Temperature,gTemperatures*   : strategy = everyChange, everyHour, restoreOnStartup
    gSwitches*,gLights*,gLux*,gHumidity*  : strategy = everyChange, everyHour, restoreOnStartup
    MyState, MyStateNumeric,NightMode,HotWater,Heating: strategy = everyChange, everyHour, restoreOnStartup

  * : strategy = everyChange, restoreOnStartup

}

tarcus · December 31, 2023, 7:50am

After removing the * line from the persistence logging, I’m still getting the same:

2023-12-31 07:50:15.406 [WARN ] [.influxdb.InfluxDBPersistenceService] - Re-queuing 80 elements, failed to write batch.

J-N-K · December 31, 2023, 8:08am

Is there anything in the InfluxDB log? Set the add-on to DEBUG logging and show the stack trace that is logged at DEBUG level. For some reason the influx driver reports a failed write operation.

markus7017 · January 2, 2024, 8:41pm

Just upgraded to 4.1 having the same issue

J-N-K · January 2, 2024, 9:10pm

Same question.

tarcus · January 3, 2024, 6:41am

I changed the binding to Debug and that immediately made the problem go away (no restarting of openhab or influxdb, just ticked the box and saved), which is a bit of a puzzle. There’s no significant logging from influx itself, just along the lines of “snapshot started” “snapshot ended”. I’ll keep an eye on it for a bit then try turning debugging off, then restarting to see if there are any changes.

J-N-K · January 3, 2024, 7:48am

That sounds like a timing issue and may be hard to fix, because I suspect it’ll also not occur in the debugger.

J-N-K · January 3, 2024, 10:52pm

You use Influx DB1, correct?

tarcus · January 4, 2024, 8:17am

That’s what it’s set to in the binding, and has always been set that way since openhab 2, but I’m not sure what influx itself is set to, presumably the same, I don’t really know much about influx as it’s just a support piece and not something I know any further than I have to. The package details are:

influxdb       1.6.7~rc0-1+b13

Also I disabled debug yesterday and so far no recurrence of the errors, or indeed any errors from influx in the last few days since debug was enabled. I’ll reboot openhab later today when I don’t need it so much and see if it comes up in a good state.

OzZy_HH · January 5, 2024, 9:12pm

I put the binding into debug and found the issue, in my case there was a mismatch of data type in influx. I simply just dropped the measurement and let the item recreate.

2024-01-03 23:25:09.494 [DEBUG] [rnal.influx1.InfluxDB1RepositoryImpl] - Writing to database failed
org.influxdb.InfluxDBException$FieldTypeConflictException: partial write: field type conflict: input field "value" on measurement "HouseEnergyUsageWeek" is type string, already exists as type float dropped=25
        at org.influxdb.InfluxDBException.buildExceptionFromErrorMessage(InfluxDBException.java:144) ~[?:?]
        at org.influxdb.InfluxDBException.buildExceptionForErrorState(InfluxDBException.java:173) ~[?:?]
        at org.influxdb.impl.InfluxDBImpl.execute(InfluxDBImpl.java:837) ~[?:?]
        at org.influxdb.impl.InfluxDBImpl.write(InfluxDBImpl.java:470) ~[?:?]
        at org.openhab.persistence.influxdb.internal.influx1.InfluxDB1RepositoryImpl.write(InfluxDB1RepositoryImpl.java:131) ~[?:?]
        at org.openhab.persistence.influxdb.InfluxDBPersistenceService.commit(InfluxDBPersistenceService.java:286) ~[?:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) [?:?]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) [?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:840) [?:?]

J-N-K · January 7, 2024, 10:14am

Very interesting. Do you know how I can force this issue to occur?