Netatmo Outside Temp stops writing to persistence DB (influxdB) after a month

xsherlock · March 4, 2019, 10:12am

I have a very curious bug in my current home setup. That runs about a year on 2.3 with about 80 sensors. January got myself a Netatmo , Mainly for resonably built rain sensor. I log every change of every sensor and after a month noticed that my grafana temp graph of the Netatmo Outside Temperature got stuck on a single value. Suspected a bug in a binding or cloud service. But to my suprise the sensor is still updating fine.

2019-03-03 22:58:32.393 [vent.ItemStateChangedEvent] - Netatmo_Temperature_Outside_new changed from 8.8999996185302734375 to 8.5

I see that in a log and Habpanels displaying that item are geting up to date values over SSE.
It just does not write to InfluxDB anymore.
Restarting OH does not fix the problem, the item is dead and will not update in influxDB.
Creatively did another item with _new at the end exactly the same setup

Number Netatmo_Temperature_Outside_new	"Outside Temp Netatmo [%.1f °C]" <temperature>  ["CurrentTemperature"] { channel="netatmo:NAModule1:8b7b2739:020000331546:Temperature" }

And that started working right away. To my suprize after 50 days i have the problem again the new item that holds Netatmo Outside Temp stops writing values to the persistance DB despite it updates just fine.

Any one seen that? any clues?
TIA

rlkoshak · March 4, 2019, 3:36pm

Does restarting InfluxDB change anything?

Do you see any errors in InfluxDB’s logs? I think it logs to the syslog by default.

xsherlock · March 4, 2019, 7:15pm

restarting influx does not fix the problem.

Unfortunately the log does not show exact measurement written by the post query but with some detective skills I see. 3 netatmo items updated this very second + one additional update from other sensor

2019-03-04 19:48:15.363 [vent.ItemStateChangedEvent] - B_Power_Grid_P2_Voltage_raw changed from 2397 to 2404
2019-03-04 19:48:16.557 [vent.ItemStateChangedEvent] - Netatmo_Pressure changed from 1015.0999755859375 to 1015
2019-03-04 19:48:16.558 [vent.ItemStateChangedEvent] - Netatmo_Temperature_Outside_new changed from 11.69999980926513671875 to 11.6000003814697265625
2019-03-04 19:48:16.558 [vent.ItemStateChangedEvent] - Netatmo_Humidity_Outside changed from 75 to 76
2019-03-04 19:48:19.679 [vent.ItemStateChangedEvent] - G_pmsensor_pm25 changed from 9.2 to 10.1

but the log shows only 3 post updates

Mar  4 19:48:14 influx-db influxd[1020]: [httpd] 172.16.10.4 - openhab [04/Mar/2019:19:48:14 +0100] "POST /write?consistency=one&db=openhab_db&p=%5BREDACTED%5D&precision=n&rp=autogen&u=openhab HTTP/1.1" 204 0 "-" "okhttp/2.4.0" 111d3aca-3eae-11e9-b3eb-5254005c6abc 25278
Mar  4 19:48:15 influx-db influxd[1020]: [httpd] 172.16.10.4 - openhab [04/Mar/2019:19:48:15 +0100] "POST /write?consistency=one&db=openhab_db&p=%5BREDACTED%5D&precision=n&rp=autogen&u=openhab HTTP/1.1" 204 0 "-" "okhttp/2.4.0" 11b5cfd2-3eae-11e9-b3ec-5254005c6abc 20351
Mar  4 19:48:16 influx-db influxd[1020]: [httpd] 172.16.10.4 - openhab [04/Mar/2019:19:48:16 +0100] "POST /write?consistency=one&db=openhab_db&p=%5BREDACTED%5D&precision=n&rp=autogen&u=openhab HTTP/1.1" 204 0 "-" "okhttp/2.4.0" 126cec14-3eae-11e9-b3ed-5254005c6abc 28832

If my logic is right and every post corresponds to the sensor update it looks there is not 3 consecutive updates from netatmo sensors
No other errors in influx log

rlkoshak · March 4, 2019, 9:17pm

It is possible that the add-on caches up events and posts them in a batch which could explain the logs, but I don’t know.

It does look like the add-on is not posting the data. The best I can recommend is filing an issue on the openhab-addons repo.

jswim788 · March 4, 2019, 9:31pm

What’s in your persistence file for InfluxDB?

Also, you can double check to see if the data is really not in InfluxDB from the command line:

$ influx (you might need to install the influxdb-client to get this, but you needed it to set up the database originally)
> auth - give the username and password if needed
> use openhab_db (or whatever database name you used)
> select * from Netatmo_Temperature_Outside_new

that should should you all the measurements in the database. Use an epoch converter to find out when it was last updated. Also, watch out for the retention policy. You need to be consistent. If you are writing to a different one than you are reading (through Grafana), you might see odd results.

xsherlock · March 4, 2019, 10:13pm

Note that this was working but twice stopped after about a month.

Strategies {
everyHour :   "0 0 * * * ?"
everyMinute : "0 * * * * ?"
default = everyChange
	
}
Items {
*   : strategy = everyChange, restoreOnStartup
gHeatingValve*, HeatingSetpoint*, B_Power_Battery_Current_Sum*, B_Power_Battery_Voltage_fixed*, B_Provent_temp_gwc, B_Provent_gear_selected, B_Provent_bypass_active, B_Power_Battery_Current_Discharge, B_Power_Battery_Current_Charge  : strategy = everyChange, everyMinute, restoreOnStartup        
}

And the data is not there in the influxdb. thats Why I thing something is wrong with OH and Netatmo combo

> select last(value) from Netatmo_Temperature_Outside_new
name: Netatmo_Temperature_Outside_new
time                last
----                ----
1550744572622000000 13.600000381469727

That is exactly the day it stopped working (21 feb) and last point on the grafana graph.
I have not touched retention policy, for all 213 Items I have in house the default and they all archive fine.

jswim788 · March 4, 2019, 11:25pm

But you are seeing the item update in the event log, so it doesn’t seem to be a Netatmo/OH issue - seems like it is the influxdb persistence at fault? Out of curiosity, have you tried putting in Netatmo_Temperature_Outside_new directly in the persistence file instead of relying on the ‘*’? Perhaps you could set up a different persistence service and see if it will write there to determine if this is just influxdb failing or some combination of Netatmo and influxdb that causes the issue.

rossko57 · March 4, 2019, 11:38pm

If you still have the event logs for that date, it would be interesting to see the Item update that recorded that value, and the following one (in case it has something odd stuck in its throat)

opus · March 4, 2019, 11:53pm

Although you state to have not touched retention policies it sounds to me as something related to that. This Thread seams to talk about a related problem, I would think it has the same unresolved cause.

xsherlock · March 5, 2019, 12:05pm

So following your leads:
Around the time of last recorded value from 11:22:52 I see that item link has been removed and updated. That also happens for other Netatmo items that are persisting fine. BUT there are some additional updates to the Netatmo_Temperature_Outside_new item in the log (11:32:36.895 and 11:36:40) and after that a sensor update at 11:37:54.155 is first that did not go into influxDB.

2019-02-21 11:19:32.403 [temChannelLinkRemovedEvent] - Link 'Netatmo_Temperature_Outside_new => netatmo:NAModule1:8b7b2739:020000331546:Temperature' has been removed.
2019-02-21 11:19:32.407 [temChannelLinkRemovedEvent] - Link 'Netatmo_Humidity_Inside => netatmo:NAMain:8b7b2739:70ee50330fda:Humidity' has been removed.
2019-02-21 11:19:32.408 [temChannelLinkRemovedEvent] - Link 'Netatmo_Humidity_Outside => netatmo:NAModule1:8b7b2739:020000331546:Humidity' has been removed.
2019-02-21 11:19:32.408 [temChannelLinkRemovedEvent] - Link 'Netatmo_Pressure => netatmo:NAMain:8b7b2739:70ee50330fda:Pressure' has been removed.
2019-02-21 11:22:12.772 [vent.ItemStateChangedEvent] - Netatmo_Temperature_Outside_new changed from NULL to 11.800000190734863
2019-02-21 11:22:12.777 [vent.ItemStateChangedEvent] - Netatmo_Humidity_Outside changed from NULL to 65.0
2019-02-21 11:22:12.783 [vent.ItemStateChangedEvent] - Netatmo_Humidity_Inside changed from NULL to 40.0
2019-02-21 11:22:12.789 [vent.ItemStateChangedEvent] - Netatmo_Pressure changed from NULL to 1033.699951171875
2019-02-21 11:22:52.622 [.ItemChannelLinkAddedEvent] - Link 'Netatmo_Temperature_Outside_new-netatmo:NAModule1:8b7b2739:020000331546:Temperature' has been added.
2019-02-21 11:22:52.624 [vent.ItemStateChangedEvent] - Netatmo_Temperature_Outside_new changed from 11.800000190734863 to 13.6000003814697265625
2019-02-21 11:22:52.628 [.ItemChannelLinkAddedEvent] - Link 'Netatmo_Humidity_Outside-netatmo:NAModule1:8b7b2739:020000331546:Humidity' has been added.
2019-02-21 11:22:52.630 [.ItemChannelLinkAddedEvent] - Link 'Netatmo_Humidity_Inside-netatmo:NAMain:8b7b2739:70ee50330fda:Humidity' has been added.
2019-02-21 11:22:52.637 [.ItemChannelLinkAddedEvent] - Link 'Netatmo_Pressure-netatmo:NAMain:8b7b2739:70ee50330fda:Pressure' has been added.
2019-02-21 11:32:36.895 [ome.event.ItemUpdatedEvent] - Item 'Netatmo_Temperature_Outside_new' has been updated.
2019-02-21 11:36:40.436 [ome.event.ItemUpdatedEvent] - Item 'Netatmo_Temperature_Outside_new' has been updated.
2019-02-21 11:37:54.155 [vent.ItemStateChangedEvent] - Netatmo_Temperature_Outside_new changed from NULL to 13.8999996185302734375
2019-02-21 11:37:54.155 [vent.ItemStateChangedEvent] - Netatmo_Humidity_Outside changed from 65 to 62
2019-02-21 11:57:57.632 [vent.ItemStateChangedEvent] - Netatmo_Pressure changed from 1033.699951171875 to 1033.5999755859375

I do not think the value recorded has anything to do with it as I have some other instance of the OH that does use the same Netatmo sensor from the cloud and it keeps persisting fine.

@jswim788 Surprisingly adding a specific line to the influxdb.persist for this sensor did restarted archivization of the states. So now at least I have a way to fix that if the problem returns.

Netatmo_Temperature_Outside_new : strategy = everyChange, everyMinute, restoreOnStartup

Still have no idea what was the cause.

rossko57 · March 5, 2019, 12:42pm

That does seem like more than a coincidence.

Unlogged, but I notice the ‘failed’ Items all get state NULL at some time. That shouldn’t matter to influxDB of course, but curious what it does store for that.

Seems to be the last valid record, by the value.

These two reports stick out, they don’t seem to appear for the other channels/Items.

I’m guessing the temp channel is somehow the “first” of the channels from the device, and maybe triggers the link recovery process.

Something odd happens at 11:22, when the links are recovered. All the Items get updated NULL->something. Might be stale values, but who cares.
Only temp gets another update value->value - that’s the 13.6 stored in influx.
Some minutes after that, temp gets a mystery update (which perhaps means no value change) - the other Items don’t.
Again, some minutes after that, temp gets another mystery update (no change) - again, the other Items don’t.
Next thing we know - temp is changing from NULL->value (which doesn’t get into influx). When did it get to NULL again??

There’s a mystery temp change to NULL hidden in there. I couldn’t guess why, but something has gone awry with that link recovery process and somehow generated a state or event that broke influx.

jswim788 · March 5, 2019, 6:57pm

It’s interesting that the restart of OH didn’t fix the problem. Did the list of Items that needs to be monitored get stored in the cache, so it is necessary to clear the cache as well as restart OH to get the influx binding to save this Item again?

If this happens regularly at 50 days, maybe the debug log level for the influxdb add on could be turned on at day 49 to see if anything more is logged?