Heavy Disk I/O on OH3 (solved)

Platform:

  • PC with Debian 9,
  • packaged-installation,
  • 32GB RAM
  • OH3 3.0.1-2
  • Java: openjdk version “11.0.8” 2020-07-14 LTS

On my installation is heavy I/O on harddrive, most times more than 95% usage.

Screenshot of IOTOP

I have no suggestion how this happens.

OH3 includes default persistence services for every Item.

It is true that rrd4j is installed by default and is configured to persist every compatible Item, but it would take thousands of Items to chew up 95% of the I/O on a hard drive on a full sized CPU. That screenshot is showing more than half a MB per second. If we assume 64 bits for the record (the size of a Java double) and another 64 bits for the timestamp (size of a Java long) and then double it for overhead we are looking at 256 bits per Item update. That means there would have to be almost 20,000 Items that are being saved every second. Even if we take just a third of that and blame the rest on logging we are looking at over 6,500 Items updating every second.

That doesn’t pass the sniff test as the source of the problem.

@Dirk_Neupert, try using the lsof command to see which files the Java process is writing to. Then watch those files over a brief period of time to see which one(s) are growing the fastest. From there we can start to narrow the problem down. It very well may be rrd4j that is causing the problem. But I at least will need a bit more evidence before I blame it.

Hallo Rich,

thanks for your answer. At Startup for the installation of OH3 i switched to InflixDB as default persistence-service like in my old oh2 installation.
I have do something with lsof and append this to this post.

The main-binding of my installation is knx, but the weather-binding produces a lot of items.

File lsof_openhab.txt (42.1 KB) is the output of lsof -u openhab
File lsof_pid.txt (42.1 KB) is the output of lsof -p XXX
File lsof_var.txt (18.1 KB) is th eoutput of lsof -D /var/lib/openhab/

The highest value seems from /var/lib/openhab/cache/org.eclipse.osgi/236/0/bundleFile

Thanks a lot

lsof will just show you the list of open files, it won’t show you how the files are changing. That’s only half of the problem. As I mentioned above, you next need to watch those files over a brief period of time to see which one(s) are growing the fastest.

Hallo Rich,

sorry for the late answer. I think i have found a way to get the growing files.
I used

find /var/lib/openhab -exec du {} ; > 08_25.txt
find /var/lib/openhab -exec du {} ; > 08_30.txt

So i get two files with filesizes and a diff of both get me some files. But the differences in sizes are very small.

but thats get only minimal differences. After searching for tools i found fatrace. It reports file-acces for the hole system. With fatrace|grep openhab i found two kinds of access. The main acces are going to /var/lib/openhab/persistence/mapdb/storage.mapdb.
The other main-kind gos to the files under /var/lib/openhab/persistence/rrd4j/
In rrd4j there are 1500 files.
Some of my items, for instance the Shelly 3em produces a lot of traffic on high update-range.

I log the output of

fatrace | grep openhab

for a minute and there are over 53000 Entries. 50000 of them are only from /var/lib/openhab/persistence/mapdb/storage.mapdb.

I think that could be the problem.

Thanks a lot for your work.

1 Like

Well that means rossko was right. the culprit is persistence. In your case both mapdb and rrd4j.

I didn’t consider map db in my estimate above which is probably handle by dividing by 2 but even then that’s a far cry from 1500 Items so. one or more of my assumptions must have been wrong.

Anyway we’ve found the culprit.

To deal with this you can stop using MapDB or you can create a custom mapdb.persist config to limit which items get saved under what conditions. Because you have so many items chasing so quickly you might constrain rrd4j too.

I doubt you need all 1500 to be charted nor do you need then all restored from startup.

You can even make it so that mapdb is only saving and restoring those items rrd4j can’t (e.g. string items). That way you aren’t saving them twice.

With some careful tuning in sure you can retain the capability you need with both and bring the writes way down.

Finally, mapdb may have some configuration options to let it cache changes and make fewer writes, at the risk of losing some data in a crash.

That did the trick.

I create an mapdb.persistance with only one item. The String items which i need to recover are less then 10. Now its much better and my other services can now work as well.

Thanks a lot. Great Job.

I’m a bit concerned you may have only fixed a symptom.

Related story - we had a heavy Modbus user with performance issues. That’s a fast-polling binding that can be configured to hundreds of Items and update them each second.
That of course walloped persistence set up to record “every update”, and a huge improvement could be made by simply recording “every change” instead,
But … the real problem was that the biding was still issuing hundreds of Item update events onto the openHAB bus, and each of those actually has a processing overhead cost. And most of them were completely pointless update-to-same value. Because that’s what you get with polling, you poll fast to catch any change quickly - but it’s only one in a thousand polls that actually have a change.
The real fix was to develop screening built in to the binding to filter out most pointless updates, a cunning mechanism out of scope here.

The point of the story is, I think you should go back and work out why your persistence was getting hammered. What are you doing to generate lots of events in the first place?

1 Like

Yes! … or changing to almost the same value (just bouncing by an irrelevant amount).

We had a discussion here how to debounce number channels to keep the overall processing overhead low (see: [profiles] Proposing debouncing profile · Issue #2172 · openhab/openhab-core · GitHub)

So I’m after some way to tame the event generation quite early in the the processing.

I guess, the link profile concept is a valid point to tackle such issues if the binding itself or the actual device (e.g. a sensor generation quick updates) does not provide a mechanism to reduce the event rate …

Your both are right.
I think its a structurel problem if items do updates which does not change values.
At the Moment it works.

In the next days i would do a roll back of my changes (rename persist-file) and switch bindings on/off one by one to find out which binding do this huge amount of updates. A part of the items, for instance the one’s for weather, i mean there is no need for persistence because they can get her values fresh at restart of openhab. With the example for the Weather-Widget here in the forum there are so much items for one location, and i have two locations.

I think i have only some items that will get more updates then normal for instance for power-consumption (Shelly 3PM).

If i am knowing something new i will post it here.

Sure, but updating once per hour or whatever is negligible. Nor would I worry about persisting selected Items if you wanted to chart e.g. external temp & humid.

It is possible to create rules that get into update loops, not just bindings passing along traffic. Even if it were only one Item involved, that could still hammer persistence when configured for “everyUpdate”.

It’s worth noting here that the default rrd4j config does in fact use everyChange and not everyUpdate. I’m not sure what MapDB’s default is. I imagine it could be everyUpdate. I’m not sure how to find out short of looking in the code. The MapDB add-on doc appears to be missing in action in the official docs.

It might be less disruptive to use the developer sidebar’s stream. Unlike events.log it will show you the Item updates in addition to changes and commands. From there you should be able to figure out which bindings are a problem and you don’t need to change your .persist files nor disable bindings.

alt-shift-D brings up the sidebar and the event stream is the second tab in MainUI.

After moving mapdb.persist away high i/o is return.
I had running openhab-cli showlogs at this time and at the moment of moving the persistence-file, all items of the weather-binding is showing in the log and iotop shows me the heavy i/o near to 100%.
I paused the openweathermap-service and all is going away and looks really nice. So it must be this binding.
All other bindings does not do so much trafic. In the options of the binding is the dafault refesh-intervall of 60 minutes set.
For information, these bindings are installed:

  • bindings
  • amazon echo control
  • astro
  • avm fritzbox
  • chromecast
  • exec
  • harmony
  • http
  • knx
  • mqtt
  • ntp
  • openweathermap
  • shelly
  • smartmeter
  • tr-064
  • traedri
  • wifi-led

knx and shelly produce the most updates but this very well and only for real changes on items.

The really stupid thing on all of this is, that you see only some updates in the event.log and it shows absolut normally in comparing to oh2.

Only changes are shown in the events.log, by design, same as OH2.

If you have a suspect Item in mind, you can monitor for updates with a rule. Or as Rich says use the developer tools.

Your right. Now i open the develepor-tools.
I think it is wrong that i told here, weather works.

The problem is the shelly3em. It seems that this is the problem with little small updates described at Issue #2172. The items changes real but for instance from 61,9 Watt to 62,0 Watt.

How can i reduce this updates. Is there a way to do some hysterese to smooth this values. These are also persited with InfluxDB to get nice diagramms. This makes with the default config that are these values are written twice.

The first thing to ask is if you really need this Item and it’s data. If not, eliminate the Item. Don’t keep around Items and data just because you can.

Assuming you do want to keep it the most effective way is going to involved some Rules. There are two approaches:

  1. Create a proxy Item and configure only the proxy Item to be persisted. Then create a rule to trigger when the raw Item changes. In this rule test to see if the new state has changed enough and only if it has done so update the proxy Item.

  2. Remove your sensor Item to be persisted based on a strategy. Create a Rule that triggers when the Item changes. If the change is big enough call MyItem.persist(“mapdb”) to cause the current state to be persisted.

In addition, if you are using InfluxDB, why keep rrd4j around? OH 3’s MainUI has some really nice diagrams now too. Since moving to OH 3 I’ve eliminated InfluxDB and Grafana and just use rrd4j and MainUI for charting. So I recommend choosing one or the other (either rrd4j or InfluxDB/Grafana) and not both. It’s adding a lot of extra load on your system for no benefit.

I’m a little bit confused.
At start with oh3 i select infux-DB as default persitance-service. I’ll setup nothin with mapdb and rrd4j, but the system is doing this by default. I thought, that, if i’m using another persitence service (for me influx-DB) this would used and not one selected by system.

So, for now, i switch with default persistance to rrd4j. Thats not perfect but is much better then before.
Your right, that diagramms could be done using the new Main-UI. At the moment i would try it without a proxy-item. The i/o is in most times under 50 %, only peeks are at 99% but that should be no problem.

By default rrd4j gets installed if it’s a new install. MapDB has to be installed by you. All persistence databases now come with a default strategy.

So at some point you had to install at least MapDB.

Setting the default persistence only changes which database is used when a specific database engine is not specified. It doesn’t change nor does it disable what the other engines are doing.

If you are not using a database engine for anything, uninstall it. Right now you are having an IO problem and part of why that is is because you are saving the same value three times to three different databases.

Can we go back a step on this … unless you are getting many changes per second, this is not a problem of itself, just a small nuisance (and yes there are ways to round it down).