Performance influxdb vs. mysql

hannibal29 · January 28, 2018, 1:39pm

Hi everybody,

I know there are a lot of threads discussing the different persistence engines with its respective (dis)advantages and especially the combination of influxdb with Grafana. This is also my setup up to now.

But I’m wondering if somebody made a performance comparison between influxdb and mysql?

So, to be more precise, imagine the following scenario:

SOC hardware running openhab2 and persistence engine (quite common rPi with openHABian)
both database engines are persisting with the exact some strategy = identical *.persist files
Grafana not included

And know the question:

Which of both DB engines uses more resources (CPU, RAM)?
Which of both DB engines uses more space to store the data?

And in addition: Is there a performance difference between native mysql and JDBC persistence service?

Any experience regarding performance of the 2 DB’s is appreciated

Cheers
Sebastian

bulletprooffool · January 28, 2018, 11:38pm

Moved from MySQL on an old laptop to RPi 3 and influxdb / grafana.
It’s was worth the shift, for the graph enhancements for sure.
Can’t say I’ve had too much performance issues, but I’ve only got 14 bindings running and about 60 managed things (of course many more “items”)

If you don’t care about your historic data, the shift is super easy - you could always roll back if you see issues?
If you do, have a browse - there’s an article somewhere on here, where so one coded a data migration for MySQL to influxdb

splatch · January 29, 2018, 12:56pm

Primary difference between traditional databases and modern ones is how data is physically stored on the disk. Main reason why influx doesn’t have notion of table is because it always stores data sorted by time. Of course SQL table can be sorted by time too, but this quite often requires additional work from software to do such - first read according to criteria and then rearrange data. Retrieving data from influx for a week month ago will take less i/o cycles because there will be continous read and no lookups between index and actual row.

I think it would be possible to configure or create table and index definition within SQL database to achieve similar results however for most of users it is too difficult and error prone. I could summarise this as “tweaking SQL queries is an art”.

My current openhab measurements take 2.1 GB on disk. You might be curious how much data is in there - I’ve ran a basic query to influxdb:

$ curl -G "http://my-lovely-influxdb:8086/query" \
  --data-urlencode "db=openhab" \
  --data-urlencode "q=SELECT COUNT(*) FROM /.*/ WHERE time > now() - 1d AND time < now()" \
 |  jq "[.results[].series[].values[0][1]] | add"
706672

Total number is 706672 data points in 370 items. For past 30 days I collected 26 718 464 data points. 26 million entries is not a lot for typical server environment but usualy requires fine tunning of queries and proper data partitioning. I run it on nas, so I do have a typical PC, with 4 cores. This allows me to be a little bit less concerned about hardware limits, however I’ve ran influx for quite long time on RPI and it did work fine. Beside killing SD card.

PS. For past year I have 268 773 147 data points and influx was able to give me results within 90 seconds.

Best regards,
Lukasz

rlkoshak · January 29, 2018, 4:37pm

Do you already know that performance is going to be a problem or are you just solving a problem you don’t know you even have yet? A typical OH environment simply does not generate enough data to make the performance of the DBs make all that much of a difference.

I’ve no comparison data to offer but based on my knowledge of how the two work, I would expect InfluxDB to be more efficient in both RAM and CPU usage.

The problems you should focus on instead of which is more efficient:

developing a backup and recovery strategy
how to detect and deal with wearing out the SD card which WILL happen if you run persistence on the SD card?
what external tools do you plan to use to analyze and display this information and which DBs do they support?

hannibal29 · January 29, 2018, 5:29pm

Thx for the feedback so far.

Yes and no

I currently use influxdb (without grafana) on a Pi2 and the influxdb process is almost always position 1 when monitoring system performance with top (10-50% CPU, 5-15% MEM). Overall the performance of the Pi is sometimes quite slow as it is doing a lot of stuff for such a small SOC.

That’s the reason I consider moving over to MySQL at the moment. I want to do some analysis with KNIME which doesn’t support influxdb directly. However I can use csv export from influxdb.

So this is the question: Is it worth the effort migrating to MySQL to have better integration for analysis or will the performance become so bad that is it better to stick with influxdb and use the workaround with csv exports?

rlkoshak · January 29, 2018, 6:54pm

With the context, the question is much better.

All things considered, OH is flexible enough I’d recommend making the switch and trying it out. It isn’t that much effort. You can even run them both in parallel.

Though I highly doubt you will ever see MySQL consuming less resources that what you see with InfluxDB right now.

Whether the ability to use the external tools without converting back and forth is something only you can answer.

lfs_alp5 · January 29, 2018, 7:54pm

I converted from mysql to influxdb and back because I didn’t realize influxdb doesn’t have the same “join” possibilities that mysql has. I need the flexibility and speed of “joins” to graph some kWh aggregation graphs.

I never retrieve a whole year worth of data points anyways, I use the sql/db to aggregate any huge amounts of data (ie. monthly average etc.) because it’s fast and it saves me the trouble of doing it in the GUI (php).
Just my experience…

splatch · January 29, 2018, 10:19pm

These days tools are starting from very simple functionality and keep enriching it. I still remember Influx 0.11 which was able to store just time and value without any additional “tags”. What influx can do these days is fairly advanced in comparison to 2014.
Retrieving so many datapoints as I enlisted in previous post is just an example - that it is possible. Sometimes I just change time period in my grafana dashboard to past 6 months and see if there are any interesting trends.

Major advantage of being able to write and retain a lot of data in fairly compact form gives you advantage of building many aggregations from very granular source. Measurements which you write can be used to generate minute, hour or daily averages without loosing ability to re-compute. With traditional databases above certain number of rows, depending on hardware configuration, you might have to aggregate data in order to keep overall solution performing well. While it keeps database free of troubles it causes lose of precision and blocks possibility to filter out invalid values later on (ie. reads which are out of range for some reason).

Cheers,
Lukasz