Get a value from a html site with http binding and regex

Asti · February 10, 2021, 12:02pm

Hello, I am running Openhab 2.5.11 on a Debian 9.13 and Zulu Java version 8.0.275-3
and am trying to fetch some data from a website (a small logger for my solarpanels).
I have started by installing the http binding and the regex transformation addon.

I must say that I am a total beginner in both openhab, programming and regex. I mostly get my information from examples here

I have created the following item:
String weblogtest “Weblogtest” { http="<[http://192.168.50.10/html/de/onlineAdmain.html:60000:REGEX((.*?))]" }

When I hover over the item in Visual Studio Code I can see that the whole website is in there.
Here is a small part of it and also where my desired information is located

I am currently interested in the value beneath Gesamtenergie (here: 382988.548)

I have tried a bit with the help of the website Regex101 and figured out that this gets me what I want:
(?m)(?<=Gesamtenergie)<.td>\n…(.* )(?=<.td>)
This may not be an elegant method, but it works on the website. But only there. Openhab is not accepting this and I am getting this error
(note: this error may not be for exactly the regex above, but the system error remains the same for all attempts I have made)
Caused by: org.openhab.model.item.binding.BindingConfigParseException: bindingConfig ‘<[http://192.168.50.10/html/de/onlineAdmain.html:60000:REGEX((?<=Gesamtenergie)<.td> …(.*)(?=<.td>))]’ doesn’t contain a valid binding configuration

I found here on the forum that the Regex in openhab act a bit different than on the website.
So I started a over again with this Regex:
.* (Gesamtenergie).*
It selects the correct word, but here am I getting lost, because my attempts to get the value in the next line all fails.

regard Asti
[edit] removed htmlcode from website as not shown as desired. Took a picture instead.
[edit] the forum removes some of the format from my regex. * is not shown, I have added some “space” to correct that.

rlkoshak · February 10, 2021, 4:50pm

You would probably be wise to move to the HTTP 2 binding as the HTTP 1 binding is not compatible with OH 3 and will not be available when you decide to upgrade. Better to start out with it now instead of needing to change later.

In OH REGEX works a little differently from normal. Your expression needs to match the entire string and then the first matching group (first set of parens) is what get’s returned. So that’s what your second attempt did. The expression does indeed match the full string. And it returned the part of the pattern inside the first set of parens.

You need to use code fences.

```
code goes here
```

Or inline

Some text `code` some more text.

I think the following will return what you are after.

.*Gesamtenergie</td>\\n<td>(.*)</td>.*

Notice the escaping for the newline (extra ). Also notice we put the parens around the part we want returned.

It could be possible that the page is using Windows two character newlines so if the above doesn’t work you might be able to use something like

.*Gesamtenergie</td>.+<td>(.*)</td>.*

That will match one or more characters after the </td>.

In general you want to find unique markers for the start and end of the string you want. Put those two markers on either side of the parens.

hafniumzinc · February 10, 2021, 5:22pm

Note that this doesn’t really exist as an official Binding for OH2. There are a couple of people who have shared the work-in-progress JARS on the original development thread.

rlkoshak · February 10, 2021, 5:23pm

I thought it was released in 2.5.10 or 2.5.11.

hafniumzinc · February 10, 2021, 5:27pm

I mistakenly thought I was on 2.5.11, but turns out I was only on 2.5.9 - I’ll update now and check, unless someone else gets in and confirms before me!

(Though I can’t see any new bindings mentioned in the release notes for .10 or .11)

Confirmed not in the 2.5.12 release. Only the V1 is available:

Asti · February 11, 2021, 9:00am

Thanks for the heads up, I will note that for my migrationplan.

I used the <code> variant, but I will use the other instead now, thanks.

rlkoshak:

I think the following will return what you are after.
.*Gesamtenergie</td>\\n<td>(.*)</td>.*
Notice the escaping for the newline (extra ). Also notice we put the parens around the part we want returned.

I think I am getting there understanding the syntax. I have tried your code and there is no error in the log, but my string item shows “space or tab” or something. On a sitemap its just white. It is not NULL.

rlkoshak:

It could be possible that the page is using Windows two character newlines so if the above doesn’t work you might be able to use something like
.*Gesamtenergie</td>.+<td>(.*)</td>.*
That will match one or more characters after the </td>.

I tried that one too and this time my stringitem shows  

My assumption is that it is matching whitespaces in both cases. Am I correct?
Here is a part of the htmlcode, this time I have copied it from the website directly and not from the Stringitem:

				<tr>
					<td class="tablebody">
						<table width="100%" border="0">
							<tr class="tablehead">
								<td colspan="7"><strong>Summe ausgewählter Digitaleingänge</strong></td>
							</tr>
							<tr>
								<td class="tablehead">Bezeichnung</td>
								<td class="tablehead">Wert</td>
								<td class="tablehead">Einheit</td>
								<td>&nbsp;</td>
								<td class="tablehead">Bezeichnung</td>
								<td class="tablehead">Wert</td>
								<td class="tablehead">Einheit</td>
							</tr>
							<tr>
								<td>Aktuelle Leistung</td>
								<td><b id="pow">0.951</b></td>
								<td>kW</td>
								<td>&nbsp;</td>
								<td>Aktuelle Monatsenergie</td>
								<td>220.601</td>
								<td>kWh</td>
							</tr>
							<tr>
								<td>Aktuelle Tagesenergie</td>
								<td><b id="enDay">0.910</b></td>
								<td>kWh</td>
								<td>&nbsp;</td>
								<td>Aktuelle Jahresenergie</td>
								<td>640.815</td>
								<td>kWh</td>
							</tr>
							<tr>
								<td>Tagesenergie Vortag</td>
								<td>21.390</td>
								<td>kWh</td>
								<td>&nbsp;</td>
								<td>Gesamtenergie</td>
								<td>383009.911</td>
								<td>kWh</td>
							</tr>
						</table>
					</td>
				</tr>
				<tr><td>&nbsp;</td></tr>
				<tr><td>&nbsp;</td></tr>

This explanation helped me understanding the structure of regex better. Thanks a lot.

Asti · February 11, 2021, 2:39pm

I have been tinkering for a while now, trying to understand more. I have been using trial & error and got so far that this regex got me very close to what I want:

.*<td>Gesamtenergie</td>(.*)<td>kWh</td>.*

Which results in this string:

                                <td>383009.911</td>

I am a bit confused about the whitespaces ahead of <td> and cannot get past it or remove it.

I have tried a few combinations, but am always failing and getting a string which is empty I guess.
It looks like that in Visual Studio Code.

webloggesamt_item_empty

Asti · February 11, 2021, 3:27pm

I finally got it. This Regex works for me.

.*<td>Gesamtenergie</td>.+<td>(.*)</td>.+<td>kWh</td>.*

webloggesamt_item_working

Thanks alot @rlkoshak for pointing me in the right direction and giving me a good explanation how regex in openhab works.

Regards Asti

rlkoshak · February 11, 2021, 3:28pm

Yes, it seems there is a bunch of white space, which clearly there is. But the .+ should handle that

Well, the original is indented and it’s spaces that are used to implement that So we need to add that to the “marker” before the value you want to return. .* should match 0 or more of any character. .+` should match one or more characters.

.*<td>Gesamtenergie</td>.*<td>(.*)</td>.*

That should consume the white space and the open and close tags around the number, returning just the number.

Asti · February 11, 2021, 3:36pm

`.*<td>Gesamtenergie</td>.*<td>(.*)</td>.*`

I dont know why but without the

</td>.*<td>kWh</td>.*

in the end it will not work, but * works fine instead of + just as you said.

Thanks again.

Asti · February 14, 2021, 12:33pm

I just wanted to add something. As I went further and created the second and third item, I noticed that I have to be more specific with my regex. I had to go as far as the next unique entry.
So one of my next items looks like this:

String weblogtagstr "Solar Tages Energie" { http="<[http://192.168.50.10/html/de/onlineAdmain.html:60000:REGEX(.*<td>Aktuelle Tagesenergie</td>.*<td><b id=.enDay.>(.*)</b></td>.*<td>kWh</td>.*<td>&nbsp;</td>.*<td>Aktuelle Jahresenergie.*)]" }

Otherwise I will not get the correct value.