Textual configuration: html binding and RegEx

I am running OH3 on a RPi4.
I already use the html binding which is working well for json and RegEx. I was able to use a simple float RegEx with a page inside my local network. But that page contained just the float string, no additional html code or anything else.

This is the channel that works:

Type number : Channel_Vito_mittl_Ausgangstemp_Luft     "Vito_mittl_Ausgangstemp_Luft"     [ stateExtension="read?DP=0x16b1&Type=TempL", stateTransformation="REGEX:(^[-+]?[0-9]+\\.[0-9]+)" ]

So RegEx in general seems to work.
Also, my RPi can ping external sites and fetch json stuff from there so I assume network access ok too.
As soon as I try to fetch data from an html page using RegEx, it fails. Here is what I got:

html page source http://www.n-tv.de

<!doctype html> <html class="no-js" lang="de"> <head> <title>Nachrichten, aktuelle Schlagzeilen und Videos - n-tv.de</title> <base href="[https://www.n-tv.de/](view-source:https://www.n-tv.de/)"> <meta charset="utf-8"> <meta http-equiv="x-ua-compatible" content="ie=edge"> <meta name="viewport" content="width=device-width, initial-scale=1"> <meta name="author" content="n-tv NACHRICHTEN" />  
...

Things file:

Thing http:url:ntv "ntv"   [baseURL="https://www.n-tv.de",   refresh="100", timeout="3000"]   {
    Channels:
        Type string : Channel_ntv    "ntv [%s]"         [stateTransformation="REGEX:.*html*.>(.*)<title.*"]
}

Items file:


String    ntv      "ntv [%s]"       <line>     (gSomeGroup)    [ "Measurement" ]    {channel="http:url:ntv:Channel_ntv", expire="12h"}

For now I am just trying to fetch anything from the n-tv.de site, e.g. the string

<html class="no-js" lang="de"> <head>

I even tried “…RegEx:(.*)”] to get the whole html but that failed too. And yes, I tried to increase the buffer size to 100k :wink:
There are no errors in the log.

Any help is appreciated.

At the moment, you don’t seem sure if your webpage fetch is working, so forget regex for now. Just get the whole page to string, and have a look at it. There might be a rejection message.

tldr;
to match a float between the strings “start” end “end” use

[stateTransformation=“REGEX:.*?start([+-]?([0-9]*[\\.])?[0-9]+)?end.*”]

Explaination
The problem is that regex with openhab does not behave like e.g. regex101.com. OpenHAB RegEx is greedy and needs to match the whole string. I give an example.
Greedy means that the RegEx algorithm tries to to return a string that is as long as possible. Think of it as if it starts from the end of the string and tries to find a match working its way backwards.

If you have the string

<html lang=‘en’><head><meta charset=‘utf-8’/>

and you wand to match just <head>, on regex101.com you would use

>(.*)<

as regular expression. It starts at the first >, captures everything inside the parenthesis and stops at the <.

If you use the same RegEx in openhab it wouldn’t start capturing unless your string begins with a >. Because your string starts with anything else, your RegEx expression needs to consider this by matching anything before and including the >. The RegEx expression for that is .*>
Now the greedy part of it is that this expression would match everything up to the last >, not the first one. Therefore you need to make it ungreedy by using a ?. Your RegEx to match everything up to and including the first > would be

.*?>

Now comes the capturing group which is some letters. In RegEx thats again .* but in parenthesis. Make that again ungreedy so capture as little as possible by appending a ? Regex for the capturing group is

(.*?)

It stops matching at the next character which is <. For that we need to add this to the expression.

Now we need to match the rest of the string, just to satisfy openhab. Thats again .* but you can leave it greedy. The whole RegEx in this example would be

.*?>(.*?)<.*

It works also with multi line strings.
In general, to match the first expression between “start” and “end” of your string, do:

.*?start(.*?)end.*

To match a float number between “start” and “end” of your string, do:

.*?start([+-]?([0-9]*[\\.])?[0-9]+)?end.*

The textual config of the things file would then look e.g. like this:

type string : Channel_WebsiteWithFloatNumber "my Number as String: [%s]" [ stateExtension="number.html", stateTransformation="REGEX:.*?start([+-]?([0-9]*[\\.])?[0-9]+)?end.*"]

Hope that helps anyone.

2 Likes