Textual configuration: html binding and RegEx

s0170071 · April 21, 2021, 7:37am

I am running OH3 on a RPi4.
I already use the html binding which is working well for json and RegEx. I was able to use a simple float RegEx with a page inside my local network. But that page contained just the float string, no additional html code or anything else.

This is the channel that works:

Type number : Channel_Vito_mittl_Ausgangstemp_Luft     "Vito_mittl_Ausgangstemp_Luft"     [ stateExtension="read?DP=0x16b1&Type=TempL", stateTransformation="REGEX:(^[-+]?[0-9]+\\.[0-9]+)" ]

So RegEx in general seems to work.
Also, my RPi can ping external sites and fetch json stuff from there so I assume network access ok too.
As soon as I try to fetch data from an html page using RegEx, it fails. Here is what I got:

html page source http://www.n-tv.de

<!doctype html> <html class="no-js" lang="de"> <head> <title>Nachrichten, aktuelle Schlagzeilen und Videos - n-tv.de</title> <base href="[https://www.n-tv.de/](view-source:https://www.n-tv.de/)"> <meta charset="utf-8"> <meta http-equiv="x-ua-compatible" content="ie=edge"> <meta name="viewport" content="width=device-width, initial-scale=1"> <meta name="author" content="n-tv NACHRICHTEN" />  
...

Things file:

Thing http:url:ntv "ntv"   [baseURL="https://www.n-tv.de",   refresh="100", timeout="3000"]   {
    Channels:
        Type string : Channel_ntv    "ntv [%s]"         [stateTransformation="REGEX:.*html*.>(.*)<title.*"]
}

Items file:


String    ntv      "ntv [%s]"       <line>     (gSomeGroup)    [ "Measurement" ]    {channel="http:url:ntv:Channel_ntv", expire="12h"}

For now I am just trying to fetch anything from the n-tv.de site, e.g. the string

<html class="no-js" lang="de"> <head>

I even tried “…RegEx:(.*)”] to get the whole html but that failed too. And yes, I tried to increase the buffer size to 100k
There are no errors in the log.

Any help is appreciated.

rossko57 · April 21, 2021, 10:19am

At the moment, you don’t seem sure if your webpage fetch is working, so forget regex for now. Just get the whole page to string, and have a look at it. There might be a rejection message.

s0170071 · April 22, 2021, 8:03am

tldr;
to match a float between the strings “start” end “end” use

[stateTransformation=“REGEX:.*?start([+-]?([0-9]*[\\.])?[0-9]+)?end.*”]

Explaination
The problem is that regex with openhab does not behave like e.g. regex101.com. OpenHAB RegEx is greedy and needs to match the whole string. I give an example.
Greedy means that the RegEx algorithm tries to to return a string that is as long as possible. Think of it as if it starts from the end of the string and tries to find a match working its way backwards.

If you have the string

<html lang=‘en’><head><meta charset=‘utf-8’/>

and you wand to match just <head>, on regex101.com you would use

>(.*)<

as regular expression. It starts at the first >, captures everything inside the parenthesis and stops at the <.

If you use the same RegEx in openhab it wouldn’t start capturing unless your string begins with a >. Because your string starts with anything else, your RegEx expression needs to consider this by matching anything before and including the >. The RegEx expression for that is .*>
Now the greedy part of it is that this expression would match everything up to the last >, not the first one. Therefore you need to make it ungreedy by using a ?. Your RegEx to match everything up to and including the first > would be

.*?>

Now comes the capturing group which is some letters. In RegEx thats again .* but in parenthesis. Make that again ungreedy so capture as little as possible by appending a ? Regex for the capturing group is

(.*?)

It stops matching at the next character which is <. For that we need to add this to the expression.

Now we need to match the rest of the string, just to satisfy openhab. Thats again .* but you can leave it greedy. The whole RegEx in this example would be

.*?>(.*?)<.*

It works also with multi line strings.
In general, to match the first expression between “start” and “end” of your string, do:

.*?start(.*?)end.*

To match a float number between “start” and “end” of your string, do:

.*?start([+-]?([0-9]*[\\.])?[0-9]+)?end.*

The textual config of the things file would then look e.g. like this:

type string : Channel_WebsiteWithFloatNumber "my Number as String: [%s]" [ stateExtension="number.html", stateTransformation="REGEX:.*?start([+-]?([0-9]*[\\.])?[0-9]+)?end.*"]

Hope that helps anyone.