Http Binding / Regex / Scrape Webpage

Dear all,

I have an inverter that has no accessible API, therefore I want to scrape the webpage to gain the energy production information.
In the html code you find something like this:

        <!-- 历史发电量 -->
        <th scope="row">Lifetime generation</th>
        <td>64.2 kWh </td>
        <!-- 最近一次系统功率 -->
        <th scope="row">Last System Power</th>
        <td>847 W </td>
        <!-- 系统当天累计发电量 -->
        <th scope="row">Generation of Current Day</th>
        <td>0.69 kWh </td>

So, I want to extract the value of the current day.

It’s my first time trying regular expression, therefore I tried this:

.*Generation of Current Day<\/th>\n\s*<td>[+-]?\d+((\.|\,)\d+)? kWh <\/td>.*

If I test that on

        <th scope="row">Generation of Current Day</th>
        <td>0.69 kWh </td>

Of course, I have to reduce the output to the number value only, any hints?

I tried to use that with OH3, to get a first result there:

  • added a http thing and channel that already receives the complete html code as a string.
  • installed the REGEX transformation
  • Configured the channel:

The result is that the item is empty.
My questions:

  • How do I configure the channel that I will receive the right value?
  • Is there any optimization for the regular expression?

One important thing to realize about openHAB and REGEX, which you have handled so that’s OK, is that the pattern has to match the full string. The .* at the beginning and end handle that .

What gets returned is the first matching group. So what ever matches inside the first set of parens is what get’s returned. You’ve actually two sets of parens, one nested inside another. I don’t know what that would do to the behavior. It also probably means that the or operation in REGEX won’t work in OH’s REGEX Transformation.

First of all, will the web page give you numbers with both . and ,? If not avoid that complication and just try to match the one you know will be there. I suspect that is the root of your problem. So see if you can avoid it. Also be less strict in your matches. Maybe something like (assuming no negative numbers):

.*Generation of Current Day.*td\>(\d*\.?\d+) kWh.*

Avoid needing to mess with all those spaces and HTML tags and such as much as possible.

1 Like

Okay, I think it is the best way to go first with a very simple example and become more complex, if the easy one is running.
I tried (including REGEX: as prefix
And I get the whole html string as a result in my item. That’s not what I expect
Is it correct that I entered this in the channel config and not in the item? So I don’t use the item profile “REGEX”

You don’t have a matching group. Remember, the REGEX must match the full String but the first matching Group is what gets returned. If you don’t define a matching group the full String is returned.

If you want just one of the td tags use


The parens tell the transformation what to return.

You could do it either place but usually the Channel Config is preferable.

I tried your expression and it works great!
Thank you a lot.
I think I have to do a little bit more homework on regular expression to get a better understanding.
Unfortunately there are so much topics where you have to dive into, if you have individual requirements (YAML, CSS, APIs, HTTP, REGEX, Python …)

I want to switch to JS transformation, because I have the need to make a small calculation on top.
But the Reg Ex isn’t working. I think the issue is the line break between and . Is there a difference between JS and Regex?

As I mentioned above, openHAB’s REGEX works differently from other systems. If you are doing a regex inside JavaScript, you are using JavaScript’s regex and need to follow how that works. Regular expressions - JavaScript | MDN