[SOLVED] Internet page scrapping with REGEX

Trying to get the air quality of my city from a webpage using the http binding and REGEX transformation.

I managed to have a REGEX transform which is working on the REGEX tester pages and seems compatible with openhab (no \ or " in the transform or result).
My first REGEX was data-index="(.*)">Aujourd but this resulted in problems with displayed items so I had to remove the " and >.

Finally, my item is the following:

String Air_quality_today "Today [%s]" {http="<[https://www.atmo-auvergnerhonealpes.fr/monair/commune/38185:3600000:REGEX(data-index=.(.*)..Aujourd)]"}

But it doesn’t seem to access the webpage, there is no displayed number on the sitemap. Also tried to replace the (.*) by (\d*.\d) without success.
Could it be because the page is secured https?
Any tips welcome

You should try debugging the http binding, https://www.openhab.org/docs/administration/logging.html

Bonjour,
Try that one…

REGEX(data-index=\"(.*)\">Aujourd)

Thanks for your inputs @lfs_alp5 and @vzorglub

  • logging only shows that the item changed from NULL to null… No error appears.

  • I am not having more luck with the different writing of the REGEX. And using square brackets instead of brackets leads to “Error REGEX does not follow the expected pattern”

Hm, what if you try:
REGEX(data-index=\"(.*?)\".*?Aujourd.*?)]

No better results…

16:45:43.491 [INFO ] [smarthome.event.ItemStateChangedEvent] - Air_quality_today changed from NULL to null

I did use regex101: build, test, and debug regex to check the REGEX code, and it seems to pick up correctly the numbers I am looking for… Not sure where to look.

I tried:

rule "test"
when
    Item Test_Switch changed to ON
then
    var String myString = sendHttpGetRequest("https://www.atmo-auvergnerhonealpes.fr/monair/commune/38185", 5000)
    myString = transform("REGEX", myString, "data-index=\"(.*)\">Aujourd")
    logInfo("TEST", myString)
end

And I get:

2018-12-12 16:06:08.022 [ERROR] [ntime.internal.engine.RuleEngineImpl] - Rule 'test': Illegal character range near index 1521
^<!DOCTYPE html>
<html class="no-js pc " lang="fr">
<head>
  <!--[if IE]><![endif]-->
<meta charset="utf-8" />
<link rel="shortcut icon" href="https://www.atmo-auvergnerhonealpes.fr/sites/all/themes/custom/theme_airra/favicon.ico" type="image/vnd.microsoft.icon" />
<meta property="og:image" content="https://www.atmo-auvergnerhonealpes.fr/sites/all/themes/custom/theme_airra/logo230x200.png" />
<meta name="generator" content="Drupal 7 (http://drupal.org)" />
<link rel="canonical" href="https://www.atmo-auvergnerhonealpes.fr/monair/commune/38185" />
<link rel="shortlink" href="https://www.atmo-auvergnerhonealpes.fr/monair/commune/38185" />
...

Is that page too large to be contained in a string?

One thing to note about how REGEX works in OH is that it must match the full webpage. Try adding .* at the start and end of the expression. Something like .*data-index=\"(.*)\"\>Aujourd.*. You might have to double escape. .*data-index=\\"(.*)\\"\\>Aujourd.*.

Same error in my rule
Nothing to do with regex

Matching all the page did the trick. But using the double escape resulted in the error:

mismatched input ‘(’ expecting ‘}’

Final working code is:
REGEX(.*data-index=.(.*)..Aujourd.*)

Thank you @rlkoshak and @vzorglub for the help!