[SOLVED] Internet page scrapping with REGEX

Fa_Bien · December 12, 2018, 6:18am

Trying to get the air quality of my city from a webpage using the http binding and REGEX transformation.

I managed to have a REGEX transform which is working on the REGEX tester pages and seems compatible with openhab (no \ or " in the transform or result).
My first REGEX was data-index="(.*)">Aujourd but this resulted in problems with displayed items so I had to remove the " and >.

Finally, my item is the following:

String Air_quality_today "Today [%s]" {http="<[https://www.atmo-auvergnerhonealpes.fr/monair/commune/38185:3600000:REGEX(data-index=.(.*)..Aujourd)]"}

But it doesn’t seem to access the webpage, there is no displayed number on the sitemap. Also tried to replace the (.*) by (\d*.\d) without success.
Could it be because the page is secured https?
Any tips welcome

lfs_alp5 · December 12, 2018, 6:45am

You should try debugging the http binding, https://www.openhab.org/docs/administration/logging.html

vzorglub · December 12, 2018, 9:31am

Bonjour,
Try that one…

REGEX(data-index=\"(.*)\">Aujourd)

Fa_Bien · December 12, 2018, 12:29pm

Thanks for your inputs @lfs_alp5 and @vzorglub

logging only shows that the item changed from NULL to null… No error appears.
I am not having more luck with the different writing of the REGEX. And using square brackets instead of brackets leads to “Error REGEX does not follow the expected pattern”

lfs_alp5 · December 12, 2018, 1:30pm

Hm, what if you try:
REGEX(data-index=\"(.*?)\".*?Aujourd.*?)]

Fa_Bien · December 12, 2018, 3:46pm

No better results…

16:45:43.491 [INFO ] [smarthome.event.ItemStateChangedEvent] - Air_quality_today changed from NULL to null

I did use regex101: build, test, and debug regex to check the REGEX code, and it seems to pick up correctly the numbers I am looking for… Not sure where to look.

vzorglub · December 12, 2018, 4:09pm

I tried:

rule "test"
when
    Item Test_Switch changed to ON
then
    var String myString = sendHttpGetRequest("https://www.atmo-auvergnerhonealpes.fr/monair/commune/38185", 5000)
    myString = transform("REGEX", myString, "data-index=\"(.*)\">Aujourd")
    logInfo("TEST", myString)
end

And I get:

2018-12-12 16:06:08.022 [ERROR] [ntime.internal.engine.RuleEngineImpl] - Rule 'test': Illegal character range near index 1521
^<!DOCTYPE html>
<html class="no-js pc " lang="fr">
<head>
  <!--[if IE]><![endif]-->
<meta charset="utf-8" />
<link rel="shortcut icon" href="https://www.atmo-auvergnerhonealpes.fr/sites/all/themes/custom/theme_airra/favicon.ico" type="image/vnd.microsoft.icon" />
<meta property="og:image" content="https://www.atmo-auvergnerhonealpes.fr/sites/all/themes/custom/theme_airra/logo230x200.png" />
<meta name="generator" content="Drupal 7 (http://drupal.org)" />
<link rel="canonical" href="https://www.atmo-auvergnerhonealpes.fr/monair/commune/38185" />
<link rel="shortlink" href="https://www.atmo-auvergnerhonealpes.fr/monair/commune/38185" />
...

Is that page too large to be contained in a string?

rlkoshak · December 12, 2018, 4:30pm

One thing to note about how REGEX works in OH is that it must match the full webpage. Try adding .* at the start and end of the expression. Something like .*data-index=\"(.*)\"\>Aujourd.*. You might have to double escape. .*data-index=\\"(.*)\\"\\>Aujourd.*.

vzorglub · December 12, 2018, 5:02pm

Same error in my rule
Nothing to do with regex

Fa_Bien · December 12, 2018, 5:27pm

Matching all the page did the trick. But using the double escape resulted in the error:

mismatched input ‘(’ expecting ‘}’

Final working code is:
REGEX(.*data-index=.(.*)..Aujourd.*)

Thank you @rlkoshak and @vzorglub for the help!