HTTP Binding problem with umlauts

I want to get a date from a website, which I get via the HTTP-binding. The problem is that the encoding isn’t correctly interpreted by the binding and so the umlauts aren’t correctly shown. The Basic UI just show a ? instead of the “ä” in März. How can I fix this?

Website code:

<tr><td width="22%" height="47"><font color="Gray"><b><nobr>Datum</nobr></b></font></td>
<td align="center" width="7%" height="47"><img border="0" src="cal.gif" width="51" height="40"></td>
<td align="center" width="52%" height="47" colspan="7"><b><font size="4">Mittwoch, 7. März 2018</font></b></td></tr>
<tr><td width="22%" height="46"><font color="Gray"><b><nobr>Zeit</nobr></b></font></td>
<td align="center" width="7%" height="46"><img border="0" src="clock.gif" width="41" height="40"></td>
<td align="center" width="52%" height="47" colspan="7"><b><font size="5">19:30</font></b></td></tr>
<tr><td width="22%" height="46"><font color="Gray"><b><nobr>Temperatur Innen</nobr></b></font></td>

Item definition:

String Temp_Date    "Datum [%s]"   <calendar>   { http="<[datenlogger:3000:REGEX((?s).*Datum.*([1-3 ][0-9]. [A-zöäü]+ [0-9]{4})..font.*)]" }

What encoding is specified in the returned HTML and its HTTP headers?

This is the charset provided via the HTML-meta-tag:

<!doctype html public "-//w3c//dtd html 3.2//en">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=iso-8859-1">

iso-8859-1 supports umlauts.
In the http-header I cannot find anything about the charset:

Accept-Ranges:bytes
Age:0
Connection:Keep-Alive
Content-Length:20318
Content-Type:text/html
Date:Thu, 08 Mar 2018 07:03:15 GMT
ETag:"4f5e-566e14141352b"
Last-Modified:Thu, 08 Mar 2018 07:02:12 GMT
Proxy-Connection:Keep-Alive
Server:Apache/2.4.29 (Unix)

You could use the XSLT transformation to change the encoding and also retrieve the value.

Something like this, just changing the encoding.
Change the result tag to get your date by applying the right xpath.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" >
    <xsl:output encoding="UTF-8" />
    <xsl:template match="/">
        <result>
            <xsl:value-of select="." />
        </result>
    </xsl:template>
</xsl:stylesheet>

https://docs.openhab.org/addons/transformations/xslt/readme.html

So I tried the following:

item:

String Temp_Date  "Datum [%s]"  <calendar>  { http="<[server:3000:XSLT(temp-date.xsl)]" }

xslt:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" >
    <xsl:output encoding="UTF-8" />
    <xsl:template match="/">
        <result>
            <xsl:value-of select="/html/body/table[1]/tbody/tr[3]/td[3]/b/font" />
        </result>
    </xsl:template>
</xsl:stylesheet>

But I get the following error:

org.openhab.core.transform.TransformationException: transformation throws exception
	at org.openhab.core.transform.TransformationHelper$TransformationServiceDelegate.transform(TransformationHelper.java:67) [227:org.openhab.core.compat1x:2.2.0]
	at org.openhab.binding.http.internal.HttpBinding.execute(HttpBinding.java:194) [221:org.openhab.binding.http:1.11.0]
	at org.openhab.core.binding.AbstractActiveBinding$BindingActiveService.execute(AbstractActiveBinding.java:144) [227:org.openhab.core.compat1x:2.2.0]
	at org.openhab.core.service.AbstractActiveService$RefreshThread.run(AbstractActiveService.java:166) [227:org.openhab.core.compat1x:2.2.0]
2018-03-09 20:43:13.890 [ERROR] [t.internal.XsltTransformationService] - transformation throws exception
javax.xml.transform.TransformerException: javax.xml.transform.TransformerException: com.sun.org.apache.xml.internal.utils.WrappedRuntimeException: Markup im Dokument vor dem Root-Element muss ordnungsgemäß formatiert sein.
	at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:740) [?:?]
	at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:343) [?:?]
	at org.eclipse.smarthome.transform.xslt.internal.XsltTransformationService.transform(XsltTransformationService.java:83) [213:org.eclipse.smarthome.transform.xslt:0.10.0.b1]
	at org.openhab.core.transform.TransformationHelper$TransformationServiceDelegate.transform(TransformationHelper.java:65) [227:org.openhab.core.compat1x:2.2.0]
	at org.openhab.binding.http.internal.HttpBinding.execute(HttpBinding.java:194) [221:org.openhab.binding.http:1.11.0]
	at org.openhab.core.binding.AbstractActiveBinding$BindingActiveService.execute(AbstractActiveBinding.java:144) [227:org.openhab.core.compat1x:2.2.0]
	at org.openhab.core.service.AbstractActiveService$RefreshThread.run(AbstractActiveService.java:166) [227:org.openhab.core.compat1x:2.2.0]
Caused by: javax.xml.transform.TransformerException: com.sun.org.apache.xml.internal.utils.WrappedRuntimeException: Markup im Dokument vor dem Root-Element muss ordnungsgemäß formatiert sein.
	at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.getDOM(TransformerImpl.java:570) ~[?:?]
	at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:730) ~[?:?]
	... 6 more
Caused by: com.sun.org.apache.xml.internal.utils.WrappedRuntimeException: Markup im Dokument vor dem Root-Element muss ordnungsgemäß formatiert sein.
	at com.sun.org.apache.xalan.internal.xsltc.dom.XSLTCDTMManager.getDTM(XSLTCDTMManager.java:427) ~[?:?]
	at com.sun.org.apache.xalan.internal.xsltc.dom.XSLTCDTMManager.getDTM(XSLTCDTMManager.java:215) ~[?:?]
	at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.getDOM(TransformerImpl.java:548) ~[?:?]
	at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:730) ~[?:?]
	... 6 more
2018-03-09 20:43:13.893 [WARN ] [ab.binding.http.internal.HttpBinding] - Transformation 'XSLT(temp-date.xsl)' threw an exception. [response=<!doctype html public "-//w3c//dtd html 3.2//en">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=iso-8859-1">

...

It seams that it does not work with a HTML page.

Of course not. XSLT requires XML.

Ok, so how can I solve my problem?

You could write a rule to convert the string’s encoding.

Sorry my bad,

  1. you would need the whole page
  2. not all html is well formatted.

Maybe somthing like this works for you.

import java.nio.charset.Charset
import java.nio.ByteBuffer
import java.nio.CharBuffer

rule "Test Encoding"
when
    Item Test changed
then
    var String input = new String("Bär".getBytes, "ISO-8859-1")

    var Charset utf8charset = Charset.forName("UTF-8");
    var ByteBuffer inputBuffer = ByteBuffer.wrap(input.getBytes("ISO-8859-1") );
    var CharBuffer fixed = utf8charset.decode(inputBuffer)

    logInfo("Test Encoding", input +" "+ fixed.toString )
end

That looks good, but do you know how to get the state of the item in the “input” String?

That doesn’t work:

var String input = new String(Temp_Date.state, "ISO-8859-1")
var String input = new String(Temp_Date.state.getBytes, "ISO-8859-1")
    var String input = Temp_Date.state
    var byte[] bytes = input.getBytes("ISO-8859-1")
    var String converted = new String(bytes, "UTF-8")
    sendCommand(FinalDate, converted)

With that rule, I get the following error:

2018-03-10 22:50:42.409 [ERROR] [ntime.internal.engine.RuleEngineImpl] - Rule 'Test Encoding': An error occurred during the script execution: Could not invoke method: java.lang.String.getBytes(java.lang.String) on instance: 10. M�rz 2018

Try changing the first line to:

var input = Temp_Date.state.toString

So that rule just changed the unknown character to a ?

11. M�rz 2018 11. M?rz 2018

So I added the following line to the rule:

Temp_Date.postUpdate(converted.replace('?','ä'))

and now I get this output:

11. M�rz 2018 11. März 2018
  1. Question
var String input = new String(Temp_Date.state.toString.getBytes, "ISO-8859-1")
  1. which rule did you take?
  1. So with this I get the following output for input:
11. M�rz 2018

When I use your rule:

import java.nio.charset.Charset
import java.nio.ByteBuffer
import java.nio.CharBuffer

rule "Change encoding of weatherstation date"
when
    Item Temp_Date received update
then
    var String input = new String(Temp_Date.state.toString.getBytes, "ISO-8859-1")
    var Charset utf8charset = Charset.forName("UTF-8");
    var ByteBuffer inputBuffer = ByteBuffer.wrap(input.getBytes("ISO-8859-1") );
    var CharBuffer fixed = utf8charset.decode(inputBuffer)
    logInfo("Test Encoding", input +" "+ fixed )
end

I get this output:

11. M�rz 2018 11. M�rz 2018

With that rule:

rule "Change encoding of weatherstation date"
when
    Item Temp_Date_str received update
then
    var input = Temp_Date_str.state.toString
    var byte[] bytes = input.getBytes("ISO-8859-1")
    var String converted = new String(bytes, "UTF-8")
    Temp_Date.postUpdate(converted.replace('?','ä'))
    logInfo("Test Encoding", input +" "+ Temp_Date.state )
end

I get this output:

11. M�rz 2018 11. März 2018

So after thinking about this, i think: it is a bug.
The HTTP binding does not consider the encoding and stuffs the result into a java string, which should be utf-16, and with that the encoding informations is allready lost. It can not be restored.

The “�” is the replacement character for all unknown character which means there is no valid solution to restore the encoding as all “äüöß” and so on just gets replaced with it.

Maybe fill out an bug report!

Ok now I’ve opened an issue on Github:

I updated this ticket with my findings. Please review at your convenience.

Maybe the page just states to be ISO-8859-1 but it is not?