No international characters (Unicode) with jsr223 / Jython?

  • Platform information:
    • Hardware: Mac mini
    • OS: 10.13.6
    • Java Runtime Environment:
      openjdk version “1.8.0_252”
      OpenJDK Runtime Environment (Zulu 8.46.0.19-CA-macosx) (build 1.8.0_252-b14)
      OpenJDK 64-Bit Server VM (Zulu 8.46.0.19-CA-macosx) (build 25.252-b14, mixed mode)
    • openHAB version: 2.5.5 release
  • Problem summary:
    Scripts do not compile / cannot contain special characters / log outputs garbage, or nothing.

I have now googled many hours and found all kind of more or less puzzling stuff, and not much related to OpenHab / jsr223 unicode. Very brief summary:

  • Python 2.7 is not Unicode by default.
  • I was able to make Python 2.7 Unicode-aware in Terminal.app, and in VS Code.

Unicode Tests A (not ideal approach).
Here I tried to interpret Unicode in code, which works for string literals, but not in a function. It’s a very clumsy approach anyway, to ‘mark’ strings with ‘u’…


from core.rules import rule
from core.triggers import when
from org.slf4j import LoggerFactory

gRuleName = "Unicode Tests A"
gRuleLogLevel = 3  # 0=Off 1=Basic 2=Detail 3=All
@rule(gRuleName, description="", tags=[""])
@when("Time cron 0/10 * * * * ?")
def module_cris(event):
    lg("Fired.", 2)

    LoggerFactory.getLogger("jsr223.jython").info(u"Hällo Unicode Wörld, declared unicode string.") # ok
    # --> 2020-05-26 12:00:30.881 [INFO ] [jsr223.jython                       ] - Hällo Unicode Declared String Wörld!

    LoggerFactory.getLogger("jsr223.jython").info("Hällo Unicode Wörld, non-declared string.") # garbage
    # --> 2020-05-26 12:00:20.876 [INFO ] [jsr223.jython                       ] - Hällo Unicode Wörld! 2

    lg("Hällo Unicode Wörld, non-declared string, Function.")
    # --> See Function.

    lg("Ended.", 2)
# Rule End





##### Functions

# Log fast
def lg(msg, logLevel = 3, prefix = gRuleName):
    if gRuleLogLevel < logLevel:
        return

    #LoggerFactory.getLogger("jsr223.jython").info( (prefix + ": " + msg))
    """ -->
    2020-05-26 12:04:50.899 [ERROR] [jsr223.jython.Module cris Python    ] - Traceback (most recent call last):
    File "/Applications/openhab/conf/automation/lib/python/core/log.py", line 51, in wrapper
    return fn(*args, **kwargs)
    File "<script>", line 19, in module_cris
    File "<script>", line 36, in lg
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
    """
    
    # https://docs.python.org/2/howto/unicode.html
    # LoggerFactory.getLogger("jsr223.jython").info( unicode(prefix + ": " + msg))
    # --> Same error message as above.

    # LoggerFactory.getLogger("jsr223.jython").info( unicode((prefix + ": " + msg), errors='ignore'))
    # --> Same error message as above.

    # LoggerFactory.getLogger("jsr223.jython").info( unicode("äää", errors='ignore'))
    # --> produces an empty log line:
    # 2020-05-26 12:22:00.957 [INFO ] [jsr223.jython                       ] - 

    m = sr(msg, ["ä", "ü", "ö"], ["ae", "ue", "oe"]) # Workaround, Umlaute can't be logged.
    LoggerFactory.getLogger("jsr223.jython").info(prefix + ": " + m)
    # -->
    # 2020-05-26 12:24:00.939 [INFO ] [jsr223.jython                       ] - Module cris Python: Haello Unicode Woerld, non-declared string, Function.



# String replace
def sr(theString, theSearchList, theReplaceList):
# loop with i
    c = len(theSearchList)
    for i in range(c):
        theSearch = theSearchList[i]
        theReplace = theReplaceList[i]
        theString = theString.replace(theSearch, theReplace)
    return theString

Unicode Tests B (better, but still not ideal approach).
Here I’m trying to give the .py file an encoding information in the header (must be in the first 2 lines).

#!/path/doesn't/matter/at/all,/it/seems?/
# encoding: utf-8
# --> As long as the "encoding: utf-8" is in the header:
"""
2020-05-26 12:33:59.942 [ERROR] [ipt.internal.ScriptEngineManagerImpl] - Error during evaluation of script 'file:/Applications/openhab/conf/automation/jsr223/python/personal/python_unicode_b.py': org.python.antlr.ParseException: org.python.antlr.ParseException: encoding declaration in Unicode string
"""
# Variations I tried:
#!/Applications/openhab/conf/automation/jsr223
#!/Applications/openhab/conf/automation/jython
#!/Applications/openhab/conf/automation/lib/python/
#!/Applications/openhab/conf/automation/jython/jython-standalone-2.7.0.jar
#!/Applications/openhab/conf/automation/jython/jython-standalone-2.7.0
# coding: utf-8
# ...


from core.rules import rule
from core.triggers import when
from org.slf4j import LoggerFactory
#import sys
#reload(sys)  
#sys.setdefaultencoding('utf8') # no

gRuleName = "Unicode Tests B"
@rule(gRuleName, description="", tags=[""])
@when("Time cron 0/10 * * * * ?")
def module_cris(event):

    LoggerFactory.getLogger("jsr223.jython").info("Hällo Unicode Wörld")

# Rule End


I saw a few other posts where people also had the same problem, without any solution.
It seems this must be configured somewhere deeper, but I do not understand where that could be.
(Original plan: double click an app, add z-wave hardware, write rules. Far far away from that :smile: )

Questions:

  • Does anyone know how to make it work?
  • Could this work out of the box, when the jsr223 stuff is installed? How long is Unicode the standard now? :wink:

(TL;DR) To my knowledge there is no way to have Jython default to using unicode strings: it is based on Python 2.

(long story) Jython is only aware of “vanilla Python 2.7” hence you must explicitly declare a string as unicode by prefixing it with u as in your 1st example.

According to this Jython 3 repository info, Jython3 development seems inactive since July 2017. So it seems unlikely that there will be a Jython 3 release anytime soon.

Another Python3 Java port is also worth mentioning: graalPython. As far as I can tell it is not supported in the current openHAB scripters.

Thank you @shutterfreak. Python 2 can be set to use Unicode by default, so I guess it’s possible for Jython 2.

As a workaround I use now the decoding function:

msg = msg.decode(encoding="utf-8",errors="ignore")

How about
from __future__ import unicode_literals

Check out https://python-future.org/unicode_literals.html

I just got modules running. Interesting thing regarding Unicode:

My .py rule file is in

/conf/automation/jsr223/python/personal/

My .py module file is in

/conf/automation/lib/python/

If in the module file is a special charäcter, I get this error:

2020-05-27 10:57:48.033 [ERROR] [ipt.internal.ScriptEngineManagerImpl] - Error during evaluation of script 'file:/Applications/openhab/conf/automation/jsr223/python/personal/python_tests.py': Traceback (most recent call last):
  File "<script>", line 23, in <module>
  File "<string>", line None
SyntaxError: Non-ASCII character in file '/Applications/openhab/conf/automation/lib/python/z_module_01.py', but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

However, in the module file I can fix it by giving the file this header:

#!/Applications/openhab/conf/automation/lib/python
# encoding: utf-8

No complaints at loading, but special characters do still not work (gives garbage).

But the same thing doesn’t work if I put the header into the rule file.
Header with path to /personal:

#!/Applications/openhab/conf/automation/jsr223/python/personal
# encoding: utf-8

Result:

2020-05-27 11:05:28.024 [ERROR] [ipt.internal.ScriptEngineManagerImpl] - Error during evaluation of script 'file:/Applications/openhab/conf/automation/jsr223/python/personal/python_tests.py': org.python.antlr.ParseException: org.python.antlr.ParseException: encoding declaration in Unicode string

Desperate, header with path to /lib/python:

#!/Applications/openhab/conf/automation/lib/python
# encoding: utf-8

Result, same:

2020-05-27 11:08:18.028 [ERROR] [ipt.internal.ScriptEngineManagerImpl] - Error during evaluation of script 'file:/Applications/openhab/conf/automation/jsr223/python/personal/python_tests.py': org.python.antlr.ParseException: org.python.antlr.ParseException: encoding declaration in Unicode string

Does that make sense? :crazy_face:

Ok, I just realize, the path doesn’t matter - the module file works (loads without error) with just this as the header:

# encoding: utf-8

But not the rule file.

To summarize:

  • if py files contain special characters:
    • The module file loads only if it has the utf-8 header.
      • The path definition does not matter, it seems.
    • The rule file loads only if it has no utf-8 header.
    • In both files, special treatment is necessary to avoid garbage characters.
  • In Terminal.app (the system has also Python 2.7), a one-time setting was necessary to avoid garbage characters. Something with “export xxx”, I don’t recall exactly.
    :face_with_monocle:

Thank you, I tried, but same error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 12: ordinal not in range(128)

It looks like we have to wait until Python 3 makes it into OH. Until then I can live with

decode(encoding="utf-8",errors="ignore")

Can you share full code?

Note: that import must be stated in all files

I tried with a single py file, without my module.

#from __future__ import unicode_literals


#!/Applicati  ons/openhab/conf/automation/lib/python
# enco ding: utf-8

from core.rules import rule
from core.triggers import when
from org.slf4j import LoggerFactory

gRuleName = "Unicode Tests"
gRuleLogLevel = 3  # 0=Off 1=Basic 2=Detail 3=All

@rule(gRuleName)
@when("Time cron 0/10 * * * * ?")
def module_cris(event):
    lg("Fired.", 2)

    lg(["äh, sind das hier etwa ", 6, " eier im karton?"])

    lg("Ended.", 2)



##### Functions
# Log. msg darf Liste sein.
def lg(msg, logLevel = 3, prefix = gRuleName):
    if gRuleLogLevel < logLevel:
        return
    if type(msg) == list:
        for i in range(len(msg)):
            if type(msg[i]) != str:
                msg[i] = str(msg[i])
        msg = "".join(msg)

    msg = msg.decode(encoding="utf-8",errors="ignore")
    LoggerFactory.getLogger("jsr223.jython").info(prefix + ": " + msg)
  • Error “…can’t decode byte…”, if
    • the first line is active. Or
    • the line “msg.decode…” is inactive.

Side note:
In macOS Terminal.app, the command to make Python 2.7 work without garbage characters was:

export PYTHONIOENCODING=utf-8

At least I believe - I may have tried others also.

For ME things are working well enough now. My py files successfully load in OpenHab - with special characters. And they are logged by OH without garbage, with the decode-command.

I guess this is a thing that could be set for OH-Jython 2.7 (like for the macOS-pre-installed Python 2.7 in Terminal.app), and if so, that would be a good thing, because it would save other people many hours of trying.

The use of special characters and unicode is one of the most tricky pieces of using scripted automation… and programming in general! I have a commit to push where I have converted all of the Jython helper libraries to use unicode strings. I’m not sure if this might help in your case, since I’m not exactly sure what your issue is. I seems like you are just not happy about having to use unicode strings when they have special characters in them. If there is an actual issue, could you please provide a few lines of script that recreates it? There’s no need for rules and functions. And please don’t put your logs inline with the code… it makes it really hard to read. Just leave them separate.

The default encoding for Python 2.7, which is what Jython 2.7.x implements, is ASCII. This means that, in Jython, you need to specify whether a string contains unicode or not. For example…

test_string_2 = u"The high temperature is 88 °F today"
test_string_3 = unicode("The high temperature is 88 °F today")

Starting with Python 3.0, the default encoding is UTF-8.

As for the source encoding of the file, PEP 263 has some more information, but this will not work in scripts used in OH, which executes the Jython scripts through javax.script.ScriptEngine.eval(). My understanding is that we can’t specify the encoding for the parser in the script, because it has already been set. This is what the error is telling you. As you’ve noticed, this does not cause an error in modules, but they are loaded differently than scripts. This all gets very complicated and it’s really not an area that I want to dig into right now, especially without a clear understanding of the issue that you are reporting.

On a side note… this PR for OH 3.0 specifies the Charset that Java uses to read the scripts as UTF-8. In testing, I have not seen any difference in behavior. I think this may be due to the default already being UTF-8, so the PR really doesn’t change anything.

No, a shebang is not needed, since we are not running the script from the command line as a standalone executable. The shebang just tells the shell what interpreter to use. OH does not need this.

No commits does not mean that no work has been done!

I was hopeful for GraalVM, but from what I have seen, it’s been very slow to get off the ground. The only functionality that has really gotten attention is graal-js, which is the closest thing to a replacement for Nashorn (removed in JDK15). If Jython3 is being developed, it will IMO be the better option… but things change.

So if I write Umlaute in the py rule files, like “öüä”, they become garbage in the log output. The special treatment is to use:

Yes, but I could set my “mac OS Python” 2.7 to use Unicode,

and then the garbage endet, without the use of “u” or unicode().
print(“Hä?”) in the Terminal works just fine since then:
image

Set by what?

Regarding https://github.com/openhab/openhab-core/pull/1484
This is only for OH3, right? Otherwise I would try the latest 2.5.x snapshot…

@chris You have commented out the import statement so it has no effect.

Working example:

from __future__ import unicode_literals

import logging 
from org.slf4j import LoggerFactory


LoggerFactory.getLogger("jsr223.jython").warn('test ascii')
LoggerFactory.getLogger("jsr223.jython").warn('test unicode äh')

# output
#18:35:03.228 [WARN ] [jsr223.jython                        ] - test ascii
#18:35:03.229 [WARN ] [jsr223.jython                        ] - test unicode äh

In addition, your helper function is mixing up str (used for bytes) and unicode (used for unicode strings) concepts, causing issues. Call str(msg[i]) is bound the fail when msg[i] is type of unicode with non-ascii characters.

In multilingual python2, str is best used with sequence of bytes (not necessarily text! not necessarily ASCII!), and unicode type with unicode strings (possibly containing characters outside ASCII, i.e. outside a…z etc.). unicode type is not suitable general byte strings (not all byte sequences are representing text).

In python2, all string literals (e.g. foobar) are interpreted by default as str type, unless you change this behaviour with the unicode_literals import, in which case they are treated as unicode type.

Naturally you can convert unicode strings, unicode, to sequence of bytes, str, (given some character encoding), but that’s not really needed in your use case. All printing and logging commands work just fine directly with unicode strings directly. By default, str(some_unicode_object) tries to convert unicode string to string of bytes using default encoding – ASCII. str(some_unicode_object) is same as calling some_unicode_object.encode('ascii')

Here’s a correct version (note that you no longer need str.decode function):

# Use unicode literals in this file.
# this means that all string literal, e.g. "test ascii äö" , are interpreted
# as u"test ascii äö" (without the future import

from __future__ import unicode_literals

import logging 
from org.slf4j import LoggerFactory
gRuleName = "Unicode Tests"
gRuleLogLevel = 3  # 0=Off 1=Basic 2=Detail 3=All


LoggerFactory.getLogger("jsr223.jython").warn('test ascii')
LoggerFactory.getLogger("jsr223.jython").warn('test unicode äh')

def lg(msg, logLevel = 3, prefix = gRuleName):
    if gRuleLogLevel < logLevel:
        return
    if type(msg) == list:
        for i in range(len(msg)):
            # If we do not have unicode string (e.g. integer or float),
            # convert it to unicode string
            if type(msg[i]) != unicode:
                msg[i] = unicode(msg[i])
        msg = "".join(msg)
    LoggerFactory.getLogger("jsr223.jython").info(prefix + ": " + msg)
    
lg("Fired.", 2) 

lg(["äh, sind das hier etwa ", 6, " eier im karton?"])

lg("Ended.", 2)

Output:

18:39:37.009 [WARN ] [jsr223.jython                        ] - test ascii
18:39:37.011 [WARN ] [jsr223.jython                        ] - test unicode äh
18:39:37.012 [INFO ] [jsr223.jython                        ] - Unicode Tests: Fired.
18:39:37.013 [INFO ] [jsr223.jython                        ] - Unicode Tests: äh, sind das hier etwa 6 eier im karton?
18:39:37.014 [INFO ] [jsr223.jython                        ] - Unicode Tests: Ended.

P.S. You should save the .py files with UTF-8 encoding. This is fortunately quite often the default with many text editors. UTF-8 is the assumed encoding when python interpreting is reading the py file into memory for execution (see PEP 3120) – note that this has amends PEP 263 from 2001. In other words, it makes no difference nowadays to have # encoding utf-8 since utf-8 is already the default.

In VS Code, you can see the encoding at the bottom of the window:

image

2 Likes

“active” meant “not commented out”.
Whatever I did wrong, I did something wrong…
I see even eggs now - every logfile should have some. :wink:

2020-05-28 19:20:20.339 [INFO ] [jsr223.jython                       ] - Unicode Tests: 🥚🥚🥚🥚🥚🥚 äh, sind das hier etwa 6 eier im karton? 😃

The files in VSC are UTF8. I think there’s for maaany years no app on OS X that does not save in UTF. I believe the unicode transition was 15(?) years ago, and therefore I was (am) surprised that it doesn’t work in OH-Jython out of the box.

To summarize, for everyone with the same problem:

  • Put this at the top of every .py rule file:
from __future__ import unicode_literals

Rule files are located at:

... /YourOpenHabFolder/conf/automation/jsr223/python/personal/
  • Put this at the top of every .py module file:
from __future__ import unicode_literals
# encoding: utf-8

Module files are located at:

.../YourOpenHabFolder/conf/automation/lib/python/

Thanks @ssalonen, also for the unicode() tip. :slight_smile:

My assumption was correct then… there is no issue, you just refuse to specify unicode strings when using special characters. If you’d just use u“öüä”, everything would fine.

Using this will cause more problems than it solves, since a person who refuses to specify unicode strings would be equally opposed to specifying byte string literals. Not to mention the other issues that this could cause. It would be best to stay with the default implementation so that all of the documentation here, in the helper libraries, and on the Internet will still be applicable. It is definitely not something that would go into the helper libraries!

The future imports are there for easing the transition to python3. Actually, this is one of many imports that are offered by the python standard library for the user.

In this case, with suitable imports, you can write python3 ready code in python2. Once migrated to python3, you can drop out the future imports all together, but ideally no changes are needed elsewhere. In addition, you can take online python3 tutorials and examples and adapt them more easily to jython environment. I would not be surprised that python2 references are vanishing over time, python3 is the new default already now.

In general, many think that new code should be python3-only, or python2 which is future proof with python3. This has been the case for many years now, as the python3 transition has been picking up speed.

Naturally some have to stick with the old due resource reasons (no resources to invest in the migration), or like in this case of jython, due to technical reasons.

For those who want to write python3-ready code, a good resource to check is python-future.org. The topic of unicode_literals remains divided. As you pointed out, there are also downsides involved.

https://python-future.org/compatible_idioms.html#unicode-text-string-literals

No matter which way you choose, I think you need to understand the language you work with. In the case of python, unicode and strings are one of the most common stumbling blocks for new developers, so it’s quite understandable to get confused. Nevertheless, in my opinion the difference is crucial to understand to write correct code.