Kotlin prints non-English characters as question marks - kotlin

I am trying to print Hebrew characters from a Kotlin program (running on the console).
All the Hebrew characters are being output as question marks.
I created the following simple test.kts script file for testing:
println("שלום מקוטלין")
// Try to print a simple non-Hebrew character too
println("\u0394") // Greek Delta
The file is properly saved in UTF-8 format.
It prints:
???? ???????
?
I tried running it in Command Prompt, PowerShell (both in its native window and in Windows Terminal), and Git Bash, all of which give the same result. I also tried redirecting the output to a file to rule out display issues in the shells.
To make sure the problem isn't the console itself, I also made simple test.bat, test.ps1, and test.sh files with the following content:
echo "שלום מקוטלין"
All three shells correctly displayed the Hebrew text here, indicating that the problem is in Kotlin's output, not in the shell display. (Though PowerShell requires the file to be saved "UTF-8 with BOM" to display properly, this can't be the issue with Kotlin since Kotlin won't even run a script that is saved with a BOM.)
As far as I can tell, Kotlin should support UTF-8 output by default with no configuration needed.
How can I get the proper output?
Updates:
If I write the output to a file using java.io.File("out.txt").writeText("שלום מקוטלין"), it works properly.
Also, if I open a new PrintStream using val out = java.io.PrintStream(System.out, true, "UTF-8") and then write to it using out.println("שלום מקוטלין"), that works properly too.
Only writing to the console with println is broken.
System info:
Windows 10 2004 (Build 19041.450)
Kotlin 1.4.0 (downloaded from GitHub Releases)
Tested with JAVA_HOME pointing to both JRE 1.8.0_261 (Oracle) and 11.0.2 (Oracle OpenJDK).

(Update at bottom)
Partial answer, but was able to get some Hebrew characters in the console in both Kotlin and Java. Was verry painful. Included some commented out stuff to show you some other things I may have tried if you run into any other hurdles.
Saved Tester.kt as UTF-8 with Notepad.
fun main(args : Array<String>) {
System.setProperty("file.encoding", "UTF8")
//val charset = Charsets.UTF_8
//val byteArray = "שלום מקוטלין".toByteArray(charset)
//System.out.printf("%c",byteArray.toString(charset))
//System.out.println(Charset.defaultCharset())
System.out.println("ל")
}
kotlinc.bat .\Tester.kt -include-runtime -d Tester.jar
Now, this leads to another mess, which I discovered by trying to copy and paste Hebrew characters to Powershell/Cmd. When copying, the ? marks showed right off the bat. Dug around a little bit, seems Powershell ISE is better suited for this (reference below). Without any plugins, copy and pasted successfully. Then had to run this:
PS> [Console]::OutputEncoding = [System.Text.Encoding]::UTF8
Because on my system, running the following showed:
PS> [Console]::OutputEncoding
IsSingleByte : True
BodyName : iso-8859-1
EncodingName : Western European (Windows)
HeaderName : Windows-1252
WebName : Windows-1252
WindowsCodePage : 1252
IsBrowserDisplay : True
IsBrowserSave : True
IsMailNewsDisplay : True
IsMailNewsSave : True
EncoderFallback : System.Text.InternalEncoderBestFitFallback
DecoderFallback : System.Text.InternalDecoderBestFitFallback
IsReadOnly : True
CodePage : 1252
Then,
java -jar -D"file.encoding=UTF-8" tester.jar
and voila, a single Lamedh
ל
Also, the Java route, which may or may not bring more insights:
Tester.java saved as UTF-8 with Notepad, imports redundant, yes, but shows some standout imports
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import static java.nio.charset.StandardCharsets.*;
import java.nio.*;
public class Tester{
public static void main(String[] args){
String str1 = "שלום מקוטלין";
byte[] ptext = str1.getBytes(UTF_8);
String value = new String(ptext, UTF_8);
ByteBuffer byteBuffer = StandardCharsets.UTF_8.encode("ש");
System.out.println(Charset.defaultCharset());
System.out.println("שלום מקוטלין");
System.out.println(value);
System.out.print(byteBuffer.getChar());
System.out.printf("Value: %s",value);
}
}
javac would give:
javac .\Tester.java
.\Tester.java:8: error: unmappable character (0x9D) for encoding windows-1252
System.out.println("╫⌐╫£╫ò╫? ╫₧╫º╫ò╫ÿ╫£╫Ö╫ƒ");
So
javac -encoding UTF-8 .\Tester.java
and voila again, PS ISE only:
PS> java -D"file.encoding=UFT-8" Tester
UTF-8
שלום מקוטלין
שלום מקוטלין
힩Value: שלום מקוטלין
I think this shows there are several hurdles, but it can work with Kotlin, and with println after making sure the file is correct, running the file the right way, and the output is correct. Hebrew may be particularly difficult due to the right-to-left nature, other characters like Greek were easier I think.
No matter what, I feel your pain, good luck. From what I read, there may be other bottlenecks like sending Hebrew over a network. This opened my eyes to several things, will continue to learn about this myself.
(Update)
Using the second link in the reference actually provided before, you can make two small changes and get Hebrew in Powershell (not just ISE)!!
PS> $OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding
Then,
Font: Courier New
References:
https://markw.dev/unicode_powershell/
Displaying Unicode in Powershell
https://community.idera.com/database-tools/powershell/ask_the_experts/f/learn_powershell_from_don_jones-24/11793/add-hebrew-to-powershell
https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html
I want to display Greek unicode characters but i get "?" instead on ouput
Encode String to UTF-8

Related

How can I run a bat file with spaces in the filepath?

I want to run a bat file used to compile sass to css from within a Kotlin program, on a Windows machine. I had everything working using Runtime.exec until I switched to a Windows account that had a space in the username. From what I read, I read that using ProcessBuilder would make this easier. It seems that even with ProcessBuilder I still can't get it to work, no matter what I try.
Here is my code so far
val commands = mutableListOf(
"cmd",
"/c",
"C:\\Users\\John Doe\\VCS\\test\\tools\\sass\\windows\\dart-sass\\sass.bat",
"--no-source-map",
"C:\\Users\\John Doe\\VCS\\test\\src\\main\\sass\\global.scss",
"global.css"
)
val processBuilder = ProcessBuilder(commands)
val process = processBuilder.start()
...
The error I get is 'C:\Users\John' is not recognized as an internal or external command, operable program or batch file. It doesn't help if I surround the strings that have spaces with \".
If I remember correctly, all windows files and folders that have a space in the name have a matching short name in the old 8.3 format replacing additional space and other characters with a tilde (~) and a number.
So whatever is returning you the path for the .bat and .sscs files could return the full filename in that format?
Doesn't solve the problem but avoids it instead, I admit.
Also means you won't get busted when someone puts a space in the filename (OK, unlikely, but still something better to deal with from the start).
Consider something along the lines of the top 2 answers on this superuser thread
This is actually a Windows cmd issue. The question here shows that in cmd, in addition to quoting the file paths, you also have to quote the entire part of the command line text after the /c switch.
I don't know if this is the best way to do it in ProcessBuilder, but I was able to get it to work with the following code.
"cmd.exe",
"/c",
"\"\"C:/Users/John Doe/VCS/test/tools/sass/windows/dart-sass/sass.bat\" "
+ "--no-source-map "
+ "\"C:/Users/John Doe/VCS/test/src/main/sass/global.scss\" "
+ "\"global.css\"\""

Create file with UTF-16LE encoding

I am trying to create a tab separate CSV via kotlin.
For the requirement we need to have UTF-16LE for the produced file encoding.
My stripped down code is something like this:
import java.io.File
import java.io.FileOutputStream
import java.io.OutputStreamWriter
fun main(args: Array<String>) {
val fileOutputStream = FileOutputStream(File("bla.csv"))
val writer = OutputStreamWriter(fileOutputStream, Charsets.UTF_16LE)
writer.write("bla\tbla\tbla")
writer.write("\n")
writer.write("lab\tlab\tlab")
writer.flush()
writer.close()
}
So after executing this program the file generated has this information:
(I am running file on the actual file)
file -I bla.csv
bla.csv: application/octet-stream; charset=binary
This is what I get when I go for
Charsets.UTF_16LE
I have tried using other UTF-16 variation which made me even more confuse!
So if I use Charsets.UTF_16 it will result in:
file -I bla.csv
bla.csv: text/plain; charset=utf-16be
And if I use Charsets.UTF_16BE it will result in:
file -I bla.csv
bla.csv: application/octet-stream; charset=binary
So after a lot of self doubt and being sure that I am doing something wrong I have give up and come here.
Any guidance will be appreciated. Thanks in advance
I suspect this is a limitation of the file command, and not a problem with your code (which is fine*).
If you write a Byte Order Mark (\uFEFF) as the first character of the file, then file recognises it fine:
> file bla.csv
bla.csv: Little-endian UTF-16 Unicode text
> file -I bla.csv
bla.csv: text/plain; charset=utf-16le
The file should be perfectly valid without a BOM, though.  So I'm not sure why file isn't recognising it.  It may be that it's not always possible to safely identify UTF16-LE without a BOM, though you'd think a case like this (where every other byte is 0) would be a safe bet!
(* Well, there are always potential improvements…  For example, it'd be safer to wrap the output in a call to writer.use() instead of closing the file manually.  You could wrap the OutputStreamWriter in a BufferedWriter for efficiency.  And in production code, you'd want some error handling, of course.  But none of that's related to the question!)

doxygen latex make fails for input encoding error

I have a git repo project in eclipse which I have been documenting using doxygen (v1.8.4).
If I run the latex make ion a fresh clone of the project it runs fine and the PDF is made.
However, if I then run a doxy build, which completes OK, then attempt to run the latex make, it fails for
! Package inputenc Error: Keyboard character used is undefined
(inputenc) in inputencoding `utf8'.
See the inputenc package documentation for explanation.
Type H <return> for immediate help.
...
I have tried switching the encoding of the doxyfile by setting DOXYFILE_ENCODING to ISO-8859-1 with no change in the result... How can I fix this?? Thanks.
EDIT: I have used no non-UTF-8 chars as far as I know in my files, the file referenced before the error is very short and definitely doesn't have non-UTF-8 chars in it. I've even tried clearing my latex output dir and building from scratch with no luck...
EDIT: Irealised that the doxy build only appears to run correctly. It doesnt show any errors, but it should, for example run DOT and build about 10 graphs. The console output says Running dot, but it doesn't say generating graph (n/x) like it should when it actually makes the graphs...
Short answer: So by a slow process of elimination I found that this was caused by a single apostrophe in a file that had appeared to be already built and made without error!!
Long answer: Firstly I used used the project properties to flip the encoding from the default Cp1252 to UTF-8. Then I started removing files one-by-one until rebuilding and remaking after each removal, until the make ran successfully. I re-added all files, but deleted the content in the most recently removed file and tested the make - to confirm it was this file and only this file that caused the issue. the make ran fine. So I pasted the content back into the empty file, and started deleting smaller and smaller sections of the file, again rebuilding and remaking each time until I was left with a good make without the apostrophe and a bad one with it... I simply retyped the apostrophe (as this would then force it to be a UTF-8 char) and success!! Such an annoying bug!
Dude you made it a hard way. Why not use python to do the work for you:
f = open(fn,"rb")
data = f.read()
f.close()
for i in range(len(data)):
ch = data[i]
if(ch > 0x7F): # non ASCII character
print("char: %c, idx: %d, file: %s"%(ch,i,fn))
str2 = str(data[i-30:i+30])#.decode("utf-8")
print("txt: %s" % (str2))

What is the most portable/cross-platform way to represent a newline in go/golang?

Currently, to represent a newline in go programs, I use \n. For example:
package main
import "fmt"
func main() {
fmt.Printf("%d is %s \n", 'U', string(85))
}
... will yield 85 is U followed by a newline.
However, this doesn't seem all that cross-platform. Looking at other languages, PHP represents this with a global constant ( PHP_EOL ). Is \n the right way to represent newlines in a cross-platform specific manner in go / golang?
I got curious about this so decided to see what exactly is done by fmt.Println. http://golang.org/src/pkg/fmt/print.go
If you scroll to the very bottom, you'll see an if addnewline where \n is always used. I can't hardly speak for if this is the most "cross-platform" way of doing it, and go was originally tied to linux in the early days, but that's where it is for the std lib.
I was originally going to suggest just using fmt.Fprintln and this might still be valid as if the current functionality isn't appropriate, a bug could be filed and then the code would simply need to be compiled with the latest Go toolchain.
You can always use an OS specific file to declare certain constants. Just like _test.go files are only used when doing go test, the _[os].go are only included when building to that target platform.
Basically you'll need to add the following files:
- main.go
- main_darwin.go // Mac OSX
- main_windows.go // Windows
- main_linux.go // Linux
You can declare a LineBreak constant in each of the main_[os].go files and have your logic in main.go.
The contents of you files would look something like this:
main_darwin.go
package somepkg
const LineBreak = "\n"
main_linux.go
package somepkg
const LineBreak = "\n"
main_windows.go
package somepkg
const LineBreak = "\r\n"
and simply in your main.go file, write the code and refer to LineBreak
main.go
package main
import "fmt"
func main() {
fmt.Printf("%d is %s %s", 'U', string(85), LineBreak)
}
Having the OS determine what the newline character is happens in many contexts to be wrong. What you really want to know is what the "record" separator is and Go assumes that you as the programmer should know that.
Even if the binary runs on Windows, it may be consuming a file from a Unix OS.
Line endings are determined by what the source of the file or document said was a line ending, not the OS the binary is running in.
You can use os.PathSeparator
func main() {
var PS = fmt.Sprintf("%v", os.PathSeparator)
var LineBreak = "\n"
if PS != "/" {
LineBreak = "\r\n"
}
fmt.Printf("Line Break %v", LineBreak)
}
https://play.golang.com/p/UTnBbTJyL9c

MSI DB, Visual Basic and cp1252 encoded string problems

I have such line of code (generated by MakeMSI)
oRec.StringData(2) = "A publicitar a aplicação"
oRec is record from Msi database, opened with:
oInstaller = MkObject("WindowsInstaller.Installer")
oMsi = oInstaller.OpenDatabase(MsiName, msiOpenDatabaseModeDirect)
oMsi.OpenView(selectQuery)
After executing and commiting string "A publicitar a aplicação" is converted to "A publicitar a aplicaçao" (ã is converted to a) in the database. I'm 100% sure database is cp1252 encoded, as when I edit field manualy and insert ã it displays well. Any ideas how to workaround this?
EDIT:
When building installer on Portugese Windows everything is fine
What is the codepage of the computer where you edit the property?
I don't know if VBA uses Unicode internally to store strings or not. If it does, then it should work on any computer; if it does not, then it should work correctly only where the system code page supports ‘ã’.
So another part of the problem is the source file itself: to work as expected it should be Unicode-enabled (UTF-8 or UTF-16), and the interpreter should handle it this was. Otherwise, you'll get unexpected results where the current code page is not compatible with cp1252.
Check the setting for Language for non-Unicode programs in the Regional settings in Windows. It should be set to Portuguese.