ANTLR4 Unicode Parse Not Recognized by grun

ANTLR4 Unicode Parse Not Recognized by grun - antlr

Given the following:
grammar Lang
start: CHAR;
CHAR: [\uE001];
WS: [ \t\r\n]+ -> skip;
When this batch file runs:
#echo off
setlocal
call antlr4 -o .\javatarget LangFile.g4 -encoding UTF-8
cd .\javatarget
call javac LangFile*.java
call grun LangFile Lang -gui -diagnostics -trace -encoding UTF-8
endlocal
#echo on
This error happens when I paste in the Unicode character:

^Z
line 1:0 token recognition error at: '?'
enter Lang, LT(1)=<EOF>
consume [#0,3:2='<EOF>',<-1>,2:0] rule Lang
exit Lang, LT(1)=<EOF>
Despite my search into the other answers (such as the -encoding option), I cannot seem to get this kind of Unicode (the Private Use Areas) parsing to work.
Edit: I have version 4.8.
The problem seems to be with the grun tool. Running it manually with Python runs fine, and so does specifying an input file. But directly pasting the content into the console fails. It's good enough for me to revert to using an input file, but perhaps this question is answered when grun's direct input mode works.

Could be an issue with how your grun script handles the input, because when I generate a lexer and parser and run this:
LangLexer lexer = new LangLexer(CharStreams.fromString("\uE001"));
LangParser parser = new LangParser(new CommonTokenStream(lexer));
parser.start();
it parses without any warnings or errors.

Related

TestRig / grun gets stuck parsing the example file

Following this question, I'm trying to learn how to use the TestRig / grun tool. Consider the grammar file in this repo. I ran the below commands :
export CLASSPATH=".:/usr/local/Cellar/antlr/&ltversion&gt/antlr-&ltversion&gt-complete.jar:$CLASSPATH"
antlr &ltgrammarName&gt.g4
javac &ltgrammarName&gt*.java
but when I run
grun <grammarName> <inputFile>
it gets stuck without returning any error messages. I have tested this with other examples as well to no avail. I would appreciate it if you could help me know what is the problem and how I can resolve it.

the normal grun alias takes the grammarName and startRule as parameters and expects the input from stdin:
grun <grammarName> <startRule> < <inputFile>
example:
grun ElmerSolver sections -tree < examples/ex001.sif
If you want to run just the Lexer, you can use the "pseudo-startrule" "tokens":
grun ElmerSolver tokens -tokens < examples/ex001.sif
With your sample, this gives me:
[#0,0:9='Simulation',<'Simulation'>,1:0]
[#1,11:13='End',<'End'>,2:0]
[#2,16:24='Equation ',<'Equation '>,4:0]
[#3,25:25='1',<Integer>,4:9]
[#4,27:29='End',<'End'>,5:0]
[#5,30:29='<EOF>',<EOF>,5:3]
(That's using the grammar changes I made in the previous answer, but should demonstrate the results)

Kotlin prints non-English characters as question marks

I am trying to print Hebrew characters from a Kotlin program (running on the console).
All the Hebrew characters are being output as question marks.
I created the following simple test.kts script file for testing:
println("שלום מקוטלין")
// Try to print a simple non-Hebrew character too
println("\u0394") // Greek Delta
The file is properly saved in UTF-8 format.
It prints:
???? ???????
?
I tried running it in Command Prompt, PowerShell (both in its native window and in Windows Terminal), and Git Bash, all of which give the same result. I also tried redirecting the output to a file to rule out display issues in the shells.
To make sure the problem isn't the console itself, I also made simple test.bat, test.ps1, and test.sh files with the following content:
echo "שלום מקוטלין"
All three shells correctly displayed the Hebrew text here, indicating that the problem is in Kotlin's output, not in the shell display. (Though PowerShell requires the file to be saved "UTF-8 with BOM" to display properly, this can't be the issue with Kotlin since Kotlin won't even run a script that is saved with a BOM.)
As far as I can tell, Kotlin should support UTF-8 output by default with no configuration needed.
How can I get the proper output?
Updates:
If I write the output to a file using java.io.File("out.txt").writeText("שלום מקוטלין"), it works properly.
Also, if I open a new PrintStream using val out = java.io.PrintStream(System.out, true, "UTF-8") and then write to it using out.println("שלום מקוטלין"), that works properly too.
Only writing to the console with println is broken.
System info:
Windows 10 2004 (Build 19041.450)
Kotlin 1.4.0 (downloaded from GitHub Releases)
Tested with JAVA_HOME pointing to both JRE 1.8.0_261 (Oracle) and 11.0.2 (Oracle OpenJDK).

(Update at bottom)
Partial answer, but was able to get some Hebrew characters in the console in both Kotlin and Java. Was verry painful. Included some commented out stuff to show you some other things I may have tried if you run into any other hurdles.
Saved Tester.kt as UTF-8 with Notepad.
fun main(args : Array<String>) {
System.setProperty("file.encoding", "UTF8")
//val charset = Charsets.UTF_8
//val byteArray = "שלום מקוטלין".toByteArray(charset)
//System.out.printf("%c",byteArray.toString(charset))
//System.out.println(Charset.defaultCharset())
System.out.println("ל")
}
kotlinc.bat .\Tester.kt -include-runtime -d Tester.jar
Now, this leads to another mess, which I discovered by trying to copy and paste Hebrew characters to Powershell/Cmd. When copying, the ? marks showed right off the bat. Dug around a little bit, seems Powershell ISE is better suited for this (reference below). Without any plugins, copy and pasted successfully. Then had to run this:
PS> [Console]::OutputEncoding = [System.Text.Encoding]::UTF8
Because on my system, running the following showed:
PS> [Console]::OutputEncoding
IsSingleByte : True
BodyName : iso-8859-1
EncodingName : Western European (Windows)
HeaderName : Windows-1252
WebName : Windows-1252
WindowsCodePage : 1252
IsBrowserDisplay : True
IsBrowserSave : True
IsMailNewsDisplay : True
IsMailNewsSave : True
EncoderFallback : System.Text.InternalEncoderBestFitFallback
DecoderFallback : System.Text.InternalDecoderBestFitFallback
IsReadOnly : True
CodePage : 1252
Then,
java -jar -D"file.encoding=UTF-8" tester.jar
and voila, a single Lamedh
ל
Also, the Java route, which may or may not bring more insights:
Tester.java saved as UTF-8 with Notepad, imports redundant, yes, but shows some standout imports
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import static java.nio.charset.StandardCharsets.*;
import java.nio.*;
public class Tester{
public static void main(String[] args){
String str1 = "שלום מקוטלין";
byte[] ptext = str1.getBytes(UTF_8);
String value = new String(ptext, UTF_8);
ByteBuffer byteBuffer = StandardCharsets.UTF_8.encode("ש");
System.out.println(Charset.defaultCharset());
System.out.println("שלום מקוטלין");
System.out.println(value);
System.out.print(byteBuffer.getChar());
System.out.printf("Value: %s",value);
}
}
javac would give:
javac .\Tester.java
.\Tester.java:8: error: unmappable character (0x9D) for encoding windows-1252
System.out.println("╫⌐╫£╫ò╫? ╫₧╫º╫ò╫ÿ╫£╫Ö╫ƒ");
So
javac -encoding UTF-8 .\Tester.java
and voila again, PS ISE only:
PS> java -D"file.encoding=UFT-8" Tester
UTF-8
שלום מקוטלין
שלום מקוטלין
힩Value: שלום מקוטלין
I think this shows there are several hurdles, but it can work with Kotlin, and with println after making sure the file is correct, running the file the right way, and the output is correct. Hebrew may be particularly difficult due to the right-to-left nature, other characters like Greek were easier I think.
No matter what, I feel your pain, good luck. From what I read, there may be other bottlenecks like sending Hebrew over a network. This opened my eyes to several things, will continue to learn about this myself.
(Update)
Using the second link in the reference actually provided before, you can make two small changes and get Hebrew in Powershell (not just ISE)!!
PS> $OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding
Then,
Font: Courier New
References:
https://markw.dev/unicode_powershell/
Displaying Unicode in Powershell
https://community.idera.com/database-tools/powershell/ask_the_experts/f/learn_powershell_from_don_jones-24/11793/add-hebrew-to-powershell
https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html
I want to display Greek unicode characters but i get "?" instead on ouput
Encode String to UTF-8

How to preserve single quotes in a CMake cached variable?

I have a variable
SET(CODE_COVERAGE_EXCLUSION_LIST
""
CACHE STRING "List of resources to exclude from code coverage analysis")
It must contain a list of expressions such as : 'tests/*' '/usr/*'
When trying to set the default value to the above expressions, the single quotes are removed.
How to preserve them ?
Moreover, when I try to pass the exclusion list like this
cmake -DCODE_COVERAGE_EXCLUSION_LIST="'tests/*' '/usr/*'" ..
The initial and final single quotes are lost. How to preserve them as well ?
Finally, the same question applies when using cmake-gui.
EDIT : I tried to use backslash to escape the quotes :
SET(CODE_COVERAGE_EXCLUSION_LIST
" \'tests/*\' \'/usr/*\'"
CACHE STRING "List of resources to exclude from code coverage analysis : ")
It gave me the following error :
Syntax error in cmake code at
xxx.cmake:106
when parsing string
\'tests/*\' \'/usr/*\'
Invalid escape sequence \'
EDIT2 : code of the add_custom_target (and not add_custom_command, my bad)
ADD_CUSTOM_TARGET(${_targetname}
# Cleanup lcov
${LCOV_PATH} --directory . --zerocounters
# Run tests
COMMAND ${_testrunner} ${ARGV3}
# Capturing lcov counters and generating report
COMMAND ${LCOV_PATH} --directory . --capture --output-file ${_outputname}.info
COMMAND ${LCOV_PATH} --remove ${_outputname}.info 'tests/*' '/usr/*' ${CODE_COVERAGE_EXCLUSION_LIST} --output-file ${_outputname}.info.cleaned
COMMAND ${GENHTML_PATH} -o ${_outputname} ${_outputname}.info.cleaned
COMMAND ${CMAKE_COMMAND} -E remove ${_outputname}.info ${_outputname}.info.cleaned
WORKING_DIRECTORY ${CMAKE_BINARY_DIR}
COMMENT "Resetting code coverage counters to zero.\nProcessing code coverage counters and generating report."
)

Turning my comments into an answer
First - taking the question why you need those quotes aside - I could reproduce your problem and found several possible solutions:
Adding spaces at the begin and end of your cached variable
cmake -DCODE_COVERAGE_EXCLUSION_LIST=" 'tests/*' '/usr/*' " ..
Using "escaped" double quotes instead of single quotes
cmake -DCODE_COVERAGE_EXCLUSION_LIST:STRING="\"tests/*\" \"/usr/\"" ..
Using your set(... CACHE ...) example by setting policy CMP0053 switch introduced with CMake 3.1:
cmake_policy(SET CMP0053 NEW)
set(CODE_COVERAGE_EXCLUSION_LIST
"\'tests/*\' \'/usr/*\'"
CACHE STRING "List of resources to exclude from code coverage analysis : ")
But when setting this in the code I could also just do
set(CODE_COVERAGE_EXCLUSION_LIST
"'tests/*' '/usr/*'"
CACHE STRING "List of resources to exclude from code coverage analysis : ")
The quoting issue seems only to be a problem when called from command line
Then - if I do assume you may not need the quotes - you could pass the paths as a list:
A semicolon separated CMake list is expanded to parameters again (with spaces as delimiter) when used in a COMMAND
cmake -DCODE_COVERAGE_EXCLUSION_LIST:STRING="tests/*;/usr/*" ..
with
add_custom_target(
...
COMMAND ${LCOV_PATH} --remove ${_outputname}.info ${CODE_COVERAGE_EXCLUSION_LIST} --output-file ${_outputname}.info.cleaned
)
would give something like
.../lcov --remove output.info tests/* /usr/* --output-file output.info.cleaned
I also tried to add the VERBATIM option, because "all arguments to the commands will be escaped properly for the build tool so that the invoked command receives each argument unchanged". But in this case, it didn't change anything.
References
add_custom_target()
CMake Language: Escape Sequences
0015200: Odd quoting issue when mixing single, double and escaped quotes to COMMAND

How to extract the strings in double quotes for localization

I'm trying to extract the strings for localization. There are so many files where some of the strings are tagged as NSLocalizedStrings, and some of them are not.
I'm able to grab the NSLocalizedStrings using ibtool and genstrings, but I'm unable to extract the plain strings without NSLocalizedString.
I'm not good at regex, but I came up with this "[^(]#\""
and with the help of grep:
grep -i -r -I "[^(]#\"" * > out.txt
It worked, and all the strings were actually grabbed into a txt file, but the problem is ,
if in my code there is a line:
..... initWithTitle:#"New Sketch".....
I only expect the grep to grab the #"New Sketch" part, but it grabs the whole line.
So in the out.txt file, I see initWithTitle:#"New Sketch", along with some unwanted lines.
How can I write the regex to grab only the strings in double quotes ?
I tried the grep command with the regex mentioned in here, but it gave me syntax error .
For ex, I tried:
grep -i -r -I (["'])(?:(?=(\\?))\2.)*?\1 * > out.txt
and it gave me
-bash: syntax error near unexpected token `('

In xcode, open your project. Go to Editor->Export For Localization...It will create the folder of files. Everything that was marked for localization will be extracted there. No need to parse it yourself. It will be in the XML format.
If you wanna go hard way, you can then parse those files the way you're trying to do it now ?! It will also have Storyboard strings there too, btw.

How to use Doxygen with Xcode?

I'm trying to use Doxygen with Xcode. I followed the Apple tutorial. After several mistakes, I builded the project and generated the docs. I discovered that if you save the doxygen.config from Doxygen and you use space " " in the directory name you will have problem and others things.
But there is one last problem:
./search/search.png
./tab_b.gif
./tab_l.gif
./tab_r.gif
./tabs.css
/Developer/usr/bin/docsetutil index com.mycompany.DoxygenExample.docset
2010-03-31 12:30:53.847 docsetutil[46338:807] Error converting XML to CoreData: Error Domain=NSXMLParserErrorDomain Code=76 UserInfo=0x1247d0 "Line 8: Opening and ending tag mismatch: Subnodes line 0 and Node
"
Failed to create docset indexer object
make: *** [docset] Error 1
load documentation set with path "/Users/WB/Library/Developer/Shared/Documentation/DocSets/"
I don't know what is the problem?? Any idea?
I'm using Core Data - sqlite.

The parser is telling you XML is not well formed, but that error usually shows because nothing has been generated BEFORE running docsetutil.
First thing should be to go over the many lines of console output and look for warnings, probably is there. Also look for the docset you generated and right click > Show Contents. If you don't see a lot of html files with the documentation, same thing: you failed at generating documentation and docsetutil has nothing to do. And btw, it's docsetutil who is using CoreData, doesn't matter if you use it on your project or not.
I don't get why Apple doesn't provide a doxygen-like tool more tightly integrated. Or a better code formatter than Crustify. Just take the damn tools and improve them a little bit. Argh!

There is a know bug from generation of Nodes.xml by Doxygen. It is referenced here https://bugzilla.gnome.org/show_bug.cgi?id=671591 and should be corrected in the next doxygen Version (Post V 1.8.0) :
At the end of the Nodes.xml there is an additional
the -silence option is workaround to suppress error, but this param does not allow dosetgeneration to work properly.
$DOXYGEN_PATH $TEMP_DIR/doxygen.config
make -C $TEMP_DIR/DoxygenDocs.docset/html install
Insert following code
Note : The script works in $TEMP_DIR and not in SOURCE_ROOT as AppleScript
$DOXYGEN_PATH $TEMP_DIR/doxygen.config
# make will invoke docsetutil. Take a look at the Makefile to see how this is done.
LINE=`xmllint --c14n $TEMP_DIR/DoxygenDocs.docset/html/Nodes.xml 2>&1 | awk 'NR == 1 {print $1}' | cut -d':' -f 2`
ECHO $LINE
if [ $LINE -gt 0 ]
then
echo "XML Cleaning "
sed -i.bak $LINE'd' $TEMP_DIR/DoxygenDocs.docset/html/Nodes.xml
fi
make -C $TEMP_DIR/DoxygenDocs.docset/html install
NB: awk and sed may certainly be combined in one line.

So the long story short is that the script creates a Doxyfile on the fly, and it does not recursively scan all subdirectories.
Take a look at this post:
http://www.duckrowing.com/2010/03/18/documenting-objective-c-with-doxygen-part-ii/
There's a script included on the second post that is based on Apple's script that shouldn't have this issue.

I use an extended version of the above script but based on the same priniciples. Although everything works fine on another project this time my script fails.
The generation of the docset works fine but the make command produces the following error.
x ./search/search_r.png
2010-07-26 17:36:01.815 docsetutil[8441:903]
Error converting XML to CoreData:
Error Domain=NSXMLParserErrorDomain
Code=76
UserInfo=0x1006105e0
"Line 8: Opening and ending tag mismatch: Subnodes line 0 and Node"
Failed to create docset indexer object
make: *** [docset] Error 1
The make command I use is: make --silent -C "$DOCSET_OUTPUT/html" install.
I added line breaks to the error message for readability.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas