ANTLR4 not processing UTF-16 input correctly - antlr

I'm using ANTLR 4.10.1 and C++. I'm using ANTLRInputStream as the input to my lexer
antlr4::ANTLRInputStream inputStream(....);
Which works fine until I use UTF-16 characters in the input, as they cause problems later on.
Since ANTLRInputStream is deprecated for 4.10.1, it seems CharStreams needs to be used to be able to specify a Charset, i.e. "UTF-16LE". But I could only find documentation for Java. Is there a way to use CharStreams with UTF-16 to make this work in C++?

The input stream in the C++ runtime actually expects UTF-8, always! See also the source code in ANTLRInputStream::load. Internally it converts that to UTF-32. The 16 bit transformation format is never used.
Because of this convention, there's no need to deprecate the C++ version of the ATNLRInputStream. That was only necessary for old language targets, which had no UTF-32 (like JS and Java). With that in mind you can ignore the new CharStreams class in the C++ target.

Related

Antlr for multiple language generation

This post about the antlr simple example shows how to create and us a grammar for java.
However, this intermixes the grammar and the Java source code in the Exp.g source.
My Question is, Is it possible to decouple the grammar file from the target language, so that the one grammar file can be used for generating multiple Java, Scala, C++, etc Lexers/Parsers?
It depends mostly on the reason why target code is used in the grammar. Is it only action code to do something with the found tokens (e.g. building a symbol table or alternative tree representation) then is indeed no problem do remove such native code and do the processing afterwards (using a parse tree walker or visitor).
However, predicates are a different. They are used to guide the parser and also require native code. What you can do is to move all the native code into a base class from which your generated parser derives. You then only need to re-write this base class in your target language and keep the grammar mostly free of native code (except for a single function call, which invokes the native code).
This approach has the advantage that no additional library reference is necessary (#include in C/C++, import in other languages), which also is native code preventing use for multiple targets.

How to implement branching in bison-based interpreter?

Now I develop a virtual machine. Bytecode interpreter uses flex and bison.
Here is some code for example:
some:
add r0 4 4
jmp some
My question is: how to handle jmp instruction?
Can I ask the bison to go back to a label, continuing the analysis?
I develop interpreter of bytecode, not compiler...
No, you can't make bison go back. You usually Bison to parse the code and generate some kind of intermediate representation. Like an AST or bytecode. Then you execute that in a separate step.
So in your case, since you're parsing an assembly language for a bytecode format, it makes sense to translate that into actual bytecode. That is when your parser sees "add r0 4 4", all it should do is append the corresponding sequence of bytes to an array containing your bytecode. Then after the parser creates this array, you can pass it to a function that actually executes the bytecode.
It would probably also make sense to split these two steps into two separate programs: an assembler that turns a source file into a binary bytecode file, and a bytecode interpreter that reads a bytecode file and executes it. The latter would not need to use Bison at all, just read the bytes and switch on them.

can hard coded strings in a compiled exe be changed?

Lets say you have some code in your app with a hard coded string.
If somevalue = "test123" Then
End If
Once the application is compiled, is it possible for someone to modify the .exe file and change 'test123' to something else? If so, would it only work if the string contained the same number of characters?
It's possible but not necessarily straightforward. For example, if your string is loaded in memory, someone could use a memory manager tool to modify the value of the string's address directly.
Alternatively, they could decompile your app, change the string, and recompile it to create a new assembly with the new string. However, whether this is likely to happen depends on your app and how important it is for that string to be changed.
You could use an obfuscator to make it a bit harder to do but, ultimately, a determined cracker would be able to do it. The question is whether that string is important enough to worry about and, if so, maybe consider an alternative approach such as using a web service to provide the string.
Strings hard-coded without any obfuscation techniques can easily be found inside compiled executables by openign them up in any HEX-editor. Once found, replacing the string is possible in 2 ways :
1. Easy way (*conditions apply)
If the following conditions apply in your case, this is a very quick-fire way of modifying the hard-coded strings in the executable binary.
length(new-string) <= length(old-string)
No logic in the code to check for executable modification using CRC.
This is a viable option ONLY if the new string is equal or shorter than the old string. Use a hex-editor to find occurrences of the old string and replace it with the new string. Pad an extra space with NULL i.e. 0x00
For example old-long-string in the binary
is modified to a shorter new-string and padded with null characters to the same length as the original string in the binary executable file
Note that such modifications to the executable files are detected by any code that verifies the checksum of the binary file against the pre-calculate checksum of the original binary executable file.
2. Harder way (applicable in almost all cases)
De-compiling the binary to native code opens up the possibility to modify any strings (and even code) and rebuild it to obtain the new binary executable.
There exist dozens of such de-compiler tools to decompile vb.net (Visual Studio.net, in general). An excellent detailed comparison of the most popular ones (ILspy, JustDecompile, DotPeek, .NET Reflector to name a few ) can be found here.
There do exist scenarios in which even the harder way will NOT be successful. This is the case when the original developer has used obfuscation techniques to prevent the strings from being detected and modified in the executable binary. One such obfuscation technique is storing encrypted strings.

"Human-readable" ANTLR-generated code?

I've been learning ANTLR for a few days now. My goal in learning it was that I would be able to generate parsers and lexers, and then personally hand-translate them from Java into my target language (neither C/C++/Java/C#/Python, no tool has support for it). I chose ANTLR because from its About page: ANTLR is widely used because it's easy to understand, powerful, flexible, generates human-readable output[...]
In learning this tool, I decided to start with a simple lexer for a simple grammar: JSON. However, once I generated the .java file for this lexer using ANTLR4 I was caught widely off-guard. I got a huge mess of far-from-human-readable serialized code, followed by:
public static final ATN _ATN =
ATNSimulator.deserialize(_serializedATN.toCharArray());
static {
_decisionToDFA = new DFA[_ATN.getNumberOfDecisions()];
}
A few Google searches were unable to provide me a way to disable this behavior.
Is there a way to disable this behavior and produce human-readable code, or am I going to have to hand-write my lexers and parsers for this target programming language?
ANTLR 4 uses a new algorithm for prediction. Terence Parr is currently working on a tech report describing the algorithm in detail. The human-readable output refers to the generated parsers.
ANTLR 4 lexers use a DFA recognizer for a massive speed and memory usage improvement over previous releases of ANTLR. For parsers, the _ATN field is a data structure used within calls to adaptivePredict (you'll notice lines in the generated code calling that method).
You won't be able to manually translate the generated Java code of an ANTLR 4 lexer to another programming language. You might be able to manually translate the code of a generated parser provided the grammar is strictly LL(1) (i.e. the generated code does not contain any calls to adaptivePredict). However, you will lose the error recovery ability that draws from information encoded in the serialized ATN.

In which language is the proto compiler (of google protocol buffers) written?

I would like to know in which language the "proto compiler" (the compiler used to generate source files from Java, Python or c++) is written? Is it maybe a mix of languages?
Any help would be appreciated.
Thanks in Advance
Horace
It appears to be written in C++. There's also documentation on Java and Python APIs, but those don't appear to contain the compiler itself (at least I don't see anything that's obviously the compiler in either case, though I didn't spend a whole lot of time looking for it either).
That said, I'm almost tempted to vote to close -- for most practical purposes, the language used to implement the compiler is basically a trivia question, irrelevant to actual use. There is, however, an entirely legitimate exception: if you're going to download and modify the compiler, knowing the language you'd need to work with could be quite useful.
The protoc compiler is written in C or C++ (its a native program anyway).
When I want to process proto files in java files, I
I use the protoc command to convert them to a Protocol Buffer File ie
protoc protofile.proto --descriptor_set_out=OutputFile
Read the new protocol buffer file (its a FileDescriptorSet) and use it
An over complicated example is example, is compileProto method in
http://code.google.com/p/protobufeditor/source/browse/trunk/%20protobufeditor/Source/ProtoBufEditor/src/net/sf/RecordEditor/ProtoBuf/re/display/ProtoLayoutSelection.java
its compilcated because options because the protoc command and options can be stored in a properties file.
Note: The getFileDescriptor method reads the newly created protocol buffer