Implementing Java serialization independent encoding - serialization

I want to use several encodings in the presentation layer to encode a object/structure in the application layeri independenty from encoding scheme (such as binary, XML, etc) and programming language (Java, Javascript, PHP, C).
An example would be to transfer an object from a producer to a consumer in a byte stream. The Java client would encode it using something like this:
Object var = new Dog();
output.writeObject(var);
The server would share the Dog class definitions and could regenerate the object doing something like this:
Object var = input.readObject();
assertTrue(var instanceof Dog); // passes
It is important to note that producer and consumer would not share the type of var, and, therefore, the consumer would not need the type to decode var. They only would share data type definitions, if ever:
public interface Pojo {}
public class Dog implements Pojo { int i; String s; } // Generated by framework from a spec
What I found:
Java Serialization: It is language dependent. Cannot be used with for example javascript.
Protobuf library: It is limited to a specific binary format. It is not possible to support additional binary formats. Need name of class ("class" of message).
XStream, Simple, etc. They are rather limited to text/XML and require name of the class.
ASN.1: The standards are there and could be used with OBJECT IDENTIFIER and type definitions but they lack on documentation and tutorials.
I prefer 4th option because, among others, it is a standard. Is there any active project that support such requirements (specially something based on ASN.1)? Any usage example? Does the project include codecs (DER, BER, XER, etc.) that can be selected at runtime?
Thanks

You can find several open source and commercial implementation of tools for ASN.1. These usually include:
a compiler for the schema, which will generate code in your desired programming language
a runtime library which is used together with the generated code for encoding and decoding
ASN.1 is mainly used with the standardized communication protocols for telecom industry, so the commercial tools have very good support for the ASN.1 standard and various encoding rules.
Here are some starter tutorials and even free e-books:
http://www.oss.com/asn1/resources/asn1-made-simple/introduction.html
http://www.oss.com/asn1/resources/reference/asn1-reference-card.html
http://www.oss.com/asn1/resources/books-whitepapers-pubs/asn1-books.html
I know that the OSS ASN.1 commercial tools (http://www.oss.com/asn1/products/asn1-products.html) will support switching the encoding rules at runtime.

To add to bosonix's answer, there's also Objective System's tools at http://www.obj-sys.com/. The documentation from both OSS and Objective Systems includes many example uses.
ASN.1 is pretty much perfect for what you're looking for. I know of no other serialisation system that does this quite so thoroughly.
As well as a whole array of different binary encodings (ranging from the comprehensively tagged BER all the way down to the very packed-together PER), it does XML and now also JSON encodings too. These are well standardised by the ITU, so it is in theory fully inter operable between tool vendors, programming languages, OSes, etc.
There are other significant benefits to ASN.1. The schema language lets you define constraints on the value of message fields, or the sizes of arrays. These then get checked for you by the generated code. This is far more complete than many other serialisations. For instance, Google Protocol Buffers doesn't let you do this, meaning that you have to check the range of message fields (where applicable) in hand written code. That's tedious, error prone, and hard to maintain.
The only other ones that do this are XSD and JSON schemas. However with those you're at the mercy of the varying quality of tools used to turn those into source code - I've not yet seen any decent ones for JSON schemas. I'm not aware of whether or not Microsoft's xsd.exe honours such constraints either.

Related

How to create a refactor tool?

I want to learn how to create refactor tools.
As an example: I want to create migration scripts
for when some library removes deprecated function
and we want to transform code to use the newer adequate functionality.
My idea was to use ANTLR to parse the code into AST,
use some pattern matching on this tree to modify contents,
and output the modified contents.
However, from what I read ANTLR isn't preserving formatting in AST tree,
therefore it would be hard to get unbroken content back.
Do you have a solution that would comply with:
allows me to modify code with preserving formatting
(optionally) allows me to use AST transformations for code transformation
(optionally) can transform variety languages like ANTLR
Question is not limited to one particular language,
I'd be happy to heard solutions created for different languages.
If you want a
general purpose tool to parse source code from arbitrary languages producing ASTs
apply procedural or preferably source-to-source pattern-directed rewrite rules to manipulate the ASTs
regenerate valid source code retaining formatting and comments
I know of only two systems at present that can do this robustly.
RASCAL Metaprogramming language, a research platform
Semantic Designs' (my company) DMS Software Reengineering Toolkit (DMS)
You probably don't want to try building frameworks like this yourself; these tools both have decades of PhD level investment to make them practical.
One issue that occurs repeatedly is the mistake of thinking that having a parser (e.g., ANTLR) solves most of the problem. See my essay on Life After Parsing. A key insight is that you can't transform "just the syntax (ASTs)" without context; you have to take in account the language semantics and whatever framework you choose to use had better help you do semantic analysis to support the rewrite rules.
There are other (general purpose) program transformation systems out there. Many are research. Few have been used to do serious software reengineering in practice. RASCAL has been applied to some quite interesting tasks but not in any commercial context that I know. DMS has been used in production for over 20 years to carry out massive code base changes including refactoring, API revision, and fully automated language migrations.
ANTLR has a TokenStreamRewriterTokenStreamRewriter class that is very good at preserving your source input.
It has very robust capabilities. It allow you to delete, insert or replace text in the input stream. IT actually stores up a series of pending changes, and then applies them when you ask for the modified input stream (even allows for rolling back changes, as well as multiple sets of changes).
A couple of examples from a recent presentation I did that touched on the Rewriter:
private void plus0(RefactorUtilContext ctx, String pName) {
for (var match : plus0PatternA.findAll(ctx, ANY_EXPR_XPATH)) {
var matchCtx = (AddSubExprContext) (match.getTree());
rewriter.delete(pName, matchCtx.op, matchCtx.rhs.getStop());
}
for (var match : plus0PatternB.findAll(ctx, ANY_EXPR_XPATH)) {
var matchCtx = (AddSubExprContext) (match.getTree());
rewriter.delete(pName, matchCtx.lhs.getStart(), matchCtx.op);
}
}
private void times1(RefactorUtilContext ctx, String pName) {
for (var match : times1PatternA.findAll(ctx, ANY_EXPR_XPATH)) {
var matchCtx = (MulDivExprContext) (match.getTree());
// rewriter.delete(pName, matchCtx.op, matchCtx.rhs.getStop());
rewriter.insertBefore(pName, matchCtx.op, "/* ");
rewriter.insertAfter(pName, matchCtx.rhs.getStart(), " */");
}
for (var match : times1PatternB.findAll(ctx, ANY_EXPR_XPATH)) {
var matchCtx = (MulDivExprContext) (match.getTree());
// rewriter.delete(pName, matchCtx.lhs.getStart(), matchCtx.op);
rewriter.insertBefore(pName, matchCtx.lhs.getStart(), "/* ");
rewriter.insertAfter(pName, matchCtx.op, " */");
}
}
TokenStreamRewriter, basically, just stores a set of instruction about how to modify you input stream, so everything about you input stream that you don't modify is, ummmm, unmodified :).
You may also wish to look into the XPath capabilities that ANTLR has. These allow you to find very specific patterns in the parse tree to locate the portions you would like to refactor. As the name suggests, the syntax is very similar to XPath for XML documents, but works on the parse tree instead of an XML DOM.
Note: all of these operate on the parse tree, not an AST (which would necessarily be of your own design, so ANTLR wouldn't know of it's structure. Also, it's in the nature of ASTs to drop irrelevant information (like comments, whitespace, etc.) so they're just not a good starting point for anything where you want to preserve formatting.)
I put together a quite small project for this presentation and it's on GitHub at LittleCalc. The most relevant file is LittleCalcExecutionVisitor.
You can start a REPL and test things out by running LittleCalcRepl.java

How can I extend JVM bytecode?

I want to create a programming language that compiles to its own bytecode format and a VM that interprets it. But I want the bytecode to be compatible with JVM. I've searched for any way to insert comments to JVM bytecode so then I can parse them with my own VM but I couldn't find any. Also I tried to insert some bytes in the start of the byte array and in the end, but it produced ClassFormatException. Is there any workaround?
"compiles to its own bytecode format" and "compatible with JVM" are mutually exclusive requirements.
If you want the JVM to be able to parse your classfiles, the classfiles must strictly comply with the Java Virtual Machine Specification Chapter 4. The class File Format.
The standard way of exteding the class file format is Attributes. You may invent your own attributes and include them to one or more of the following structures:
ClassFile
field_info
method_info
Code
Scala does this, so you should check that out. As mentioned by apanging, the standard way to insert "comments" in a classfile are with attributes. The classfile format defines certain standard attributes that are used by the JVM. However, you can also include arbitrary user defined attributes which can contain arbitrary data. This is what Scala uses to include the metadata required by the Scala runtime.

In what way is gobject facilitating binding?

On the official website of gobject, we can read:
GObject, and its lower-level type system, GType, are used by GTK+ and most GNOME libraries to provide:
object-oriented C-based APIs and
automatic transparent API bindings to other compiled or interpreted
languages
The first part seems clear to me but not the second one.
Indeed, when talking about gobject and binding, the concept introduced is often gobject-intropspection, but as far as I understand, gobject-introspection can be used to create .gir and .typelib for any documented C library, not only for gobject-based library.
Therefore I wonder what makes gobject particularly binding-friendly.
as far as I understand, gobject-introspection can be used to create .gir and .typelib for any documented C library, not only for gobject-based library.
That's not really true in practice. You can do some very basic stuff, but you have to write the GIR by hand (instead of just running a program which scans the source code). The only ones I'm aware of are those distributed with gobject-introspection (the *.gir files, the *.c files there are to avoid cyclical dependencies), and even those are generally only a fairly small subset of the C API.
As for other features, almost everything in GObject helps… the basic idea is that bindings often need RTTI. There are types like GValue (a simple box to store a value + type information), GClosure (for callbacks), properties and signals describe themselves with GTypes, etc. If you use GObjects (instead of creating a new fundamental type) you get run-time data about inheritance and interfaces, and GObject's odd construction scheme even allows other languages to subclass types declared in C.
The reason g-ir-scanner can't really do much on non-GObject libraries is that all that information is missing. After scanning the source code looking for annotations, g-ir-scanner will actually load the compiled module and use GObject's API to grab this information (which makes cross-compiling painful). In other words, GObject-Introspection is a much smaller project than you think… a huge percentage of the data it needs it gets from the GObject API.

What is the need of JVM when you can pass the source code?

i am new to java.
i wanted to know this.
what is the need to create the .class file in java ?
can't we just pass the source code to every machine so that each machine can compile it according to the OS and the hardware ?
I believe it's mostly for efficiency reasons.
From wikipedia http://en.wikipedia.org/wiki/Bytecode:
Bytecode, also known as p-code (portable code), is a form of
instruction set designed for efficient execution by a software
interpreter. Unlike human-readable source code, bytecodes are compact
numeric codes, constants, and references (normally numeric addresses)
which encode the result of parsing and semantic analysis of things
like type, scope, and nesting depths of program objects. They
therefore allow much better performance than direct interpretation of
source code.
(my emphasis)
And as others have mentioned possible weak obfuscation of the source code.
The main reason for the compilation is that the Virtual Machines which are used to host java classes and run them only understands bytecode
And since compiling a class each time to the language the virtual machine understands is expensive. That's the only reason why the source code is compiled into bytecode.
But we can also use some compilers which compiles source code directly into machine code.But that's a different story which I don't know about much.

What language is to binary, as Perl is to text?

I am looking for a scripting (or higher level programming) language (or e.g. modules for Python or similar languages) for effortlessly analyzing and manipulating binary data in files (e.g. core dumps), much like Perl allows manipulating text files very smoothly.
Things I want to do include presenting arbitrary chunks of the data in various forms (binary, decimal, hex), convert data from one endianess to another, etc. That is, things you normally would use C or assembly for, but I'm looking for a language which allows for writing tiny pieces of code for highly specific, one-time purposes very quickly.
Any suggestions?
Things I want to do include presenting arbitrary chunks of the data in various forms (binary, decimal, hex), convert data from one endianess to another, etc. That is, things you normally would use C or assembly for, but I'm looking for a language which allows for writing tiny pieces of code for highly specific, one-time purposes very quickly.
Well, while it may seem counter-intuitive, I found erlang extremely well-suited for this, namely due to its powerful support for pattern matching, even for bytes and bits (called "Erlang Bit Syntax"). Which makes it very easy to create even very advanced programs that deal with inspecting and manipulating data on a byte- and even on a bit-level:
Since 2001, the functional language Erlang comes with a byte-oriented datatype (called binary) and with constructs to do pattern matching on a binary.
And to quote informIT.com:
(Erlang) Pattern matching really starts to get
fun when combined with the binary
type. Consider an application that
receives packets from a network and
then processes them. The four bytes in
a packet might be a network byte-order
packet type identifier. In Erlang, you
would just need a single processPacket
function that could convert this into
a data structure for internal
processing. It would look something
like this:
processPacket(<<1:32/big,RestOfPacket>>) ->
% Process type one packets
...
;
processPacket(<<2:32/big,RestOfPacket>>) ->
% Process type two packets
...
So, erlang with its built-in support for pattern matching and it being a functional language is pretty expressive, see for example the implementation of ueencode in erlang:
uuencode(BitStr) ->
<< (X+32):8 || <<X:6>> <= BitStr >>.
uudecode(Text) ->
<< (X-32):6 || <<X:8>> <= Text >>.
For an introduction, see Bitlevel Binaries and Generalized Comprehensions in Erlang.You may also want to check out some of the following pointers:
Parsing Binaries with erlang, lamers inside
More File Processing with Erlang
Learning Erlang and Adobe Flash format same time
Large Binary Data is (not) a Weakness of Erlang
Programming Efficiently with Binaries and Bit Strings
Erlang bit syntax and network programming
erlang, the language for network programming (1)
Erlang, the language for network programming Issue 2: binary pattern matching
An Erlang MIDI File Reader/Writer
Erlang Bit Syntax
Comprehending endianness
Playing with Erlang
Erlang: Pattern Matching Declarations vs Case Statements/Other
A Stream Library using Erlang Binaries
Bit-level Binaries and Generalized Comprehensions in Erlang
Applications, Implementation and Performance Evaluation of Bit Stream Programming in Erlang
perl's pack and unpack ?
Take a look at python bitstring, it looks like exactly what you want :)
The Python bitstring module was written for this purpose. It lets you take arbitary slices of binary data and offers a number of different interpretations through Python properties. It also gives plenty of tools for constructing and modifying binary data.
For example:
>>> from bitstring import BitArray, ConstBitStream
>>> s = BitArray('0x00cf') # 16 bits long
>>> print(s.hex, s.bin, s.int) # Some different views
00cf 0000000011001111 207
>>> s[2:5] = '0b001100001' # slice assignment
>>> s.replace('0b110', '0x345') # find and replace
2 # 2 replacements made
>>> s.prepend([1]) # Add 1 bit to the start
>>> s.byteswap() # Byte reversal
>>> ordinary_string = s.bytes # Back to Python string
There are also functions for bit-wise reading and navigation in the bitstring, much like in files; in fact this can be done straight from a file without reading it into memory:
>>> s = ConstBitStream(filename='somefile.ext')
>>> hex_code, a, b = s.readlist('hex:32, uint:7, uint:13')
>>> s.find('0x0001') # Seek to next occurence, if found
True
There are also views with different endiannesses as well as the ability to swap endianness and much more - take a look at the manual.
I'm using 010 Editor to view binary files all the time to view binary files.
It's especially geared to work with binary files.
It has an easy to use c-like scripting language to parse binary files and present them in a very readable way (as a tree, fields coded by color, stuff like that)..
There are some example scripts to parse zipfiles and bmpfiles.
Whenever I create a binary file format, I always make a little script for 010 editor to view the files. If you've got some header files with some structs, making a reader for binary files is a matter of minutes.
Any high-level programming language with pack/unpack functions will do. All 3 Perl, Python and Ruby can do it. It's matter of personal preference. I wrote a bit of binary parsing in each of these and felt that Ruby was easiest/most elegant for this task.
Why not use a C interpreter? I always used them to experiment with snippets, but you could use one to script something like you describe without too much trouble.
I have always liked EiC. It was dead, but the project has been resurrected lately. EiC is surprisingly capable and reasonably quick. There is also CINT. Both can be compiled for different platforms, though I think CINT needs Cygwin on windows.
Python's standard library has some of what you require -- the array module in particular lets you easily read parts of binary files, swap endianness, etc; the struct module allows for finer-grained interpretation of binary strings. However, neither is quite as rich as you require: for example, to present the same data as bytes or halfwords, you need to copy it between two arrays (the numpy third-party add-on is much more powerful for interpreting the same area of memory in several different ways), and, for example, to display some bytes in hex there's nothing much "bundled" beyond a simple loop or list comprehension such as [hex(b) for b in thebytes[start:stop]]. I suspect there are reusable third-party modules to facilitate such tasks yet further, but I can't point you to one...
Forth can also be pretty good at this, but it's a bit arcane.
Well, if speed is not a consideration, and you want perl, then translate each line of binary into a line of chars - 0's and 1's. Yes, I know there are no linefeeds in binary :) but presumably you have some fixed size -- e.g. by byte or some other unit, with which you can break up the binary blob.
Then just use the perl string processing on that data :)
If you're doing binary level processing, it is very low level and likely needs to be very efficient and have minimal dependencies/install requirements.
So I would go with C - handles bytes well - and you can probably google for some library packages that handle bytes.
Going with something like Erlang introduces inefficiencies, dependencies, and other baggage you probably don't want with a low-level library.