What's your data serialisation format of choice? - serialization

I'm interested to see who favours more obscure data serialisation formats over the more obvious ones (JSON, XML and YAML). What do you tend to use? What syntax do you prefer?

It really depends on the requirements:
Do you need portability? If so, between which platforms?
Is speed more important than size, or vice versa?
Is it important to use some sort of international standard container format (such as XML, even if the details aren't standardised)?
What sort of backward/forward compatibility do you need?
Personally I'm a fan of Protocol Buffers, but then I'm biased as not only a Google employee, but one who's ported PB to C#...

Some are very fond of ASN.1 (you know who you are).
It is less human readable than XML, but more compact.
An example, after bit-encoding:
30 13 02 01 05 16 0e 41 6e 79 62 6f 64 79
This assumes the sender and receiver already knows about the structure of the data.
(Before bit-encoding it is:
myQuestion FooQuestion ::= {
trackingNumber 5,
question "Anybody there?"
}
)

Related

Choosing serialization frameworks

I was reading about the downsides of using java serialization and the necessity to go for serialization frameworks. There are so many frameworks like avro, parquet, thrift, protobuff.
Question is what framework addresses what and what are all the parameters that are to be considered while choosing a serialization framework.
I would like to get hands on with a practical use case and compare/choose the serialization frameworks based on the requirements.
Can somebody please assist on this topic?
There are a lot of factors to consider. I'll go through some of the important ones.
0) Schema first or Code First
If you have a project that'll involve different languages, code first approaches are likely to be problematic. It's all very well have a JAVA class that can be serialised, but it might be a nuisance if that has to be deserialised in C.
Generally I favour schema first approaches, just in case.
1) Inter-object Demarkation
Some serialisations produce a byte stream that makes it possible to see where one object stops and another begins. Other do not.
So, if you have a message transport / data store that will separate out batches of bytes for you, e.g. ZeroMQ, or a data base field, then you can use a serialisation that doesn't demarkate messages. Examples include Google Protocol Buffers. With demarkation done by the transport / store, the reader can get a batch of bytes knowing for sure that it encompasses one object, and only one object.
If your message transport / data store doesn't demarkate between batches of bytes, e.g. a network stream or a file, then either you invent your own demarkation markers, or use a serialisation that demarkates for you. Examples include ASN.1 BER, XML.
2) Cannonical
This is a property of a serialisation which means that the serialised data desribes its own structure. In principal the reader of a cannonical message doesn't have to know up front what the message structure is, it can simply work that out as it reads the bytes (even if it doesn't know the field names). This can be useful in circumstances where you're not entirely sure where the data is coming from. If the data is not cannonical, the reader has to know in advance what the object structure was otherwise the deserialisation is ambiguous.
Examples of cannonical serialisations include ASN.1 BER, ASN.1 cannonical PER, XML. Ones that aren't include ASN.1 uPER, possibly Google Protocol Buffers (I may have that wrong).
AVRO does something different - the data schema is itself part of the serialised data, so it is always possible to reconstruct the object from arbitrary data. As you can imagine the libraries for this are somewhat clunky in languages like C, but rather better in dynamic languages.
3) Size and Value Constrained.
Some serialisation technologies allow the developer to set constraints on the values of fields and the sizes of arrays. The intention is that code generated from a schema file incorporating such constraints will automatically validate objects on serialisation and on deserialistion.
This can be extremely useful - free, schema driven content inspection done automatically. It's very easy to spot out-of-specification data.
This is extremely useful in large, hetrogenous projects (lots of different languages in use), as all sources of truth about what's valid and what's not comes from the schema, and only the schema, and is enforced automatically by the auto-generated code. Developers can't ignore / get round the constraints, and when the constraints change everyone can't help but notice.
Examples include ASN.1 (usually done pretty well by tool sets), XML (not often done properly by free / cheap toolsets; MS's xsd.exe purposefully ignores any such constraints) and JSON (down to object validators). Of these three, ASN.1 has by far the most elaborate of constraints syntaxes; it's really very powerful.
Examples that don't - Google Protocol Buffers. In this regard GPB is extremely irritating, because it doesn't have constraints at all. The only way of having value and size constraints is to either write them as comments in the .proto file and hope developers read them and pay attention, or some other sort of non-sourcecode approach. With GPB being aimed very heavily at hetrogrenous systems (literally every language under the sun is supported), I consider this to be a very serious omission, because value / size validation code has to be written by hand for each language used in a project. That's a waste of time. Google could add syntactical elements to .proto and code generators to support this without changing wire foramts at all (it's all in the auto-generated code).
4) Binary / Text
Binary serialisations will be smaller, and probably a bit quicker to serialise / deserialise. Text serialisations are more debuggable. But it's amazing what can be done with binary serialisations. For example, one can easily add ASN.1 decoders to Wireshark (you compile them up from your .asn schema file using your ASN.1 tools), et voila - on the wire decoding of programme data. The same is possible with GPB I should think.
ASN.1 uPER is extremely useful in bandwidth constrained situations; it automatically uses the size / value constraints to economise in bits on the wire. For example, a field valid between 0 and 15 needs only 4 bits, and that's what uPER will use. It's no coincidence that uPER features heavily in protocols like 3G, 4G, and 5G too I should think. This "minimum bits" approach is a whole lot more elegant than compressing a text wireformat (which is what's done a lot with JSON and XML to make them less bloaty).
5) Values
This is a bit of an oddity. In ASN.1 a schema file can define both the structure of objects, and also values of objects. With the better tools you end up with (in your C++, JAVA, etc source code) classes, and pre-define objects of that class already filled in with values.
Why is that useful? Well, I use it a lot for defining project constants, and to give access to the limits on constraints. For example, suppose you'd got a an array field with a valid length of 15 in a message. You could have a literal 15 in the field constraint or you could cite the value of an integer value object in the constraint, with the integer also being available to developers.
--ASN.1 value that, in good tools, is built into the
--generated source code
arraySize INTEGER ::= 16
--A SET that has an array of integers that size
MyMessage ::= SET
{
field [0] SEQUENCE (SIZE(arraySize)) OF INTEGER
}
This is really handy in circumstances where you want to loop over that constraint, because the loop can be
for (int i = 0; i < arraySize; i++) {do things with MyMessage.field[i];} // ArraySize is an integer in the auto generated code built from the .asn schema file
Clearly this is fantastic if the constraint ever needs to be changed, because the only place it has to be changed is in the schema, followed by a project recompile (where every place it's used will use the new value). Better still, if it's renamed in the schema file a recompile identifies everywhere in the poject it was used (because the developer written source code that uses it is still using the old name, which is now an undefined symbol --> compiler errors.
ASN.1 constraints can get very elaborate. Here's a tiny taste of what can be done. This is fantastic for system developers, but is pretty complicated for the tool developers to implement.
arraySize INTEGER ::= 16
minSize INTEGER ::= 4
maxVal INTEGER ::= 31
minVal INTEGER ::= 16
oddVal INTEGER ::= 63
MyMessage2 ::= SET
{
field_1 [0] SEQUENCE (SIZE(arraySize)) OF INTEGER, -- 16 elements
field_2 [1] SEQUENCE (SIZE(0..arraySize)) OF INTEGER, -- 0 to 16 elements
field_3 [2] SEQUENCE (SIZE(minSize..arraySize)) OF INTEGER, -- 4 to 16 elements
field_4 [3] SEQUENCE (SIZE(minSize<..<arraySize)) OF INTEGER, -- 5 to 15 elements
field_5 [4] SEQUENCE (SIZE(minSize<..<arraySize)) OF INTEGER(0..maxVal), -- 5 to 15 elements valued 0..31
field_6 [5] SEQUENCE (SIZE(minSize<..<arraySize)) OF INTEGER(minVal..maxVal), -- 5 to 15 elements valued 16..31
field_7 [6] SEQUENCE (SIZE(minSize<..<arraySize)) OF INTEGER(minVal<..maxVal), -- 5 to 15 elements valued 17..31
field_8 [7] SEQUENCE (SIZE(arraySize)) OF INTEGER(minVal<..<maxVal), -- 16 elements valued 17..30
field_9 [8] INTEGER (minVal..maxVal AND oddVal) -- valued 16 to 31, and also 63
f8_indx [10] INTEGER (0..<arraySize) -- index into field 8, constrained to be within the bounds of field 8
}
So far as I know, only ASN.1 does this. And then it's only the more expensive tools that actually pick up these elements out of a schema file. With it, this makes it tremendously useful in a large project because literally everything related to data and its constraints and how to handle it is defined in only the .asn schema, and nowhere else.
As I said, I use this a lot, for the right type of project. Once one has got it pervading an entire project, the amount of time and risk saved is fantastic. It changes the dynamics of a project too; one can make late changes to a schema knowing that the entire project picks those up with nothing more than a recompile. So, protocol changes late in a project go from being high risk to something you might be content to do every day.
6) Wireformat Object Type
Some serialisation wireformats will identify the type of an object in the wireformat bytestrean. This helps the reader in situations where objects of many different types may arrive from one or more sources. Other serialisations won't.
ASN.1 varies from wireformat to wireformat (it has several, including a few binary ones as well as XML and JSON). ASN.1 BER uses type, value and length fields in its wireformat, so it is possible for the reader to peek at an object's tag up front decode the byte stream accordingly. This is very useful.
Google Protocol Buffers doesn't quite do the same thing, but if all message types in a .proto are bundled up into one final oneof, and it's that that's only every serialised, then you can achieve the same thing
7) Tools cost.
ASN.1 tools range from very, very expensive (and really good), to free (and less good). A lot of others are free, though I've found that the best XML tools (paying proper attention to value / size constraints) are quite expensive too.
8) Language Coverage
If you've heard of it, it's likely covered by tools for lots of different languages. If not, less so.
The good commercial ASN.1 tools cover C/C++/Java/C#. There are some free C/C++ ones of varying completeness.
9) Quality
It's no good picking up a serialisation technology if the quality of the tools is poor.
In my experience, GPB is good (it generally does what it says it will). The commercial ASN1 tools are very good, eclipsing GPB's toolset comprehensively. AVRO works. I've heard of some occassional problems with Capt'n Proto, but having not used it myself you'd have to check that out. XML works with good tools.
10) Summary
In case you can't tell, I'm quite a fan of ASN.1.
GPB is incredibly useful too for its widespread support and familiarity, but I do wish Google would add value / size constraints to fields and arrays, and also incorporate a value notation. If they did this it'd be possible to have the same project workflow as can be achieved with ASN.1. If Google added just these two features, I'd consider GPB to be pretty well nigh on "complete", needing only an equivalent of ASN.1's uPER to finish it off for those people with little storage space or bandwidth.
Note that quite a lot of it is focused on what a project's circumstances are, as well as how good / fast / mature the technology actually is.

Generalizing work orders

Hello stackoverflowians,
I am working on designing tables for work orders.
The problem:
There is different work order models (from now on called WOM)
The WOMs share some attributes (Num, Date, Description, ... etc)
The WOMs have details such as:
Sectors on wich the work is done.
Some WOMs use storage tank instead of sectors (products are prepared in storage tanks).
Products and their quantities (plus or no some info on product) applied to wich sector.
Human ressources wich worked on the WO.
Materials used on the work order
... etc
What is needed
Design tables for work orders and the details ofc.
They want to know how ressources were spent.
Design queries to retrieve all shape of infos.
Constraints
Simple presentation for the end users.
Generalizing the work orders models.
What has been done
Designed all work orders and their details as a hierarchy starting from work order num as the mother node.
WorkOrderTable (ID, ParentID, Type, Value)
example of a work order Transform hierarchical data into flat table
ID ParentID Type Value
38 0 Num 327
39 38 Sector 21
40 38 Sector 22
43 40 Product NS
44 40 Product MS
50 40 Temp RAS
48 44 Quantity 60
47 43 Quantity 25
41 39 Product ARF
42 39 Product BRF
49 39 Temp RAS
51 39 Cible Acarien A.
46 42 Quantity 30
52 42 Cible Acarien B.
45 41 Quantity 20
The Question
Is what I am doing good/efficient easy to maintien work with or there is other ideas ?
UPDATE I : More details
Products aren't changing about 50 active ones [products change over time, need to keep track of version]
Sectors are about 40 (fixe land area)
People normal HR table
How Big is a typical WOM :
about 15 attributes (3 of them mportante and shared by all WOMs the others are a little less)
about 5 or more details sharing : Product, Sector, People and other describing infos like the quantity of the product.
WOMs are fixe for now but I am worried about them changing in future (or the rise of new ones)
The versionning isn't a requirement right now, but adding it is a plus.
I am planning on using different tables for participants (sectors, products ...)
The meta-data / data confilict is what this design dilemma is about.
Considered any WOM is defined by 3 parts:
The Work Order General Info (Num, Date, ...)
The Sectors [Other WOMs use Tank storage] in wich the jobs are done.
The Ressources to complete the job products, people, machines ...
State of the design
Specific tables for participants sectors,people,machines...
Meta-data table (ID, meta-data, lvl). Example :
Sector, 1 (directly to WO)
Tank Storage, 1
Product, 2 (can be part of sector job not directly to WO) sd
Work Order table (ID, parentID, metadataID, valueID) the value ID is taken from the participants table
Concerning XML I have so to no informations about how to store them and manipulate them.
Without knowing any numbers and further knowledge about your needs no good advise is possible. Here are some question coming into my mind
How many users?
How many products/locations/sectors/people...?
Is this changing data?
How many WOMs?
How big is one typical WOM?
Is it a plain tree hierarchy
If not: Might there be alternative routes, circles, islands?
Are these WOMs fix or do they change?
If changing: Do you need versioning?
It looks like trying to re-invent a professional ERP system. As Bostwick told you already, you should rather think about to use an existing one...
Just some general hints:
Do not use the WOM-storage for (meta) data (only IDs / foreign key)
Try to draw a sharp border between working data and meta data
Use a dedicated table for each of your participants (sectors, products...)
A WOM might be better placed within XML
Read about (Finite) State Machines
Read about state pattern
Read about Business-Process-Modelling and Workflows
I think if your looking for design advice, you should go to a meta stack group, i.e code review exchange
That being said, your asking for advice on design, but only giving abstract information. The scale of a system, the amount of CRUD expected, and several other factors need to be considered in design. With out getting the details and targets its really hard to answer your question. There are trade offs with different approaches. It may even be advisable to use a nosql solution.
That being said, I would suggest not building your own ERP system, and instead look to buy one from a vendor, that is industry specific, and apply your customization to it.
Its very expensive to write your own system, keeping it updated, adding security, and a lot of others features make it a worth while business decision to purchase from a software vendor.
If your just looking to gain more experience by writing this, I would suggest browsing github, and the previously mentioned stack exchange.

What is the typical alphabet size of Finite State Machines?

Not quite sure if this is the correct forum, but it was suggested at Theoretical Computer Science that I move it here...
What is the typical alphabet size of Finite State Machines?
I am currently busy implementing a high-performance FA library and need to make some design considerations before continuing. My state space will be in the order of 2 147 483 647 (Integer.MAX_VALUE) which I feel is more than enough, even for non-general use. Now, all that remains is the alphabet space.
Is there any merit in assuming that the alphabet would usually only consist of all displayable characters (in which case it can be stored as a byte which would result in really good performance)? Or should alphabet symbols rather be translated into Strings so that you rather have alphabet labels? In this case I would need to keep a Map that translates a String into either a int, short or byte, depending on how large I want to make it.
Really the alphabet of a finite state machine is a mathematical 'set' of any type. There is nothing restricting the content of the set, it could be 1's and 0's, A-Z, or apples-oranges. There is no 'typical' FSM alphabet size as per se. Do you have a user in mind for your library?

Saving Double.MinValue in SQLServer

Using a TSQL update command against a SQLServer database, how can I update a column of type FLOAT with the smallest possible double value?
The smallest possible double value in hex notation being 3ff0 0000 0000 0001
(http://en.wikipedia.org/wiki/Double_precision)
Whatever it is you need this for, I suggest you consider alternatives that don't require assumptions about SQL Server's FLOAT type. Unfortunately, SQL Server is rather flaky about IEEE 854 compliance. For example, see these newsgroup threads. Also note that SQL Server's behavior in this regard has changed between versions and may well change again without warning. Without giving all the gory details, the smallest value you can assign directly to a FLOAT is not necessarily the smallest value a FLOAT can contain (without complaining). Some of SQL Server's flakiness also revolves around what IEEE calls "denormalized" floating point numbers, consideration of which is important if you want "smallest" to have a precise meaning.
Sorry not to answer your question, but I don't think much good can come from answers that help you head further down the rocky path you're on.
Don't, use decimal to avoid losing the value. Or binary(8) if you want to store it in hex, as per the article.
Seriously: depending on the CPU, the wind direction and moon phase, it's highly likely won't get the same value when you manage to convert it.
And what do you mean by "smallest"? 3ff0 0000 0000 0001 = 1.0000000000000002
And, is it hex in the client but you want float in the db?

[My]SQL VARCHAR Size and Null-Termination

Disclaimer: I'm very new to SQL and databases in general.
I need to create a field that will store a maximum of 32 characters of text data. Does "VARCHAR(32)" mean that I have exactly 32 characters for my data? Do I need to reserve an extra character for null-termination?
I conducted a simple test and it seems that this is a WYSIWYG buffer. However, I wanted to get a concrete answer from people who actually know what they're doing.
I have a C[++] background, so this question is raising alarm bells in my head.
Yes, you have 32 characters at your disposal. SQL does not concern itself with nul terminated strings like some programming languages do.
Your VARCHAR specification size is the max size of your data, so in this case, 32 characters. However, VARCHARS are a dynamic field, so the actual physical storage used is only the size of your data, plus one or two bytes.
If you put a 10-character string into a VARCHAR(32), the physical storage will be 11 or 12 bytes (the manual will tell you the exact formula).
However, when MySQL is dealing with result sets (ie. after a SELECT), 32 bytes will be allocated in memory for that field for every record.