Will protobuf generate bitwise perfect copy if ran on the same input on different languages/architectures? - cross-platform

If I use the same .proto file, across several machine (arm, x86, amd64 etc.) with implementations written in different languages (c++, python, java, etc.), will the same message result in the exact same byte sequence when serialized across those different configurations?
I would like to use these bytes for hashing to ensure that the same message, when generated on a different platform, would end up with the exact same hash.

"Often, but not quite always"
The reasons you might get variance include:
it is only a "should", not a "must" that fields are written in numerical sequential order - citation, emphasis mine:
when a message is serialized its known fields should be written sequentially by field number
and it is not demanded that fields are thus ordered (it is a "must" that deserializers be able to handle out-of-order fields); this can apply especially when discussing unexpected/extension fields; if two serializations choose different field orders, the bytes will be different
protobuf can be constructed by merging two partial messages, which will by necessity cause out-of-order fields, but when re-serializing an object deserialized from a merged message, it may become normalized (sequential)
the "varint" encoding allows some small subtle ambiguity... the number 1 would usually be encoded as 0x01, but it could also be encoded as 0x8100 or 0x818000 or 0x81808080808000 - the specification doesn't actually demand (AFAIK) that the shortest version be used; I am not aware of any implementation that actually outputs this kind of subnormal form, though :)
some options are designed to be forward- and backward- compatible; in particular, the [packed=true] option on repeated primitive values can be safely toggled at any time, and libraries are expected to cope with it; if you originally serialized it in one way, and now you're serializing it with the other option: the result can be different; a side-effect of this is that a specific library could also simply choose to use the alternate representation, especially if it knows it will be smaller; if two libraries make different decisions here - different bytes
In most common cases, yes: it'll be reliable and repeatable. But this is not an actual guarantee.
The bytes should be compatible, though - it'll still have the same semantics - just not the same bytes. It shouldn't matter what language, framework, library, runtime, or processor you use. If it does: that's a bug.

Related

I want to load a YAML file, possibly edit the data, and then dump it again. How can I preserve formatting?

This question tries to collect information spread over questions about different languages and YAML implementations in a mostly language-agnostic manner.
Suppose I have a YAML file like this:
first:
- foo: {a: "b"}
- "bar": [1, 2, 3]
second: | # some comment
some long block scalar value
I want to load this file into an native data structure, possibly change or add some values, and dump it again. However, when I dump it, the original formatting is not preserved:
The scalars are formatted differently, e.g. "b" loses its quotation marks, the value of second is not a literal block scalar anymore, etc.
The collections are formatted differently, e.g. the mapping value of foo is written in block style instead of the given flow style, similarly the sequence value of "bar" is written in block style
The order of mapping keys (e.g. first/second) changes
The comment is gone
The indentation level differs, e.g. the items in first are not indented anymore.
How can I preserve the formatting of the original file?
Preface: Throughout this answer, I mention some popular YAML implementations. Those mentions are never exhaustive since I do not know all YAML implementations out there.
I will use YAML terms for data structures: Atomic text content (even numbers) is a scalar. Item sequences, known elsewhere as arrays or lists, are sequences. A collection of key-value pairs, known elsewhere as dictionary or hash, is a mapping.
If you are using Python, using ruamel will help you preserve quite some formatting since it implements round-tripping up to native structures. However, it isn't perfect and cannot preserve all formatting.
Background
The process of loading YAML is also a process of losing information. Let's have a look at the process of loading/dumping YAML, as given in the spec:
When you are loading a YAML file, you are executing some or all of the steps in the Load direction, starting at the Presentation (Character Stream). YAML implementations usually promote their most high-level APIs, which load the YAML file all the way to Native (Data Structure). This is true for most common YAML implementations, e.g. PyYAML/ruamel, SnakeYAML, go-yaml, and Ruby's YAML module. Other implementations, such as libyaml and yaml-cpp, only provide deserialization up to the Representation (Node Graph), possibly due to restrictions of their implementation languages (loading into native data structures requires either compile-time or runtime reflection on types).
The important information for us is what is contained in those boxes. Each box mentions information which is not available anymore in the box left to it. So this means that styles and comments, according to the YAML specification, are only present in the actual YAML file content, but are discarded as soon as the YAML file is parsed. For you, this means that once you have loaded a YAML file to a native data structure, all information about how it originally looked in the input file is gone. Which means that when you dump the data, the YAML implementation chooses a representation it deems useful for your data. Some implementations let you give general hints/options, e.g. that all scalars should be quoted, but that doesn't help you restore the original formatting.
Thankfully, this diagram only describes the logical process of loading YAML; a conforming YAML implementation does not need to slavishly conform to it. Most implementations actually preserve data longer than they need to. This is true for PyYAML/ruamel, SnakeYAML, go-yaml, yaml-cpp, libyaml and others. In all these implementations, the style of scalars, sequences and mappings is remembered up until the Representation (Node Graph) level.
On the other hand, comments are discarded rather early since they do not belong to an event or node (the exceptions here is ruamel which links comments to the following event, and go-yaml which remembers comments before, at and after the line that created a node). Some YAML implementations (libyaml, SnakeYAML) provide access to a token stream which is even more low-level than the Event Tree. This token stream does contain comments, however it is only usable for doing things like syntax highlighting, since the APIs do not contain methods for consuming the token stream again.
So what to do?
Loading & Dumping
If you need to only load your YAML file and then dump it again, use one of the lower-level APIs of your implementation to only load the YAML up until the Representation (Node Graph) or Serialization (Event Tree) level. The API functions to search for are compose/parse and serialize/present respectively.
It is preferable to use the Event Tree instead of the Node Graph as some implementations already forget the original order of mapping keys (due to internally using hashmaps) when composing. This question, for example, details loading / dumping events with SnakeYAML.
Information that is already lost in the event stream of your implementation, for example comments in most implementations, is impossible to preserve. Also impossible to preserve is scalar layout, like in this example:
"1 \x2B 1"
This loads as string "1 + 1" after resolving the escape sequence. Even in the event stream, the information about the escape sequence has already been lost in all implementations I know. The event only remembers that it was a double-quoted scalar, so writing it back will result in:
"1 + 1"
Similarly, a folded block scalar (starting with >) will usually not remember where line breaks in the original input have been folded into space characters.
To sum up, loading to the Event Tree and dumping again will usually preserve:
Style: unquoted/quoted/block scalars, flow/block collections (sequences & mappings)
Order of keys in mappings
YAML tags and anchors
You will usually lose:
Information about escape sequences and line breaks in flow scalars
Indentation and non-content spacing
Comments – unless the implementation specifically supports putting them in events and/or nodes
If you use the Node Graph instead of the Event Tree, you will likely lose anchor representations (i.e. that &foo may be written out as &a later with all aliases referring to it using *a instead of *foo). You might also lose key order in mappings. Some APIs, like go-yaml, don't provide access to the Event Tree, so you have no choice but to use the Node Graph instead.
Modifying Data
If you want to modify data and still preserve what you can of the original formatting, you need to manipulate your data without loading it to a native structure. This usually means that you operate on YAML scalars, sequences and mappings, instead of strings, numbers, lists or whatever structures the target programming language provides.
You have the option to either process the Event Tree or the Node Graph (assuming your API gives you access to it). Which one is better usually depends on what you want to do:
The Event Tree is usually provided as stream of events. It may be better for large data since you do not need to load the complete data in memory; instead you inspect each event, track your position in the input structure, and place your modifications accordingly. The answer to this question shows how to append items giving a path and a value to a given YAML file with PyYAML's event API.
The Node Graph is better for highly structured data. If you use anchors and aliases, they will be resolved there but you will probably lose information about their names (as explained above). Unlike with events, where you need to track the current position yourself, the data is presented as complete graph here, and you can just descend into the relevant sections.
In any case, you need to know a bit about YAML type resolution to work with the given data correctly. When you load a YAML file into a declared native structure (typical in languages with a static type system, e.g. Java or Go), the YAML processor will map the YAML structure to the target type if that's possible. However, if no target type is given (typical in scripting languages like Python or Ruby, but also possible in Java), types are deduced from node content and style.
Since we are not working with native loading because we need to preserve formatting information, this type resolution will not be executed. However, you need to know how it works in two cases:
When you need to decide on the type of a scalar node or event, e.g. you have a scalar with content 42 and need to know whether that is a string or integer.
When you need to create a new event or node that should later be loaded as a specific type. E.g. if you create a scalar containing 42, you might want to control whether that it is loaded as integer 42 or string "42" later.
I won't discuss all the details here; in most cases, it suffices to know that if a string is encoded as a scalar but looks like something else (e.g. a number), you should use a quoted scalar.
Depending on your implementation, you may come in touch with YAML tags. Seldom used in YAML files (they look like e.g. !!str, !!map, !!int and so on), they contain type information about a node which can be used in collections with heterogeneous data. More importantly, YAML defines that all nodes without an explicit tag will be assigned one as part of type resolution. This may or may not have already happened at the Node Graph level. So in your node data, you may see a node's tag even when the original node does not have one.
Tags starting with two exclamation marks are actually shorthands, e.g. !!str is a shorthand for tag:yaml.org,2002:str. You may see either in your data, since implementations handle them quite differently.
Important for you is that when you create a node or event, you may be able and may also need to assign a tag. If you don't want the output to contain an explicit tag, use the non-specific tags ! for non-plain scalars and ? for everything else on event level. On node level, consult your implementation's documentation about whether you need to supply resolved tags. If not, same rule for the non-specific tags applies. If the documentation does not mention it (few do), try it out.
So to sum up: You modify data by loading either the Event Tree or the Node Graph, you add, delete or modify events or nodes in the data you get, and then you present the modified data as YAML again. Depending on what you want to do, it may help you to create the data you want to add to your YAML file as native structure, serialize it to YAML and then load it again as Node Graph or Event Tree. From there, you can include it in the structure of the YAML file you want to modify.
Conclusion / TL;DR
YAML has not been designed for this task. In fact, it has been defined as a serialization language, assuming that your data is authored as native data structures in some programming language and from there dumped to YAML. However, in reality, YAML is used a lot for configuration, meaning that you typically write YAML by hand and then load it into native data structures.
This contrast is the reason why it is so difficult to modify YAML files while preserving formatting: The YAML format has been designed as transient data format, to be written by one application, and then to be loaded by another (or the same) application. In that process, preserving formatting does not matter. It does, however, for data that is checked-in to version control (you want your diff to only contain the line(s) with data you actually changed), and other situations where you write your YAML by hand, because you want to keep style consistent.
There is no perfect solution for changing exactly one data item in a given YAML file and leaving everything else intact. Loading a YAML file does not give you a view of the YAML file, it gives you the content it describes. Therefore, everything that is not part of the described content – most importantly, comments and whitespace – is extremely hard to preserve.
If format preservation is important to you and you can't live with the compromises made by the suggestions in this answer, YAML is not the right tool for you.
I would like to challenge the accepted answer. Whether you can preserve comments, the order of map keys, or other features depends on the YAML parsing library that you use. For starters, the library needs to give you access to the parsed YAML as a YAML Document, which is a collection of YAML nodes. These nodes can contain metadata besides the actual key/value pairs. The kinds of metadata that your library chooses to store will determine how much of the initial YAML document you can preserve. I will not speak for all languages and all libraries, but Golang's most popular YAML parsing library, go-yaml supports parsing YAML into a YAML document tree and serializing YAML document back, and preserves:
comments
the order of keys
anchors and aliases
scalar blocks
However, it does not preserve indentation, insignificant whitespace, and some other minor things. On the plus side, it allows modifying the YAML document and there's another library,
yaml-jsonpath that simplifies browsing the YAML node tree. Example:
import (
"github.com/stretchr/testify/assert"
"gopkg.in/yaml.v3"
"testing"
)
func Test1(t *testing.T) {
var n yaml.Node
y := []byte(`# Comment
t: &t
- x: 1 # anchor
a:
b: *t # alias
b: |
cccc
dddd
`)
err := yaml.Unmarshal(y, &n)
assert.NoError(t, err)
y2, _ := yaml.Marshal(&n)
assert.Equal(t, y, y2)
}

What is the best way to represent a form with hundreds of questions in a model

I am trying to design an income return tax software.
What is the best way to represent/store a form with hundreds of questions in a model?
Just for this example, I need at least 6 models (T4, T4A(OAS), T4A(P), T1032, UCCB, T4E) which possibly contain hundreds of fields.
Is it by creating hundred of fields? Storing values in a map? An Array?
One very generic approach could be XML
XML allows you to
nest your data to any degree
combine values and meta information (attributes and elements)
describe your data in detail with XSD
store it externally
maintain it easily
even combine it with additional information (look at processing instructions)
and (last but not least) store the real data in almost the same format as the modell...
and (laster but even not leaster :-) ) there is XSLT to transform your XML data into any other format (such as HTML for nice presentation)
There is high support for XML in all major languages and database systems.
Another way could be a typical parts list (or bill of materials/BOM)
This tree structure is - typically - implemented as a table with a self-referenced parentID. Working with such a table needs a lot of recursion...
It is very highly recommended to store your data type-safe. Either use a character storage format and a type identifier (that means you have to cast all your values here and there), or you use different type-safe side tables via reference.
Further more - if your data is to be filled from lists - you should define a datasource to load a selection list dynamically.
Conclusio
What is best for you mainly depends on your needs: How often will the modell change? How many rules are there to guarantee data's integrity? Are you using a RDBMS? Which language/tools are you using?
With a case like this, the monolithic aggregate is probably unavoidable (unless you can deduce common fields). I'm going to exclude RDBMS since the topic seems to focus more on lower-level data structures and a more proprietary-style solution, though that could be a very valid option that can manage all these fields.
In this case, I think it ceases to become so much about formalities as just daily practicalities.
Probably worst from that standpoint in this case is a formal object aggregating fields, like a class or struct with a boatload of data members. Those tend to be the most awkward and the most unattractive as monoliths, since they tend to have a static nature about them. Depending on the language, declaration/definition/initialization could be separate which means 2-3 lines of code to maintain per field. If you want to read/write these fields from a file, you have to write a separate line of code for each and every field, and maintain and update all that code if new fields added or existing ones removed. If you start approaching anything resembling polymorphic needs in this case, you might have to write a boatload of branching code for each and every field, and that too has to be maintained.
So I'd say hundreds of fields in a static kind of aggregate is, by far, the most unmaintainable.
Arrays and maps are effectively the same thing to me here in a very language-agnostic sense provided that you need those key/value pairs, with only potential differences in where you store the keys and what kind of algorithmic complexity is involved. Whatever you do, probably a key search in this monolith should be logarithmic time or better. 'Maps/associative arrays' in most languages tend to inherently have this quality.
Those can be far more suitable, and you can achieve the kind of runtime flexibility that you like on top of those (like being able to manage these from a file and add the fields on the fly with no pre-existing knowledge). They'll be far more forgiving here.
So if the choice is between a bunch of fields in a class and something resembling a map, I'd suggest going for a map. The dynamic nature of it will be far more forgiving for these kinds of cases and will typically far outweigh the compile-time benefits of, say, checking to make sure a field actually exists and producing a syntax error otherwise. That kind of checking is easy to add back in and more if we just accept that it will occur at runtime.
An exception that might make the field solution more appealing is if you involve reflection and more dynamic techniques to generate an object with the appropriate fields on the fly. Then you get back those dynamic benefits and flexibility at runtime. But that might be more unwieldy to initialize the structure, could involve leaning a lot more heavily on heavy-duty (and possibly very computationally-expensive) introspection and type manipulation and code generation mechanisms, and also end up with more funky code that's hard to maintain.
So I think the safest bet is the map or associative array, and a language that lets you easily add new fields, inspect existing ones, etc. with very fast turnaround. If the language doesn't inherently have that quality, you could look to an external file to dynamically add fields, and just maintain the file.

Are long variable names a waste of memory?

If I have an variable int d; // some comment.
Will that be better than int daysElapsedSinceBlahBlahBlahBlah with no comment.
It is more reabale, but will it waste memory?
You marked this as language-agnostic but the example corresponds to C-languages family. In C-like languages the name of the variable shouldn't waste memory, it's just a label for the compiler. In the resulting binary code it will be replaced by a memory address.
In general, there is no benefit in storing the name of variable inside resulting binary, the only usages I can think of is some extreme debug, reverse-engineering or some weird form of reflection. None of these are normal use-cases.
It depends entirely on the language and its implementation, but I can give you some examples based on a few languages that I know.
In C and C++, variable names are symbolic identifiers for the programmer and compiler. The compiler replaces them with memory addresses, CPU registers, or otherwise inlines their accesses to eliminate them entirely. The names do not appear in the generated code. They can appear in generated debug information, but you can omit that for release builds of a program when you don't need to do interactive step-debugging any more.
In Java, the compiler eliminates function-local variable names and rewrites the code to use stack-relative offsets. Field names (i.e., variables at class level) do persist in the compiled bytecode. This is required mainly because of the way classes are compiled individually and can be linked dynamically at run time, so the compiler cannot optimize the state of the entire program at once. The other reason field names are kept is because they are available through reflection. At run time, the virtual machine can generate native code which largely just uses memory addresses and native CPU registers in the manner of C & C++. Field names are kept in memory anyway, for reflection, and so that any additional loaded classes can be linked in. Mid-stage whole-program optimizers & obfuscators like ProGuard can make all symbolic names much shorter though.
In languages with an eval function, such as JavaScript and PHP, even local variable names have to be kept around, in theory, as they can all be accessed by name via run-time strings. A good interpreter can optimize this to use fast memory addresses in cases when it can prove that a particular variable isn't accessed by name.
In true line-by-line interpreters, like very old implementations of BASIC, variable names have to be kept around always, because the interpreter operates directly from the source code. It forgets about each line when it moves on to the next, so it has no way to track variables except by their names in the source. Since the full source code had to be kept in memory at run time, and it was often limited to 64 kB or less, variable names really did matter! Code from this era often used (and re-used) cryptically short names because of that (but also for other reasons, such as the way BASIC code was sometimes printed in magazines to be typed in, and these platforms did not have particularly nice keyboards or editors.)
Unless you're programming for an interpreter from the 1980s or earlier, identifier names are so cheap in any case, that you should not worry about it. Chose names that are long enough to be understandable, and short enough to be quick to read. Let the compiler worry about the rest, or run an optimizer or minifier on the code after it's written.
Variable names never take up memory. At least not nearly enough to even start worrying about it. While some language implementation will store the variable names somewhere (sometimes the language even requires this), the space they take up is absolutely tiny compared to everything else that's flying around. Just use whatever is best by other metrics (readability, conventions, etc.).

is there an ocaml library store/use data structure on disk

like bdb. However, I looked at the ocaml-bdb, seems like it's made to store only string. My problem is I have arrays that store giant data. Sure, I can serialize them into many files, or encode/decode my data and put them on database or those key-value db things, which is my last resort. I'm wondering if there's a better way.
The HDF4 / HDF5 file format might suit your needs. See http://forge.ocamlcore.org/projects/ocaml-hdf/
In addition to the HDF4 bindings mentioned by jrouquie there are HDF5 bindings available (http://opam.ocaml.org/packages/hdf5/). Depending on the type of data you're storing there are bindings to GDAL (http://opam.ocaml.org/packages/gdal/).
For data which can fit in a bigarray you also have the option of memory mapping a large file on disk. See https://caml.inria.fr/pub/docs/manual-ocaml/libref/Bigarray.Genarray.html#VALmap_file for example. While it ties you to a rather strict on-disk format, it does make it relatively simple to manipulate arrays which are larger than the available RAM.
there was an ocaml BerkeleyDB wrapper in the past:
OCamlDB
Apparently someone looked into it recently:
recent patch for OCamlDB
However, the GDAL bindings from hcarty are probably production ready and in intensive usage somewhere.
Also, there are bindings for dbm in opam: dbm and cryptodbm
HDF5 is prolly the answer, but given the question is somewhat vague, another solution is possible.
Disclaimer: I don't know ocaml (but I knew caml-light) and I know berkeley database (AKA. bsddb (AKA bdb)).
However, I looked at the ocaml-bdb, seems like it's made to store only string.
That maybe true in ocaml-bdb but in reality it stores bytes. I am not sure about your case, because in Python2 there was no difference between bytes and strings of unicode chars. It's until recently that Python 3 got a proper byte type and the bdb bindings take and spit bytes. That said, the difference is subtile but you'd rather work with bytes because that what bdb understand and use.
My problem is I have arrays that store giant data. Sure, I can serialize them into many files, or encode/decode my data and put them on database
or use those key-value db things, which is my last resort.
I'm wondering if there's a better way.
It depends on you need and how the data looks.
If the data can all stay in memory, you'd rather dump memory to a file and load it back.
If you need to share than data among several architectures or Operating system you'd rather use a serialisation framework like HDF5. Remember is that HDF5 doesn't handle circular references.
If the data can not stay all in memory, then you need to use something like bdb (or wiredtiger).
Why bdb (or wiredtiger)
Simply said, several decades of work have gone into:
splitting data
storing it on disk
retrieve data
As fast as possible.
wiredtiger is the successor of bdb.
So yes you could split the files yourself et al. but that will require a lot of work. Only specialized compagnies do that (bloomberg included...), among people that manage themself all the above there is the famous postgresql, mariadb, google and algolia.
ordered key value stores like wiredtiger and bdb use similar algorithm to higher level databases like postgresql and mysql or specialized one like lucene/solr or sphinx ie. mvcc, btree, lsm, PSSI etc...
MongoDB since 3.2 use wiredtiger backend for storing all the data.
Some people argue that key-value store are not good at storing relational data, that said several project started doing distributed databases on top of key value stores. This is a clue that it's useful. E.g. FoundationDB or CockroachDB.
The idea behind key-value stores is to deliver a generic framework for:
splitting data
storing it on disk
retrieve data
As fast as possible, giving some guarantees (like ACID) and other nice to haves (like compression or cryptography).
To take advantage of the power offer by those libraries. You need to learn about key-value composition.

Making a file format extensible

I'm writing a particular serialisation system. The first version works well. It's a hierarchial string-key, data-value system. So to get a particular value, you navigate to a particular node and say getInt("some key") etc. etc.
My issue with the current system is that the file size gets quite large very quickly.
I'm going to combat this by adding a string table. The issue with this is that I can't think of a way to support the old system. All I have is a file identifier which is 32 bits long.
I can change the file identifier, but everytime I make another change to the format, I'll need to change the identifier again.
What's an elegant way to implement new features while still supporting the old features?
I've studied the PNG format and creating chunks seems like a good way to go.
Is there any other advice you can give me on chunk dependencies and so forth?
If you need a binary format, look at Protocol Buffers, which Google uses internally for RPCs as well as long-term serialization of records. Each field of a protocol buffer is identified by an integer ID. Old applications ignore (and pass through) the fields that they don't understand, so you can safely add new fields. You never reuse deprecated field IDs or change the type of a field.
Protocol buffers support primitive types (bool, int32, int64, string, byte arrays) as well as repeated and even recursively nested messages. Unfortunately they don't support maps, so you have to turn a map into a list of (key, value).
Don't spend all your time fretting about serialization and deserialization. It's not as fun as designing protobufs.