I'm hoping someone could illustrate a common use case for the Microsoft Bond runtime schemas (SchemaDef). I understand these are used when schema definitions are not known at compile time, but if the shape of an object is fluid and changes frequently, what benefits might a runtime generated schema provide?
My use case is that the business user is in control of the shape of an object (via a rules engine). They could conceivably do all sorts of things that could break our backward compatibility (for example, invert the order of fields on the object). If we plan on persisting all the object versions that the user created, is there any way to manage backward/forward compatibility using Bond runtime schemas? I presume no, as if they invert from this:
0: int64 myInt;
1: string myString;
to this
0: string myString;
1: int64 myInt;
I'd expect a runtime error. Which implies managing the object with runtime schemas wouldn't provide much help to me.
What would be a usecase where a runtime schema would in fact be useful?
Thank you!
Some of the uses for runtime schemas are:
with the Simple Binary protocol to handle schema changes
schema validation/evoluton
rendering a struct in a GUI
custom mapping from one struct to another
Your case feels like schema validation, if you can pro-actively reject a schema that would no be compatible. I worked on a system that used Bond under the hood and took this approach. There was an explicit "change the schema of this entity" operation that validated whether the two schemas were compatible with each other.
I don't know the data flow in your system, so such validation might not be possible. In that case, you could use the runtime schemas, along with some rules provided by the business users, to convert between different shapes.
Simple Binary
When deserializing from Simple Binary, the reader must know the exact schema that the writer used, otherwise it has no way to interpret the bytes, resulting in potentially silent data corruption.
Such corruption can happen if the schema undergoes the following change:
// starting struct
struct Foo
{
0: uint8 f1;
1: uint16 f2;
}
The Simple Binary serialized representation of Foo { f1: 1, f2: 2} is 0x01 0x02 0x00.
Let's now change the schema to this:
// changed struct
struct Foo
{
0: uint8 f1;
// It's OK to remove an optional field.
// 1: uint16 f2;
2: uint8 f3;
3: uint8 f4;
}
If we deserialize 0x01 0x02 0x00 with this schema, we'll get Foo { f1: 1, f3: 2, f4: 0}. Notice that f3 is 2, which is not correct: it should be 0. With the runtime schema for the old Foo, the reader will know that the second and third bytes correspond to a field that has since been deleted and can skip them, resulting in the expected Foo { f1:1, f3: 0, f4: 0 }.
Schema Validation and Evolution
Some systems that use Bond have different rules for schema evolution that the normal Bond rules. Runtime schemas can be used to enforce such rules (e.g., checking a type to enforce a rule that no collections are used) before accepting structs of a given type or before registering such a schema in, say, a repository of known schemas.
You could also walk two schemas to determine with they are compatible with each other. It would be nice if Bond provided such an API itself, so that it doesn't have to be reimplemented again and again. I've opened a GitHub issue for such an API.
GUI
With a runtime schema, you have extra information about the struct, including things like the names of the fields. (The binary encoding protocols omit field names, relying, instead, on field IDs.) You can use this additional information to do things like create GUI controls specific to each field.
There's an example showing inspection of a runtime schema in both C# and C++.
Custom Mapping
In C++, the MapTo transform can be used to convert one struct to another, which incompatible shapes, given a set of rules. There's an example of this, that makes use of a runtime schema to derive the rules.
Related
Extensible records were one of the most amazing Elm's features, but since v0.16 adding and removing fields is no longer available. And this puts me in an awkward position.
Consider an example. I want to give a name to a random thing t, and extensible records provide me a perfect tool for this:
type alias Named t = { t | name: String }
„Okay,“ says the complier. Now I need a constructor, i.e. a function that equips a thing with specified name:
equip : String -> t -> Named t
equip name thing = { thing | name = name } -- Oops! Type mismatch
Compilation fails, because { thing | name = ... } syntax assumes thing to be a record with name field, but type system can't assure this. In fact, with Named t I've tried to express something opposite: t should be a record type without its own name field, and the function adds this field to a record. Anyway, field addition is necessary to implement equip function.
So, it seems impossible to write equip in polymorphic manner, but it's probably not a such big deal. After all, any time I'm going to give a name to some concrete thing I can do this by hands. Much worse, inverse function extract : Named t -> t (which erases name of a named thing) requires field removal mechanism, and thus is not implementable too:
extract : Named t -> t
extract thing = thing -- Error: No implicit upcast
It would be extremely important function, because I have tons of routines those accept old-fashioned unnamed things, and I need a way to use them for named things. Of course, massive refactoring of those functions is ineligible solution.
At last, after this long introduction, let me state my questions:
Does modern Elm provides some substitute for old deprecated field addition/removal syntax?
If not, is there some built-in function like equip and extract above? For every custom extensible record type I would like to have a polymorphic analyzer (a function that extracts its base part) and a polymorphic constructor (a function that combines base part with additive and produces the record).
Negative answers for both (1) and (2) would force me to implement Named t in a more traditional way:
type Named t = Named String t
In this case, I can't catch the purpose of extensible records. Is there a positive use case, a scenario in which extensible records play critical role?
Type { t | name : String } means a record that has a name field. It does not extend the t type but, rather, extends the compiler’s knowledge about t itself.
So in fact the type of equip is String -> { t | name : String } -> { t | name : String }.
What is more, as you noticed, Elm no longer supports adding fields to records so even if the type system allowed what you want, you still could not do it. { thing | name = name } syntax only supports updating the records of type { t | name : String }.
Similarly, there is no support for deleting fields from record.
If you really need to have types from which you can add or remove fields you can use Dict. The other options are either writing the transformers manually, or creating and using a code generator (this was recommended solution for JSON decoding boilerplate for a while).
And regarding the extensible records, Elm does not really support the “extensible” part much any more – the only remaining part is the { t | name : u } -> u projection so perhaps it should be called just scoped records. Elm docs itself acknowledge the extensibility is not very useful at the moment.
You could just wrap the t type with name but it wouldn't make a big difference compared to approach with custom type:
type alias Named t = { val: t, name: String }
equip : String -> t -> Named t
equip name thing = { val = thing, name = name }
extract : Named t -> t
extract thing = thing.val
Is there a positive use case, a scenario in which extensible records play critical role?
Yes, they are useful when your application Model grows too large and you face the question of how to scale out your application. Extensible records let you slice up the model in arbitrary ways, without committing to particular slices long term. If you sliced it up by splitting it into several smaller nested records, you would be committed to that particular arrangement - which might tend to lead to nested TEA and the 'out message' pattern; usually a bad design choice.
Instead, use extensible records to describe slices of the model, and group functions that operate over particular slices into their own modules. If you later need to work accross different areas of the model, you can create a new extensible record for that.
Its described by Richard Feldman in his Scaling Elm Apps talk:
https://www.youtube.com/watch?v=DoA4Txr4GUs&ab_channel=ElmEurope
I agree that extensible records can seem a bit useless in Elm, but it is a very good thing they are there to solve the scaling issue in the best way.
Suppose I serialized a given Bond struct with a single field:
struct NameBond
{
1: string name;
}
And then I renamed the field in the .bond file (without changing its ordinal):
struct NameBond
{
1: string displayName;
}
Would I still be able to deserialize it?
What about the name of the struct? (NameBond in the example.)
Would changing that prevent me from deserializing?
This depends on which protocol you are using.
Your change will cause no problems in the CompactBinary serializer.
It may cause trouble with other protocols.
You may want to consult the Bond schema evolution guide, where it says:
Caution should be used when changing or reusing field names as this could break text-based protocols like SimpleJsonProtocol
See also this related SO question.
Using Microsoft Bond (the C# library in particular), I see that whenever a Bond struct is defined, it looks like this:
struct Name
{
0: type name;
5: type name;
...
}
What do these numbers (0, 5, ...) mean?
Do they require special treatment in inheritance? (Do I need to make sure that I do not override members with the same number defined in my ancestor?)
The field ordinals are the unique identity of each field. When serializing to tagged binary protocols, these numbers are used to indicate which fields are in the payload. The names of the fields are not used. (Renaming a field in the .bond file does not break serialized binary data compatibility [though, see caveat below about text protocols].) Numbers are smaller than strings, which helps reduce the payload size, but also ends up improving serialization/deserialization time.
You cannot re-use the same field ordinal within the same struct.
There's no special treatment needed when you inherit from a struct (or if you have a struct field inside your struct). Bond keeps the ordinals for the structs separate. Concretely, the following is legal and will work:
namespace inherit_use_same_ordinal;
struct Base {
0: string field;
}
struct Derived : Base {
0: bool field;
}
A caveat about text serialization protocols like Simple JSON and Simple XML: these protocols use the field name as the field identifier. So, in these protocols renaming a field breaks serialized data compatibility.
Also, Simple JSON and Simple XML flatten the inheritance hierarchy, so re-using names across Base and Derived will result in clashes. Both have ways to work around this. For Simple XML, the SimpleXml.Settings.UseNamespaces parameter can be set to true to emit fully qualified names.
For Simple JSON, the Bond attribute JsonName can be used to change the name used for Simple JSON serialization, to avoid the conflict:
struct Derived : Base {
[JsonName("derived_field")]
0: bool field;
}
I'm reading through Paulson's ML For the Working Programmer and am a bit confused about the distinction between datatypes and structures.
On p. 142, he defines a type for binary trees as follows:
datatype 'a tree = Lf
| Br of 'a * 'a tree * 'a tree;
This seems to be a recursive definition where 'a denotes some fixed type. So any time I see 'a, it must refer to the same type throughout.
On p. 148, he discusses a structure for binary trees:
"...we have been following an imaginary ML session in which we typed in the tree functions one at a time. Now we ought to collect the most important of those functions into a structure, called Tree. We really must do so, because one of our functions (size) clashes with a built-in function. One reason for using structures is to prevent such name clashes.
We shall, however, leave the datatype declaration of tree outside of the structure. If it were inside, we should be forced to refer to the constructors by Tree.Lf and Tree.Br, which would make our patters unreadable. Thus, in the sequel, imagine that we have made the following declarations:
datatype 'a tree = Lf
| Br of 'a * 'a tree * 'a tree;
structure Tree =
struct
fun size Lf = 0
| size (Br( v, t1, t2)) = 1 + size t1 + size t2;
fun depth...
etc...
end;
I'm a little confused.
1) What is the relationship between a datatype and a structure?
2) What is the role of "struct" within the structure definition?
3) Later on, Paulson discusses a structure for dictionaries as binary search trees. He does the following:
structure Dict : DICTIONARY =
struct
type key = string;
type 'a t = (key * 'a) tree;
val empty = Lf;
<a bunch of functions for dictionaries>
This makes me think struct specifies the different primitive or compound types involved int he definition of a Dict.
That's a really fuzzy definition though. Anyone like to clarify?
Thanks for the help,
bclayman
A structure is a module. Everything between the struct and end keywords forms the body of this module. Similarly, you can view a signature as the description of an abstract module interface. Ascribing a signature to a structure (like the : DICTIONARY syntax does in your example) limits the exports of the module to what is specified in that signature (by default, everything would be accessible). That allows you to hide implementation details of a module.
However, ML modules are much richer than that. They can be arbitrarily nested. There are also functors, which are effectively functions from modules to modules ("parameterised modules", if you want). Altogether, the module language in ML forms a full functional language on its own, with structures as the basic entities, functors over them, and signatures describing the "types" of such modules. This little language is a layer on top of the so-called core language, where ordinary values and types live.
So, to answer your individual questions:
1) There is no specific relationship between the datatype and the structure. The latter simply uses the former.
2) struct-end is simply a keyword pair to delimit the structure body (languages in C tradition would probably use curly braces there).
3) As explained above, a structure is a basic module. It can contain (and export) arbitrary other language entities, including other modules. By grouping definitions together, and potentially hiding some of them through a signature ascription, you can express namespacing and encapsulation (in particular, abstract data types).
I should also note that Paulson's book is outdated regarding its description of modules, as it predates the current language version. In particular, it does not describe how to express abstract data types through modules, but instead introduces the obsolete abstype declaration which nobody has been using in almost 20 years. A more extensive and up-to-date introduction to modular programming in ML can be found in Harper's Programming in Standard ML.
In this example, the datatype 'a tree is describing a binary tree (https://en.wikipedia.org/wiki/Binary_tree) that is capable of storing any value of a single type. The 'a in the definition is a variant type which will later be constrained down to a concrete type wherever tree is used with a different type. This allows you to define the structure of a tree once and then use it with any type later on.
The Tree structure is separate from the datatype definition. It is being used to group functions together that operate on the 'a tree datatype. It is being used right now as a way to modularize the code and, as it points out, to prevent namespace clashes.
struct is just an identifier keyword to let the compiler know where your structure definition starts while the end keyword is used to let the compiler know where the definition ends.
The dictionary structure is defining a dictionary (a key -> value data structure) that uses a tree as the internal data structure. Once again, the structure is a collection of functions that will be used to create and operate on dictionaries. The types within the dictionary structure compose the type of the internal data structure that makes up the dictionary. The following functions define the public interface that you're exposing to allow clients to work with dictionaries.
I need to serialize a (possibly complex *) object so that I can calculate the object's MAC**.
If your messages are strings you can simply do tag := MAC(key, string) and with very high probability if s1 != s2 then MAC(key, s1) != MAC(key, s2), moreover it is computationally hard to find s1,s2 such that MAC(k,s1) == MAC(k,s2).
Now my question is, what happens if instead of strings you need do MAC a very complex object that can contain arrays of objects and nested objects:
JSON
Initially I though that just using JSON serialization could do the trick but it turns out that JSON serializers do not care about order so for example: {b:2,a:1} can be serialized to both {"b":2,"a":1} or {"a":2,"b":1}.
URL Params
You can convert the object to a list of url query params after sorting the keys, so for example {c:1,b:2} can be serialized to b=2&c=1. The problem is that as the object gets more complex the serialization becomes difficult to understand. Example: {c:1, b:{d:2}}
1. First we serialize the nested object:{c:1, b:{d=2}}
2. Then url encode the = sign: {c:1, b:{d%3D2}}
3. Final serialization is: b=d%3D2&c=1
As you can see, the serialization quickly becomes unreadable and though I have not proved it yet I also have the feeling that it is not very secure (i.e. it is possible to find two messages that MAC to the same value)
Can anyone show me a good secure*** algorithm for serializing objects?
[*]: The object can have nested objects and nested arrays of objects. No circular references allowed. Example:
{a:'a', b:'b', c:{d:{e:{f:[1,2,3,4,5]}}, g:[{h:'h'},{i:'i'}]}}
[**]: This MAC will then be sent over the wire. I cannot know what languages/frameworks are supported by the servers so language specific solutions like Java Object Serialization are not possible.
[***]: Secure in this context means that given messages a,b: serialize(a) = serialize(b) implies that a = b
EDIT: I just found out about the SignedObject through this link. Is there a language agnostic equivalent?
What you are looking for is a canonical representation, either for the data storage itself, or for pre-processing before applying the MAC algorithm. One rather known format is the canonicalization used for XML-signature. It seems like the draft 2.0 version of XML signature is also including HMAC. Be warned that creating a secure verification of XML signatures is fraught with dangers - don't let yourself be tricked into trusting the signed document itself.
As for JSON, there seems to be a canonical JSON draft, but I cannot see the status of it or if there are any compliant implementations. Here is a Q/A where the same issue comes up (for hashing instead of a MAC). So there does not seem to be a fully standardized way of doing it.
In binary there's ASN.1 DER encoding, but you may not want to go into that as it is highly complex.
Of course you can always define your own binary or textual representation, as long as there is one representation for data sets that are semantically identical. In the case of an textual representation, you will still need to define a specific character encoding (UTF-8 is recommended) to convert the representation to bytes, as HMAC takes binary input only.