General algorithm to securely serialize objects - serialization

I need to serialize a (possibly complex *) object so that I can calculate the object's MAC**.
If your messages are strings you can simply do tag := MAC(key, string) and with very high probability if s1 != s2 then MAC(key, s1) != MAC(key, s2), moreover it is computationally hard to find s1,s2 such that MAC(k,s1) == MAC(k,s2).
Now my question is, what happens if instead of strings you need do MAC a very complex object that can contain arrays of objects and nested objects:
JSON
Initially I though that just using JSON serialization could do the trick but it turns out that JSON serializers do not care about order so for example: {b:2,a:1} can be serialized to both {"b":2,"a":1} or {"a":2,"b":1}.
URL Params
You can convert the object to a list of url query params after sorting the keys, so for example {c:1,b:2} can be serialized to b=2&c=1. The problem is that as the object gets more complex the serialization becomes difficult to understand. Example: {c:1, b:{d:2}}
1. First we serialize the nested object:{c:1, b:{d=2}}
2. Then url encode the = sign: {c:1, b:{d%3D2}}
3. Final serialization is: b=d%3D2&c=1
As you can see, the serialization quickly becomes unreadable and though I have not proved it yet I also have the feeling that it is not very secure (i.e. it is possible to find two messages that MAC to the same value)
Can anyone show me a good secure*** algorithm for serializing objects?
[*]: The object can have nested objects and nested arrays of objects. No circular references allowed. Example:
{a:'a', b:'b', c:{d:{e:{f:[1,2,3,4,5]}}, g:[{h:'h'},{i:'i'}]}}
[**]: This MAC will then be sent over the wire. I cannot know what languages/frameworks are supported by the servers so language specific solutions like Java Object Serialization are not possible.
[***]: Secure in this context means that given messages a,b: serialize(a) = serialize(b) implies that a = b
EDIT: I just found out about the SignedObject through this link. Is there a language agnostic equivalent?

What you are looking for is a canonical representation, either for the data storage itself, or for pre-processing before applying the MAC algorithm. One rather known format is the canonicalization used for XML-signature. It seems like the draft 2.0 version of XML signature is also including HMAC. Be warned that creating a secure verification of XML signatures is fraught with dangers - don't let yourself be tricked into trusting the signed document itself.
As for JSON, there seems to be a canonical JSON draft, but I cannot see the status of it or if there are any compliant implementations. Here is a Q/A where the same issue comes up (for hashing instead of a MAC). So there does not seem to be a fully standardized way of doing it.
In binary there's ASN.1 DER encoding, but you may not want to go into that as it is highly complex.
Of course you can always define your own binary or textual representation, as long as there is one representation for data sets that are semantically identical. In the case of an textual representation, you will still need to define a specific character encoding (UTF-8 is recommended) to convert the representation to bytes, as HMAC takes binary input only.

Related

Microsoft Bond runtime schemaDef

I'm hoping someone could illustrate a common use case for the Microsoft Bond runtime schemas (SchemaDef). I understand these are used when schema definitions are not known at compile time, but if the shape of an object is fluid and changes frequently, what benefits might a runtime generated schema provide?
My use case is that the business user is in control of the shape of an object (via a rules engine). They could conceivably do all sorts of things that could break our backward compatibility (for example, invert the order of fields on the object). If we plan on persisting all the object versions that the user created, is there any way to manage backward/forward compatibility using Bond runtime schemas? I presume no, as if they invert from this:
0: int64 myInt;
1: string myString;
to this
0: string myString;
1: int64 myInt;
I'd expect a runtime error. Which implies managing the object with runtime schemas wouldn't provide much help to me.
What would be a usecase where a runtime schema would in fact be useful?
Thank you!
Some of the uses for runtime schemas are:
with the Simple Binary protocol to handle schema changes
schema validation/evoluton
rendering a struct in a GUI
custom mapping from one struct to another
Your case feels like schema validation, if you can pro-actively reject a schema that would no be compatible. I worked on a system that used Bond under the hood and took this approach. There was an explicit "change the schema of this entity" operation that validated whether the two schemas were compatible with each other.
I don't know the data flow in your system, so such validation might not be possible. In that case, you could use the runtime schemas, along with some rules provided by the business users, to convert between different shapes.
Simple Binary
When deserializing from Simple Binary, the reader must know the exact schema that the writer used, otherwise it has no way to interpret the bytes, resulting in potentially silent data corruption.
Such corruption can happen if the schema undergoes the following change:
// starting struct
struct Foo
{
0: uint8 f1;
1: uint16 f2;
}
The Simple Binary serialized representation of Foo { f1: 1, f2: 2} is 0x01 0x02 0x00.
Let's now change the schema to this:
// changed struct
struct Foo
{
0: uint8 f1;
// It's OK to remove an optional field.
// 1: uint16 f2;
2: uint8 f3;
3: uint8 f4;
}
If we deserialize 0x01 0x02 0x00 with this schema, we'll get Foo { f1: 1, f3: 2, f4: 0}. Notice that f3 is 2, which is not correct: it should be 0. With the runtime schema for the old Foo, the reader will know that the second and third bytes correspond to a field that has since been deleted and can skip them, resulting in the expected Foo { f1:1, f3: 0, f4: 0 }.
Schema Validation and Evolution
Some systems that use Bond have different rules for schema evolution that the normal Bond rules. Runtime schemas can be used to enforce such rules (e.g., checking a type to enforce a rule that no collections are used) before accepting structs of a given type or before registering such a schema in, say, a repository of known schemas.
You could also walk two schemas to determine with they are compatible with each other. It would be nice if Bond provided such an API itself, so that it doesn't have to be reimplemented again and again. I've opened a GitHub issue for such an API.
GUI
With a runtime schema, you have extra information about the struct, including things like the names of the fields. (The binary encoding protocols omit field names, relying, instead, on field IDs.) You can use this additional information to do things like create GUI controls specific to each field.
There's an example showing inspection of a runtime schema in both C# and C++.
Custom Mapping
In C++, the MapTo transform can be used to convert one struct to another, which incompatible shapes, given a set of rules. There's an example of this, that makes use of a runtime schema to derive the rules.

Can I have a string object store its data within the structure?

I'm looking for a quick way to serialize custom structures consisting of basic value types and strings.
Using C++CLI to pin the pointer of the structure instance and destination array and then memcpy the data over is working quite well for all the value types. However, if I include any reference types such as string then all I get is the reference address.
Expected as much since otherwise it would be impossible for the structure to have a fixed.. structure. I figured that maybe, if I make the string fixed size, it might place it inside the structure though. Adding < VBFixedString(256) > to the string declaration did not achieve that.
Is there anything else that would place the actual data inside the structure?
Pinning a managed object and memcpy-ing the content will never give you what you want. Any managed object, be it String, a character array, or anything else will show up as a reference, and you'll just get a memory location.
If I read between the lines, it sounds like you need to call some C or C++ (not C++/CLI) code, and pass it a C struct that looks similar to this:
struct UnmanagedFoo
{
int a_number;
char a_string[256];
};
If that's the case, then I'd solve this by setting up the automatic marshaling to handle this for you. Here's how you'd define that struct so that it marshals properly. (I'm using C# syntax here, but it should be an easy conversion to VB.net syntax.)
[StructLayout(LayoutKind.Sequential, CharSet=CharSet.Ansi)]
public struct ManagedFoo
{
public int a_number;
[MarshalAs(UnmanagedType.ByValTStr, SizeConst=256)]
public string a_string;
}
Explanation:
StructLayout(LayoutKind.Sequential) specifies that the fields should be in the declared order. The default LayoutKind, Auto, allows the fields to be re-ordered if the compiler wants.
CharSet=CharSet.Ansi specifies the type of strings to marshal. You can specify CharSet.Ansi to get char strings on the C++ side, or CharSet.Unicode to get wchar_t strings in C++.
MarshalAs(UnmanagedType.ByValTStr) specifies a string inline to the struct, which is what you were asking about. There are several other string types, with different semantics, see the UnmanagedType page on MSDN for descriptions.
SizeConst=256 specifies the size of the character array. Note that this specifies the number of characters (or when doing arrays, number of array elements), not the number of bytes.
Now, these marshal attributes are an instruction to the built-in marshaler in .Net, which you can call directly from your VB.Net code. To use it, call Marshal.StructureToPtr to go from the .Net object to unmanaged memory, and Marshal.PtrToStructure to go from unmanaged memory to a .Net object. MSDN has some good examples of calling those two methods, take a look at the linked pages.
Wait, what about C++/CLI? Yes, you could use C++/CLI to marshal from the .Net object to a C struct. If your structs get too complex to represent with the MarshalAs attribute, it's highly appropriate to do that. In that case, here's what you do: Declare your .Net struct like I listed above, without the MarshalAs or StructLayout. Also declare the C struct, plain and ordinary, also as listed above. When you need to switch from one to the other, copy things field by field, not a big memcpy. Yes, all the fields that are basic types (integers, doubles, etc.) will be a repetitive output.a_number = input.a_number, but that's the proper way to do it.

Standard ML: Datatype vs. Structure

I'm reading through Paulson's ML For the Working Programmer and am a bit confused about the distinction between datatypes and structures.
On p. 142, he defines a type for binary trees as follows:
datatype 'a tree = Lf
| Br of 'a * 'a tree * 'a tree;
This seems to be a recursive definition where 'a denotes some fixed type. So any time I see 'a, it must refer to the same type throughout.
On p. 148, he discusses a structure for binary trees:
"...we have been following an imaginary ML session in which we typed in the tree functions one at a time. Now we ought to collect the most important of those functions into a structure, called Tree. We really must do so, because one of our functions (size) clashes with a built-in function. One reason for using structures is to prevent such name clashes.
We shall, however, leave the datatype declaration of tree outside of the structure. If it were inside, we should be forced to refer to the constructors by Tree.Lf and Tree.Br, which would make our patters unreadable. Thus, in the sequel, imagine that we have made the following declarations:
datatype 'a tree = Lf
| Br of 'a * 'a tree * 'a tree;
structure Tree =
struct
fun size Lf = 0
| size (Br( v, t1, t2)) = 1 + size t1 + size t2;
fun depth...
etc...
end;
I'm a little confused.
1) What is the relationship between a datatype and a structure?
2) What is the role of "struct" within the structure definition?
3) Later on, Paulson discusses a structure for dictionaries as binary search trees. He does the following:
structure Dict : DICTIONARY =
struct
type key = string;
type 'a t = (key * 'a) tree;
val empty = Lf;
<a bunch of functions for dictionaries>
This makes me think struct specifies the different primitive or compound types involved int he definition of a Dict.
That's a really fuzzy definition though. Anyone like to clarify?
Thanks for the help,
bclayman
A structure is a module. Everything between the struct and end keywords forms the body of this module. Similarly, you can view a signature as the description of an abstract module interface. Ascribing a signature to a structure (like the : DICTIONARY syntax does in your example) limits the exports of the module to what is specified in that signature (by default, everything would be accessible). That allows you to hide implementation details of a module.
However, ML modules are much richer than that. They can be arbitrarily nested. There are also functors, which are effectively functions from modules to modules ("parameterised modules", if you want). Altogether, the module language in ML forms a full functional language on its own, with structures as the basic entities, functors over them, and signatures describing the "types" of such modules. This little language is a layer on top of the so-called core language, where ordinary values and types live.
So, to answer your individual questions:
1) There is no specific relationship between the datatype and the structure. The latter simply uses the former.
2) struct-end is simply a keyword pair to delimit the structure body (languages in C tradition would probably use curly braces there).
3) As explained above, a structure is a basic module. It can contain (and export) arbitrary other language entities, including other modules. By grouping definitions together, and potentially hiding some of them through a signature ascription, you can express namespacing and encapsulation (in particular, abstract data types).
I should also note that Paulson's book is outdated regarding its description of modules, as it predates the current language version. In particular, it does not describe how to express abstract data types through modules, but instead introduces the obsolete abstype declaration which nobody has been using in almost 20 years. A more extensive and up-to-date introduction to modular programming in ML can be found in Harper's Programming in Standard ML.
In this example, the datatype 'a tree is describing a binary tree (https://en.wikipedia.org/wiki/Binary_tree) that is capable of storing any value of a single type. The 'a in the definition is a variant type which will later be constrained down to a concrete type wherever tree is used with a different type. This allows you to define the structure of a tree once and then use it with any type later on.
The Tree structure is separate from the datatype definition. It is being used to group functions together that operate on the 'a tree datatype. It is being used right now as a way to modularize the code and, as it points out, to prevent namespace clashes.
struct is just an identifier keyword to let the compiler know where your structure definition starts while the end keyword is used to let the compiler know where the definition ends.
The dictionary structure is defining a dictionary (a key -> value data structure) that uses a tree as the internal data structure. Once again, the structure is a collection of functions that will be used to create and operate on dictionaries. The types within the dictionary structure compose the type of the internal data structure that makes up the dictionary. The following functions define the public interface that you're exposing to allow clients to work with dictionaries.

What is the proper way to encode an AMF0 StrictArray

After overviewing the AMF0 specification I find that I cannot understand the proper way to encode the StrictArray type.
Here is the most immediate section of the specification:
array-count = U32
strict-array-type = array-count *(value-type)
which describes the StrictArray type with Augmented Backus-Naur Form (ABNF) syntax (See RFC2234)
Does the StrictArray type have ordinal indices or simply encoded objects (without ordinal keys) in order of their appearance in the StrictArray object graph?
Also, as an additional question, does the serialization table (from which object reference IDs are generated) contain all objects in the object graph, or only objects which can be potentially encoded via reference (ECMAArray,StrictArray,TypedObject,AnonymousObject)?
See https://github.com/silexlabs/amfphp-2.0/blob/master/Amfphp/Core/Amf/Serializer.php line 329 to 336.
you write the number of objects, then each object.
additional question: same code, look for Amf0StoredObjects.
references ids are only for referencable objects. These vary for AMF0 and AMF3 though.

WCF REST and JSON - Serialization and empty elements

I am returning a list of items (user defined class) in a REST service using WCF. I am returning the items as JSON and it is used in some client side javascript (so the 'schema' of the class was derived from what the javascript library required). The class is fairly basic, strings and a bool. The bool is optional, so if it is absent the javascript library uses a default value, and if it is present (true or false) the value is used.
The problem is if I use a bool, the value is defaulted to false when serialized, and if i use a bool?, the member is still sent accross in the JSON and defaulted to null which causes problems with the library (it wont fall back to the default value).
I know I can probably mess around the the javascript library, but I would like to find a way to just not send any members which are null so they dont show up in the serialized JSON at all.
Any ideas?
You could do a little bit of packing and unpacking before and after the serialization. E.g.:
You could make two different versions
of the class, one with and one
without the bool, and convert as
appropriate before and after
transmitting. (sends the least amount
of data, if # of bytes is a greater
consideration than code complexity)
You could add another bool that tells
whether the first bool is supposed to
be null.