I have a complex chain of operators on a Kotlin Flow, and many of them are ran in groups in different contexts using flowOn like this:
flowOf(1, 2, 3)
.map { /*do some stuff*/ }
.flowOn(context1)
.map { /*do some different stuff*/ }
.flowOn(context2)
According to documentation, each flowOn introduces a channel buffer with default size 64 (configurable).
In addition to this, I have a MutableSharedFlow with a fixed buffer size configured by the extraBufferCapacity parameter to which I'm emitting items.
I would like to monitor the current buffer sizes, however, the buffers are private property and there seems to be no method to retrieve the buffer reference or its current size. Is there any way to retrieve it, or is it intended solely for internal Flow purposes?
I'm hoping someone could illustrate a common use case for the Microsoft Bond runtime schemas (SchemaDef). I understand these are used when schema definitions are not known at compile time, but if the shape of an object is fluid and changes frequently, what benefits might a runtime generated schema provide?
My use case is that the business user is in control of the shape of an object (via a rules engine). They could conceivably do all sorts of things that could break our backward compatibility (for example, invert the order of fields on the object). If we plan on persisting all the object versions that the user created, is there any way to manage backward/forward compatibility using Bond runtime schemas? I presume no, as if they invert from this:
0: int64 myInt;
1: string myString;
to this
0: string myString;
1: int64 myInt;
I'd expect a runtime error. Which implies managing the object with runtime schemas wouldn't provide much help to me.
What would be a usecase where a runtime schema would in fact be useful?
Thank you!
Some of the uses for runtime schemas are:
with the Simple Binary protocol to handle schema changes
schema validation/evoluton
rendering a struct in a GUI
custom mapping from one struct to another
Your case feels like schema validation, if you can pro-actively reject a schema that would no be compatible. I worked on a system that used Bond under the hood and took this approach. There was an explicit "change the schema of this entity" operation that validated whether the two schemas were compatible with each other.
I don't know the data flow in your system, so such validation might not be possible. In that case, you could use the runtime schemas, along with some rules provided by the business users, to convert between different shapes.
Simple Binary
When deserializing from Simple Binary, the reader must know the exact schema that the writer used, otherwise it has no way to interpret the bytes, resulting in potentially silent data corruption.
Such corruption can happen if the schema undergoes the following change:
// starting struct
struct Foo
{
0: uint8 f1;
1: uint16 f2;
}
The Simple Binary serialized representation of Foo { f1: 1, f2: 2} is 0x01 0x02 0x00.
Let's now change the schema to this:
// changed struct
struct Foo
{
0: uint8 f1;
// It's OK to remove an optional field.
// 1: uint16 f2;
2: uint8 f3;
3: uint8 f4;
}
If we deserialize 0x01 0x02 0x00 with this schema, we'll get Foo { f1: 1, f3: 2, f4: 0}. Notice that f3 is 2, which is not correct: it should be 0. With the runtime schema for the old Foo, the reader will know that the second and third bytes correspond to a field that has since been deleted and can skip them, resulting in the expected Foo { f1:1, f3: 0, f4: 0 }.
Schema Validation and Evolution
Some systems that use Bond have different rules for schema evolution that the normal Bond rules. Runtime schemas can be used to enforce such rules (e.g., checking a type to enforce a rule that no collections are used) before accepting structs of a given type or before registering such a schema in, say, a repository of known schemas.
You could also walk two schemas to determine with they are compatible with each other. It would be nice if Bond provided such an API itself, so that it doesn't have to be reimplemented again and again. I've opened a GitHub issue for such an API.
GUI
With a runtime schema, you have extra information about the struct, including things like the names of the fields. (The binary encoding protocols omit field names, relying, instead, on field IDs.) You can use this additional information to do things like create GUI controls specific to each field.
There's an example showing inspection of a runtime schema in both C# and C++.
Custom Mapping
In C++, the MapTo transform can be used to convert one struct to another, which incompatible shapes, given a set of rules. There's an example of this, that makes use of a runtime schema to derive the rules.
I have some serialization in google protobuf in a series of files, but wonder if there is a shortcut way of concatenating these smaller files into one larger protobuf without worrying about reading each and every protobuf and then grouping these objects and outputting.
Is there a cheap way to join files together? I.e. do I have serialize each individual file?
You can combine protocol buffers messages by simple concatenation. It appears that you want the result to form an array, so you'll need to serialize each individual file as an array itself:
message MyItem {
...
}
message MyCollection {
repeated MyItem items = 1;
}
Now if you serialize each file as a MyCollection and then concatenate them (just put the raw binary data together), the resulting file can be read as a one large collection itself.
In addition to jpas answer, it might be relevant to say that the data does not need to be in the exact same container, when being serialized, for it being compatible on deserialisation.
Consider the following messages:
message FileData{
required uint32 versionNumber = 1;
repeated Data initialData = 2;
}
message MoreData{
repeated Data data = 2;
}
It is possible to serialize those different messages into one single data container and deserialize it as one single FileData message, as long as the FileData is serialized before zero or more MoreData and both, the FileData and MoreData have the same index for the repeated field.
I have a graph data structure that I'd like to encode with protocol buffers. There are cyclic connections between the graph vertices. Is there a standard/common way to encode such structures in protobuf? One approach that comes to mind is to add an "id" field to each vertex, and use those ids instead of pointers. E.g.:
message Vertex {
required int32 id = 1;
required string label = 2;
repeated int32 outgoing_edges = 3; // values should be id's of other nodes
}
message Graph {
repeated Vertex vertices = 1;
}
Then I could write classes that wrap the protobuf-generated classes, and automatically convert these identifiers to real pointers on deserialization (and back to ids on serialization). Is this the best approach? If so, then does anyone know of existing projects that use/document this approach? If not, then what approach would you recommend?
If you need cross platform support, then using a DTO as you propose in the question, then mapping that to/from a separate graph-based model in your own code is probably your best approach.
As a side note, in protobuf-net (c# / .net) I've added support for this which adds a layer of abstraction silently. Basically, the following works:
[ProtoContract]
class Vertex {
...
[ProtoMember(3, AsReference = true)]
public List<Vertex> OutgoingEdges {get;set;}
}
I know that in terms of several distributed techniques (such as RPC), the term "Marshaling" is used but don't understand how it differs from Serialization. Aren't they both transforming objects into series of bits?
Related:
What is Serialization?
What is Object Marshalling?
Marshaling and serialization are loosely synonymous in the context of remote procedure call, but semantically different as a matter of intent.
In particular, marshaling is about getting parameters from here to there, while serialization is about copying structured data to or from a primitive form such as a byte stream. In this sense, serialization is one means to perform marshaling, usually implementing pass-by-value semantics.
It is also possible for an object to be marshaled by reference, in which case the data "on the wire" is simply location information for the original object. However, such an object may still be amenable to value serialization.
As #Bill mentions, there may be additional metadata such as code base location or even object implementation code.
Both do one thing in common - that is serializing an Object. Serialization is used to transfer objects or to store them. But:
Serialization: When you serialize an object, only the member data within that object is written to the byte stream; not the code that
actually implements the object.
Marshalling: Term Marshalling is used when we talk about passing Object to remote objects(RMI). In Marshalling Object is serialized(member data is serialized) + Codebase is attached.
So Serialization is a part of Marshalling.
CodeBase is information that tells the receiver of Object where the implementation of this object can be found. Any program that thinks it might ever pass an object to another program that may not have seen it before must set the codebase, so that the receiver can know where to download the code from, if it doesn't have the code available locally. The receiver will, upon deserializing the object, fetch the codebase from it and load the code from that location.
From the Marshalling (computer science) Wikipedia article:
The term "marshal" is considered to be synonymous with "serialize" in the Python standard library1, but the terms are not synonymous in the Java-related RFC 2713:
To "marshal" an object means to record its state and codebase(s) in such a way that when the marshalled object is "unmarshalled", a copy of the original object is obtained, possibly by automatically loading the class definitions of the object. You can marshal any object that is serializable or remote. Marshalling is like serialization, except marshalling also records codebases. Marshalling is different from serialization in that marshalling treats remote objects specially. (RFC 2713)
To "serialize" an object means to convert its state into a byte stream in such a way that the byte stream can be converted back into a copy of the object.
So, marshalling also saves the codebase of an object in the byte stream in addition to its state.
Basics First
Byte Stream - Stream is a sequence of data. Input stream - reads data from source. Output stream - writes data to destination.
Java Byte Streams are used to perform input/output byte by byte (8 bits at a time). A byte stream is suitable for processing raw data like binary files.
Java Character Streams are used to perform input/output 2 bytes at a time, because Characters are stored using Unicode conventions in Java with 2 bytes for each character. Character stream is useful when we process (read/write) text files.
RMI (Remote Method Invocation) - an API that provides a mechanism to create distributed application in java. The RMI allows an object to invoke methods on an object running in another JVM.
Both Serialization and Marshalling are loosely used as synonyms. Here are few differences.
Serialization - Data members of an object is written to binary form or Byte Stream (and then can be written in file/memory/database etc). No information about data-types can be retained once object data members are written to binary form.
Marshalling - Object is serialized (to byte stream in binary format) with data-type + Codebase attached and then passed Remote Object (RMI). Marshalling will transform the data-type into a predetermined naming convention so that it can be reconstructed with respect to the initial data-type.
So Serialization is a part of Marshalling.
CodeBase is information that tells the receiver of Object where the implementation of this object can be found. Any program that thinks it might ever pass an object to another program that may not have seen it before must set the codebase, so that the receiver can know where to download the code from, if it doesn't have the code available locally. The receiver will, upon deserializing the object, fetch the codebase from it and load the code from that location. (Copied from #Nasir answer)
Serialization is almost like a stupid memory-dump of the memory used by the object(s), while Marshalling stores information about custom data-types.
In a way, Serialization performs marshalling with implementation of pass-by-value because no information of data-type is passed, just the primitive form is passed to byte stream.
Serialization may have some issues related to big-endian, small-endian if the stream is going from one OS to another if the different OS have different means of representing the same data. On the other hand, marshalling is perfectly fine to migrate between OS because the result is a higher-level representation.
Marshaling refers to converting the signature and parameters of a function into a single byte array.
Specifically for the purpose of RPC.
Serialization more often refers to converting an entire object / object tree into a byte array
Marshaling will serialize object parameters in order to add them to the message and pass it across the network.
*Serialization can also be used for storage to disk.*
I think that the main difference is that Marshalling supposedly also involves the codebase. In other words, you would not be able to marshal and unmarshal an object into a state-equivalent instance of a different class.
Serialization just means that you can store the object and reobtain an equivalent state, even if it is an instance of another class.
That being said, they are typically synonyms.
Marshalling is the rule to tell compiler how the data will be represented on another environment/system;
For example;
[MarshalAs(UnmanagedType.ByValTStr, SizeConst = 260)]
public string cFileName;
[MarshalAs(UnmanagedType.ByValTStr, SizeConst = 14)]
public string cAlternateFileName;
as you can see two different string values represented as different value types.
Serialization will only convert object content, not representation (will stay same) and obey rules of serialization, (what to export or no). For example, private values will not be serialized, public values yes and object structure will stay same.
Here's more specific examples of both:
Serialization Example:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
typedef struct {
char value[11];
} SerializedInt32;
SerializedInt32 SerializeInt32(int32_t x)
{
SerializedInt32 result;
itoa(x, result.value, 10);
return result;
}
int32_t DeserializeInt32(SerializedInt32 x)
{
int32_t result;
result = atoi(x.value);
return result;
}
int main(int argc, char **argv)
{
int x;
SerializedInt32 data;
int32_t result;
x = -268435455;
data = SerializeInt32(x);
result = DeserializeInt32(data);
printf("x = %s.\n", data.value);
return result;
}
In serialization, data is flattened in a way that can be stored and unflattened later.
Marshalling Demo:
(MarshalDemoLib.cpp)
#include <iostream>
#include <string>
extern "C"
__declspec(dllexport)
void *StdCoutStdString(void *s)
{
std::string *str = (std::string *)s;
std::cout << *str;
}
extern "C"
__declspec(dllexport)
void *MarshalCStringToStdString(char *s)
{
std::string *str(new std::string(s));
std::cout << "string was successfully constructed.\n";
return str;
}
extern "C"
__declspec(dllexport)
void DestroyStdString(void *s)
{
std::string *str((std::string *)s);
delete str;
std::cout << "string was successfully destroyed.\n";
}
(MarshalDemo.c)
#include <Windows.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
int main(int argc, char **argv)
{
void *myStdString;
LoadLibrary("MarshalDemoLib");
myStdString = ((void *(*)(char *))GetProcAddress (
GetModuleHandleA("MarshalDemoLib"),
"MarshalCStringToStdString"
))("Hello, World!\n");
((void (*)(void *))GetProcAddress (
GetModuleHandleA("MarshalDemoLib"),
"StdCoutStdString"
))(myStdString);
((void (*)(void *))GetProcAddress (
GetModuleHandleA("MarshalDemoLib"),
"DestroyStdString"
))(myStdString);
}
In marshaling, data does not necessarily need to be flattened, but it needs to be transformed to another alternative representation. all casting is marshaling, but not all marshaling is casting.
Marshaling doesn't require dynamic allocation to be involved, it can also just be transformation between structs. For example, you might have a pair, but the function expects the pair's first and second elements to be other way around; you casting/memcpy one pair to another won't do the job because fst and snd will get flipped.
#include <stdio.h>
typedef struct {
int fst;
int snd;
} pair1;
typedef struct {
int snd;
int fst;
} pair2;
void pair2_dump(pair2 p)
{
printf("%d %d\n", p.fst, p.snd);
}
pair2 marshal_pair1_to_pair2(pair1 p)
{
pair2 result;
result.fst = p.fst;
result.snd = p.snd;
return result;
}
pair1 given = {3, 7};
int main(int argc, char **argv)
{
pair2_dump(marshal_pair1_to_pair2(given));
return 0;
}
The concept of marshaling becomes especially important when you start dealing with tagged unions of many types. For example, you might find it difficult to get a JavaScript engine to print a "c string" for you, but you can ask it to print a wrapped c string for you. Or if you want to print a string from JavaScript runtime in a Lua or Python runtime. They are all strings, but often won't get along without marshaling.
An annoyance I had recently was that JScript arrays marshal to C# as "__ComObject", and has no documented way to play with this object. I can find the address of where it is, but I really don't know anything else about it, so the only way to really figure it out is to poke at it in any way possible and hopefully find useful information about it. So it becomes easier to create a new object with a friendlier interface like Scripting.Dictionary, copy the data from the JScript array object into it, and pass that object to C# instead of JScript's default array.
(test.js)
var x = new ActiveXObject('Dmitry.YetAnotherTestObject.YetAnotherTestObject');
x.send([1, 2, 3, 4]);
(YetAnotherTestObject.cs)
using System;
using System.Runtime.InteropServices;
namespace Dmitry.YetAnotherTestObject
{
[Guid("C612BD9B-74E0-4176-AAB8-C53EB24C2B29"), ComVisible(true)]
public class YetAnotherTestObject
{
public void send(object x)
{
System.Console.WriteLine(x.GetType().Name);
}
}
}
above prints "__ComObject", which is somewhat of a black box from the point of view of C#.
Another interesting concept is that you might have the understanding how to write code, and a computer that knows how to execute instructions, so as a programmer, you are effectively marshaling the concept of what you want the computer to do from your brain to the program image. If we had good enough marshallers, we could just think of what we want to do/change, and the program would change that way without typing on the keyboard. So, if you could have a way to store all the physical changes in your brain for the few seconds where you really want to write a semicolon, you could marshal that data into a signal to print a semicolon, but that's an extreme.
Marshalling is usually between relatively closely associated processes; serialization does not necessarily have that expectation. So when marshalling data between processes, for example, you may wish to merely send a REFERENCE to potentially expensive data to recover, whereas with serialization, you would wish to save it all, to properly recreate the object(s) when deserialized.
My understanding of marshalling is different to the other answers.
Serialization:
To Produce or rehydrate a wire-format version of an object graph utilizing a convention.
Marshalling:
To Produce or rehydrate a wire-format version of an object graph by utilizing a mapping file, so that the results can be customized. The tool may start by adhering to a convention, but the important difference is the ability to customize results.
Contract First Development:
Marshalling is important within the context of contract first development.
Its possible to make changes to an internal object graph, while keeping the external interface stable over time. This way all of the service subscribers won't have to be modified for every trivial change.
Its possible to map the results across different languages. For example from the property name convention of one language ('property_name') to another ('propertyName').
Marshaling uses Serialization process actually but the major difference is that it in Serialization only data members and object itself get serialized not signatures but in Marshalling Object + code base(its implementation) will also get transformed into bytes.
Marshalling is the process to convert java object to xml objects using JAXB so that it can be used in web services.
Serialisation vs Marshalling
Problem: Object belongs to some process(VM) and it's lifetime is the same
Serialisation - transform object state into stream of bytes(JSON, XML...) for saving, sharing, transforming...
Marshalling - contains Serialisation + codebase. Usually it used by Remote procedure call(RPC) -> Java Remote Method Invocation(Java RMI) where you are able to invoke a object's method which is hosted on remote Java processes.
codebase - is a place or URL to class definition where it can be downloaded by ClassLoader. CLASSPATH[About] is as a local codebase
JVM -> Class Loader -> load class definition
java -Djava.rmi.server.codebase="<some_URL>" -jar <some.jar>
Very simple diagram for RMI
Serialisation - state
Marshalling - state + class definition
Official doc
Think of them as synonyms, both have a producer that sends stuff over to a consumer... In the end fields of instances are written into a byte stream and the other end foes the reverse ands up with the same instances.
NB - java RMI also contains support for transporting classes that are missing from the recipient...