Does flatbuffer guaranty the same for the identical data? - flatbuffers

We just got a strange problem, the same code and input data, the generated flatbuffers could be different when they are called twice. Of course, it is most probably caused by some bug by us, but we cannot help asking the question, whether flatbuffer guaranties the same for identical input data?
Thanks very much

FlatBuffers certainly is deterministic, i.e. if you call it in exactly the same order with the same data, it should generate the same buffer, bit for bit. I'm guessing either something was different about the data or about the order of creation.
Also, different implementations for different languages may generate slightly different buffers.

Related

Julia: how stable are serialize() / deserialize()

I am considering the use of serialize() and deserialize() for all of my data i/o due to their convenience. I do not, however, want to be stuck with unreadable files on a Julia update.
How stable are serialize() and deserialize()? Should they work between updates of 0.3? Can I expect safe behavior if I stick to basic types like arrays of Float64?
Thank you.
If you want to store data you might depend on being able to read in the future, you should not use a format that will incorporate breaking changes if/when someone finds it useful. As far as I understand the default serialization format is for network communications, so it is designed for maximum performance.
There is also the HDF5.jl package that uses a documented format and a common library that has wrappers for different languages.
I believe the official answer here is, "people will try not to break the serialization format, but you shouldn't depend upon on it."

Does it make sense to allow retrieval of data from OpenGL's context

I am trying to abstract some of OpenGLs concepts into an object oriented style, wrapping elements like Buffers, Arrays, Vertices etc. into objects that save their access-id, data-types, buffer-sizes, used indices etc. and provice further simplifications to their usage.
Though right now I mentioned: Does anyone actually want to reaccess this data that was once pushed into the GPU? Are functions like glGetBufferSubData actually ever used other than for Debugging, since the documentation of these functions on the official wiki isn't very elaborate and I have never seen it in any tutorial.
GL is the general conecpt that everything can be queried. Reading back stuff that you yourself put should be avoided and is usually more expensive than if you keep a local copy. However, there is also data which is generated by the GPU which you might read back. Examples of this are of course frambeuffer contents, textures you rendered into, or vertex data which you stored via transform feedback into a buffer. So yes, there are real use cases for things like glGetBufferSubData() (although I prefer buffer mappings in most situations).
If you need support for such operations is another matter entirely, and one whoch I think is off-topic here and primarily opinion-based. The problem with those abstractions one builds without the intended use case in mind is that one tends to over-abstract things. YMMV.
I wrote a program to generate meshes using transform feedback, and needed to read the data in buffers to save the resulting mesh.
The transform feedback generated the data. It wasn't data that I originally pushed there.
So, yes.

is there an ocaml library store/use data structure on disk

like bdb. However, I looked at the ocaml-bdb, seems like it's made to store only string. My problem is I have arrays that store giant data. Sure, I can serialize them into many files, or encode/decode my data and put them on database or those key-value db things, which is my last resort. I'm wondering if there's a better way.
The HDF4 / HDF5 file format might suit your needs. See http://forge.ocamlcore.org/projects/ocaml-hdf/
In addition to the HDF4 bindings mentioned by jrouquie there are HDF5 bindings available (http://opam.ocaml.org/packages/hdf5/). Depending on the type of data you're storing there are bindings to GDAL (http://opam.ocaml.org/packages/gdal/).
For data which can fit in a bigarray you also have the option of memory mapping a large file on disk. See https://caml.inria.fr/pub/docs/manual-ocaml/libref/Bigarray.Genarray.html#VALmap_file for example. While it ties you to a rather strict on-disk format, it does make it relatively simple to manipulate arrays which are larger than the available RAM.
there was an ocaml BerkeleyDB wrapper in the past:
OCamlDB
Apparently someone looked into it recently:
recent patch for OCamlDB
However, the GDAL bindings from hcarty are probably production ready and in intensive usage somewhere.
Also, there are bindings for dbm in opam: dbm and cryptodbm
HDF5 is prolly the answer, but given the question is somewhat vague, another solution is possible.
Disclaimer: I don't know ocaml (but I knew caml-light) and I know berkeley database (AKA. bsddb (AKA bdb)).
However, I looked at the ocaml-bdb, seems like it's made to store only string.
That maybe true in ocaml-bdb but in reality it stores bytes. I am not sure about your case, because in Python2 there was no difference between bytes and strings of unicode chars. It's until recently that Python 3 got a proper byte type and the bdb bindings take and spit bytes. That said, the difference is subtile but you'd rather work with bytes because that what bdb understand and use.
My problem is I have arrays that store giant data. Sure, I can serialize them into many files, or encode/decode my data and put them on database
or use those key-value db things, which is my last resort.
I'm wondering if there's a better way.
It depends on you need and how the data looks.
If the data can all stay in memory, you'd rather dump memory to a file and load it back.
If you need to share than data among several architectures or Operating system you'd rather use a serialisation framework like HDF5. Remember is that HDF5 doesn't handle circular references.
If the data can not stay all in memory, then you need to use something like bdb (or wiredtiger).
Why bdb (or wiredtiger)
Simply said, several decades of work have gone into:
splitting data
storing it on disk
retrieve data
As fast as possible.
wiredtiger is the successor of bdb.
So yes you could split the files yourself et al. but that will require a lot of work. Only specialized compagnies do that (bloomberg included...), among people that manage themself all the above there is the famous postgresql, mariadb, google and algolia.
ordered key value stores like wiredtiger and bdb use similar algorithm to higher level databases like postgresql and mysql or specialized one like lucene/solr or sphinx ie. mvcc, btree, lsm, PSSI etc...
MongoDB since 3.2 use wiredtiger backend for storing all the data.
Some people argue that key-value store are not good at storing relational data, that said several project started doing distributed databases on top of key value stores. This is a clue that it's useful. E.g. FoundationDB or CockroachDB.
The idea behind key-value stores is to deliver a generic framework for:
splitting data
storing it on disk
retrieve data
As fast as possible, giving some guarantees (like ACID) and other nice to haves (like compression or cryptography).
To take advantage of the power offer by those libraries. You need to learn about key-value composition.

How to decode SAP text from from STXL.CLUSTD?

I know ! The "proper" way to read STXL.CLUSTD is through SAP ABAP function. But I'm sorry, we are suffering badly from performance problem. We have already make our decision to go directly to the database (Oracle), and we don't have any plan to revert our decision yet since everything goes so much better so far.
However, we've came across this issue. The text in STXL.CLUSTD field was stored in an incomprehensible format. We cannot find any information about its encoding format via google. Anybody can hint me how to decode text from STXL.CLUSTD ?
Thanks
Short version: You don't. Use the function module READ_TEXT.
Long version: You're looking at a so-called cluster table. See http://help.sap.com/saphelp_47x200/helpdata/en/fc/eb3bf8358411d1829f0000e829fbfe/frameset.htm for the details. The data you see is an internal representation of the text, somehow related to the way the ABAP kernel handles the data internally. This data does not make any sense without the metadata. If you change the original structure in an incompatible way, the data can no longer be read. Oh, and did I mention that the data does not contain a reference to the metadata? When reading the contents of these tables, even in ABAP, you need to know the original source data structure, otherwise you're doomed. Without the metadata and the knowledge of how the kernel handles these data types at runtime, you'll have a hard time deciphering the contents.
Personal opinion: Direct access to the database below the SAP R/3 system is a really bad idea since this not only bypasses all safety measures, but it also makes you very vulnerable to all structural changes of the database. The only real reason for accessing the database directly is not lack of performance, but lack of (ABAP) knowledge, and that should be curable :-)
You can definitely read clusters and pools without running any ABAP code, or invoking RFC's or BAPI's, etc. it is a very good approach, highly performant, and easy to use.
I don't like people flogging their products in StackOverflow, but the information that you must use ABAP to access SAP data has been outdated for over 7 years now.
Thanks,
Bill MacLean
I just noticed this thread and I work for Simplement. Snow_FFFF is correct (BTW, that user is not me, and ASFAIK is not anyone in our company). The Data Liberator product has been de-clustering and de-pooling tables (and many other things) for our customers since 2009.

Functional testing of output files, when output is non-deterministic (or with low control)

A long time ago, I had to test a program generating a postscript file image. One quick way to figure out if the program was producing the correct, expected output was to do an md5 of the result to compare against the md5 of a "known good" output I checked beforehand.
Unfortunately, Postscript contains the current time within the file. This time is, of course, different depending on when the test runs, therefore changing the md5 of the result even if the expected output is obtained. As a fix, I just stripped off the date with sed.
This is a nice and simple scenario. We are not always so lucky. For example, now I am programming a writer program, which creates a big fat RDF file containing a bunch of anonymous nodes and uuids. It is basically impossible to check the functionality of the whole program with a simple md5, and the only way would be to read the file with a reader, and then validate the output through this reader. As you probably realize, this opens a new can of worms: first, you have to write a reader (which can be time consuming), second, you are assuming the reader is functionally correct and at the same time in sync with the writer. If both the reader and the writer are in sync, but on incorrect assumptions, the reader will say "no problem", but the file format is actually wrong.
This is a general issue when you have to perform functional testing of a file format, and the file format is not completely reproducible through the input you provide. How do you deal with this case?
In the past I have used a third party application to validate such output (preferably converting it into some other format which can be mechanically verified). The use of a third party ensures that my assumptions are at least shared by others, if not strictly correct. At the very least this approach can be used to verify syntax. Semantic correctness will probably require the creation of a consumer for the test data which will likely always be prone to the "incorrect assumptions" pitfall you mention.
Is the randomness always in the same places? I.e. is most of the file fixed but there are some parts that always change? If so, you might be able to take several outputs and use a programmatic diff to determine the nondeterministic parts. Once those are known, you could use the information to derive a mask and then do a comparison (md5 or just a straight compare). Think about pre-processing the file to remove (or overwrite with deterministic data) the parts that are non-deterministic.
If the whole file is non-deterministic then you'll have to come up with a different solution. I did testing of MPEG-2 decoders which are non-deterministic. In that case we were able to do a PSNR and fail if it was above some threshold. That may or may not work depending on your data but something similar might be possible.