Identify compression method used on blob/binary data

Identify compression method used on blob/binary data - blob

I have some binary data (blobs) from a database, and I need to know what compression method was used on them to decompress them.
How do I determine what method of compression that has been used?

Actually it is easier. Assume one of the standard methods was used, there possibly are some magic bytes at the beginning. I suggest taiking the hex values of the first 3-4 bytes and asking google.
It makes no sense to develop your own compressions, so... unless the case was special, or the programmer stupid, he used one of the well known compression methods. YOu could also take libraires of the most popular ones and just try what they say.

The only way to do this, in general, would be to store which compression method was used when you store the BLOB.

Starting from the blob in db you can do the following:
Store in file
For my use case I used DBeaver to export multiple blobs to separate files.
Find out more about the magic numbers from the file by doing
file -i filename
In my case the files are application/zlib; charset=binary.

Related

Is it possible run query COPY TO STDOUT WITH BINARY and stream results with node-postgres?

I'm worried about data-type coercion or will i get a nice Buffer or UInt8Array? Can I get it in chunks (streaming)?

Delving into npm I found: https://www.npmjs.com/package/pg-copy-streams -- this is the answer I was looking for.
Here is a bit more information (copied from the README) so you can avoid traversing the link:
pg-copy-streams
COPY FROM / COPY TO for node-postgres. Stream from one database to
another, and stuff.
how? what? huh?
Did you know the all powerful PostgreSQL supports streaming binary
data directly into and out of a table? This means you can take your
favorite CSV or TSV or whatever format file and pipe it directly into
an existing PostgreSQL table. You can also take a table and pipe it
directly to a file, another database, stdout, even to /dev/null if
you're crazy!
What this module gives you is a Readable or Writable stream directly
into/out of a table in your database. This mode of interfacing with
your table is very fast and very brittle. You are responsible for
properly encoding and ordering all your columns. If anything is out of
place PostgreSQL will send you back an error. The stream works within
a transaction so you wont leave things in a 1/2 borked state, but it's
still good to be aware of.
If you're not familiar with the feature (I wasn't either) you can read
this for some good helps:
http://www.postgresql.org/docs/9.3/static/sql-copy.html

Is it safer putting data in implementation files instead of in header files or other data files?

The purpose is to increase the cost for users to cheat in games by hacking local game data, and safety is the main concern. Don't need to think about the working flow related issues between designers and programmers.
Situation: iOS game development, objective-c
To save some game setting data with simple structure such as the HP Max value for a boss, I got three plans:
Using plist files (or XML\SQLite etc., base64 encoding is optional);
Using macro #define and put these data in a specific header file say constants.h;
Write them with obj-c code in an implementation file. For example using a singleton instance GameData, put data in GameData.m and get them by calling its method.
My questions are:
Is plan 3 the safest one here?
Are there other better plans that are not too complicated?
When you use the 1st and 2nd plan to save data, is it right to write code with the thought that "all data even the code here are visible to users"? For example is #define kABC 100.0f a little bit safer(looks more confusing to hackers) than #define kEnemy01_HP_Max 100.0f?

Neither method is safe, nor is any of them safer than another, unless you encrypt the data. You are confusing data security/integrity with private encapsulation. They are not related: a hacker won't be kind enough to use your pre-defined setter/getter functions, they will check the binary executable which is your program. Anyone with a basic hex editor for your given platform will be able to see those data, if they know where to look.
EDIT:
Also, please note that variable/function/macro names etc are only present in your source code, they are not present in your executable. So giving them cryptic names will serve one purpose, and that is to confuse you, the programmer.

Use the GameData singleton you mentioned. Add 3 methods:
Make GameData capable to read its data from an unencrypted data
file.
Make GameData capable to write its data encrypted to a data
file.
Make GameData capable to read its data from an encrypted data
file.
Refer to: http://www.codeproject.com/Articles/831481/File-Encryption-Decryption-Tutorial-in-Cplusplus
For development use an unencrypted data file and use GameData to encrypt the data (methods 1 and 2).
Ship the encrypted data file and use GameData to decrypt it (method 3).

is there an ocaml library store/use data structure on disk

like bdb. However, I looked at the ocaml-bdb, seems like it's made to store only string. My problem is I have arrays that store giant data. Sure, I can serialize them into many files, or encode/decode my data and put them on database or those key-value db things, which is my last resort. I'm wondering if there's a better way.

The HDF4 / HDF5 file format might suit your needs. See http://forge.ocamlcore.org/projects/ocaml-hdf/

In addition to the HDF4 bindings mentioned by jrouquie there are HDF5 bindings available (http://opam.ocaml.org/packages/hdf5/). Depending on the type of data you're storing there are bindings to GDAL (http://opam.ocaml.org/packages/gdal/).
For data which can fit in a bigarray you also have the option of memory mapping a large file on disk. See https://caml.inria.fr/pub/docs/manual-ocaml/libref/Bigarray.Genarray.html#VALmap_file for example. While it ties you to a rather strict on-disk format, it does make it relatively simple to manipulate arrays which are larger than the available RAM.

there was an ocaml BerkeleyDB wrapper in the past:
OCamlDB
Apparently someone looked into it recently:
recent patch for OCamlDB
However, the GDAL bindings from hcarty are probably production ready and in intensive usage somewhere.
Also, there are bindings for dbm in opam: dbm and cryptodbm

HDF5 is prolly the answer, but given the question is somewhat vague, another solution is possible.
Disclaimer: I don't know ocaml (but I knew caml-light) and I know berkeley database (AKA. bsddb (AKA bdb)).
However, I looked at the ocaml-bdb, seems like it's made to store only string.
That maybe true in ocaml-bdb but in reality it stores bytes. I am not sure about your case, because in Python2 there was no difference between bytes and strings of unicode chars. It's until recently that Python 3 got a proper byte type and the bdb bindings take and spit bytes. That said, the difference is subtile but you'd rather work with bytes because that what bdb understand and use.
My problem is I have arrays that store giant data. Sure, I can serialize them into many files, or encode/decode my data and put them on database
or use those key-value db things, which is my last resort.
I'm wondering if there's a better way.
It depends on you need and how the data looks.
If the data can all stay in memory, you'd rather dump memory to a file and load it back.
If you need to share than data among several architectures or Operating system you'd rather use a serialisation framework like HDF5. Remember is that HDF5 doesn't handle circular references.
If the data can not stay all in memory, then you need to use something like bdb (or wiredtiger).
Why bdb (or wiredtiger)
Simply said, several decades of work have gone into:
splitting data
storing it on disk
retrieve data
As fast as possible.
wiredtiger is the successor of bdb.
So yes you could split the files yourself et al. but that will require a lot of work. Only specialized compagnies do that (bloomberg included...), among people that manage themself all the above there is the famous postgresql, mariadb, google and algolia.
ordered key value stores like wiredtiger and bdb use similar algorithm to higher level databases like postgresql and mysql or specialized one like lucene/solr or sphinx ie. mvcc, btree, lsm, PSSI etc...
MongoDB since 3.2 use wiredtiger backend for storing all the data.
Some people argue that key-value store are not good at storing relational data, that said several project started doing distributed databases on top of key value stores. This is a clue that it's useful. E.g. FoundationDB or CockroachDB.
The idea behind key-value stores is to deliver a generic framework for:
splitting data
storing it on disk
retrieve data
As fast as possible, giving some guarantees (like ACID) and other nice to haves (like compression or cryptography).
To take advantage of the power offer by those libraries. You need to learn about key-value composition.

Binary file & saved game formatting

I am working on a small roguelike game, and need some help with creating save games. I have tried several ways of saving games, but the load always fails, because I am not exactly sure what is a good way to mark the beginning of different sections for the player, entities, and the map.
What would be a good way of marking the beginning of each section, so that the data can read back reliably without knowing the length of each section?
Edit: The language is C++. It looks like a readable format would be a better shot. Thanks for all the quick replies.

The easiest solution is usually use a library to write the data using XML or INI, then compress it. This will be easier for you to parse, and result in smaller files than a custom binary format.
Of course, they will take slightly longer to load (though not much, unless your data files are 100's of MBs)
If you're determined to use a binary format, take a look at BER.

Are you really sure you need binary format?
Why not store in some text format so that it can be easily parseable, be it plain text, XML or YAML.

Since you're saving binary data you can't use markers without length.
Simply write the number of records of any type and then structured data, then it will be
easy to read again. If you have variable length elements like string the also need length information.
2
player record
player record
3
entities record
entities record
entities record
1
map

If you have a marker, you have to guarantee that the pattern doesn't exist elsewhere in your binary stream. If it does exist, you must use a special escape sequence to differentiate it. The Telnet protocol uses 0xFF to mark special commands that aren't part of the data stream. Whenever the data stream contains a naturally occurring 0xFF, then it must be replaced by 0xFFFF.
So you'd use a 2-byte marker to start a new section, like 0xFF01. If your reader sees 0xFF01, it's a new section. If it sees 0xFFFF, you'd collapse it into a single 0xFF. Naturally you can expand this approach to use any length marker you want.
(Although my personal preference is a text format (optionally compressed) or a binary format with length bytes instead of markers. I don't understand how you're serializing it without knowing when you're done reading a data structure.)

What's the canonical way to store arbitrary (possibly marked up) text in SQL?

What do wikis/stackoverflow/etc. do when it comes to storing text? Is the text broken at newlines? Is it broken into fixed-length chunks? How do you best store arbitrarily long chunks of text?

nvarchar(max) ftw. because over complicating simple things is bad, mmkay?

I guess if you need to offer the ability to store large chunks of text and you don't mind not being able to look into their content too much when querying, you can use CLobs.

This all depends on the RDBMS that you are using as well as the types of text that you are going to store. If the text is formatted into sizable chunks of data that mean something in and of themselves, like, say header/body, then you might want to break the data up into columns of these types. It may take multiple tables to use this method depending on the content that you are dealing with.
I don't know how other RDBMS's handle it, but I know that that it's not a good idea to have more than one open ended column in each table (text or varchar(max)). So you will want to make sure that only one column has unlimited characters.

Regarding PostgreSQL - use type TEXT or BYTEA. If you need to read random chunks you may consider large objects.

If you need to worry about keeping things like formatting strings, quotes, and other "cruft" in the text, as code would likely have, then the special characters need to be completely escaped first - otherwise on submission the db, they might end up causing an invalid command to be issued.
Most scripting languages have tools to do this built-in natively.

I guess it depends on where you want to store the text, if you need things like transactions etc.
Databases like SQL Server have a type that can store long text fields. In SQL Server 2005 this would primarily be nvarchar(max) for long unicode text strings. By using a database you can benefit from transactions and easy backup/restore assuming you are using the database for other things like StackOverflow.com does.
The alternative is to store text in files on disk. This may be fairly simple to implement and can work in environments where a database is not available or overkill.
Regards the format of the text that is stored in a database or file, it is probably very close to the input. If it's HTML then you would just push it through a function that would correctly escape it.
Something to remember is that you probably want to be using unicode or UTF-8 from creation to storage and vice-versa. This will allow you to support additional languages. Any problem with this encoding mechanism will corrupt your text. Historically people may have defaulted to ASCII based on the assumption they were saving disk space etc.

For SQL Server:
Use a varchar(max) to store. I think the upper limit is 2 GB.
Don't try to escape the text yourself. Pass the text through a parameterizing structure that will do the escapes properly for you. In .Net you'd add a parameter to a SqlCommand, or just use LinqToSQL (which then manages the SqlCommand for you).

I suspect StackOverflow is storing text in markdown format in arbitrarily-sized 'text' column. Maybe as UTF8 (but it might be UTF16 or something. I'm guessing it's SQL Server, which I don't know much about).
As a general rule you want to store stuff in your database in the 'rawest' form possible. That is, do all your decoding, and possibly cleaning, but don't do anything else with it (for example, if it's Markdown, don't encode it to HTML, leave it in its original 'raw' format)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas