is there an ocaml library store/use data structure on disk - serialization

like bdb. However, I looked at the ocaml-bdb, seems like it's made to store only string. My problem is I have arrays that store giant data. Sure, I can serialize them into many files, or encode/decode my data and put them on database or those key-value db things, which is my last resort. I'm wondering if there's a better way.

The HDF4 / HDF5 file format might suit your needs. See http://forge.ocamlcore.org/projects/ocaml-hdf/

In addition to the HDF4 bindings mentioned by jrouquie there are HDF5 bindings available (http://opam.ocaml.org/packages/hdf5/). Depending on the type of data you're storing there are bindings to GDAL (http://opam.ocaml.org/packages/gdal/).
For data which can fit in a bigarray you also have the option of memory mapping a large file on disk. See https://caml.inria.fr/pub/docs/manual-ocaml/libref/Bigarray.Genarray.html#VALmap_file for example. While it ties you to a rather strict on-disk format, it does make it relatively simple to manipulate arrays which are larger than the available RAM.

there was an ocaml BerkeleyDB wrapper in the past:
OCamlDB
Apparently someone looked into it recently:
recent patch for OCamlDB
However, the GDAL bindings from hcarty are probably production ready and in intensive usage somewhere.
Also, there are bindings for dbm in opam: dbm and cryptodbm

HDF5 is prolly the answer, but given the question is somewhat vague, another solution is possible.
Disclaimer: I don't know ocaml (but I knew caml-light) and I know berkeley database (AKA. bsddb (AKA bdb)).
However, I looked at the ocaml-bdb, seems like it's made to store only string.
That maybe true in ocaml-bdb but in reality it stores bytes. I am not sure about your case, because in Python2 there was no difference between bytes and strings of unicode chars. It's until recently that Python 3 got a proper byte type and the bdb bindings take and spit bytes. That said, the difference is subtile but you'd rather work with bytes because that what bdb understand and use.
My problem is I have arrays that store giant data. Sure, I can serialize them into many files, or encode/decode my data and put them on database
or use those key-value db things, which is my last resort.
I'm wondering if there's a better way.
It depends on you need and how the data looks.
If the data can all stay in memory, you'd rather dump memory to a file and load it back.
If you need to share than data among several architectures or Operating system you'd rather use a serialisation framework like HDF5. Remember is that HDF5 doesn't handle circular references.
If the data can not stay all in memory, then you need to use something like bdb (or wiredtiger).
Why bdb (or wiredtiger)
Simply said, several decades of work have gone into:
splitting data
storing it on disk
retrieve data
As fast as possible.
wiredtiger is the successor of bdb.
So yes you could split the files yourself et al. but that will require a lot of work. Only specialized compagnies do that (bloomberg included...), among people that manage themself all the above there is the famous postgresql, mariadb, google and algolia.
ordered key value stores like wiredtiger and bdb use similar algorithm to higher level databases like postgresql and mysql or specialized one like lucene/solr or sphinx ie. mvcc, btree, lsm, PSSI etc...
MongoDB since 3.2 use wiredtiger backend for storing all the data.
Some people argue that key-value store are not good at storing relational data, that said several project started doing distributed databases on top of key value stores. This is a clue that it's useful. E.g. FoundationDB or CockroachDB.
The idea behind key-value stores is to deliver a generic framework for:
splitting data
storing it on disk
retrieve data
As fast as possible, giving some guarantees (like ACID) and other nice to haves (like compression or cryptography).
To take advantage of the power offer by those libraries. You need to learn about key-value composition.

Related

messagepack with redis where size of data is not big

Redis is a data structure store but still its recommended to use message-pack (or protobuf) to serialize/deserialize data. I am kind of confuse with Messagepack on top of Redis if data chunks written to Redis is not very big.
Since, Messagepack would need packing and unpacking data as per its own protocol and for sure it will incur some cost and packed data would be store only as "string" data type on Redis.
To leverage on Redis as data structure server a thin layer can be written to read/write directly to/from redis data structure let say between C++ and Python then where exactly message-packs fits in?
Can somebody shed some light on message-pack in context of redis?
Regards,
Rahul
Disclaimer - No offence to Messagepack capability, I know its really awesome :-)
There is no single answer, but I can offer a few guidelines.
Redis' basic data type is the string - it is binary-safe and can hold up to 0.5GB (probably more in an upcoming version).
Key names are strings, but you usually want to a) keep em short and b) they are the only way to access your data so hopefully legible and reconstructable.
Values can be strings. If the payload is already a string - no need to serialize/deserialize, just store as is. Common example: jpg or png files.
If your app is already using msgpack (or json or protobuff or...), you can store the serialized form.
Redis' Lua has built-in libs for dealing with json and msgpack.
There are modules (e.g. http://rejson.io) that can extend that.
I hope that helps.
Disclaimer: author of mentioned module, Redis geek and has a black belt working w/ Redis' Lua ;)

How to store huge amount of NSStrings for comparison purpose

I am writing a (linguistic) Morphology Mac Application. I often have to check if the Words in a given Text are in a huge List of Words (~1.000.000).
My Question is: How do i store these Lists ?
I use a .txt File to store the Words and create an NSSet from this File, which survives as long as the Application is launched.
I use a Database like SQLite.
Some points:
I think the focus should be on speed, because the analysis is triggered by the user and this comparisons make the largest part of the computation.
The Lists may change via updates.
I used CoreData and MySQL before, so (i think) i could realize both.
I have read a lot about the pro/cons of Database vs. File but i never thought its my usecase.
I dont know if its relevant which technik i use, because the size of these Files is relatively small (~20MB) and even with a lot of supported Languages, only 3-4 of this files will be loaded into memory at the same time.
Thanks! Danke!

Migrating RMS to RDB

We're approaching the migration of legacy OpenVMS RMS files into relational database (both MS SQL 2012 and Oracle 10g are available).
I wonder if there are:
Tools to retrieve schema of indexed files
Tools to parse indexed files
Tools to deal with custom RMS data formats (zoned decimals etc)
as a bundle/API/Library
Perhaps I should change the approach?
There are several tools available, notably through ODBC vendors (I work for one: Attunity).
1 >> Tools to retrieve schema of indexed files
Please clarify. Looking for just record/column layout and indexes within the files or also relationships between files.
1a) How are the files currently being used? Cobol, Basic, Fortran programs? Datatrieve?
They will be using some data definition method, so you want a tool which can exploit that.
Connx, and Attunity Connect can 'import' CDD definitions, BASIC - MAP files, Cobol Copybooks. Variants are typically covered as well. I have written many a (perl/awk) script to convert special definition to XML.
1b ) Analyze/RMS, or a program with calling RMS XAB's can get available index information. Atunity connect will know how to map those onto the fields from 1a)
1c ) There is no formal, stored, relationship between (indexed) files on OpenVMS. That's all in the program logic. However, some modestly smart Perl/Awk/DCL script can often generate a tablem of likely foreign/primary keys by looking at filed names and datatypes matches.
How many files / layouts / gigabytes are we talking about?
2 >> Tools to parse indexed files
Please clarify? Once the structure is known (question 1), the parsing is done by reading using that structure right? You never ever want to understand the indexed file internals. Just tell RMS to fetch records.
3 >> Tools to deal with custom RMS data formats (zoned decimals etc) as a bundle/API/Library
Again, please clarify. Once the structure is known just use the 'right' tool to read using that structure and surely it will honor the detailed data definitions.
(I know it is quite simple to write one yourself, just thought there would be something in the industry)
Famous last words... 'quite simple'. Entire companies have been build and thrive doing just that for general cases. I admit that for specific cases it can be relatively straightforward, but 'the devil is in the details'.
In the Attunity Connect case we have a UDT (User Defined data Type) to handle the 'odd' cases, often involving DATES. Dates in integers, in strings, as units since xxx are all available out of the box, but for example some have -1 meaning 'some high date' which needs some help to be stored in a DB.
All the databases have some bulk load tool (BCP, SQL$LOADER).
As long as you can deliver data conforming to what those expect (tabular, comma-seperated, quoted-or-not, escapes-or-not) you should be in good shape.
The EGH tool Vselect may be a handy, and high performance, way to bulk read indexed files, filter and format some and spit out sequential files for the DB loaders. It can read RMS indexed file faster than RMS can! (It has its own metadata language though!)
Attunity offers full access and replication services.
They include a CDC (change data capture) to not a only load the data, but to also keep it up to date in near-real-time. That's useful for 'evolution' versus 'revolution'.
Check out Attunity 'Replicate'. Once you have a data dictionary, just point to the tables desired (include, exlude filters), point to a target DB and click to replicate. Of course there are options for (global or per-table) transformations (like an AREA-CODE+EXHANGE+NUMBER to single phone number, or adding a modified date columns ).
Will this be a single big switch conversion, or is there desire to migrate the data and keep the old systems alive for days, months, years perhaps, all along keeping the data in close sync?
Hope this helps some,
Hein van den Heuvel.
OP: Perhaps I should change the approach? Probably.
You might consider finding data migration vendors, some which likely have off-the-shelf solutions, if not as a COTS tool, more likely packaged as a service (I don't think this is a big market).
What this won't help you with is what I think of as much bigger problem with the application code: who is going to change all the code that is making RMS calls, in the corresponding code that makes relational DB calls? How will the entity ("Joe Programmer", or some tool), know where the data migrated to, so that he can write the correct call? What are you doing to do about the fact that the data representation is like to change?
Ideally you'd like an automated migration tool, that will move the data itself (therefore knows that datalayouts and representation changes), and will make the code changes that correspond. You can look for these kind of vendors, too.

Provide example for why it is not advisable to store images in CoreData?

this question has been asked many times, I have read many users telling that it is not advisable to store images in a DB, in particular within CoreData. By they all seems to omit the reason why they would do so. Even Apple documentation state this, and everybody points to that direction, and every discussion end like this "well you can, but storing the path is better".
Apart from opinions, I would like to have a concrete example of why it is not a good solution.
I explain better, I have a strong background in building Web Application. A concrete example I would give from my point of view could be: do not store images in a DB, but rather the path to them, because you can have them served them by the web server, which can apply all of its caching issues.
But in a desktop environment, especially in iOS application, what are the downside of having stored in Core Data using sqllite, providing that:
There's a separate entity holding the images, it is not an attribute
of main entity
Also seems to be a limit of 100kb for images. Why ? What does happen with a 110,120...200kb ecc ?
thanks
There's nothing special about what Core Data normally does here. It's just using an SQLite database. You can put large blobs of data into it, but it just doesn't scale all that well. You can read more about it here: Internal Versus External BLOBs in SQLite.
That said, Core Data has support for external blobs which in Core Data terminology is called stored in external record (iOS 5.0 and later). Again, there's nothing magic about it, it's just storing the large pieces of data in the file system separately from the SQLite db itself. The benefit is that Core Data updates all this for you.
When you're in Xcode, there'll be a checkbox called Allows External Storage that you can check for Binary Data properties.
The filesystem, and the API:s surrounding it is (just like a webserver) optimized to serve files, of any size, and to apply caching where appropriate.
CoreData is optimized for handling an object graph with tiny pieces of data, like integers and short strings.
Also, there are a number of other issues that tend to creep up on you, like periodically vacuuming the SQLite database CoreData uses, or it won't be able to shrink, just grow.
Leonardo,
With Lion/iOS 5, Core Data started handling file system storage of large BLOBs for you.
The choice is really determined by how many images you are going to have open. If you have many, then you should keep them in the DB. Why? Because you only have a modest number of file descriptors, one of which is used for each open image stored in the file system.
That said, there is still a reason to manage the files yourself. If your BLOBs are really big, say 2+ MB, you will want to map them into memory and not just read them in. (When the memory warnings come, this lets the OS automatically purge them from your resident memory. This is a very good thing.) Even so, you still have the limited number of file descriptors problem.
Andrew

Good reasons NOT to use a relational database?

Can you please point to alternative data storage tools and give good reasons to use them instead of good-old relational databases? In my opinion, most applications rarely use the full power of SQL--it would be interesting to see how to build an SQL-free application.
Plain text files in a filesystem
Very simple to create and edit
Easy for users to manipulate with simple tools (i.e. text editors, grep etc)
Efficient storage of binary documents
XML or JSON files on disk
As above, but with a bit more ability to validate the structure.
Spreadsheet / CSV file
Very easy model for business users to understand
Subversion (or similar disk based version control system)
Very good support for versioning of data
Berkeley DB (Basically, a disk based hashtable)
Very simple conceptually (just un-typed key/value)
Quite fast
No administration overhead
Supports transactions I believe
Amazon's Simple DB
Much like Berkeley DB I believe, but hosted
Google's App Engine Datastore
Hosted and highly scalable
Per document key-value storage (i.e. flexible data model)
CouchDB
Document focus
Simple storage of semi-structured / document based data
Native language collections (stored in memory or serialised on disk)
Very tight language integration
Custom (hand-written) storage engine
Potentially very high performance in required uses cases
I can't claim to know anything much about them, but you might also like to look into object database systems.
Matt Sheppard's answer is great (mod up), but I would take account these factors when thinking about a spindle:
Structure : does it obviously break into pieces, or are you making tradeoffs?
Usage : how will the data be analyzed/retrieved/grokked?
Lifetime : how long is the data useful?
Size : how much data is there?
One particular advantage of CSV files over RDBMSes is that they can be easy to condense and move around to practically any other machine. We do large data transfers, and everything's simple enough we just use one big CSV file, and easy to script using tools like rsync. To reduce repetition on big CSV files, you could use something like YAML. I'm not sure I'd store anything like JSON or XML, unless you had significant relationship requirements.
As far as not-mentioned alternatives, don't discount Hadoop, which is an open source implementation of MapReduce. This should work well if you have a TON of loosely structured data that needs to be analyzed, and you want to be in a scenario where you can just add 10 more machines to handle data processing.
For example, I started trying to analyze performance that was essentially all timing numbers of different functions logged across around 20 machines. After trying to stick everything in a RDBMS, I realized that I really don't need to query the data again once I've aggregated it. And, it's only useful in it's aggregated format to me. So, I keep the log files around, compressed, and then leave the aggregated data in a DB.
Note I'm more used to thinking with "big" sizes.
The filesystem's prety handy for storing binary data, which never works amazingly well in relational databases.
Try Prevayler:
http://www.prevayler.org/wiki/
Prevayler is alternative to RDBMS. In the site have more info.
If you don't need ACID, you probably don't need the overhead of an RDBMS. So, determine whether you need that first. Most of the non-RDBMS answers provided here do not provide ACID.
Custom (hand-written) storage engine / Potentially very high performance in required uses cases
http://www.hdfgroup.org/
If you have enormous data sets, instead of rolling your own, you might use HDF, the Hierarchical Data Format.
http://en.wikipedia.org/wiki/Hierarchical_Data_Format:
HDF supports several different data models, including multidimensional arrays, raster images, and tables.
It's also hierarchical like a file system, but the data is stored in one magic binary file.
HDF5 is a suite that makes possible the management of extremely large and complex data collections.
Think petabytes of NASA/JPL remote sensing data.
G'day,
One case that I can think of is when the data you are modelling cannot be easily represented in a relational database.
Once such example is the database used by mobile phone operators to monitor and control base stations for mobile telephone networks.
I almost all of these cases, an OO DB is used, either a commercial product or a self-rolled system that allows heirarchies of objects.
I've worked on a 3G monitoring application for a large company who will remain nameless, but whose logo is a red wine stain (-: , and they used such an OO DB to keep track of all the various attributes for individual cells within the network.
Interrogation of such DBs is done using proprietary techniques that are, usually, completely free from SQL.
HTH.
cheers,
Rob
Object databases are not relational databases. They can be really handy if you just want to stuff some objects in a database. They also support versioning and modify classes for objects that already exist in the database. db4o is the first one that comes to mind.
In some cases (financial market data and process control for example) you might need to use a real-time database rather than a RDBMS. See wiki link
There was a RAD tool called JADE written a few years ago that has a built-in OODBMS. Earlier incarnations of the DB engine also supported Digitalk Smalltalk. If you want to sample application building using a non-RDBMS paradigm this might be a start.
Other OODBMS products include Objectivity, GemStone (You will need to get VisualWorks Smalltalk to run the Smalltalk version but there is also a java version). There were also some open-source research projects in this space - EXODUS and its descendent SHORE come to mind.
Sadly, the concept seemed to die a death, probably due to the lack of a clearly visible standard and relatively poor ad-hoc query capability relative to SQL-based RDMBS systems.
An OODBMS is most suitable for applications with core data structures that are best represented as a graph of interconnected nodes. I used to say that the quintessential OODBMS application was a Multi-User Dungeon (MUD) where rooms would contain players' avatars and other objects.
You can go a long way just using files stored in the file system. RDBMSs are getting better at handling blobs, but this can be a natural way to handle image data and the like, particularly if the queries are simple (enumerating and selecting individual items.)
Other things that don't fit very well in a RDBMS are hierarchical data structures and I'm guessing geospatial data and 3D models aren't that easy to work with either.
Services like Amazon S3 provide simpler storage models (key->value) that don't support SQL. Scalability is the key there.
Excel files can be useful too, particularly if users need to be able to manipulate the data in a familiar environment and building a full application to do that isn't feasible.
There are a large number of ways to store data - even "relational databse" covers a range of alternatives from a simple library of code that manipulates a local file (or files) as if it were a relational database on a single user basis, through file based systems than can handle multiple-users to a generous selection of serious "server" based systems.
We use XML files a lot - you get well structured data, nice tools for querying same the ability to do edits if appropriate, something that's human readable and you don't then have to worry about the db engine working (or the workings of the db engine). This works well for stuff that's essentially read only (in our case more often than not generated from a db elsewhere) and also for single user systems where you can just load the data in and save it out as required - but you're creating opportunities for problems if you want multi-user editing - at least of a single file.
For us that's about it - we're either going to use something that will do SQL (MS offer a set of tools that run from a .DLL to do single user stuff all the way through to enterprise server and they all speak the same SQL (with limitations at the lower end)) or we're going to use XML as a format because (for us) the verbosity is seldom an issue.
We don't currently have to manipulate binary data in our apps so that question doesn't arise.
Murph
One might want to consider the use of an LDAP server in the place of a traditional SQL database if the application data is heavily key/value oriented and hierarchical in nature.
BTree files are often much faster than relational databases. SQLite contains within it a BTree library which is in the public domain (as in genuinely 'public domain', not using the term loosely).
Frankly though, if I wanted a multi-user system I would need a lot of persuading not to use a decent server relational database.
Full-text databases, which can be queried with proximity operators such as "within 10 words of," etc.
Relational databases are an ideal business tool for many purposes - easy enough to understand and design, fast enough, adequate even when they aren't designed and optimized by a genius who could "use the full power," etc.
But some business purposes require full-text indexing, which relational engines either don't provide or tack on as an afterthought. In particular, the legal and medical fields have large swaths of unstructured text to store and wade through.
Also:
* Embedded scenarios - Where usually it is required to use something smaller then a full fledged RDBMS. Db4o is an ODB that can be easily used in such case.
* Rapid or proof-of-concept development - where you wish to focus on the business and not worry about persistence layer
CAP theorem explains it succinctly. SQL mainly provides "Strong Consistency: all clients see the same view, even in presence of updates".
K.I.S.S: Keep It Small and Simple
I would offer RDBMS :)
If you do not wont to have troubles with set up/administration go for SQLite.
Built in RDBMS with full SQL support. It even allows you to store any type of data in any column.
Main advantage against for example log file: If you have huge one, how are you going to search in it? With SQL engine you just create index and speed up operation dramatically.
About full text search: SQLite has modules for full text search too..
Just enjoy nice standard interface to your data :)
One good reason not to use a relational database would be when you have a massive data set and want to do massively parallel and distributed processing on the data. The Google web index would be a perfect example of such a case.
Hadoop also has an implementation of the Google File System called the Hadoop Distributed File System.
I would strongly recommend Lua as an alternative to SQLite-kind of data storage.
Because:
The language was designed as a data description language to begin with
The syntax is human readable (XML is not)
One can compile Lua chunks to binary, for added performance
This is the "native language collection" option of the accepted answer. If you're using C/C++ as the application level, it is perfectly reasonable to throw in the Lua engine (100kB of binary) just for the sake of reading configs/data or writing them out.