I found this link where RavenDB author explains that, in spite of the fact that Raven DB has a huge document size limit (around 2 GB), it is unhealthy to manage too big files (among other reasons, it makes the indexing process too slow). What is the maximum size that is still "healthy" for the system? If it is not documented, is there a best way to determine this size somehow?
Up to a few MB, that is fine. Beyond a few MB, the sending and retrieving of documents become awkward.
Related
I am trying to better understand how RavenDB uses disk space.
My application has many small documents (approximately 140 bytes each). Presently, there are around 81,000 documents which would give a total data size of around 11MB. However, the size of the database is just over 70MB.
Is most of the actual space being used by indexes?
I had read somewhere else that there may be a minimum overhead of around 600 bytes per document. This would consume around 49MB, which is more in the ballpark of the actual use I am seeing.
Would using the compression bundle provide much benefit in this scenario (many small documents), or is it targeted towards helping reduce the size of databases with very large documents?
I have done some further testing on my own and determined, in answer to my own question, that:
Indexes are not the main consumer of disk space in my scenario. In this case, indexes represent < 25% of the disk space used.
Adding the compression bundle for a database with a large number of small documents does not really reduce the total amount of disk space used. This is likely due to some minimum data overhead that each document requires. Compression would benefit documents that are very large.
Is most of the actual space being used by indexes?
Yes, that's likely. Remember that Raven creates indexes for different queries you make. You can fire up Raven Studio to see what indexes it's created for you:
Would using the compression bundle provide much benefit in this
scenario (many small documents), or is it targeted towards helping
reduce the size of databases with very large documents?
Probably wouldn't benefit your scenario of small documents. The compression bundle works on individual documents, not on indexes. But it might be worth trying to see what results you get.
Bigger question: since hard drive space is cheap and only getting cheaper, and 70MB is a spec on the map, why are you concerned about hard drive space? Databases often trade disk space for speed (e.g. multiple indexes, like Raven), and this is usually a good trade off for most apps.
What should do system : store/manage centralized large(100 - 400 mb) text files
What to store : lines from text file, for some files lines must be unique, metadata about file(filename, comment, last update etc.) also must be stored position in file( on same file may be different positions for different applications)
Operations : concurrent get lines from file (100 - 400 lines on query), add lines(also 100 - 400 lines), exporting is not critical - can be scheduled
So which storage to use SQL DBMS - too slow, i think, maybe a noSQL solution ?
NoSQL: Cassandra is an option (you can store it line by line or groups of lines I guess), Voldemort is not too bad, you might even get away with using MongoDB but not sure it fits the "large files" requirement.
400 MiB will be completely served from the caches on every non-ridiculous database server. Insofar, the choice of database does not really matter too much, any database will be able to deliver fast (though there are different kinds of "fast", it depends what you need).
If you are really desperate for raw speed, you can go with something like redis. Again, 400 MiB is no challenge for that.
SQL might be slightly slower (but not that much) but has the huge advantage of being flexible. Flexibility, generality, and the presence of a "built-in programming language" are not free, but they should not have a too bad impact, because either way returning data from the buffer cache works more or less at the speed of RAM.
If you ever figure that you need a different database at a later time, SQL will let you do it with a few commands, or if you ever want something else you've not planned for, SQL will do. There is no guarantee that doing something different will be feasible with a simple key-value store.
Personally, I wouldn't worry about performance for such rather "small" datasets. Really, every kind of DB will serve that well, worry not. Come again when your datasets are several dozens of gigabytes in size.
If you are 100% sure that you will definitively never need the extras that a fully blown SQL database system offers, go with NoSQL to shave off a few microseconds. Otherwise, just stick with it to be on the safe side.
EDIT:
To elaborate, consider that a "somewhat lower class" desktop has upwards of 2 GiB (usually rather 4 GiB) nowadays, and a typical "no big deal" server has something like 32 GiB. In that light, 400 MiB is nothing. Typical network uplink on a server (unless you are willing to pay extra) are 100 mibit/s.
A 400 MiB text file might have somewhere around a million lines. That boils down to 6-7 memory accesses for a "typical SQL server", and to 2 memory accesses plus the time needed to calculate a hash for a "typical NoSQL server". Which is, give or take few a dozen cycles, the same in either case -- something around a half a microsecond on a relatively slow system.
Add to that a few dozen microseconds the first time a query is executed, because it must be parsed, validated, and optimized, if you use SQL.
Network latency is somewhere around 2 to 3 milliseconds if you're lucky. That's 3 to 4 orders of magnitude more for establishing a connection, sending a request to the server, and receiving an answer. Compared to that, it seems ridiculous to worry whether the query takes 517 or 519 microseconds. If there are 1-2 routers in between, it becomes even more pronounced.
The same is true for bandwidth. You can in theory push around 119 MiB/s over a 1 Gibit/s link assuming maximum sized frames and assuming no ACKs and assuming absolutely no other traffic, and zero packet loss. RAM delivers in the tens of GiB per second without trouble.
I'm doing some testing with RavenDB to store data based on an iphone application. The application is going to send up a string of 5 GPS coordinates with a GUID for the key. I'm seeing in RavenDB that each document is around 664-668 bytes. That's HUGE for 10 decimals and a guid. Can someone help me understand what I'm doing wrong? I noticed the size was extraordinarily large when a million records was over a gig on disk. By my calculations it should be much smaller. Purely based on the data sizes shouldn't the document be around 100 bytes? And given that the document database has the object schema built in let's say double that to 200 bytes. Given that calculation the database should be about two hundred megs with 1 million records. But it's ten times larger. Can someone help me where I've gone wrong with the math here?
(Got a friend to check my math and I was off by a bit - numbers updated)
As a general principal, NoSQL databases aren't optimized for disk space. That's the kind of traditional requirement of an RDBMS. Often with NoSQL, you will choose to store the data in duplicate or triplicate for various reasons.
Specifically with RavenDB, each document is in JSON format, so you have some overhead there. However, it is actually persisted on disk in BSON format, saving you some bytes. This implementation detail is obscured from the client. Also, every document has two streams - the main document content, and the associated metadata. This is very powerful, but does take up additional disk space. Both the document and the metadata are kept in BSON format in the ESENT backed document store.
Then you need to consider how you will access the data. Any static indexes you create, and any dynamic indexes you ask Raven to create for you via its LINQ API will have the data copied into the index store. This is a separate store implemented with Lucene.net using their proprietary index file format. You need to take this into consideration if you are estimating disk space requirements. (BTW - you would also have this concern with indexes in an RDBMS solution)
If you are super concerned about optimizing every byte of disk space, perhaps NoSQL solutions aren't for you. Just about every product on the market has these types of overhead. But keep in mind that disk space is cheap today. Relational databases optimized for disk space because storage was very expensive when they were invented. The world has changed, and NoSQL solutions embrace that.
Is Lucene capable of indexing 500M text documents of 50K each?
What performance can be expected such index, for single term search and for 10 terms search?
Should I be worried and directly move to distributed index environment?
Saar
Yes, Lucene should be able to handle this, according to the following article:
http://www.lucidimagination.com/content/scaling-lucene-and-solr
Here's a quote:
Depending on a multitude of factors, a single machine can easily host a Lucene/Solr index of 5 – 80+ million documents, while a distributed solution can provide subsecond search response times across billions of documents.
The article goes into great depth about scaling to multiple servers. So you can start small and scale if needed.
A great resource about Lucene's performance is the blog of Mike McCandless, who is actively involved in the development of Lucene: http://blog.mikemccandless.com/
He often uses Wikipedia's content (25 GB) as test input for Lucene.
Also, it might be interesting that Twitter's real-time search is now implemented with Lucene (see http://engineering.twitter.com/2010/10/twitters-new-search-architecture.html).
However, I am wondering if the numbers you provided are correct: 500 million documents x 50 KB = ~23 TB -- Do you really have that much data?
I am really interested in GLASS. The 4GB limit for the free version has me concerned. Especially when I consider the price for the next level ($7000 year).
I know this can be subjective and variable, but can someone describe for me in everyday terms what 4 GB of GLASS will get you? Maybe a business example. 4 GB may get me more storage than I realize.. and I don't have to worry about it.
In my app, some messages have file attachments up to 5 MB in size. Can I conserve the 4 GB of Gemstone space by saving these attachments directly to files on the operating system, instead of inside Gemstone? I'm thinking yes.
I'm aware of one GLASS system that is ~944 MB and has 8.3 million objects, or ~118 bytes per object. At this rate, it can grow to over 36 million objects and stay under 4 GB.
As to "attachments", I'd suggest that even in an RDBMS you should consider storing larger, static data in the file system and referencing it from the database. If you are building a web-based application, serving static content (JPG, CSS, etc.) should be done by your web server (e.g., Apache) rather than through the primary application.
By comparison, Oracle and Microsoft SQL Server have no-cost licenses for a 4-GB database.
What do you think would be a good price for the next level?
The 4GByte limit has been removed a while ago. The free version is limited now to the use of two cores and 2GByte ram.
4GB is quite a decent size database. Not having used gemstone before I can only speculate as to how efficient it is a storing objects, but having played with a few other similar object databases (Mongodb, db4o). I know that you're going to be able to fit several(5-10) million records before you even get close to that limit. In reality, how many records depends highly on the type of data you're storing.
As an example I was storing ~2million listings & ~1million transactions, in a mysql database and the space was < 1Gb. You have a small overhead serializing a whole object, but not that much.
Files can definitely can be stored on the file system.
4gb an issue... I guess you think you're building the next ebay!
Nowadays, there is no limit on the size of the repository. See the latest specs for GemStone
If you have multiple simultaneous users with attachments of 5MB you need a separate strategy for them anyway, as each takes about a twentieth second of bandwidth of a GBit ethernet network.