Does the RavenDB compresion bundle provide benefits with many small documents? - ravendb

I am trying to better understand how RavenDB uses disk space.
My application has many small documents (approximately 140 bytes each). Presently, there are around 81,000 documents which would give a total data size of around 11MB. However, the size of the database is just over 70MB.
Is most of the actual space being used by indexes?
I had read somewhere else that there may be a minimum overhead of around 600 bytes per document. This would consume around 49MB, which is more in the ballpark of the actual use I am seeing.
Would using the compression bundle provide much benefit in this scenario (many small documents), or is it targeted towards helping reduce the size of databases with very large documents?

I have done some further testing on my own and determined, in answer to my own question, that:
Indexes are not the main consumer of disk space in my scenario. In this case, indexes represent < 25% of the disk space used.
Adding the compression bundle for a database with a large number of small documents does not really reduce the total amount of disk space used. This is likely due to some minimum data overhead that each document requires. Compression would benefit documents that are very large.

Is most of the actual space being used by indexes?
Yes, that's likely. Remember that Raven creates indexes for different queries you make. You can fire up Raven Studio to see what indexes it's created for you:
Would using the compression bundle provide much benefit in this
scenario (many small documents), or is it targeted towards helping
reduce the size of databases with very large documents?
Probably wouldn't benefit your scenario of small documents. The compression bundle works on individual documents, not on indexes. But it might be worth trying to see what results you get.
Bigger question: since hard drive space is cheap and only getting cheaper, and 70MB is a spec on the map, why are you concerned about hard drive space? Databases often trade disk space for speed (e.g. multiple indexes, like Raven), and this is usually a good trade off for most apps.


What about performance of cursors,reindex and shrinking?

i am having recently came to know that sql server if i delete one column or modify it acquires space at backend so i need to reindex and shrink the database and i have done it and my datbase size reduced to
2.82 to 1.62
so its good like wise so now i am in a confusion
so in my mind many questions regarding this subject occurs pls help me about this one
1. So it is necessary to recreate indexes(refresh ) after particular interval
It is necessary to shrink database after particular time so performance will be up to date?
If above yes then what particular time should i refresh (Shrink) my database?
i am having no idea what should be done for disk spacing problem i am having 77000 records it takes 2.82gb dataspace which is not acceptable i am having two tables of that one only with one table nvarchar(max) so there should be minimum spaces to database can anyone help me on this one Thanks in advance
I am going to simplify things a little for you so you might want to read up about the things I talk about in my answer.
Two concepts you must understand. Allocated space vs free space. A database might be 2GB in size but it is only using 1GB so it has allocated 2GB with 1GB free space. When you shrink a database it removes the free space so free space should be about 0. Dont think smaller file size is faster. As you database grows it has to allocate space again. When you shrink the file and then it grows every so often it cannot allocate space in a contiguous fashion. This will create fragmentation of the files which slows you down even more.
With data files(.mdb) files this is not so bad but with the transaction log shrinking the log can lead to virtual log file fragmentation issues which can slow you down. So in a nutshell there is very little reason to shrink your database on a schedule. Go read about Virtual Log Files in SQL Server there are a lot of articles about it. This is a good article about shrink log files and why it is bad. Use it as a starting point.
Secondly indexes get fragmented over time. This will lead to bad performance of SELECT queries mainly but will also affect other queries. Thus you need to perform some index maintenance on the database. See this answer on how to defragment your indexes.
Well the time you rebuild indexes is not clear cut. Index rebuilds lock the index during the rebuild. Essentially they are offline for the duration. In your case it would be fast 77 000 rows is nothing for SQL server. So rebuilding the indexes will consume server resources. IF you have enterprise edition you can do online index rebuilding which will NOT lock the indexes but will consume more space.
So what you need to do is find a maintenance window. For example if your system is used from 8:00 till 17:00 you can schedule maintenance rebuilds after hours. Schedule this with SQL server agent. The script in the link can be automated to run.
Your database is not big. I have seen SQL server handle tables of 750GB without taking strain if the IO is split over several disks. The slowest part of any database server is not the CPU or the RAM but the IO pathways to the disks. This is a huge topic though. Back to your point you are storing data in NVARCHAR(MAX) fields. I assume this is large text. So after you shrink the database you see the size at 1,62GB which means that each row in your database is about 1,62/77 000 big or roughly 22Kb big. This seems reasonable. Export the table to a text file and check the size you will be suprised it will probably be larger than 1,62GB.
Feel free to ask more detail if required.

RavenDB : Storage Size Problems

I'm doing some testing with RavenDB to store data based on an iphone application. The application is going to send up a string of 5 GPS coordinates with a GUID for the key. I'm seeing in RavenDB that each document is around 664-668 bytes. That's HUGE for 10 decimals and a guid. Can someone help me understand what I'm doing wrong? I noticed the size was extraordinarily large when a million records was over a gig on disk. By my calculations it should be much smaller. Purely based on the data sizes shouldn't the document be around 100 bytes? And given that the document database has the object schema built in let's say double that to 200 bytes. Given that calculation the database should be about two hundred megs with 1 million records. But it's ten times larger. Can someone help me where I've gone wrong with the math here?
(Got a friend to check my math and I was off by a bit - numbers updated)
As a general principal, NoSQL databases aren't optimized for disk space. That's the kind of traditional requirement of an RDBMS. Often with NoSQL, you will choose to store the data in duplicate or triplicate for various reasons.
Specifically with RavenDB, each document is in JSON format, so you have some overhead there. However, it is actually persisted on disk in BSON format, saving you some bytes. This implementation detail is obscured from the client. Also, every document has two streams - the main document content, and the associated metadata. This is very powerful, but does take up additional disk space. Both the document and the metadata are kept in BSON format in the ESENT backed document store.
Then you need to consider how you will access the data. Any static indexes you create, and any dynamic indexes you ask Raven to create for you via its LINQ API will have the data copied into the index store. This is a separate store implemented with using their proprietary index file format. You need to take this into consideration if you are estimating disk space requirements. (BTW - you would also have this concern with indexes in an RDBMS solution)
If you are super concerned about optimizing every byte of disk space, perhaps NoSQL solutions aren't for you. Just about every product on the market has these types of overhead. But keep in mind that disk space is cheap today. Relational databases optimized for disk space because storage was very expensive when they were invented. The world has changed, and NoSQL solutions embrace that.

How to store 15 x 100 million 32-byte records for sequential access?

Me got 15 x 100 million 32-byte records. Only sequential access and appends needed. The key is a Long. The value is a tuple - (Date, Double, Double). Is there something in this universe which can do this? I am willing to have 15 seperate databases (sql/nosql) or files for each of those 100 million records. I only have a i7 core and 8 GB RAM and 2 TB hard disk.
I have tried PostgreSQL, MySQL, Kyoto Cabinet (with fine tuning) with Protostuff encoding.
SQL DBs (with indices) take forever to do the silliest query.
Kyoto Cabinet's B-Tree can handle upto 15-18 million records beyond which appends take forever.
I am fed up so much that I am thinking of falling back on awk + CSV which I remember used to work for this type of data.
If you scenario means always going through all records in sequence then it may be an overkill to use a database. If you start to need random lookups, replacing/deleting records or checking if a new record is not a duplicate of an older one, a database engine would make more sense.
For the sequential access, a couple of text files or hand-crafted binary files will be easier to handle. You sound like a developer - I would probably go for an own binary format and access it with help of memory-mapped files to improve the sequential read/append speed. No caching, just a sliding window to read the data. I think that it would perform better and even on usual hardware than any DB would; I did such data analysis once. It would also be faster than awking CSV files; however, I am not sure how much and if it satisfied the effort to develop the binary storage, first of all.
As soon as the database becomes interesting, you can have a look at MongoDB and CouchDB. They are used for storing and serving very large amounts of data. (There is a flattering evaluation that compares one of them to traditional DBs.). Databases usually need a reasonable hardware power to perform better; maybe you could check out how those two would do with your data.
--- Ferda
Ferdinand Prantl's answer is very good. Two points:
By your requirements I recommend that you create a very tight binary format. This will be easy to do because your records are fixed size.
If you understand your data well you might be able to compress it. For example, if your key is an increasing log value you don't need to store it entirely. Instead, store the difference to the previous value (which is almost always going to be one). Then, use a standard compression algorithm/library to save on data size big time.
For sequential reads and writes, leveldb will handle your dataset pretty well.
I think that's about 48 gigs of data in one table.
When you get into large databases, you have to look at things a little differently. With an ordinary database (say, tables less than a couple million rows), you can do just about anything as a proof of concept. Even if you're stone ignorant about SQL databases, server tuning, and hardware tuning, the answer you come up with will probably be right. (Although sometimes you might be right for the wrong reason.)
That's not usually the case for large databases.
Unfortunately, you can't just throw 1.5 billion rows straight at an untuned PostgreSQL server, run a couple of queries, and say, "PostgreSQL can't handle this." Most SQL dbms have ways of dealing with lots of data, and most people don't know that much about them.
Here are some of the things that I have to think about when I have to process a lot of data over the long term. (Short-term or one-off processing, it's usually not worth caring a lot about speed. A lot of companies won't invest in more RAM or a dozen high-speed disks--or even a couple of SSDs--for even a long-term solution, let alone a one-time job.)
Server CPU.
Server RAM.
Server disks.
RAID configuration. (RAID 3 might be worth looking at for you.)
Choice of operating system. (64-bit vs 32-bit, BSD v. AT&T derivatives)
Choice of DBMS. (Oracle will usually outperform PostgreSQL, but it costs.)
DBMS tuning. (Shared buffers, sort memory, cache size, etc.)
Choice of index and clustering. (Lots of different kinds nowadays.)
Normalization. (You'd be surprised how often 5NF outperforms lower NFs. Ditto for natural keys.)
Tablespaces. (Maybe putting an index on its own SSD.)
I'm sure there are others, but I haven't had coffee yet.
But the point is that you can't determine whether, say, PostgreSQL can handle a 48 gig table unless you've accounted for the effect of all those optimizations. With large databases, you come to rely on the cumulative effect of small improvements. You have to do a lot of testing before you can defensibly conclude that a given dbms can't handle a 48 gig table.
Now, whether you can implement those optimizations is a different question--most companies won't invest in a new 64-bit server running Oracle and a dozen of the newest "I'm the fastest hard disk" hard drives to solve your problem.
But someone is going to pay either for optimal hardware and software, for dba tuning expertise, or for programmer time and waiting on suboptimal hardware. I've seen problems like this take months to solve. If it's going to take months, money on hardware is probably a wise investment.

Best cache size for iOS apps

I'm currently developing an application that loads lots of images from the internet and saves them locally (I'm using SDURLCache). However, old images have get removed from the disk again so I was wondering what the best cache size is.
The advantage of a big cache is obviously that more images get saved which leads to better UX.
The disadvantage is that images need a lot of space and the user will run out of disk space faster. The size I am thinking of is 20MB. It seems so big to me though so I'm asking you what you're opinion is.
The best way to decide on an appropriate cache size is to test. Run the app under Instruments to measure both performance and battery usage. Keep increasing the cache size until you can't discern a difference in performance. That's the largest size you'd need, at least under the test conditions. Once you've established that size, reduce the size until performance is just barely acceptable to determine the smallest acceptable size.
The right size is somewhere between those two sizes, depending on what you think is important. If you can't determine a right size, then either pick a size or add a slider to the app's settings to let the user decide. (I'd avoid making it user-adjustable if you can -- users shouldn't have to think about such things.)
Considering that the smallest iDevices have 8GB of storage, I don't think a 20MB cache is too big, especially if it significantly improves the performance of the app. Also, keep in mind the huge advantage a network cache can have for battery life, since network usage is very expensive in battery time.
Determining the ideal size however is hard without some more information. How often is the same picture accessed? How large is each picture (i.e. how many pictures can 20MB hold). How often will images need to be removed from the cache to add new ones?
If you are constantly changing the images in the cache, it could actually have an adverse effect on the battery life due to the increased disk usage.

When should we store images in database?

I have a table of productList in which i have 4 column, now i have to store image for each row so i have two option for this..
Store image in data base.
Save images in a folder and store only path on table.
So my question is which one is better in this situation and why ?
Microsoft Research published quite an extensive paper on the subject, called To Blob Or Not To Blob.
Their synopsis is:
Application designers often face the question of whether to store large objects in a filesystem or in a database. Often this decision is made for application design simplicity. Sometimes, performance measurements are also used. This paper looks at the question of fragmentation – one of the operational issues that can affect the performance and/or manageability of the system as deployed long term. As expected from the common wisdom, objects smaller than 256K are best stored in a database while objects larger than 1M are best stored in the filesystem. Between 256K and 1M, the read:write ratio and rate of object overwrite or replacement are important factors. We used the notion of “storage age” or number of object overwrites as way of normalizing wall clock time. Storage age allows our results or similar such results to be applied across a number of read:write ratios and object replacement rates.
It depends -
You can store images in DB if you know that they wont increase in size very often. This has its advantage when you are deploying your systems or migrating to new servers. you dont have to worry about copying images seperately.
If the no. of rows increase very frequently on that system, and the images get bulkier, then its good to store on the file system and have a path stored in database for later retrieval. This also will keep you on toes when migrating your servers where you have to take care of copying the images from filepath seperately.