RavenDB : Storage Size Problems - ravendb

I'm doing some testing with RavenDB to store data based on an iphone application. The application is going to send up a string of 5 GPS coordinates with a GUID for the key. I'm seeing in RavenDB that each document is around 664-668 bytes. That's HUGE for 10 decimals and a guid. Can someone help me understand what I'm doing wrong? I noticed the size was extraordinarily large when a million records was over a gig on disk. By my calculations it should be much smaller. Purely based on the data sizes shouldn't the document be around 100 bytes? And given that the document database has the object schema built in let's say double that to 200 bytes. Given that calculation the database should be about two hundred megs with 1 million records. But it's ten times larger. Can someone help me where I've gone wrong with the math here?
(Got a friend to check my math and I was off by a bit - numbers updated)

As a general principal, NoSQL databases aren't optimized for disk space. That's the kind of traditional requirement of an RDBMS. Often with NoSQL, you will choose to store the data in duplicate or triplicate for various reasons.
Specifically with RavenDB, each document is in JSON format, so you have some overhead there. However, it is actually persisted on disk in BSON format, saving you some bytes. This implementation detail is obscured from the client. Also, every document has two streams - the main document content, and the associated metadata. This is very powerful, but does take up additional disk space. Both the document and the metadata are kept in BSON format in the ESENT backed document store.
Then you need to consider how you will access the data. Any static indexes you create, and any dynamic indexes you ask Raven to create for you via its LINQ API will have the data copied into the index store. This is a separate store implemented with Lucene.net using their proprietary index file format. You need to take this into consideration if you are estimating disk space requirements. (BTW - you would also have this concern with indexes in an RDBMS solution)
If you are super concerned about optimizing every byte of disk space, perhaps NoSQL solutions aren't for you. Just about every product on the market has these types of overhead. But keep in mind that disk space is cheap today. Relational databases optimized for disk space because storage was very expensive when they were invented. The world has changed, and NoSQL solutions embrace that.

Related

Is it possible to store PDF files in a CQL blob type in Cassandra?

To avoid questions about. Why do you use casandra in favour of another database. we have to because our custoner decided that Im my option a completely wrong decision.
In our Applikation we have to deal with PDF documents, i.e. Reader them and populate them with data.
So my intention was to hold the documents (templates) in the database read them and then do what we need to do with them.
I noticed that cassandra provieds a blob column type.
However for me it seems that this type has nothing to with a blob in qn Oracle or other relational database.
From what I understand is that cassandra is not for storing documnents and therefore it is not possible?
Or is the only way to make byte-array out of the document?
what is the intention of the blob column type?
The blob type in Cassandra is used to store raw bytes, so it's "theoretically" could be used to store PDF files as well (as bytes). But there is one thing that should be taken into consideration - Cassandra doesn't work well with big payloads - usual recommendation is to store 10s or 100s of Kb, not more than 1Mb. With bigger payloads, operations, such as repair, addition/removal of nodes, etc. could lead to increased overhead and performance degradation. On older versions of Cassandra (2.x/3.0) I have seen the situations when people couldn't add new nodes because join operation failed. It's a bit better situation with newer versions, but still it should be evaluated before jumping into implementation. It's recommended to do performance testing + some maintenance operations at scale to understand if it will work for your load. NoSQLBench is a great tool for such things.
It is possible to store binary files in a CQL blob column however the general recommendation is to only store a small amount of data in blobs, preferably 1MB or less for optimum performance.
For larger files, it is better to place them in an object store and only save the metadata in Cassandra.
Most large enterprises whose applications hold large amount of media files (music, video, photos, etc) typically store them in Amazon S3, Google Cloud Store or Azure Blob then store the metadata (such as URLs) of the files in Cassandra. These enterprises are household names in streaming services and social media apps. Cheers!

Does the RavenDB compresion bundle provide benefits with many small documents?

I am trying to better understand how RavenDB uses disk space.
My application has many small documents (approximately 140 bytes each). Presently, there are around 81,000 documents which would give a total data size of around 11MB. However, the size of the database is just over 70MB.
Is most of the actual space being used by indexes?
I had read somewhere else that there may be a minimum overhead of around 600 bytes per document. This would consume around 49MB, which is more in the ballpark of the actual use I am seeing.
Would using the compression bundle provide much benefit in this scenario (many small documents), or is it targeted towards helping reduce the size of databases with very large documents?
I have done some further testing on my own and determined, in answer to my own question, that:
Indexes are not the main consumer of disk space in my scenario. In this case, indexes represent < 25% of the disk space used.
Adding the compression bundle for a database with a large number of small documents does not really reduce the total amount of disk space used. This is likely due to some minimum data overhead that each document requires. Compression would benefit documents that are very large.
Is most of the actual space being used by indexes?
Yes, that's likely. Remember that Raven creates indexes for different queries you make. You can fire up Raven Studio to see what indexes it's created for you:
Would using the compression bundle provide much benefit in this
scenario (many small documents), or is it targeted towards helping
reduce the size of databases with very large documents?
Probably wouldn't benefit your scenario of small documents. The compression bundle works on individual documents, not on indexes. But it might be worth trying to see what results you get.
Bigger question: since hard drive space is cheap and only getting cheaper, and 70MB is a spec on the map, why are you concerned about hard drive space? Databases often trade disk space for speed (e.g. multiple indexes, like Raven), and this is usually a good trade off for most apps.

Database or other method of storing and dynamically accessing HUGE binary objects

I have some large (200 GB is normal) flat files of data that I would like to store in some kind of database so that it can be accessed quickly and in the intuitive way that the data is logically organized. Think of it as large sets of very long audio recordings, where each recording is the same length (samples) and can be thought of as a row. One of these files normally has about 100,000 recordings of 2,000,000 samples each in length.
It would be easy enough to store these recordings as rows of BLOB data in a relational database, but there are many instances where I want to load into memory only certain columns of the entire data set (say, samples 1,000-2,000). What's the most memory- and time-efficient way to do this?
Please don't hesitate to ask if you need more clarification on the particulars of my data in order to make a recommendation.
EDIT: To clarify the data dimensions... One file consists of: 100,000 rows (recordings) by 2,000,000 columns (samples). Most relational databases I've researched will allow a maximum of a few hundred to a couple thousand rows in a table. Then again, I don't know much about object-oriented databases, so I'm kind of wondering if something like that might help here. Of course, any good solution is very welcome. Thanks.
EDIT: To clarify the usage of the data... The data will be accessed only by a custom desktop/distributed-server application, which I will write. There is metadata (collection date, filters, sample rate, owner, etc.) for each data "set" (which I've referred to as a 200 GB file up to now). There is also metadata associated with each recording (which I had hoped would be a row in a table so I could just add columns for each piece of recording metadata). All of the metadata is consistent. I.e. if a particular piece of metadata exists for one recording, it also exists for all recordings in that file. The samples themselves do not have metadata. Each sample is 8 bits of plain-ol' binary data.
DB storage may not be ideal for large files. Yes, it can be done. Yes, it can work. But what about DB backups? The file contents likely will not change often - once they're added, they will remain the same.
My recommendation would be store the file on disk, but create a DB-based index. Most filesystems get cranky or slow when you have > 10k files in a folder/directory/etc. Your application can generate the filename and store metadata in the DB, then organize by the generated name on disk. Downsides are file contents may not be directly apparent from the name. However, you can easily backup changed files without specialized DB backup plugins and a sophisticated partitioning, incremental backup scheme. Also, seeks within the file become much simpler operations (skip ahead, rewind, etc.). There is generally better support for these operations in a file system than in a DB.
I wonder what makes you think that RDBMS would be limited to mere thousands of rows; there's no reason this would be the case.
Also, at least some databases (Oracle as an example) do allow direct access to parts of LOB data, without loading the full LOB, if you just know the offset and length you want to have. So, you could have a table with some searchable metadata and then the LOB column, and if needed, an additional metadata table containing metadata on the LOB contents so that you'd have some kind of keyword->(offset,length) relation available for partal loading of LOBs.
Somewhat echoing another post here, incremental backups (which you might wish to have here) are not quite feasible with databases (ok, can be possible, but at least in my experience tend to have a nasty price tag attached).
How big is each sample, and how big is each recording?
Are you saying each recording is 2,000,000 samples, or each file is? (it can be read either way)
If it is 2 million samples to make up 200 GB, then each sample is ~10 K, and each recording is 200K (to have 100,000 per file, which is 20 samples per recording)?
That seems like a very reasonable size to put in a row in a DB rather than a file on disk.
As for loading into memory only a certain range, if you have indexed the sample ids, then you could very quickly query for only the subset you want, loading only that range into memory from the DB query result.
I think that Microsoft SQL does what you need with the varbinary(MAX) field type WHEN used in conjnction with filestream storage.
Have a read on TechNet for more depth: (http://technet.microsoft.com/en-us/library/bb933993.aspx).
Basically, you can enter any descriptive fields normally into your database, but the actual BLOB is stored in NTFS, governed by the SQL engine and limited in size only by your NTFS file system.
Hope this helps - I know it raises all kinds of possibilities in my mind. ;-)

Options for storing large text blobs in/with an SQL database?

I have some large volumes of text (log files) which may be very large (up to gigabytes). They are associated with entities which I'm storing in a database, and I'm trying to figure out whether I should store them within the SQL database, or in external files.
It seems like in-database storage may be limited to 4GB for LONGTEXT fields in MySQL, and presumably other DBs have similar limits. Also, storing in the database presumably precludes any kind of seeking when viewing this data -- I'd have to load the full length of the data to render any part of it, right?
So it seems like I'm leaning towards storing this data out-of-DB: are my misgivings about storing large blobs in the database valid, and if I'm going to store them out of the database then are there any frameworks/libraries to help with that?
(I'm working in python but am interested in technologies in other languages too)
Your misgivings are valid.
DB's gained the ability to handle large binary and text fields some years ago, and after everybody tried we gave up.
The problem stems from the fact that your operations on large objects tend to be very different from your operations on the atomic values. So the code gets difficult and inconsistent.
So most veterans just go with storing them on the filesystem with a pointer in the db.
I know php/mysql/oracle/prob more lets you work with large database objects as if you have a file pointer, which gets around memory issues.

When should we store images in database?

I have a table of productList in which i have 4 column, now i have to store image for each row so i have two option for this..
Store image in data base.
Save images in a folder and store only path on table.
So my question is which one is better in this situation and why ?
Microsoft Research published quite an extensive paper on the subject, called To Blob Or Not To Blob.
Their synopsis is:
Application designers often face the question of whether to store large objects in a filesystem or in a database. Often this decision is made for application design simplicity. Sometimes, performance measurements are also used. This paper looks at the question of fragmentation – one of the operational issues that can affect the performance and/or manageability of the system as deployed long term. As expected from the common wisdom, objects smaller than 256K are best stored in a database while objects larger than 1M are best stored in the filesystem. Between 256K and 1M, the read:write ratio and rate of object overwrite or replacement are important factors. We used the notion of “storage age” or number of object overwrites as way of normalizing wall clock time. Storage age allows our results or similar such results to be applied across a number of read:write ratios and object replacement rates.
It depends -
You can store images in DB if you know that they wont increase in size very often. This has its advantage when you are deploying your systems or migrating to new servers. you dont have to worry about copying images seperately.
If the no. of rows increase very frequently on that system, and the images get bulkier, then its good to store on the file system and have a path stored in database for later retrieval. This also will keep you on toes when migrating your servers where you have to take care of copying the images from filepath seperately.