Is it possible to store PDF files in a CQL blob type in Cassandra? - pdf

To avoid questions about. Why do you use casandra in favour of another database. we have to because our custoner decided that Im my option a completely wrong decision.
In our Applikation we have to deal with PDF documents, i.e. Reader them and populate them with data.
So my intention was to hold the documents (templates) in the database read them and then do what we need to do with them.
I noticed that cassandra provieds a blob column type.
However for me it seems that this type has nothing to with a blob in qn Oracle or other relational database.
From what I understand is that cassandra is not for storing documnents and therefore it is not possible?
Or is the only way to make byte-array out of the document?
what is the intention of the blob column type?

The blob type in Cassandra is used to store raw bytes, so it's "theoretically" could be used to store PDF files as well (as bytes). But there is one thing that should be taken into consideration - Cassandra doesn't work well with big payloads - usual recommendation is to store 10s or 100s of Kb, not more than 1Mb. With bigger payloads, operations, such as repair, addition/removal of nodes, etc. could lead to increased overhead and performance degradation. On older versions of Cassandra (2.x/3.0) I have seen the situations when people couldn't add new nodes because join operation failed. It's a bit better situation with newer versions, but still it should be evaluated before jumping into implementation. It's recommended to do performance testing + some maintenance operations at scale to understand if it will work for your load. NoSQLBench is a great tool for such things.

It is possible to store binary files in a CQL blob column however the general recommendation is to only store a small amount of data in blobs, preferably 1MB or less for optimum performance.
For larger files, it is better to place them in an object store and only save the metadata in Cassandra.
Most large enterprises whose applications hold large amount of media files (music, video, photos, etc) typically store them in Amazon S3, Google Cloud Store or Azure Blob then store the metadata (such as URLs) of the files in Cassandra. These enterprises are household names in streaming services and social media apps. Cheers!

Related

Storing large objects in Couchbase - best practice?

In my system, a user can upload very large files, which I need to store in Couchbase. I don't need such very large objects to persist in memory, but I want them to be always read/written from/to disk. These files are read-only (never modified). The user can upload them, delete them, download them, but never update them. For some technical constraints, my system cannot store those files in the file system, so they have to be stored into the database.
I've done some research and found an article[1] saying that storing large objects in a database is generally a bad idea, especially with Couchbase, but at the same time provides some advice: create a secondary bucket with a low RAM quota, tune up the value/full eviction policy. My concern is the limit of 20Mb mentioned by the author. My files would be much larger than that.
What's the best approach to follow to store large files into Couchbase without having them persist in memory? Is it possible to raise the limit of 20Mb in case? Shall I create a secondary bucket with a very low RAM quota and a full eviction policy?
[1]http://blog.couchbase.com/2016/january/large-objects-in-a-database
Generally, Couchbase engineers recommend that you not store large files in Couchbase. Instead, you can store the files on some file server (like AWS or Azure Blob or something) and instead store the meta-data about the files in Couchbase.
There's a couchbase blog posting that gives a pretty detailed breakdown of how to do what you want to do in Couchbase.
This is Java API specific but the general approach can work with any of the Couchbase SDKs, I'm actually in the midst of doing something pretty similar right now with the node SDK.
I can't speak for what couchbase engineers recommend but they've posted this blog entry detailing how to do it.
For large files, you'll certainly want to split into chunks. Do not attempt to store a big file all in one document. The approach I'm looking at is to chunk the data, and insert it under the file sha1 hash. So file "Foo.docx" would get split into say 4 chunks, which would be "sha1|0", "sha1|1" and so on, where sha1 is the hash of the document. This would also enable a setup where you can store the same file under many different names.
Tradeoffs -- if integration with Amazon S3 is an option for you, you might be better off with that. In general chunking data in a DB like what I describe is going to be more complicated to implement, and much slower, than using something like Amazon S3. But that has to be traded off other requirements, like whether or not you can keep sensitive files in S3, or whether you want to deal with maintaining a filesystem and the associated scaling of that.
So it depends on what your requirements are. If you want speed/performance, don't put your files in Couchbase -- but can you do it? Sure. I've done it myself, and the blog post above describes a separate way to do it.
There are all kinds of interesting extensions you might wish to implement, depending on your needs. For example, if you commonly store many different files with similar content, you might implement a blocking strategy that would allow single-store of many common segments, to save space. Other solutions like S3 will happily store copies of copies of copies of copies, and gleefully charge you huge amounts of money to do so.
EDIT as a follow-up, there's this other Couchbase post talking about why storing in the DB might not be a good idea. Reasonable things to consider - but again it depends on your application-specific requirements. "Use S3" I think would be generally good advice, but won't work for everyone.
MongoDB has an option to do this sort of thing, and it's supported in almost all drivers: GridFS. You could do something like GridFS in Couchbase, which is to make a metadata collection (bucket) and a chunk collection with fixed size blobs. GridFS allows you to change the blob size per file, but all blobs must be the same size. The filesize is stored in the metadata. A typical chunk size is 2048, and are restricted to powers of 2.
You don't need memory cache for files, you can queue up the chunks for download in your app server. You may want to try GridFS on Mongo first, and then see if you can adapt it to Couchbase, but there is always this: https://github.com/couchbaselabs/cbfs
This is the best practice: do not take couchbase database as the main database consider it as sync database because no matter how you chunk data into small pieces it will go above 20MB size which will hit you in long run, so having a strong database like MySQL in a middle will help to save those large data then use couchbase for realtime and sync only.

RavenDB : Storage Size Problems

I'm doing some testing with RavenDB to store data based on an iphone application. The application is going to send up a string of 5 GPS coordinates with a GUID for the key. I'm seeing in RavenDB that each document is around 664-668 bytes. That's HUGE for 10 decimals and a guid. Can someone help me understand what I'm doing wrong? I noticed the size was extraordinarily large when a million records was over a gig on disk. By my calculations it should be much smaller. Purely based on the data sizes shouldn't the document be around 100 bytes? And given that the document database has the object schema built in let's say double that to 200 bytes. Given that calculation the database should be about two hundred megs with 1 million records. But it's ten times larger. Can someone help me where I've gone wrong with the math here?
(Got a friend to check my math and I was off by a bit - numbers updated)
As a general principal, NoSQL databases aren't optimized for disk space. That's the kind of traditional requirement of an RDBMS. Often with NoSQL, you will choose to store the data in duplicate or triplicate for various reasons.
Specifically with RavenDB, each document is in JSON format, so you have some overhead there. However, it is actually persisted on disk in BSON format, saving you some bytes. This implementation detail is obscured from the client. Also, every document has two streams - the main document content, and the associated metadata. This is very powerful, but does take up additional disk space. Both the document and the metadata are kept in BSON format in the ESENT backed document store.
Then you need to consider how you will access the data. Any static indexes you create, and any dynamic indexes you ask Raven to create for you via its LINQ API will have the data copied into the index store. This is a separate store implemented with Lucene.net using their proprietary index file format. You need to take this into consideration if you are estimating disk space requirements. (BTW - you would also have this concern with indexes in an RDBMS solution)
If you are super concerned about optimizing every byte of disk space, perhaps NoSQL solutions aren't for you. Just about every product on the market has these types of overhead. But keep in mind that disk space is cheap today. Relational databases optimized for disk space because storage was very expensive when they were invented. The world has changed, and NoSQL solutions embrace that.

Options for storing large text blobs in/with an SQL database?

I have some large volumes of text (log files) which may be very large (up to gigabytes). They are associated with entities which I'm storing in a database, and I'm trying to figure out whether I should store them within the SQL database, or in external files.
It seems like in-database storage may be limited to 4GB for LONGTEXT fields in MySQL, and presumably other DBs have similar limits. Also, storing in the database presumably precludes any kind of seeking when viewing this data -- I'd have to load the full length of the data to render any part of it, right?
So it seems like I'm leaning towards storing this data out-of-DB: are my misgivings about storing large blobs in the database valid, and if I'm going to store them out of the database then are there any frameworks/libraries to help with that?
(I'm working in python but am interested in technologies in other languages too)
Your misgivings are valid.
DB's gained the ability to handle large binary and text fields some years ago, and after everybody tried we gave up.
The problem stems from the fact that your operations on large objects tend to be very different from your operations on the atomic values. So the code gets difficult and inconsistent.
So most veterans just go with storing them on the filesystem with a pointer in the db.
I know php/mysql/oracle/prob more lets you work with large database objects as if you have a file pointer, which gets around memory issues.

When should we store images in database?

I have a table of productList in which i have 4 column, now i have to store image for each row so i have two option for this..
Store image in data base.
Save images in a folder and store only path on table.
So my question is which one is better in this situation and why ?
Microsoft Research published quite an extensive paper on the subject, called To Blob Or Not To Blob.
Their synopsis is:
Application designers often face the question of whether to store large objects in a filesystem or in a database. Often this decision is made for application design simplicity. Sometimes, performance measurements are also used. This paper looks at the question of fragmentation – one of the operational issues that can affect the performance and/or manageability of the system as deployed long term. As expected from the common wisdom, objects smaller than 256K are best stored in a database while objects larger than 1M are best stored in the filesystem. Between 256K and 1M, the read:write ratio and rate of object overwrite or replacement are important factors. We used the notion of “storage age” or number of object overwrites as way of normalizing wall clock time. Storage age allows our results or similar such results to be applied across a number of read:write ratios and object replacement rates.
It depends -
You can store images in DB if you know that they wont increase in size very often. This has its advantage when you are deploying your systems or migrating to new servers. you dont have to worry about copying images seperately.
If the no. of rows increase very frequently on that system, and the images get bulkier, then its good to store on the file system and have a path stored in database for later retrieval. This also will keep you on toes when migrating your servers where you have to take care of copying the images from filepath seperately.

Combining Relational and Document based "Databases"

I am developing a system that is all about media archiving, searching, uploading, distributing and thus about handling BLOBs.
I am currently trying to find out the best way how to handle the BLOB's. I have limited resources for high end servers with a lot of memory and huge disks, but I can access a large array of medium performance off-the-shelf computers and hook them to the Internet.
Therefore I decided to not store the BLOBs in a central Relational Database, because I would then have, in the worst case, one very heavy Database Instance, possibly on a single average machine. Not an option.
Storing the BLOBs as files directly on the filesystem and storing their path in the database is also somewhat ugly and distribution would have to be managed manually, keeping track of the different copies myself. I don't even want to get close to that.
I looked at CouchDB and I really like their peer-to-peer based design. This would allow me to run a distributed cluster of machines across the Internet, implies:
Low cost Hardware
Distribution for Redundancy and Failover out of the box
Lightweight REST Interface
So if I got it right, one could summarize it like this: Cloud like API and self managed, distributed, replicated system
The rest of the system does the normal stuff any average web application does: handling session, security, users, searching and the like. For this part I still want to use a relational datamodel. (CouchDB claims not to be a replacement for relational databases).
So I would have all the standard data, including the BLOB's meta data in the relational database but the BLOBs themselves in CouchDB.
Do you see a problem with this approach? Am I missing something important? Can you think of better solutions?
Thank you!
You could try Amazon's relational database SimpleDB and S3 toghether with SimpleJPA. SimpleJPA is a JPA-implementation on top of SimpleDB. SimpleJPA uses SimpleDB for the relational structure and S3 to store BLOBs.
Take a look at MongoDB, it supports storing binary data in an efficient format and is incredibly fast
No problem. I have done a design very similar to that one. You may also want to take a peek to HBase as an alternative to CouchDB and to the Adaptive Object-Model architectural pattern, as a way to manage your data and meta-data.