I have a table of productList in which i have 4 column, now i have to store image for each row so i have two option for this..
Store image in data base.
Save images in a folder and store only path on table.
So my question is which one is better in this situation and why ?
Microsoft Research published quite an extensive paper on the subject, called To Blob Or Not To Blob.
Their synopsis is:
Application designers often face the question of whether to store large objects in a filesystem or in a database. Often this decision is made for application design simplicity. Sometimes, performance measurements are also used. This paper looks at the question of fragmentation – one of the operational issues that can affect the performance and/or manageability of the system as deployed long term. As expected from the common wisdom, objects smaller than 256K are best stored in a database while objects larger than 1M are best stored in the filesystem. Between 256K and 1M, the read:write ratio and rate of object overwrite or replacement are important factors. We used the notion of “storage age” or number of object overwrites as way of normalizing wall clock time. Storage age allows our results or similar such results to be applied across a number of read:write ratios and object replacement rates.
It depends -
You can store images in DB if you know that they wont increase in size very often. This has its advantage when you are deploying your systems or migrating to new servers. you dont have to worry about copying images seperately.
If the no. of rows increase very frequently on that system, and the images get bulkier, then its good to store on the file system and have a path stored in database for later retrieval. This also will keep you on toes when migrating your servers where you have to take care of copying the images from filepath seperately.
Related
When should I make this direct recording at the bank?
What are the situations?
I know I can record the path of the image in the bank.
In addition to the cost being higher as mentioned, one must take into account several factors:
Data Volume: For a low volume of data there may be no problem. On the other hand, for mass storage of data the database is practically unfeasible.
Clustering: One advantage of the database is if your system runs on multiple servers, everyone will have uniform access to the files.
Scalability: If demand for volume or availability increases, can you add more capacity to the system? It is much easier to split files between different servers than to distribute records from one table to more servers.
Flexibility: Backing up, moving files from one server to another, doing some processing on the stored files, all this is easier if the files are in a directory.
There are several strategies for scaling a system in terms of both availability and volume. Basically these strategies consist of distributing them on several different servers and redirecting the user to each of them according to some criteria. The details vary of implementation, such as: data update strategy, redundancy, distribution criteria, etc.
One of the great difficulties in managing files outside BD is that we now have two distinct data sources that need to be always in sync.
From the safety point of view, there is actually little difference. If a hacker can compromise a server, it can read both the files written to disk of your system and the files of the database system. If this question is critical, an alternative is to store the encrypted data.
I also convert my images into byte array and store them in an sql server database but in the long run, I am sure that someone will ask you and tell you that you should only save the (server) path of the image.
The biggest disadvantage of storing as binary I think is
Retrieving images from database is significantly more expensive compared to using the file system
I'm doing some testing with RavenDB to store data based on an iphone application. The application is going to send up a string of 5 GPS coordinates with a GUID for the key. I'm seeing in RavenDB that each document is around 664-668 bytes. That's HUGE for 10 decimals and a guid. Can someone help me understand what I'm doing wrong? I noticed the size was extraordinarily large when a million records was over a gig on disk. By my calculations it should be much smaller. Purely based on the data sizes shouldn't the document be around 100 bytes? And given that the document database has the object schema built in let's say double that to 200 bytes. Given that calculation the database should be about two hundred megs with 1 million records. But it's ten times larger. Can someone help me where I've gone wrong with the math here?
(Got a friend to check my math and I was off by a bit - numbers updated)
As a general principal, NoSQL databases aren't optimized for disk space. That's the kind of traditional requirement of an RDBMS. Often with NoSQL, you will choose to store the data in duplicate or triplicate for various reasons.
Specifically with RavenDB, each document is in JSON format, so you have some overhead there. However, it is actually persisted on disk in BSON format, saving you some bytes. This implementation detail is obscured from the client. Also, every document has two streams - the main document content, and the associated metadata. This is very powerful, but does take up additional disk space. Both the document and the metadata are kept in BSON format in the ESENT backed document store.
Then you need to consider how you will access the data. Any static indexes you create, and any dynamic indexes you ask Raven to create for you via its LINQ API will have the data copied into the index store. This is a separate store implemented with Lucene.net using their proprietary index file format. You need to take this into consideration if you are estimating disk space requirements. (BTW - you would also have this concern with indexes in an RDBMS solution)
If you are super concerned about optimizing every byte of disk space, perhaps NoSQL solutions aren't for you. Just about every product on the market has these types of overhead. But keep in mind that disk space is cheap today. Relational databases optimized for disk space because storage was very expensive when they were invented. The world has changed, and NoSQL solutions embrace that.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I am new to NoSQL world and thinking of replacing my MS Sql Server database to MongoDB. My application (written in .Net C#) interacts with IP Cameras and records meta data for each image coming from Camera, into MS SQL Database. On average, i am inserting about 86400 records per day for each camera and in current database schema I have created separate table for separate Camera images, e.g. Camera_1_Images, Camera_2_Images ... Camera_N_Images. Single image record consists of simple metadata info. like AutoId, FilePath, CreationDate. To add more details to this, my application initiates separate process (.exe) for each camera and each process inserts 1 record per second in relative table in database.
I need suggestions from (MongoDB) experts on following concerns:
to tell if MongoDB is good for holding such data, which eventually will be queried against time ranges (e.g. retrieve all images of a particular camera between a specified hour)? Any suggestions about Document Based schema design for my case?
What should be the specs of server (CPU, RAM, Disk)? any suggestion?
Should i consider Sharding/Replication for this scenario (while considering the performance in writing to synch replica sets)?
Are there any benefits of using multiple databases on same machine, so that one database will hold images of current day for all cameras, and the second one will be used to archive previous day images? I am thinking on this with respect to splitting reads and writes on separate databases. Because all read requests might be served by second database and writes to first one. Will it benefit or not? If yes then any idea to ensure that both databases are synced always.
Any other suggestions are welcomed please.
I am myself a starter on NoSQL databases. So I am answering this at the expense of potential down votes but it will be a great learning experience for me.
Before trying my best to answer your questions I should say that if MS
SQL Server is working well for you then stick with it. You have not
mentioned any valid reason WHY you want to use MongoDB except the fact
that you learnt about it as a document oriented db. Moreover I see
that you have almost the same set of meta-data you are capturing for
each camera i.e. your schema is dynamic.
to tell if MongoDB is good for holding such data, which eventually will be queried against time ranges (e.g. retrieve all images of a particular camera between a specified hour)? Any suggestions about Document Based schema design for my case?
MongoDB being a document oriented db, is good at querying within an aggregate (you call it document). Since you already are storing each camera's data in its own table, in MongoDB you will have a separate collection created for each camera. Here is how you perform date range queries.
What should be the specs of server (CPU, RAM, Disk)? any suggestion?
All NoSQL data bases are built to scale-out on commodity hardware. But by the way you have asked the question, you might be thinking of improving performance by scaling-up. You can start with a reasonable machine and as the load increases, you can keep adding more servers (scaling-out). You no need to plan and buy a high end server.
Should i consider Sharding/Replication for this scenario (while considering the performance in writing to synch replica sets)?
MongoDB locks the entire db for a single write (but yields for other operations) and is meant for systems which have more reads than writes. So this depends upon how your system is. There are multiple ways of sharding and should be domain specific. A generic answer is not possible. However some examples can be given like sharding by geography, by branches etc.
Also read A plain english introduction to CAP Theorem
Updated with answer to the comment on sharding
According to their documentation, You should consider deploying a sharded cluster, if:
your data set approaches or exceeds the storage capacity of a single node in your system.
the size of your system’s active working set will soon exceed the capacity of the maximum amount of RAM for your system.
your system has a large amount of write activity, a single MongoDB instance cannot write data fast enough to meet demand, and all other
approaches have not reduced contention.
So based upon the last point yes. The auto-sharding feature is built to scale writes. In that case, you have a write lock per shard, not per database. But mine is a theoretical answer. I suggest you take consultation from 10gen.com group.
to tell if MongoDB is good for holding such data, which eventually
will be queried against time ranges (e.g. retrieve all images of a
particular camera between a specified hour)?
This quiestion is too subjective for me to answer. From personal experience with numerous SQL solutions (ironically not MS SQL) I would say they are both equally as good, if done right.
Also:
What should be the specs of server (CPU, RAM, Disk)? any suggestion?
Depends on too many variables that only you know, however a small cluster of commodity hardware works quite well. I cannot really give a factual response to this question and it will come down to your testing.
As for a schema I would go for a document of the structure:
{
_id: {},
camera_name: "my awesome camera",
images: [
{
url: "http://I_like_S3_here.amazons3.com/my_image.png" ,
// All your other fields per image
}
]
}
This should be quite easy to mantain and update so long as you are not embedding much deeper since then it could become a bit of pain, however, that depends upon your queries.
Not only that but this should be good for sharding since you have all the data you need in one document, if you were to shard on _id you could probably get the perfect setup here.
Should i consider Sharding/Replication for this scenario (while considering the performance in writing to synch replica sets)?
Possibly, many people assume they need to shard when in reality they just need to be more intelligent in how they design the database. MongoDB is very free form so there are a lot of ways to do it wrong, but that being said, there are also a lot of ways of dong it right. I personally would keep sharding in mind. Replication can be very useful too.
Are there any benefits of using multiple databases on same machine, so that one database will hold images of current day for all cameras, and the second one will be used to archive previous day images?
Even though MongoDBs write lock is on DB level (currently) I would say: No. The right document structure and the right sharding/replication (if needed) should be able to handle this in a single document based collection(s) under a single DB. Not only that but you can direct writes and reads within a cluster to certain servers so as to create a concurrency situation between certain machines in your cluster. I would promote the correct usage of MongoDBs concurrency features over DB separation.
Edit
After reading the question again I omitted from my solution that you are inserting 80k+ images for each camera a day. As such instead of the embedded option I would actually make a row per image in a collection called images and then a camera collection and query the two like you would in SQL.
Sharding the images collection should be just as easy on camera_id.
Also make sure you take you working set into consideration with your server.
to tell if MongoDB is good for holding such data, which eventually
will be queried against time ranges (e.g. retrieve all images of a
particular camera between a specified hour)? Any suggestions about
Document Based schema design for my case?
MongoDB can do this. For better performance, you can set an index on your time field.
What should be the specs of server (CPU, RAM, Disk)? any suggestion?
I think RAM and Disk would be important.
If you don't want to do sharding to scale out, you should consider a larger size of disk so you can store all your data in it.
Your hot data should can fit into your RAM. If not, then you should consider a larger RAM because the performance of MongoDB mainly depends on RAM.
Should i consider Sharding/Replication for this scenario (while
considering the performance in writing to synch replica sets)?
I don't know many cameras do you have, even 1000 inserts/second with total 1000 cameras should still be easy to MongoDB. If you are concerning insert performance, I don't think you need to do sharding(Except the data size are too big that you have to separate them into several machines).
Another problem is the read frequency of your application. It it is very high, then you can consider sharding or replication here.
And you can use (timestamp + camera_id) as your sharding key if your query only on one camera in a time range.
Are there any benefits of using multiple databases on same machine, so
that one database will hold images of current day for all cameras, and
the second one will be used to archive previous day images?
You can separate the table into two collections(archive and current). And set index only on archive if you only query date on archive. Without the overhead of index creation, the current collection should benefit with insert.
And you can write a daily program to dump the current data into archive.
I have some large (200 GB is normal) flat files of data that I would like to store in some kind of database so that it can be accessed quickly and in the intuitive way that the data is logically organized. Think of it as large sets of very long audio recordings, where each recording is the same length (samples) and can be thought of as a row. One of these files normally has about 100,000 recordings of 2,000,000 samples each in length.
It would be easy enough to store these recordings as rows of BLOB data in a relational database, but there are many instances where I want to load into memory only certain columns of the entire data set (say, samples 1,000-2,000). What's the most memory- and time-efficient way to do this?
Please don't hesitate to ask if you need more clarification on the particulars of my data in order to make a recommendation.
EDIT: To clarify the data dimensions... One file consists of: 100,000 rows (recordings) by 2,000,000 columns (samples). Most relational databases I've researched will allow a maximum of a few hundred to a couple thousand rows in a table. Then again, I don't know much about object-oriented databases, so I'm kind of wondering if something like that might help here. Of course, any good solution is very welcome. Thanks.
EDIT: To clarify the usage of the data... The data will be accessed only by a custom desktop/distributed-server application, which I will write. There is metadata (collection date, filters, sample rate, owner, etc.) for each data "set" (which I've referred to as a 200 GB file up to now). There is also metadata associated with each recording (which I had hoped would be a row in a table so I could just add columns for each piece of recording metadata). All of the metadata is consistent. I.e. if a particular piece of metadata exists for one recording, it also exists for all recordings in that file. The samples themselves do not have metadata. Each sample is 8 bits of plain-ol' binary data.
DB storage may not be ideal for large files. Yes, it can be done. Yes, it can work. But what about DB backups? The file contents likely will not change often - once they're added, they will remain the same.
My recommendation would be store the file on disk, but create a DB-based index. Most filesystems get cranky or slow when you have > 10k files in a folder/directory/etc. Your application can generate the filename and store metadata in the DB, then organize by the generated name on disk. Downsides are file contents may not be directly apparent from the name. However, you can easily backup changed files without specialized DB backup plugins and a sophisticated partitioning, incremental backup scheme. Also, seeks within the file become much simpler operations (skip ahead, rewind, etc.). There is generally better support for these operations in a file system than in a DB.
I wonder what makes you think that RDBMS would be limited to mere thousands of rows; there's no reason this would be the case.
Also, at least some databases (Oracle as an example) do allow direct access to parts of LOB data, without loading the full LOB, if you just know the offset and length you want to have. So, you could have a table with some searchable metadata and then the LOB column, and if needed, an additional metadata table containing metadata on the LOB contents so that you'd have some kind of keyword->(offset,length) relation available for partal loading of LOBs.
Somewhat echoing another post here, incremental backups (which you might wish to have here) are not quite feasible with databases (ok, can be possible, but at least in my experience tend to have a nasty price tag attached).
How big is each sample, and how big is each recording?
Are you saying each recording is 2,000,000 samples, or each file is? (it can be read either way)
If it is 2 million samples to make up 200 GB, then each sample is ~10 K, and each recording is 200K (to have 100,000 per file, which is 20 samples per recording)?
That seems like a very reasonable size to put in a row in a DB rather than a file on disk.
As for loading into memory only a certain range, if you have indexed the sample ids, then you could very quickly query for only the subset you want, loading only that range into memory from the DB query result.
I think that Microsoft SQL does what you need with the varbinary(MAX) field type WHEN used in conjnction with filestream storage.
Have a read on TechNet for more depth: (http://technet.microsoft.com/en-us/library/bb933993.aspx).
Basically, you can enter any descriptive fields normally into your database, but the actual BLOB is stored in NTFS, governed by the SQL engine and limited in size only by your NTFS file system.
Hope this helps - I know it raises all kinds of possibilities in my mind. ;-)
I have some large volumes of text (log files) which may be very large (up to gigabytes). They are associated with entities which I'm storing in a database, and I'm trying to figure out whether I should store them within the SQL database, or in external files.
It seems like in-database storage may be limited to 4GB for LONGTEXT fields in MySQL, and presumably other DBs have similar limits. Also, storing in the database presumably precludes any kind of seeking when viewing this data -- I'd have to load the full length of the data to render any part of it, right?
So it seems like I'm leaning towards storing this data out-of-DB: are my misgivings about storing large blobs in the database valid, and if I'm going to store them out of the database then are there any frameworks/libraries to help with that?
(I'm working in python but am interested in technologies in other languages too)
Your misgivings are valid.
DB's gained the ability to handle large binary and text fields some years ago, and after everybody tried we gave up.
The problem stems from the fact that your operations on large objects tend to be very different from your operations on the atomic values. So the code gets difficult and inconsistent.
So most veterans just go with storing them on the filesystem with a pointer in the db.
I know php/mysql/oracle/prob more lets you work with large database objects as if you have a file pointer, which gets around memory issues.