SQL or filesystem for FAST storing files/BLOBs? [closed] - sql

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I have an app that stores quite a lot of publications as files on filesystem, using nested dirs like "6/0/3/6/....". Files are not huge (.jpg, .pdf, similar documents), there's "just" a lot of them, running into GB hundreds. Once stored in fs, they typically are never rewritten, just served over http.
Searching, versioning through such files is painfully slow. Copying such dirs is also rather cumbersome.
This got me thinking: would it not be better to store such data as BLOBs in db (my app is using postgres for various purposes anyway).
Which one -- fs or scalable sql db -- could perform better all around? Or would PG collapse under so much weight?

Incremental backup is much easier with the filesystem. So is recovering from partial damage. Versioning is pretty easy to do on top of the file system so long as you don't need atomic change sets, only individual file versioning.
On the other hand, using the DB gets you transactional behaviour - atomic commit, multiple concurrent consistent snapshots, etc. It costs you disk storage efficiency and it adds overhead to access. It means you can't just sendfile() the data directly from the file system, you must do several memory copies and some encoding just to get and send the file.
For a high performance server the file system is almost certainly going to win unless you really need atomic commit and the simultaneous consistent visibility of multiple versions.
There are lots of related past questions you should probably read too, most concerning whether it's better to store images in the DB or on the file system.

FileSystem has other big disadvantages :
You can get problems with user rights
Not atomic
Slow
When dealing with BLOB < 1GB, I would 100% store them in database since all good system databases can handle BLOB properly. (They store it in a different manner than structured data but it is not visible to you)
By the way, when you read on http://www.postgresql.org/about/
Maximum Database Size => Unlimited

Related

Using SQLite database as backend, MS Access for Frontend, more than 2 gigs of data [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I've been using Microsoft Access for years, but inevitably, I need to move onto other database systems. Right now, SQLite seems to be perfect for my current work environment. I understand it's pretty easy to have a SQLite backend with a Microsoft Access frontend. However, I also know Microsoft Access databases tend to have issues when they exceed 2 gigs. I realize that SQLite is NOT limited to 2 gigs, but if I had say a 10 gig dataset in a SQLite database and used MS Access for the frontend, would I have performance issues? Could it handle it or would it not matter since the backend in on a SQLite database? I apologize for the ignorance in understanding, however, it would be x 10 more ignorant if I kept aimlessly searching for an answer and continued to not find a solution. Thanks!
Well, the data enigne is quite fast. In fact, on a local machine, you tend to get better performance using JET/ACE as compared to say SQL server running as a local instance.
It not clear how good or optomized the ODBC drivers are for SQLite, but performance with such a setup should be about as fast as anything else, and likly even faster then running say a local isstance of some type of SQL server. To be fair, because computers often have additional processors (cores), then running a server database even on your local machine can yield better performance since you are using "more" processors to do the same job.
Ignoring the threading issue (JET/ACE and SQLite are NOT threaded to my knowledge, and thus you can't really take advantage of multiple CPU cores.
However, from a raw performance point of view, I suspect that SQLite is slower then JET/ACE, but I never really looked close.
Tables of several million rows tend to be nothing for ACE/JET and Access, and I would suggest that SQLite would produce similar results, but allow you to get around the 2 gig limitation. I think if the files are pushing the 2 gig limit, then I would consider using a server based system for the database. However, if the database is not to be multi-user, then again, using a "file" based in-process file data engine like jet or SQLite should not pose any particular performance penalty over that of using a local server based system.
If a network or multiple users come into play, then hands down a server system that runs as a separate process (and thus on separate CPU cores is a better choice).
I have tested SQLite with Access, but not for large files, so I don't know how well it works for large tables. I mean, 5, or 10 million rows will easy fit in a JET database, so it not clear as to how large your datasets are, but they must be rather large if you exceeding JET. SQL Express is free, and allows up to 10 gigs, but you do of course have to setup and run a "server" database on your stand alone computer, and often that's not worth the setup time.
Switching BE for Access is pretty easy but unless you know how to "handle" the new BE Pros & Cons you won't get very far and you will just add issues to your current ones.
If your main concern is size you can split your data in more that 1 BE (.mdbs,.accdbs) and still stay in the Access ecosystem.
Also you have to take into account that SQLite is more single user oriented so if your are going to use it in a network this will be problematic.A good way to use it if you are the only user of your application is to keep the "real" data in Ms Access and store the "extra" data to a SQLite DB (like documents,photos..etc) .
I guess my answer was very similar John's answer above. Additionally, you can utilize the analyze database tool and double check your tables are normalized. Split the back-end database. Don't store images and photos inside the accdb file and just create links outside database for them. There's numerous things you can perform to reduce size.

Does faster code uses less system resources? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
Generally speaking, does the faster code uses less system resources?
If yes, should I suppose that reading a file from file system in 0.02 sec is lighter (better?) than querying the database that took 0.03 sec?
Please note:
The speed is not a concern here, I'm just talking about the system resources like memory, etc.
Also it's a general question, I'm not comparing file system vs. database. That's just an example.
I'm aware that I need to do different benchmarks or profiling in my code to find the precise answer, but as I said above, I'm curious to see if it's generally true or not.
I used to do speed benchmarks in my project to define the better solution, however I never thought that I might need to do benchmarks on memory usage for example. I did it a few times, but it wasn't serious. So, that's why I'm asking this question. I hope it make sense at all.
That depends on why the code is faster.
One common way to optimise for speed, is to use a lot of some other resource.
With your example, both the database and the file system uses RAM memory to cache data. It's more likely that the database would actually be faster, because it uses a lot more RAM to cache data.
So, often faster code uses more resources.
It very broad topic for discussion
faster code mean what? If a code is all static compile time bind than naturally you get faster one.That is what structural programming like C is base line.But,when you enter into object oriented programming just a static binding does not provides object oriented programming figure .So,you need class,objects which natually uses more system resources like more cpu cycle and memory for run time binding . If compared to C and java .yes C is definitely faster than java in some extend .If you run a single hello world example program from C and Java .You can see C takes less resource than java.It mean less CUP cycle and less memory.But in cost we may miss
reusability,maintainbaility,extendability.

Pros and Cons of using MongoDB instead of MS SQL Server [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I am new to NoSQL world and thinking of replacing my MS Sql Server database to MongoDB. My application (written in .Net C#) interacts with IP Cameras and records meta data for each image coming from Camera, into MS SQL Database. On average, i am inserting about 86400 records per day for each camera and in current database schema I have created separate table for separate Camera images, e.g. Camera_1_Images, Camera_2_Images ... Camera_N_Images. Single image record consists of simple metadata info. like AutoId, FilePath, CreationDate. To add more details to this, my application initiates separate process (.exe) for each camera and each process inserts 1 record per second in relative table in database.
I need suggestions from (MongoDB) experts on following concerns:
to tell if MongoDB is good for holding such data, which eventually will be queried against time ranges (e.g. retrieve all images of a particular camera between a specified hour)? Any suggestions about Document Based schema design for my case?
What should be the specs of server (CPU, RAM, Disk)? any suggestion?
Should i consider Sharding/Replication for this scenario (while considering the performance in writing to synch replica sets)?
Are there any benefits of using multiple databases on same machine, so that one database will hold images of current day for all cameras, and the second one will be used to archive previous day images? I am thinking on this with respect to splitting reads and writes on separate databases. Because all read requests might be served by second database and writes to first one. Will it benefit or not? If yes then any idea to ensure that both databases are synced always.
Any other suggestions are welcomed please.
I am myself a starter on NoSQL databases. So I am answering this at the expense of potential down votes but it will be a great learning experience for me.
Before trying my best to answer your questions I should say that if MS
SQL Server is working well for you then stick with it. You have not
mentioned any valid reason WHY you want to use MongoDB except the fact
that you learnt about it as a document oriented db. Moreover I see
that you have almost the same set of meta-data you are capturing for
each camera i.e. your schema is dynamic.
to tell if MongoDB is good for holding such data, which eventually will be queried against time ranges (e.g. retrieve all images of a particular camera between a specified hour)? Any suggestions about Document Based schema design for my case?
MongoDB being a document oriented db, is good at querying within an aggregate (you call it document). Since you already are storing each camera's data in its own table, in MongoDB you will have a separate collection created for each camera. Here is how you perform date range queries.
What should be the specs of server (CPU, RAM, Disk)? any suggestion?
All NoSQL data bases are built to scale-out on commodity hardware. But by the way you have asked the question, you might be thinking of improving performance by scaling-up. You can start with a reasonable machine and as the load increases, you can keep adding more servers (scaling-out). You no need to plan and buy a high end server.
Should i consider Sharding/Replication for this scenario (while considering the performance in writing to synch replica sets)?
MongoDB locks the entire db for a single write (but yields for other operations) and is meant for systems which have more reads than writes. So this depends upon how your system is. There are multiple ways of sharding and should be domain specific. A generic answer is not possible. However some examples can be given like sharding by geography, by branches etc.
Also read A plain english introduction to CAP Theorem
Updated with answer to the comment on sharding
According to their documentation, You should consider deploying a sharded cluster, if:
your data set approaches or exceeds the storage capacity of a single node in your system.
the size of your system’s active working set will soon exceed the capacity of the maximum amount of RAM for your system.
your system has a large amount of write activity, a single MongoDB instance cannot write data fast enough to meet demand, and all other
approaches have not reduced contention.
So based upon the last point yes. The auto-sharding feature is built to scale writes. In that case, you have a write lock per shard, not per database. But mine is a theoretical answer. I suggest you take consultation from 10gen.com group.
to tell if MongoDB is good for holding such data, which eventually
will be queried against time ranges (e.g. retrieve all images of a
particular camera between a specified hour)?
This quiestion is too subjective for me to answer. From personal experience with numerous SQL solutions (ironically not MS SQL) I would say they are both equally as good, if done right.
Also:
What should be the specs of server (CPU, RAM, Disk)? any suggestion?
Depends on too many variables that only you know, however a small cluster of commodity hardware works quite well. I cannot really give a factual response to this question and it will come down to your testing.
As for a schema I would go for a document of the structure:
{
_id: {},
camera_name: "my awesome camera",
images: [
{
url: "http://I_like_S3_here.amazons3.com/my_image.png" ,
// All your other fields per image
}
]
}
This should be quite easy to mantain and update so long as you are not embedding much deeper since then it could become a bit of pain, however, that depends upon your queries.
Not only that but this should be good for sharding since you have all the data you need in one document, if you were to shard on _id you could probably get the perfect setup here.
Should i consider Sharding/Replication for this scenario (while considering the performance in writing to synch replica sets)?
Possibly, many people assume they need to shard when in reality they just need to be more intelligent in how they design the database. MongoDB is very free form so there are a lot of ways to do it wrong, but that being said, there are also a lot of ways of dong it right. I personally would keep sharding in mind. Replication can be very useful too.
Are there any benefits of using multiple databases on same machine, so that one database will hold images of current day for all cameras, and the second one will be used to archive previous day images?
Even though MongoDBs write lock is on DB level (currently) I would say: No. The right document structure and the right sharding/replication (if needed) should be able to handle this in a single document based collection(s) under a single DB. Not only that but you can direct writes and reads within a cluster to certain servers so as to create a concurrency situation between certain machines in your cluster. I would promote the correct usage of MongoDBs concurrency features over DB separation.
Edit
After reading the question again I omitted from my solution that you are inserting 80k+ images for each camera a day. As such instead of the embedded option I would actually make a row per image in a collection called images and then a camera collection and query the two like you would in SQL.
Sharding the images collection should be just as easy on camera_id.
Also make sure you take you working set into consideration with your server.
to tell if MongoDB is good for holding such data, which eventually
will be queried against time ranges (e.g. retrieve all images of a
particular camera between a specified hour)? Any suggestions about
Document Based schema design for my case?
MongoDB can do this. For better performance, you can set an index on your time field.
What should be the specs of server (CPU, RAM, Disk)? any suggestion?
I think RAM and Disk would be important.
If you don't want to do sharding to scale out, you should consider a larger size of disk so you can store all your data in it.
Your hot data should can fit into your RAM. If not, then you should consider a larger RAM because the performance of MongoDB mainly depends on RAM.
Should i consider Sharding/Replication for this scenario (while
considering the performance in writing to synch replica sets)?
I don't know many cameras do you have, even 1000 inserts/second with total 1000 cameras should still be easy to MongoDB. If you are concerning insert performance, I don't think you need to do sharding(Except the data size are too big that you have to separate them into several machines).
Another problem is the read frequency of your application. It it is very high, then you can consider sharding or replication here.
And you can use (timestamp + camera_id) as your sharding key if your query only on one camera in a time range.
Are there any benefits of using multiple databases on same machine, so
that one database will hold images of current day for all cameras, and
the second one will be used to archive previous day images?
You can separate the table into two collections(archive and current). And set index only on archive if you only query date on archive. Without the overhead of index creation, the current collection should benefit with insert.
And you can write a daily program to dump the current data into archive.

What database do online games like Farmville use? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Apart from graphical features, online games should have a simple Relational Database structure. I am curious what database do online games like Farmville and MafiaWars use?
Is it practical to use SQL based databases for such programs with such frequent writes ?
If not, how could one store the relational dependence of users in these games?
EDIT: As pointed, they use NOSQL databases like Couchbase. NOSQL is fast with good cuncurrency (which is really needed here); but the sotrage size is much larger (due to key/value structure).
1. Does't it slow down the system (as we need to read large database files from the disk)?
2. We will be very limited as we do not have SQL's JOIN to connected different sets of data.
These databases scale to about 500,000 operations per second, and they're massively distributed. Zynga still uses SQL for logs, but for game data, they presently use code that is substantially the same as Couchbase.
“Zynga’s objective was simple: we needed a database that could keep up with the challenging demands of our games while minimizing our average, fully-loaded cost per database operation – including capital equipment, management costs and developer productivity. We evaluated many NoSQL database technologies but all fell short of our stringent requirements. Our membase development efforts dovetailed with work being done at NorthScale and NHN and we’re delighted to contribute our code to the open source community and to sponsor continuing efforts to maintain and enhance the software.” - Cadir Lee, Chief Technology Officer, Zynga
To Answer your edit:
You can decrease storage size by using a non key=>value storage like MongoDB. This does still have some overhead but less than trying to maintain a key=>value store.
It does not slow down the system terribly since quite a few NoSQL products are memory mapped which means that unlike SQL it doesn't go directly to disk but instead to a fsync queue that then writes to disk when it is convient to. For those NoSQL solutions which are not memory mapped they have extremely fast read/write speeds and it is normally a case of trade off between the two.
As for JOINs, it is a case of arranging your schema in such a manner that you can avoid huge joins. Small joins to join say, a user with his score record are fine but aggregated joins will be a problem and you will need to find other ways around this. There are numerous solutions provided by many user groups of various NoSQL products.
The database they use has been reported to be Membase. It's open source, and one of the many nosql databases.
In January 2012, Membase became Couchbase, and you can download it here if you want to give it a try.

Given this expectations, what language or system would you choose to implement the solution? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
Here are the estimates the system should handle:
3000+ end users
150+ offices around the world
1500+ concurrent users at peak times
10.000+ daily updates
4-5 commits per second
50-70 transactions per second (reads/searches/updates)
This will be internal only business application, dedicated to help shipping company with worldwide shipment management.
What would be your technology choice, why that choice and roughly how long would it take to implement it? Thanks.
Note: I'm not recruiting. :-)
So, you asked how I would tackle such a project. In the Smalltalk world, people seem to agree that Gemstone makes things scale somewhat magically.
So, what I'd really do is this: I'd start developing in a simple Squeak image, using SandstoneDB. Then, this moment would come where a single image begins being too slow.
GemStone then takes care of copying your public objects (those visible from a certain root) back and forth between all instances. You get sessions and enhanced query functionalities, plus quite a fast VM.
It shares data with C, Java and Ruby.
In fact, they have their own VM for ruby, which is also worth a look.
wikipedia manages much more demanding requirements with MySQL
Your volumes are significant but not likely to strain any credible RDBMS if programmed efficiently. If your team is sloppy (i.e., casually putting SQL queries directly into components which are then composed into larger components), you face the likelihood of a "multiplier" effect where one logical requirement (get the data necessary for this page) turns into a high number of physical database queries.
So, rather than focussing on the capacity of your RDBMS, you should focus on the capacity of your programmers and the degree to which your implementation language and environment facilitate profiling and refactoring.
The scenario you propose is clearly a 24x7x365 one, too, so you should also consider the need for monitoring / dashboard requirements.
There's no way to estimate development effort based on the needs you've presented; it's great that you've analyzed your transactions to this level of granularity, but the main determinant of development effort will be the domain and UI requirements.
Choose the technology your developers know and are familiar with. All major technologies out there will handle such requirements with ease.
Your daily update numbers vs commits do not add up. Four commits per second = 14,400 per hour.
You did not mention anything about expected database size.
In any case, I would concentrate my efforts on choosing a robust back end like Oracle, Sybase, MS etc. This choice will make the most difference in performance. The front end could either be a desktop app or WEB app depending on needs. Since this will be used in many offices around the world, a WEB app might make the most sense.
I'd go with MySQL or PostgreSQL. Not likely to have problems with either one for your requirements.
I love object-databases. In terms of commits-per-second and database-roundtrip, no relational database can hold up. Check out db4o. It's dead easy to learn, check out the examples!
As for the programming language and UI framework: Well, take what your team is good at. Dynamic languages with fewer meta-time wasting will probably save time.
There is not enough information provided here to give a proper recommendation. A little more due diligence is in order.
What is the IT culture like? Do they prefer lots of little servers or fewer bigger servers or big iron? What is their position on virtualization?
What is the corporate culture like? What is the political climate like? The open source offerings may very well handle the load but you may need to go with a proprietary vendor just because they are already used to navigating the political winds of a large company. Perception is important.
What is the maturity level of the organization? Do they already have an Enterprise Architecture team in place? Do they even know what EA is?
You've described the operational side but what about the analytical side? What OLAP technology are they expecting to use or already have in place?
Speaking of integration, what other systems will you need to integrate with?