NoSQL or SQL for Data Structure [closed] - sql

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I'm building an app, and this is my first time working with databases. I went with MongoDB because originally I thought my data structure would be fitting for it. After more research, I've become a bit lost in all the possible ways I could structure my data, and which of those would be best for performance vs best for my DB type (currently MongoDB, but could change to PostgreSQL). Here are all my data structures and iterations:
Note: I understand that the "Payrolls" collection is somewhat redundant in the below example. It's just there as something to represent the data hierarchy in this hypothetical.
Original Data Structure
The structure here is consistent with what NoSQL is good at, quickly fetching everything in a single document. However, I intend for my employee object to hold lots of data, and I don't want that to encroach on the document size limit as a user continues to add employees and data to those employees, so I split them into a separate collection and tied them together using reference (object) IDs:
Second Data Structure
It wasn't long after that I wanted to be able to manipulate the clients, locations, departments, and employees all independent of one another, but still maintain their relationships, and I arrived at this iteration of my data structure:
Third and Current Data Structure
It was at this point that I began to realize I had been shifting away from the NoSQL philosophy. Now, instead of executing one query against one collection in a database (1st iteration), or executing one query with a follow-up population (2nd iteration), I was now doing 4 queries in parallel when grabbing my data, despite all the data being related tied to each other.
My Questions
Is my first data structure suitable to continue with MongoDB? If so, how do I compensate for the document size limit in the event the employees field grows too large?
Is my second data structure more suitable to continue with MongoDB? If so, how can I manipulate the fields independently? Do I create document schemas/models for each field and query them by model?
Is my third data structure still suitable for MongoDB, or should I consider a move to a relational database with this level of decentralized structure? Does this structure allow me any more freedom or ease of access to manipulate my data than the others?

Your question is a bit broad, but I will answer by saying that MongoDB should be able to handle your current data structure without too much trouble. The maximum document size for a BSON Mongo document is 16MB (q.v. the documentation). This is quite a lot of text, and it is probably unlikely that, e.g., an employee would need 16MB of storage.
In the event that you need a single transaction per object to occupy more than the 16MB BSON maximum, you may use GridFS. GridFS uses special collections (files and chunks) which do not have any storage limit (other than the limit of maximum database size). With GridFS, you may write objects of any size, and MongoDB will accommodate the operations.

Related

Should I create a new database or use an existing databases? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I have multiple databases that sometimes interact with each other but are mostly independent. I now need build a new application that allows users to search though the data of the rest of the application (sort of searching through the history of the other applications).
So I'm going to need a dozen or so stored procedures/views that will access data from various databases.
Should I have each stored procedure/view on the database that is being queried? Or do I have a brand new database for this part of the application that gathers data from all other databases in views/SPs and just query that?
I think it should be the first option, but then where do I put the Login table that tracks user logins into this new report application? It doesn't belong in any other database. (each database has it's own login table, its just the way it was setup).
What you are asking here fits into the wide umbrella of business intelligence.
The problem you are going to hit quickly...reporting queries tend to be low number of queries and relatively resource intense (from a hardware point of view). If you will, low volume high intensity.
The databases you are hitting are most likely high transaction databases. IE they are dealing with a large number of smaller queries, either as a large number of single (or multiple) inserts or quick selects. If you will, high volume low intensity queries.
Of course, these two models conflict heavily when trying to optimize them. Running a reporting query that joins multiple tables and runs for several minutes will often lock tables or consume resources that prevent (or severely inhibit) the database from performing its day to day job. If the system is configured for high number of small transactions, then your reporting query simply isn't going to get the resources it requires and the time lines on reporting results will be horribly long.
The answer here is the centralized data warehouse that collects the data from several sources and brings it together so it can be reported on. It's usually 3 components, a centralized data model, an etl platform to load that data model from the several data sources, and a reporting platform that interacts with this data. There are several third party potentials (listed in comments) that somewhat mimic the functionality of all three, or you can create these separately.
There are a few scenarios (usually due to an abundance of resources or a lack of traffic) where reporting direct from the production data of multiple data sources works, but those scenarios are pretty far and few between (usually never in an actual production environment).

LINQ or SQL queries. What is more efficient? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a program with a database with information about people that contains million records.
One of the tasks were to filter the results by birth date, then group
them by city and finally compare the population of each city with the
given numbers.
I started to write everything in SQL query, but then I started to wonder, that it may make server too busy and maybe it's better to do some calculations with the application itself.
I would like to know if there are any rules/recommendations
when to use server to make calculations ?
when to use tools like LINQ in the application ?
For such requirements, there's no fixed rule or strategy, it is driven by application / business requirements, couple of suggestions that may help:
Normally Sql Query does a good job in churning lots of data to deliver a smaller result set post filtering / Grouping / Sorting. However it needs
correct table design, indexing to optimize. As the data size increase Sql may under perform
Transferring data over the network, from hosted database to application is what kills the performance, since network can be big bottleneck, especially if the data is beyond certain size
In memory processing using Linq2Objects can be very fast for repetitive calls, which needs to apply filters, sort data and do some more processing
If the UI is a rich client, then you can afford to bring lots of data in the memory and keep working on it using Linq, it can be part of in memory data structures, if the UI is Web then you need to Cache the data
For having the same operations as sql, for in memory data, for multiple types, you need custom code, which preferably use Expression trees along with linq, else a simple linq would do for a known fixed type
I have a similar design in one of my web application, normally it is a combination, which works best in the most of the practical scenarios

Metadata reporting database - what type of DB schema should I use? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I've been tasked with a project to collect server configuration metadata from Windows servers and storing it in a DB for the purpose of reporting. I will be collecting data for over 100 configuration fields for each server.
One of the tasks the client wants to be able to do is compare config data for either the same server at different points in time, or two different servers which have the same function (i.e. Exchange servers). To see if there are any differences and what those differences may be.
As for DB design, I would normally just normalize all of the data into a OLTP type schema, where all of the similar config items would be persisted to a table relating to their specific area (e.g. Hardware info). But I'm thinking this may be a bad move and I should be looking to save this to some kind of OLAP type data warehouse.
I'm just not sure which way to go with the DB design, so could do with some direction on this. Should I go with normalizing the data and creating lots of tables, or one massive table with no normalisation and over 100 fields, or should I look into a star topology or something completely different (EAV)?
I am limited to using .Net and MSSQL server 2005.
Edit: The tool to collect and store the data will be run on an as required basis, rather than just grabbing the config data every day/week. Would be looking to keep the data for a couple of years at least.
Star Schema is best for reporting purposes in my experience. It is not necessary to use Star Schema for storage because it might be a set of views (indexed for performance) and you can design views for Star Schema later. Storage model should be a set of event tables to record configuration changes. You can start from flat log file structure and normalize it iteratively to find good structures for storage and queries. Storage model is supposed to be good if you can define model constraints, reporting model should be good for fast ad-hoc queries. You should focus on storage model because reporting model is a denormalization of storage model and it is easier to denormalize later. EAV structures are useless for both models because you can not define any constraints but queries are complex anyways.

When to use a query or code [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I am asking for a concrete case for Java + JPA / Hibernate + Mysql, but I think you can apply this question to a great number of languages.
Sometimes I have to perform a query on a database to get some entities, such as employees. Let's say you need some specific employees (the ones with 'John' as their firstname), would you rather do a query returning this exact set of employees, or would you prefer to search for all the employees and then use a programming language to retrieve the ones that you are interested with? why (ease, efficiency)?
Which is (in general) more efficient?
Is one approach better than the other depending on the table size?
Considering:
Same complexity, reusability in both cases.
Always do the query on the database. If you do not you have to copy over more data to the client and also databases are written to efficiently filter data almost certainly being more efficient than your code.
The only exception I can think of is if the filter condition is computationally complex and you can spread the calculation over more CPU power than the database has.
In the cases I have had a database the server has had more CPU power than the clients so unless overloaded will just run the query more quickly for the same amount of code.
Also you have to write less code to do the query on the database using Hibernates query language rather than you having to write code to manipulate the data on the client. Hibernate queries will also make use of any client caching in the configiration without you having to write more code.
There is a general trick often used in programming - paying with memory for operation speedup. If you have lots of employees, and you are going to query a significant portion of them, one by one (say, 75% will be queried at one time or the other), then query everything, cache it (very important!), and complete the lookup in memory. The next time you query, skip the trip to RDBMS, go straight to the cache, and do a fast look-up: a roundtrip to a database is very expensive, compared to an in-memory hash lookup.
On the other hand, if you are accessing a small portion of employees, you should query just one employee: data transfer from the RDBMS to your program takes a lot of time, a lot of network bandwidth, a lot of memory on your side, and a lot of memory on the RDBMS side. Querying lots of rows to throw away all but one never makes sense.
In general, I would let the database do what databases are good at. Filtering data is something databases are really good at, so it would be best left there.
That said, there are some situations where you might just want to grab all of them and do the filtering in code though. One I can think of would be if the number of rows is relatively small and you plan to cache them in your app. In that case you would just look up all the rows, cache them, and do subsequent filtering against what you have in the cache.
It's situational. I think in general, it's better to use sql to get the exact result set.
The problem with loading all the entities and then searching programmatically is that you ahve to load all the entitites, which could take a lot of memory. Additionally, you have to then search all the entities. Why do that when you can leverage your RDBMS and get the exact results you want. In other words, why load a large dataset that could use too much memory, then process it, when you can let your RDBMS do the work for you?
On the other hand, if you know the size of your dataset is not too, you can load it into memory and then query it -- this has the advantage that you don't need to go to the RDBMS, which might or might not require going over your network, depending on your system architecture.
However, even then, you can use various caching utilities so that the common query results are cached, which removes the advantage of caching the data yourself.
Remember, that your approach should scale over time. What may be a small data set could later turn into a huge data set over time. We had an issue with a programmer that coded the application to query the entire table then run manipulations on it. The approach worked fine when there were only 100 rows with two subselects, but as the data grew over the years, the performance issues became apparent. Inserting even a date filter to query only the last 365 days, could help your application scale better.
-- if you are looking for an answer specific to hibernate, check #Mark's answer
Given the Employee example -assuming the number of employees can scale over time, it is better to use an approach to query the database for the exact data.
However, if you are considering something like Department (for example), where the chances of the data growing rapidly is less, it is useful to query all of them and have in memory - this way you don't have to reach to the external resource (database) every time, which could be costly.
So the general parameters are these,
scaling of data
criticality to bussiness
volume of data
frequency of usage
to put some sense, when the data is not going to scale frequently and the data is not mission critical and volume of data is manageable in memory on the application server and is used frequently - Bring it all and filter them programatically, if needed.
if otherwise get only specific data.
What is better: to store a lot of food at home or buy it little by little? When you travel a lot? Just when hosting a party? It depends, isn't? Similarly, the best approach is a matter of performance optimization. That involves a lot of variables. The art is to both prevent painting yourself into a corner when designing your solution and optimize later, when you know your real bottlenecks. A good starting point is here: en.wikipedia.org/wiki/Performance_tuning One think could be more or less universally helpful: encapsulate your data access well.

Pros and Cons of using MongoDB instead of MS SQL Server [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I am new to NoSQL world and thinking of replacing my MS Sql Server database to MongoDB. My application (written in .Net C#) interacts with IP Cameras and records meta data for each image coming from Camera, into MS SQL Database. On average, i am inserting about 86400 records per day for each camera and in current database schema I have created separate table for separate Camera images, e.g. Camera_1_Images, Camera_2_Images ... Camera_N_Images. Single image record consists of simple metadata info. like AutoId, FilePath, CreationDate. To add more details to this, my application initiates separate process (.exe) for each camera and each process inserts 1 record per second in relative table in database.
I need suggestions from (MongoDB) experts on following concerns:
to tell if MongoDB is good for holding such data, which eventually will be queried against time ranges (e.g. retrieve all images of a particular camera between a specified hour)? Any suggestions about Document Based schema design for my case?
What should be the specs of server (CPU, RAM, Disk)? any suggestion?
Should i consider Sharding/Replication for this scenario (while considering the performance in writing to synch replica sets)?
Are there any benefits of using multiple databases on same machine, so that one database will hold images of current day for all cameras, and the second one will be used to archive previous day images? I am thinking on this with respect to splitting reads and writes on separate databases. Because all read requests might be served by second database and writes to first one. Will it benefit or not? If yes then any idea to ensure that both databases are synced always.
Any other suggestions are welcomed please.
I am myself a starter on NoSQL databases. So I am answering this at the expense of potential down votes but it will be a great learning experience for me.
Before trying my best to answer your questions I should say that if MS
SQL Server is working well for you then stick with it. You have not
mentioned any valid reason WHY you want to use MongoDB except the fact
that you learnt about it as a document oriented db. Moreover I see
that you have almost the same set of meta-data you are capturing for
each camera i.e. your schema is dynamic.
to tell if MongoDB is good for holding such data, which eventually will be queried against time ranges (e.g. retrieve all images of a particular camera between a specified hour)? Any suggestions about Document Based schema design for my case?
MongoDB being a document oriented db, is good at querying within an aggregate (you call it document). Since you already are storing each camera's data in its own table, in MongoDB you will have a separate collection created for each camera. Here is how you perform date range queries.
What should be the specs of server (CPU, RAM, Disk)? any suggestion?
All NoSQL data bases are built to scale-out on commodity hardware. But by the way you have asked the question, you might be thinking of improving performance by scaling-up. You can start with a reasonable machine and as the load increases, you can keep adding more servers (scaling-out). You no need to plan and buy a high end server.
Should i consider Sharding/Replication for this scenario (while considering the performance in writing to synch replica sets)?
MongoDB locks the entire db for a single write (but yields for other operations) and is meant for systems which have more reads than writes. So this depends upon how your system is. There are multiple ways of sharding and should be domain specific. A generic answer is not possible. However some examples can be given like sharding by geography, by branches etc.
Also read A plain english introduction to CAP Theorem
Updated with answer to the comment on sharding
According to their documentation, You should consider deploying a sharded cluster, if:
your data set approaches or exceeds the storage capacity of a single node in your system.
the size of your system’s active working set will soon exceed the capacity of the maximum amount of RAM for your system.
your system has a large amount of write activity, a single MongoDB instance cannot write data fast enough to meet demand, and all other
approaches have not reduced contention.
So based upon the last point yes. The auto-sharding feature is built to scale writes. In that case, you have a write lock per shard, not per database. But mine is a theoretical answer. I suggest you take consultation from 10gen.com group.
to tell if MongoDB is good for holding such data, which eventually
will be queried against time ranges (e.g. retrieve all images of a
particular camera between a specified hour)?
This quiestion is too subjective for me to answer. From personal experience with numerous SQL solutions (ironically not MS SQL) I would say they are both equally as good, if done right.
Also:
What should be the specs of server (CPU, RAM, Disk)? any suggestion?
Depends on too many variables that only you know, however a small cluster of commodity hardware works quite well. I cannot really give a factual response to this question and it will come down to your testing.
As for a schema I would go for a document of the structure:
{
_id: {},
camera_name: "my awesome camera",
images: [
{
url: "http://I_like_S3_here.amazons3.com/my_image.png" ,
// All your other fields per image
}
]
}
This should be quite easy to mantain and update so long as you are not embedding much deeper since then it could become a bit of pain, however, that depends upon your queries.
Not only that but this should be good for sharding since you have all the data you need in one document, if you were to shard on _id you could probably get the perfect setup here.
Should i consider Sharding/Replication for this scenario (while considering the performance in writing to synch replica sets)?
Possibly, many people assume they need to shard when in reality they just need to be more intelligent in how they design the database. MongoDB is very free form so there are a lot of ways to do it wrong, but that being said, there are also a lot of ways of dong it right. I personally would keep sharding in mind. Replication can be very useful too.
Are there any benefits of using multiple databases on same machine, so that one database will hold images of current day for all cameras, and the second one will be used to archive previous day images?
Even though MongoDBs write lock is on DB level (currently) I would say: No. The right document structure and the right sharding/replication (if needed) should be able to handle this in a single document based collection(s) under a single DB. Not only that but you can direct writes and reads within a cluster to certain servers so as to create a concurrency situation between certain machines in your cluster. I would promote the correct usage of MongoDBs concurrency features over DB separation.
Edit
After reading the question again I omitted from my solution that you are inserting 80k+ images for each camera a day. As such instead of the embedded option I would actually make a row per image in a collection called images and then a camera collection and query the two like you would in SQL.
Sharding the images collection should be just as easy on camera_id.
Also make sure you take you working set into consideration with your server.
to tell if MongoDB is good for holding such data, which eventually
will be queried against time ranges (e.g. retrieve all images of a
particular camera between a specified hour)? Any suggestions about
Document Based schema design for my case?
MongoDB can do this. For better performance, you can set an index on your time field.
What should be the specs of server (CPU, RAM, Disk)? any suggestion?
I think RAM and Disk would be important.
If you don't want to do sharding to scale out, you should consider a larger size of disk so you can store all your data in it.
Your hot data should can fit into your RAM. If not, then you should consider a larger RAM because the performance of MongoDB mainly depends on RAM.
Should i consider Sharding/Replication for this scenario (while
considering the performance in writing to synch replica sets)?
I don't know many cameras do you have, even 1000 inserts/second with total 1000 cameras should still be easy to MongoDB. If you are concerning insert performance, I don't think you need to do sharding(Except the data size are too big that you have to separate them into several machines).
Another problem is the read frequency of your application. It it is very high, then you can consider sharding or replication here.
And you can use (timestamp + camera_id) as your sharding key if your query only on one camera in a time range.
Are there any benefits of using multiple databases on same machine, so
that one database will hold images of current day for all cameras, and
the second one will be used to archive previous day images?
You can separate the table into two collections(archive and current). And set index only on archive if you only query date on archive. Without the overhead of index creation, the current collection should benefit with insert.
And you can write a daily program to dump the current data into archive.