Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a program with a database with information about people that contains million records.
One of the tasks were to filter the results by birth date, then group
them by city and finally compare the population of each city with the
given numbers.
I started to write everything in SQL query, but then I started to wonder, that it may make server too busy and maybe it's better to do some calculations with the application itself.
I would like to know if there are any rules/recommendations
when to use server to make calculations ?
when to use tools like LINQ in the application ?
For such requirements, there's no fixed rule or strategy, it is driven by application / business requirements, couple of suggestions that may help:
Normally Sql Query does a good job in churning lots of data to deliver a smaller result set post filtering / Grouping / Sorting. However it needs
correct table design, indexing to optimize. As the data size increase Sql may under perform
Transferring data over the network, from hosted database to application is what kills the performance, since network can be big bottleneck, especially if the data is beyond certain size
In memory processing using Linq2Objects can be very fast for repetitive calls, which needs to apply filters, sort data and do some more processing
If the UI is a rich client, then you can afford to bring lots of data in the memory and keep working on it using Linq, it can be part of in memory data structures, if the UI is Web then you need to Cache the data
For having the same operations as sql, for in memory data, for multiple types, you need custom code, which preferably use Expression trees along with linq, else a simple linq would do for a known fixed type
I have a similar design in one of my web application, normally it is a combination, which works best in the most of the practical scenarios
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I'm building an app, and this is my first time working with databases. I went with MongoDB because originally I thought my data structure would be fitting for it. After more research, I've become a bit lost in all the possible ways I could structure my data, and which of those would be best for performance vs best for my DB type (currently MongoDB, but could change to PostgreSQL). Here are all my data structures and iterations:
Note: I understand that the "Payrolls" collection is somewhat redundant in the below example. It's just there as something to represent the data hierarchy in this hypothetical.
Original Data Structure
The structure here is consistent with what NoSQL is good at, quickly fetching everything in a single document. However, I intend for my employee object to hold lots of data, and I don't want that to encroach on the document size limit as a user continues to add employees and data to those employees, so I split them into a separate collection and tied them together using reference (object) IDs:
Second Data Structure
It wasn't long after that I wanted to be able to manipulate the clients, locations, departments, and employees all independent of one another, but still maintain their relationships, and I arrived at this iteration of my data structure:
Third and Current Data Structure
It was at this point that I began to realize I had been shifting away from the NoSQL philosophy. Now, instead of executing one query against one collection in a database (1st iteration), or executing one query with a follow-up population (2nd iteration), I was now doing 4 queries in parallel when grabbing my data, despite all the data being related tied to each other.
My Questions
Is my first data structure suitable to continue with MongoDB? If so, how do I compensate for the document size limit in the event the employees field grows too large?
Is my second data structure more suitable to continue with MongoDB? If so, how can I manipulate the fields independently? Do I create document schemas/models for each field and query them by model?
Is my third data structure still suitable for MongoDB, or should I consider a move to a relational database with this level of decentralized structure? Does this structure allow me any more freedom or ease of access to manipulate my data than the others?
Your question is a bit broad, but I will answer by saying that MongoDB should be able to handle your current data structure without too much trouble. The maximum document size for a BSON Mongo document is 16MB (q.v. the documentation). This is quite a lot of text, and it is probably unlikely that, e.g., an employee would need 16MB of storage.
In the event that you need a single transaction per object to occupy more than the 16MB BSON maximum, you may use GridFS. GridFS uses special collections (files and chunks) which do not have any storage limit (other than the limit of maximum database size). With GridFS, you may write objects of any size, and MongoDB will accommodate the operations.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I have multiple databases that sometimes interact with each other but are mostly independent. I now need build a new application that allows users to search though the data of the rest of the application (sort of searching through the history of the other applications).
So I'm going to need a dozen or so stored procedures/views that will access data from various databases.
Should I have each stored procedure/view on the database that is being queried? Or do I have a brand new database for this part of the application that gathers data from all other databases in views/SPs and just query that?
I think it should be the first option, but then where do I put the Login table that tracks user logins into this new report application? It doesn't belong in any other database. (each database has it's own login table, its just the way it was setup).
What you are asking here fits into the wide umbrella of business intelligence.
The problem you are going to hit quickly...reporting queries tend to be low number of queries and relatively resource intense (from a hardware point of view). If you will, low volume high intensity.
The databases you are hitting are most likely high transaction databases. IE they are dealing with a large number of smaller queries, either as a large number of single (or multiple) inserts or quick selects. If you will, high volume low intensity queries.
Of course, these two models conflict heavily when trying to optimize them. Running a reporting query that joins multiple tables and runs for several minutes will often lock tables or consume resources that prevent (or severely inhibit) the database from performing its day to day job. If the system is configured for high number of small transactions, then your reporting query simply isn't going to get the resources it requires and the time lines on reporting results will be horribly long.
The answer here is the centralized data warehouse that collects the data from several sources and brings it together so it can be reported on. It's usually 3 components, a centralized data model, an etl platform to load that data model from the several data sources, and a reporting platform that interacts with this data. There are several third party potentials (listed in comments) that somewhat mimic the functionality of all three, or you can create these separately.
There are a few scenarios (usually due to an abundance of resources or a lack of traffic) where reporting direct from the production data of multiple data sources works, but those scenarios are pretty far and few between (usually never in an actual production environment).
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am optimizing an oracle query. I found the cost of my new query is lower than the original one. I think my new query may have better performance than the old one. But lots of people on web say "the COST column cannot be used to compare execution plans". So my question is how do you know the performance of one query is better than the other if you don't check the cost from the explain plan? Any other ways? Thanks!
Query plans are extremely useful animals. In general, however, they are not a useful way to determine which of two queries are actually going to be more efficient. There are exceptionally few instances where a human can take a look at two reasonable query plans and immediately know that one will be more efficient than the other. If you can, that generally implies that something is deeply flawed with one of the plans (i.e. you see a table scan of a billion row table rather than an index access to grab the 10 rows you know you're interested in).
When it comes down to comparing two different plans, you need to focus on execution statistics. Most of the time, measuring the actual logical I/O is the simplest yardstick. You can get that by running set autotrace on in SQL*Plus or using the autotrace option in SQL Developer (F6 rather than F5 to run the query). You can also measure the elapsed time but that often requires a bit more effort to produce reasonable benchmarks based on what fraction of the blocks are going to be in the various caches (database, operating system, SAN, etc.) when the query runs. CPU time is likely to be a bit more stable across executions regardless of caching. Occasionally, you'll want to measure some other statistic (i.e. the amount of data sent over the network if you're looking at optimizing queries involving database links).
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
there's a data set with around 6 millions of records. Each record has the same number of fields. There are 8 fields totally:
ID Title Color Date1 Date2 Date3 Date4...
There should be a way to filter these records by title and all date fields (or, 'columns' in RDBMS terms).
The size of the data is not so huge, around few gigabytes. We don't have long text fields etc. (we got rid of them during architecture creation, so now we have only really important fields in the data set).
The backend reads & writes the data quite intensive. We would really like to speed up both reads\writes (and filtering by fields) as much as possible. Currently we're using Postgres and we like its reliability, but it seems it's not really fast. Yes, we did some tweaking and optimization, added indexes, installed it on 32GB RAM machine and set all necessary settings. In other words, it works, but I still believe it might be better. What we need is speed: filtering records by dates and titles should be fast, really fast. Data insertion might be slower. The backend filters all records that were not processed, process it, and sets the date flag (of the datetime when it was processed). There are around 50 backend 'workers' executed every 5-10 seconds, so the DB should be able to perform really fast. Also we do some DB iterations (kind of a map\reduce jobs), so the DB solution should be able to execute this kind of tasks (RDBMS are not really good here).
We don't have joins there, the data is already optimized for big data solutions. Only one 'big table'.
And we would like to run it on a single node, or on many small instances. The data is not really important. But we would like to avoid expensive solutions so we're looking for a SQL or NoSQL solution that will perform faster than Postgres on the same cheap hardware.
I remember I tried MongoDB about a year or two ago. From what I remember, filtering was not so quick that moment. Cassandra was better but I remember it was able to perform only small subset of filtering queries. Riak is good but only for a big cluster with many machines. This is my very basic experience, if you guys know that one of these solutions performs great please do write that. Or suggest another solution.
Thanks!
I agree with Ryan above. Stick with PostgreSQL.
You haven't described what your write load is actually like (are you updating a few records here and there, but with a lot of parallel queries? Updating with a fewer number of parallel queries but a lot of rows updated at once, etc). So I can't tell you what you need to do to get more speed.
However, based on your question and the things you say you have tried so far, I would recommend that you consider hiring a consultant to look at your db, look at your environment, etc. with fresh eyes and suggest improvements. My guess is that you have a lot of stuff going on that could be optimized quite a bit and you will spend a lot less on such optimizations than you will switching to a new environment.
I agree with Denis, that you should stick with Postgres. From my experience, the relational databases when tuned correctly have incredibly fast results. Or put another way ... I've found it much harder to tune Mongo to get complex queries returning in 10ms or less than I have tuning SQL Server and MySQL.
Read this website http://use-the-index-luke.com/ for ideas on how to further tune. The guy also wrote a book that will likely be useful to you.
Like Denis said, the data size is not so big that it would be worth the price to start from scratch with a NoSQL solution.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I am asking for a concrete case for Java + JPA / Hibernate + Mysql, but I think you can apply this question to a great number of languages.
Sometimes I have to perform a query on a database to get some entities, such as employees. Let's say you need some specific employees (the ones with 'John' as their firstname), would you rather do a query returning this exact set of employees, or would you prefer to search for all the employees and then use a programming language to retrieve the ones that you are interested with? why (ease, efficiency)?
Which is (in general) more efficient?
Is one approach better than the other depending on the table size?
Considering:
Same complexity, reusability in both cases.
Always do the query on the database. If you do not you have to copy over more data to the client and also databases are written to efficiently filter data almost certainly being more efficient than your code.
The only exception I can think of is if the filter condition is computationally complex and you can spread the calculation over more CPU power than the database has.
In the cases I have had a database the server has had more CPU power than the clients so unless overloaded will just run the query more quickly for the same amount of code.
Also you have to write less code to do the query on the database using Hibernates query language rather than you having to write code to manipulate the data on the client. Hibernate queries will also make use of any client caching in the configiration without you having to write more code.
There is a general trick often used in programming - paying with memory for operation speedup. If you have lots of employees, and you are going to query a significant portion of them, one by one (say, 75% will be queried at one time or the other), then query everything, cache it (very important!), and complete the lookup in memory. The next time you query, skip the trip to RDBMS, go straight to the cache, and do a fast look-up: a roundtrip to a database is very expensive, compared to an in-memory hash lookup.
On the other hand, if you are accessing a small portion of employees, you should query just one employee: data transfer from the RDBMS to your program takes a lot of time, a lot of network bandwidth, a lot of memory on your side, and a lot of memory on the RDBMS side. Querying lots of rows to throw away all but one never makes sense.
In general, I would let the database do what databases are good at. Filtering data is something databases are really good at, so it would be best left there.
That said, there are some situations where you might just want to grab all of them and do the filtering in code though. One I can think of would be if the number of rows is relatively small and you plan to cache them in your app. In that case you would just look up all the rows, cache them, and do subsequent filtering against what you have in the cache.
It's situational. I think in general, it's better to use sql to get the exact result set.
The problem with loading all the entities and then searching programmatically is that you ahve to load all the entitites, which could take a lot of memory. Additionally, you have to then search all the entities. Why do that when you can leverage your RDBMS and get the exact results you want. In other words, why load a large dataset that could use too much memory, then process it, when you can let your RDBMS do the work for you?
On the other hand, if you know the size of your dataset is not too, you can load it into memory and then query it -- this has the advantage that you don't need to go to the RDBMS, which might or might not require going over your network, depending on your system architecture.
However, even then, you can use various caching utilities so that the common query results are cached, which removes the advantage of caching the data yourself.
Remember, that your approach should scale over time. What may be a small data set could later turn into a huge data set over time. We had an issue with a programmer that coded the application to query the entire table then run manipulations on it. The approach worked fine when there were only 100 rows with two subselects, but as the data grew over the years, the performance issues became apparent. Inserting even a date filter to query only the last 365 days, could help your application scale better.
-- if you are looking for an answer specific to hibernate, check #Mark's answer
Given the Employee example -assuming the number of employees can scale over time, it is better to use an approach to query the database for the exact data.
However, if you are considering something like Department (for example), where the chances of the data growing rapidly is less, it is useful to query all of them and have in memory - this way you don't have to reach to the external resource (database) every time, which could be costly.
So the general parameters are these,
scaling of data
criticality to bussiness
volume of data
frequency of usage
to put some sense, when the data is not going to scale frequently and the data is not mission critical and volume of data is manageable in memory on the application server and is used frequently - Bring it all and filter them programatically, if needed.
if otherwise get only specific data.
What is better: to store a lot of food at home or buy it little by little? When you travel a lot? Just when hosting a party? It depends, isn't? Similarly, the best approach is a matter of performance optimization. That involves a lot of variables. The art is to both prevent painting yourself into a corner when designing your solution and optimize later, when you know your real bottlenecks. A good starting point is here: en.wikipedia.org/wiki/Performance_tuning One think could be more or less universally helpful: encapsulate your data access well.