Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I am asking for a concrete case for Java + JPA / Hibernate + Mysql, but I think you can apply this question to a great number of languages.
Sometimes I have to perform a query on a database to get some entities, such as employees. Let's say you need some specific employees (the ones with 'John' as their firstname), would you rather do a query returning this exact set of employees, or would you prefer to search for all the employees and then use a programming language to retrieve the ones that you are interested with? why (ease, efficiency)?
Which is (in general) more efficient?
Is one approach better than the other depending on the table size?
Considering:
Same complexity, reusability in both cases.
Always do the query on the database. If you do not you have to copy over more data to the client and also databases are written to efficiently filter data almost certainly being more efficient than your code.
The only exception I can think of is if the filter condition is computationally complex and you can spread the calculation over more CPU power than the database has.
In the cases I have had a database the server has had more CPU power than the clients so unless overloaded will just run the query more quickly for the same amount of code.
Also you have to write less code to do the query on the database using Hibernates query language rather than you having to write code to manipulate the data on the client. Hibernate queries will also make use of any client caching in the configiration without you having to write more code.
There is a general trick often used in programming - paying with memory for operation speedup. If you have lots of employees, and you are going to query a significant portion of them, one by one (say, 75% will be queried at one time or the other), then query everything, cache it (very important!), and complete the lookup in memory. The next time you query, skip the trip to RDBMS, go straight to the cache, and do a fast look-up: a roundtrip to a database is very expensive, compared to an in-memory hash lookup.
On the other hand, if you are accessing a small portion of employees, you should query just one employee: data transfer from the RDBMS to your program takes a lot of time, a lot of network bandwidth, a lot of memory on your side, and a lot of memory on the RDBMS side. Querying lots of rows to throw away all but one never makes sense.
In general, I would let the database do what databases are good at. Filtering data is something databases are really good at, so it would be best left there.
That said, there are some situations where you might just want to grab all of them and do the filtering in code though. One I can think of would be if the number of rows is relatively small and you plan to cache them in your app. In that case you would just look up all the rows, cache them, and do subsequent filtering against what you have in the cache.
It's situational. I think in general, it's better to use sql to get the exact result set.
The problem with loading all the entities and then searching programmatically is that you ahve to load all the entitites, which could take a lot of memory. Additionally, you have to then search all the entities. Why do that when you can leverage your RDBMS and get the exact results you want. In other words, why load a large dataset that could use too much memory, then process it, when you can let your RDBMS do the work for you?
On the other hand, if you know the size of your dataset is not too, you can load it into memory and then query it -- this has the advantage that you don't need to go to the RDBMS, which might or might not require going over your network, depending on your system architecture.
However, even then, you can use various caching utilities so that the common query results are cached, which removes the advantage of caching the data yourself.
Remember, that your approach should scale over time. What may be a small data set could later turn into a huge data set over time. We had an issue with a programmer that coded the application to query the entire table then run manipulations on it. The approach worked fine when there were only 100 rows with two subselects, but as the data grew over the years, the performance issues became apparent. Inserting even a date filter to query only the last 365 days, could help your application scale better.
-- if you are looking for an answer specific to hibernate, check #Mark's answer
Given the Employee example -assuming the number of employees can scale over time, it is better to use an approach to query the database for the exact data.
However, if you are considering something like Department (for example), where the chances of the data growing rapidly is less, it is useful to query all of them and have in memory - this way you don't have to reach to the external resource (database) every time, which could be costly.
So the general parameters are these,
scaling of data
criticality to bussiness
volume of data
frequency of usage
to put some sense, when the data is not going to scale frequently and the data is not mission critical and volume of data is manageable in memory on the application server and is used frequently - Bring it all and filter them programatically, if needed.
if otherwise get only specific data.
What is better: to store a lot of food at home or buy it little by little? When you travel a lot? Just when hosting a party? It depends, isn't? Similarly, the best approach is a matter of performance optimization. That involves a lot of variables. The art is to both prevent painting yourself into a corner when designing your solution and optimize later, when you know your real bottlenecks. A good starting point is here: en.wikipedia.org/wiki/Performance_tuning One think could be more or less universally helpful: encapsulate your data access well.
Related
Given a live table in SQL with some non-trivial number of columns/entries, with one or more applications actively querying it, what would be the effect of introducing a new index on some column of this table? What takes priority? Serving the query, or constructing the index? Put another way, would setting up the index be experienced by the querying applications as a delay in getting their responses?
It is possible to use the database while indexing is taking place, but it's effects on performance is nearly impossible for us to say. A great deal about the optimizer is magic to anyone who hasn't worked on it themselves, and the answer could change greatly depending on which RDMS you're using. On top of that, your own hardware will play a huge part in the answer.
That being said, if you're primarily reading from the table, there's a good chance you won't see a major performance hit, if your system has the IO/CPU capabilities of handling both tasks at the same time. Inserting however, will be slowed down considerably.
Whether this impact is problematic will depend on your current system load, size of your tables, and what exactly it is you're indexing. Generally speaking, if you have a decent server, a lowish load, and a table with only a few million rows or less, I wouldn't expect to see a performance hit at all.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
there's a data set with around 6 millions of records. Each record has the same number of fields. There are 8 fields totally:
ID Title Color Date1 Date2 Date3 Date4...
There should be a way to filter these records by title and all date fields (or, 'columns' in RDBMS terms).
The size of the data is not so huge, around few gigabytes. We don't have long text fields etc. (we got rid of them during architecture creation, so now we have only really important fields in the data set).
The backend reads & writes the data quite intensive. We would really like to speed up both reads\writes (and filtering by fields) as much as possible. Currently we're using Postgres and we like its reliability, but it seems it's not really fast. Yes, we did some tweaking and optimization, added indexes, installed it on 32GB RAM machine and set all necessary settings. In other words, it works, but I still believe it might be better. What we need is speed: filtering records by dates and titles should be fast, really fast. Data insertion might be slower. The backend filters all records that were not processed, process it, and sets the date flag (of the datetime when it was processed). There are around 50 backend 'workers' executed every 5-10 seconds, so the DB should be able to perform really fast. Also we do some DB iterations (kind of a map\reduce jobs), so the DB solution should be able to execute this kind of tasks (RDBMS are not really good here).
We don't have joins there, the data is already optimized for big data solutions. Only one 'big table'.
And we would like to run it on a single node, or on many small instances. The data is not really important. But we would like to avoid expensive solutions so we're looking for a SQL or NoSQL solution that will perform faster than Postgres on the same cheap hardware.
I remember I tried MongoDB about a year or two ago. From what I remember, filtering was not so quick that moment. Cassandra was better but I remember it was able to perform only small subset of filtering queries. Riak is good but only for a big cluster with many machines. This is my very basic experience, if you guys know that one of these solutions performs great please do write that. Or suggest another solution.
Thanks!
I agree with Ryan above. Stick with PostgreSQL.
You haven't described what your write load is actually like (are you updating a few records here and there, but with a lot of parallel queries? Updating with a fewer number of parallel queries but a lot of rows updated at once, etc). So I can't tell you what you need to do to get more speed.
However, based on your question and the things you say you have tried so far, I would recommend that you consider hiring a consultant to look at your db, look at your environment, etc. with fresh eyes and suggest improvements. My guess is that you have a lot of stuff going on that could be optimized quite a bit and you will spend a lot less on such optimizations than you will switching to a new environment.
I agree with Denis, that you should stick with Postgres. From my experience, the relational databases when tuned correctly have incredibly fast results. Or put another way ... I've found it much harder to tune Mongo to get complex queries returning in 10ms or less than I have tuning SQL Server and MySQL.
Read this website http://use-the-index-luke.com/ for ideas on how to further tune. The guy also wrote a book that will likely be useful to you.
Like Denis said, the data size is not so big that it would be worth the price to start from scratch with a NoSQL solution.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I have developed a website which provides very generic data storage. Currently it works just fine but I am thinking about optimizing the speed.
INSERT/SELECT ratio is hard to predict and changes for different cases but usually SELECT is more often. INSERTs are fast enough. SELECTs are what worries me. There are a lot of LEFT JOINs. E.g. each object can have a image which is stored in separate table (as it can span across multiple objects) and stores additional information about the image as well.
Up to 8 joins are made every select and it can take up to 1 seconds to process - mean value is around 0.3s. There can be multiple of such selects for every request. It has already been optimized multiple times on SQL side and there is not much that can be done there.
Other than buying more powerful machine for DB, what can be done (if anything)?
Django is not a speed demon here as well but we still got some optimizations left there. Switch to PyPy if we must. On DB side I had a few ideas but there they seem to be uncommon - couldn't find any real case scenario.
Use different storage for this part that's faster. We need transactions and we need consistency checks so it may not be preferable.
Searchable cache? Does it make any sense here? E.g. maintain a flat copy of all tables combined in NoSQL or something. Inserts would be more expensive - it needs to update multiple records in NoSQL if some common table changes. Tough to maintain as well.
Is there anything that would make sense or is it just the fastest that can get and just get more RAM, increase cache size in rdbms, get SSD and leave it. Focus on optimizing other parts like pooling database connections as they are expensive as well.
Technologies used: PostgreSQL 9.1 and Django (python).
To summarize. Question is: after optimizing all SQL part - indexes, clustering etc. What can be done to optimize further when static timeout cache for results is not an option (different request arguments, different results anyway).
---EDIT 30-08-2012---
We are already using checking slow queries on a daily basis. This IS our bottleneck. We only order and filter on indexes. Also, sorry for not being clear about this - we don't store actual images in db. Just file paths.
JOINs and ORDER BY are killing our performance here. E.g. one complex query that spits out 20 000 results takes 1800ms (EXPLAIN ANALYZE used). And this assumes that we are not using any kind of filtering based on JOINed tables.
If we skip all the JOINS we are down to 110ms. That's insane... That's why we are thinking of some kind of searchable cache or flat copy NoSQL.
Without ordering we got 60ms which is great but what's with the JOIN performance in PostgreSQL?
Is there some different DB that can do better for us? Preferably free one.
First, although I think that there are times and places to store image files in the database, in general you are going to have extra I/O and memory associated with this sort of operation. If I was looking at optimizing this I would put every image with a path and be able to bulk save these to the fs. This way they are still in your db for backup purposes but you can just pull the relative path out and generate links, thus saving you a bunch of sql queries and reducing overhead. Over a web-based backend you aren't going to be able to get transactions working really well between generating the HTML and retrieving the image anyway since these come in under different HTTP requests.
As for speed, I can't tell if you are looking at total http request time or db time. But the first thing you need to do is break everything apart and look for where most of your time is being spent. This may surprise you. The next thing is to get query plans of those queries which are slow queries:
http://heatware.net/databases/how-to-find-log-slow-queries-postgresql/
Then from there, start using explain analyze to find out what is the problem.
Also in deciding to upgrade hardware you want to have a good idea of where you are currently facing limits. More RAM helps generally (and it is helpful if your db can fit comfortably in RAM), but beyond that it makes no sense to put faster storage in a cpu-bound server or switch to a server with faster cpu's in an I/O bound server. top is your friend there. Similarly depending on the concurrency issues, it might (or might not!) make sense to use a hot standby for your select statements.
But without a lot more information I can't tell you what the best way to go about further optimizing your db is. PostgreSQL is capable of running really fast under the right conditions and scaling very well.
I have a database containing a single huge table. At the moment a query can take anything from 10 to 20 minutes and I need that to go down to 10 seconds. I have spent months trying different products like GridSQL. GridSQL works fine, but is using its own parser which does not have all the needed features. I have also optimized my database in various ways without getting the speedup I need.
I have a theory on how one could scale out queries, meaning that I utilize several nodes to run a single query in parallel. A precondition is that the data is partitioned (vertically), one partition placed on each node. The idea is to take an incoming SQL query and simply run it exactly like it is on all the nodes. When the results are returned to a coordinator node, the same query is run on the union of the resultsets. I realize that an aggregate function like average need to be rewritten into a count and sum to the nodes and that the coordinator divides the sum of the sums with the sum of the counts to get the average.
What kinds of problems could not easily be solved using this model. I believe one issue would be the count distinct function.
Edit: I am getting so many nice suggestions, but none have addressed the method.
It's a data volume problem, not necessarily an architecture problem.
Whether on 1 machine or 1000 machines, if you end up summarizing 1,000,000 rows, you're going to have problems.
Rather than normalizing you data, you need to de-normalize it.
You mention in a comment that your data base is "perfect for your purpose", when, obviously, it's not. It's too slow.
So, something has to give. Your perfect model isn't working, as you need to process too much data in too short of a time. Sounds like you need some higher level data sets than your raw data. Perhaps a data warehousing solution. Who knows, not enough information to really say.
But there are a lot of things you can do to satisfy a specific subset of queries with a good response time, while still allowing ad hoc queries that respond in "10-20 minutes".
Edit regarding comment:
I am not familiar with "GridSQL", or what it does.
If you send several, identical SQL queries to individual "shard" databases, each containing a subset, then the simple selection query will scale to the network (i.e. you will eventually become network bound to the controller), as this is a truly, parallel, stateless process.
The problem becomes, as you mentioned, the secondary processing, notably sorting and aggregates, as this can only be done on the final, "raw" result set.
That means that your controller ends up, inevitably, becoming your bottleneck and, in the end, regardless of how "scaled out" you are, you still have to contend with a data volume issue. If you send your query out to 1000 node and inevitably have to summarize or sort the 1000 row result set from each node, resulting in 1M rows, you still have a long result time and large data processing demand on a single machine.
I don't know what database you are using, and I don't know the specifics about individual databases, but you can see how if you actually partition your data across several disk spindles, and have a decent, modern, multi-core processor, the database implementation itself can handle much of this scaling in terms of parallel disk spindle requests for you. Which implementations actually DO do this, I can't say. I'm just suggesting that it's possible for them to (and some may well do this).
But, my general point, is if you are running, specifically, aggregates, then you are likely processing too much data if you're hitting the raw sources each time. If you analyze your queries, you may well be able to "pre-summarize" your data at various levels of granularity to help avoid the data saturation problem.
For example, if you are storing individual web hits, but are more interested in activity based on each hour of the day (rather than the subsecond data you may be logging), summarizing to the hour of the day alone can reduce your data demand dramatically.
So, scaling out can certainly help, but it may well not be the only solution to the problem, rather it would be a component. Data warehousing is designed to address these kinds of problems, but does not work well with "ad hoc" queries. Rather you need to have a reasonable idea of what kinds of queries you want to support and design it accordingly.
One huge table - can this be normalised at all?
If you are doing mostly select queries, have you considered either normalising to a data warehouse that you then query, or running analysis services and a cube to do your pre-processing for you?
From your question, what you are doing sounds like the sort of thing a cube is optimised for, and could be done without you having to write all the plumbing.
By trying custom solution (grid) you introduce a lot of complexity. Maybe, it's your only solution, but first did you try partitioning the table (native solution)?
I'd seriously be looking into an OLAP solution. The trick with the Cube is once built it can be queried in lots of ways that you may not have considered. And as #HLGEM mentioned, have you addressed indexing?
Even at in millions of rows, a good search should be logarithmic not linear. If you have even one query which results in a scan then your performance will be destroyed. We might need an example of your structure to see if we can help more?
I also agree fully with #Mason, have you profiled your query and investigated the query plan to see where your bottlenecks are. Adding nodes improving speed makes me think that your query might be CPU bound.
David,
Are you using all of the features of GridSQL? You can also use constraint exclusion partitioning, effectively breaking out your big table into several smaller tables. Depending on your WHERE clause, when the query is processed it may look at a lot less data and return results much faster.
Also, are you using multiple logical nodes per physical server? Configuring it that way can take advantage of otherwise idle cores.
If you monitor the servers during execution, is the bottleneck IO or CPU?
Also alluded to here is that you may want to roll up rows in your fact table into summary tables/cubes. I do not know enough about Tableau, will it automatically use the appropriate cube and drill down only when necessary? If so, it seems like you would get big gains doing something like this.
My guess (based on nothing but my gut) is that any gains you might see from parallelization will be eaten up by reaggregation and subsequent queries of the results. Further, I would think that writing might get more complicated with pk/fk/constraints. If this were my world, I would probably create many indexed views on top of my table (and other views) that optimized for the particular queries I need to execute (which I have worked with successfully on 10million+ row tables.)
If you run the incoming query, unpartitioned, on each node, why will any node finish before a single node running the same query would finish? Am I misunderstanding your execution plan?
I think this is, in part, going to depend on the nature of the queries you're executing and, in particular, how many rows contribute to the final result set. But surely you'll need to partition the query somehow among the nodes.
Your method to scale out queries works fine.
In fact, I've implemented such a method in:
http://code.google.com/p/shard-query
It uses a parser, but it supports most SQL constructs.
It doesn't yet support count(distinct expr) but this is doable and I plan to add support in the future.
I also have a tool called Flexviews (google for flexviews materialized views)
This tool lets you create materialized views (summary tables) which include various aggregate functions and joins.
Those tools combined together can yield massive scalability improvements for OLAP type queries.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I have these tables:
Projects(projectID, CreatedByID)
Employees(empID,depID)
Departments(depID,OfficeID)
Offices(officeID)
CreatedByID is a foreign key for Employees. I have a query that runs for almost every page load.
Is it bad practice to just add a redundant OfficeID column to Projects to eliminate the three joins? Or should I do the following:
SELECT *
FROM Projects P
JOIN Employees E ON P.CreatedBY = E.EmpID
JOIN Departments D ON E.DepID = D.DepID
JOIN Offices O ON D.officeID = O.officeID
WHERE O.officeID = #SomeOfficeID
In application programming I "Write with best practices first and optimize afterwards", but database administrators are always warning about the cost of joins.
Normalize till it hurts, then denormalize till it works
Denormalization has the advantage of fast SELECTs on large queries.
Disadvantages are:
It takes more coding and time to ensure integrity (which is most important in your case)
It's slower on DML (INSERT/UPDATE/DELETE)
It takes more space
As for optimization, you may optimize either for faster querying or for faster DML (as a rule, these two are antagonists).
Optimizing for faster querying often implies duplicating data, be it denormalization, indices, extra tables of whatever.
In case of indices, the RDBMS does it for you, but in case of denormalization, you'll need to code it yourself. What if Department moves to another Office? You'll need to fix it in three tables instead of one.
So, as I can see from the names of your tables, there won't be millions records there. So you'd better normalize your data, it will be simplier to manage.
Always normalize as far as necessary to remove database integrity issues (i.e. potential duplicated or missing data).
Even if there were performance gains from denormalizing (which is usually not the case), the price of losing data integrity is too high to justify.
Just ask anyone who has had to work on fixing all the obscure issues from a legacy database whether they would prefer good data or insignificant (if any) speed increases.
Also, as mentioned by John - if you do end up needing denormalised data (for speed/reporting/etc) then create it in a separate table, preserving the raw data.
The cost of joins shouldn't worry you too much per se (unless you're trying to scale to millions of users, in which case you absolutely should worry).
I'd be more concerned about the effect on the code that's calling this. Normalized databases are much easier to program against, and almost always lead to better efficiency within the application itself.
That said, don't normalize beyond the bounds of reason. I've seen normalization for normalization's sake, which usually ends up in a database that has one or two tables of actual data, and 20 tables filled with nothing but foreign keys. That's clearly overkill. The rule I normally use is: If the data in a column would otherwise be duplicated, it should be normalized.
It is better to keep that schema in Third Normal Form and let your DBA to complain about joins cost.
DBA's should be concerned if your db is not properly normalized to begin with. After you carefully measured performance and determined you have bottlenecks you may start denormalizing, but I would be extremely cautious.
I'd be most concerned about DBAs who are warning you about the cost of joins, unless you're in a highly pathological situation.
You shouldn't look at denormalizing before you've tried everything else.
Is the performance of this really an issue?
Do your database have any features you can use to speed things up without compromising integrity?
Can you increase your performance by caching?
Normalize to model the concepts in your design, and their relationship. Think of what relationships can change, and what a change like that will mean in terms of your design.
In the schema you posted, there is what looks to me like a glaring error (which may not be an error if you have a special case in terms of how your organization works) -- there is an implicit assumption that every department is in exactly one office, and that all the employees who are in the same department work at that office.
What if the department occupies two offices?
What if an employee nominally belongs to one department, but works out of a different office (assuming you are referring to physical offices)?
Don't denormalize.
Design your tables according to simple and sound design principles that will make it easy to implement the rest of your system. Easy to build, populate, use, and administer the database. Easy and fast to run queries and updates against. Easy to revise and extend the table design when the situation calls for it, and unnecessary to do so for light and transient reasons.
One set of design principles is normalization. Normalization leads to tables that are easy and fast to update (including inserts and deletes). Normalization obviates update anomalies, and obviates the possiblity of a database that contradicts itself. This prevents a whole lot of bugs by making them impossible. It also prevents a whole lot of update bottlenecks by making them unnecessary. This is good.
There are other sets of design principles. They lead to table designs that are less than fully normalized. But that isn't "denormalization". It's just a different design, somewhat incompatible with normalization.
One set of design principles that leads to a radically different design from normalization is star schema design. Star schema is very fast for queries. Even large scale joins and aggregations can be done in a reasonable time, given a good DBMS, good physical design, and enough hardware to get the job done. As you might expect, a star schema suffers update anomalies. You have to program around these anomalies when you keep the database up to date. You will will generally need a tightly controlled and carefully built ETL process that updates the star schema from other (perhaps normalized) data sources.
Using data stored in a star schema is dramatically easy. It's so easy that using some kind of OLAP and reporting engine, you can get all the information needed without writing any code, and without sacrificing performance too much.
It takes good and somewhat deep data analysis to design a good normalized schema. Errors and omissions in data analysis may result in undiscovered functional dependencies. These undiscovered FDs will result in unwitting departures from normalization.
It also takes good and somewhat deep data analysis to design and build a good star schema. Errors and ommissions in data analysis may result in unfortunate choices in dimensions and granularity. This will make ETL almost impossible to build, and/or make the information carrying capacity of the star inadequate for the emerging needs.
Good and somewhat deep data analysis should not be an excuse for analysis paralysis. The analysis has to be right and reasonably complete in a short amount of time. Shorter for smaller projects. The design and implementation should be able to survive some late additions and corrections to the data analysis and to the requirements, but not a steady torrent of requirements revisions.
This response expands on your original question, but I think it's relevant for the would be database designer.
Normalization is a quality decision.
Denormalization is a performance decision.
That's why -
Normalize till it hurts; De-normalize till it works.
Quality decisions tell which is the least Normal Form that you can live with:
How much non-redundancy is important for your tables?
How fast data management do you want?
How clear do you want the relation between your tables?
Performance decisions tell what is the highest Normal Form acceptable:
Is my database's response fast enough?
Are too many joins causing a slowdown?
When you have fixed the least and the highest Normal Form acceptable in your case, pick the Normal Form anywhere in between.
If you're using Integers (or BIGINT) as the ID's and they are the clustered primary key you should be fine.
Although it seems like it would always be faster to find an office from a project as you are always looking up primary keys the use of indexes on the foreign keys will make the difference minimal as the indexes will cover the primary keys too.
If you ever find a need later on to denormalise the data, you can create a cache table on a schedule or trigger.
In the example given indexes set up properly on the tables should allow the joins to occur extremely fast and will scale well to the 100,000s of rows. This is usually the approach that I take to get around the issue.
There are times though that the data is written once and the selected for the rest of its life where it really didn't make sense to do a dozen joins each time.