How to speedup joins - sql

I'm using SQL server 2008r2. I have a problem of returning data to the user because of massive joins (for example I need to make 5 inner + 6 left joins in one query (usually tvfs, sometimes tables). It takes toooo long.)
What are the workarounds for this problem?
Should I denormolize my database?
What are the best practices to avoid huge number of joins?

I'd have to see the SQL to troubleshoot specifics, but here's a few things I do when pulling results that have extremely high demand:
Use you tools. Display Estimated Execution Plan can expose some obvious vagaries in your logic.
Learn to love 'where exists' and 'having'. You can minimize the focus and scope sometimes by qualifying in creative ways that don't require HARD IO. This is more true for sub-queries than joins but I add a clause for every outer join I need.
Most importantly IMO, don't be afraid of staging your results. You sometimes need to process billions/trillions of transactions against millions of records and what takes hours with joins can be accomplished in minutes or seconds by staging. If you only need x% of you top 2 or 3 tables, why join every record top to bottom? Sometimes it's just too much overhead.
Pull your simplest result-set down to a stage table (or temp, whatever you need), index it and then go after the next chunk. That usually saves me a fortune in memory.
Use CTEs when you can. However, my experience has been they degrade beyond a certain point. Nice for ancillary tables but not for serious volume.
Be creative in your combinations. I'll use those exists clauses in Stage 1 (reading Tables a, b and c) to only bring back the records that also exist in tables d, e and f.
A lot of the expert SQL advice is not based on VLDBs - it's based on Customer, Orders, Demographic type schemas.
Are these stored procs run natively?

Here's a good (over-simplified) example of staging:
Let's say you wanted to find all of the high-risk individuals in your city (Might as well be interesting about it). You have a Phone company dB (national) indexed by state, city, last name, first name, address and an FBI dB (global) indexed by last name, first name, country, region, address. Let's say the FBI dB has multiple records for each individual due to multiple past addresses.
You could join the two dBs on the common elements and then qualify your criteria. Or...
Select RecordID from Phone as P1
Where State = 'MyState' and City = 'MyCity' and
exists (Select 1
From TheMan as M1
Where M1.Last = P1.Last and M1.First = P1.First and M1.Risk > 80)
Now I have a small record-set to qualify and a small result-set to work from. From there I can go get details. That's a good candidate for a CTE and I could shoot a dozen holes in the logic, but it illustrates the concept. If you bring M1.Risk (non-indexed field) into the equation with a full join, you're forcing SQL Server to plan against it in certain situations. Not necessarily here, but as your logic gets more complex and subsequent non-indexed criteria comes into play.

Related

100 Join SQL query

I'm looking to essentially have a centralized table with a number of lookup tables that surround it. The central table is going to be used to store 'Users' and the lookup tables will be user attributes, like 'Religion'. The central table will store an Id, like ReligionId, and the lookup table would contain a list of religions.
Now, I've done a lot of digging into this and I've seen many people comment saying that a UserAttribute table might be the best way to go, essentially using an EAV pattern. I'm not looking to do this. I realize that my strategy will be join-heavy and that's why I ask this question here. I'm looking for a way to optimize those joins.
If the table has 100 lookup tables, how could it be optimized to be faster than just doing a massive 100 table inner join? Some ideas come to mind like using many smaller joins, sub-selects and views. I'm open to anything, including a combination of these strategies. Again, just to note, I'm not looking to do anything that's EAV-related. I need the lookup tables for other reasons and I like normalized data.
All suggestions considered!
Here's a visual look:
Edit: Is this insane?
Optimization techniques will likely depend on the size of the center table and intended query patterns. This is very similar to what you get in data warehousing star schemas, so approaches from that paradigm may help.
For one, ensuring the size of each row is absolutely as small as possible. Disk space may be cheap, but disk throughput, memory, and CPU resources are potential bottle necks. You want small rows so that it can read them quickly and cache as much as possible in memory.
A materialized/indexed view with the joins already performed allows the joins to essentially be precomputed. This may not work well if you are dealing with a center table that is being written to alot or is very large.
Anything you can do to optimize a single join should be done for all 100. Appropriate indexes based on the selectivity of the column, etc.
Depending on what kind of queries you are performing, then other techniques from data warehousing or OLAP may apply. If you are doing lots of group by's then this is likely an area to look in to. Data warehousing techniques can be applied within SQL Server with no additional tooling.
Ask yourself why so many attributes are being queried and how they are being presented? For most analysis it is not necessary to join with lookup tables until the final step where you materialize a report, at which time you may only have grouped by on a subset of columns and thus only need some of the lookup tables.
Group By's generally should be able to group on the lookup Id's without needing the text/description from the lookup table so a join is not necessary. If your lookups have other information relevant to the query at hand then consider denormalizing it into the central table to eliminate the join and/or make that discreet value its own lookup, essentially splitting the existing lookup ID into another ID.
You could implement a master code table that combines the code tables into a single table with a CodeType column. This is not the same as EAV because you'd still have a column in the center table for each code type and a join for each, where as EAV is usually used to normalize out an arbitrary number of attributes. (Note: I personally hate master code tables.)
Lastly, consider normalization the center table if you are not doing data warehousing.
Are there lots of null values in certain lookupId columns? Is the table sparse? This is an indication that you can pull some columns out into a 1 to 1/0 relationships to reduce the size of the center table. For example, a Person table that includes address information can have a PersonAddress table pulled out of it.
Partitioning the table may improve performance if there's a large number of rows and you can determine that certain rows, perhaps with a certain old datetime from couple years in the past, would rarely be queried.
Update: See "Ask yourself why so many attributes are being queried and how they are being presented?" above. Consider a user wants to know number of sales grouped by year, department, and product. You should have id's for each of these so you can just group by those IDs on the center table and in an outer query join lookups for only what columns remain. This ensures the aggregation doesn't need to pull in unnecessary information from lookups that aren't needed anyway.
If you aren't doing aggregations, then you probably aren't querying large numbers of records at a time, so join performance is less of a concern and should be taken care of with appropriate indexes.
If you're querying large numbers of records at a time pulling in all information, I'd look hard at the business case for this. No one sits down at their desk and opens a report with a million rows and 100 columns in it and does anything meaningful with all of that data, that couldn't be accomplished in a better way.
The only case for such a query be a dump of all data intended for export to another system, in which case performance shouldn't be as much as a concern as it can be scheduled overnight.
Since you are set on your way. you can consider duplicating data in order to join less times in a similar way to what is done in olap database.
http://en.wikipedia.org/wiki/OLAP_cube
With that said I don't think this is the best way to do it if you have 100 properties.
Have you tried to export it to Microsoft Excel Power Pivot with Power Query? you can make fast data analysis with pretty awsome ways to show it with Power view video sample

Multiple single field indexes vs multiple-fields indexes

I know there are similar questions on StackOverflow, but after testing different indexes on my tables, I think I don't quite understand how indexes work and I'd like it if someone could explain the behavior I'm experiencing on my queries' performance.
I'm using this query as an example, I'm going to try to explain it in detail:
SELECT ss1.PlayerID, ss1.Name, ss1.Series, ss1.LanesNum, ss1.Date, ss1.LeagueName, ss1.Season FROM SeriesScores ss1
JOIN (SELECT Series, Gender, LanesNum, Bowlout, Season FROM SeriesScores
WHERE Gender = ? AND LanesNum = ? AND Series > -1 AND Bowlout = 'No' AND Season = '2011-2012'
ORDER BY Series DESC LIMIT 0,?) as ss2
USING(series, gender, lanesNum, bowlout, season)
ORDER BY ss1.Series DESC
This query is used to get the highest series bowled in a given season for each pair of lanes in a bowling center for both male and female players.
I'm joining the table on itself instead of using the MAX aggregate function because if there's a tie on a given pair of lanes, I want all the names to come up.
Basically, I join all the fields that match what the inner SELECT returns. That inner SELECT returns the top X players for a given gender and a given pair of lanes.
The USING part makes sure only the players that haven't bowled out, with the same gender, series, lanesNum and season as I'm looking for get selected. I then order them by highest series to lowest series.
This query is in a for loop, which gets run 12 times for men and 12 times for women (12 pair of lanes in the bowling center) with only the lanesNum and gender parameters changing.
I then put all the results in two different vectors in Java to display the results in an application (one vector for men, one for women).
Without any indexes whatsoever, it takes around 11 seconds to run everything including putting the results in a vector and all of that. (5.5 seconds for the 12 queries for men, same for women).
With an index on (gender, lanesNum, series), it takes 0.04 seconds for the whole thing, which is amazing, since that's a more than acceptable speed for my needs.
I used that index because those are all the most important fields I'm using in my WHERE clause, but I don't get why it speeds things up that much, because I tried other things and using some other indexes actually made my queries SLOWER by more than 100%. Also, I'm wondering if I would get an even faster query if I added "bowlout" and "season" to that index.
I wanted to try a single column index on series first and test performance. That's the index that made all of those queries take a total of 22 seconds.
I came to the conclusion that I don't understand where I should be using my indexes and when I should be using them on multiple fields, or using multiple indexes on single fields, etc. Also, I don't understand how using (the wrong) indexes can actually make performance worse.
Optimizing an index too aggresively for just one query runs the risk of slowing down other queries (and thus a real world application, or the next version of it). However, let us do exactly that as an exercise in analysing index performance.
Indexes influence query performance in multiple ways; their existence can actually completely change the algorithm that the database server will use to get to the data. A nice overview is here, but as your query is simple, and you actually have very few relevant indexes in your database (the one you see, and also automatically created indexes to support the primary keys of your tables) we can simplify the story greatly.
A good index makes it faster to cross reference the data between the tables. Ideally it contains columns in your USING and WHERE clauses, and enough of them to reference a unique row in its table most of the time. If it contains less, it may still be used by the database server, but the remaining rows will have to be visited one by one.
An great index does not only all that, but it also contains all data that you will be selecting from the table (yes, this makes sense when the two tables are actually the same physical table due to the self-join; the database server still processes as if it was two different tables, incidentally with the same data). The benefit of such a "fully covering index" is that the database server does not have to visit its table at all; all the columns are available in the index.
Order of columns in the index matters. It is especially essential that the leftmost column in the index appears in the USING clause, or WHERE clause; otherwise the index is pretty much unusable as matching data for a single lookup can appear in many locations in that index. It should also be highly selective (have many different values in the table). Do a few experiments now to see this first hand.
For this reason, the first choice index I'd suggest to you would be series, gender, lanesNum, bowlout; but yours is also a very good one for this query.
There is not much use in creating more than one index explicitly. There is basically no use for more than one of them during query execution, because your query is so simple. So the most useful one will supposedly win and all the others will be ignored.
To your last question: some people believe that superfluous indexes only slow down UPDATE, INSERT and DELETE statements (because these carry the overhead to update the indexes), but it is not that simple. As the database server considers multiple algorithms to compute your query (there are two logical tables to start from and automatic and explicit indexes to use, or not to use), it may choose the wrong plan: an index may look seductive without knowing the data distribution in the table, but be very counterproductive given the distribution.
There is actually a way to let the database server analyze the data and record some statistics that will greatly help it optimize your subsequent queries reasonably and probably to avoid any 22 second executions of your query (until you change your data so much that the statistics will no longer hold true). That is the ANALYZE command. Issue it every time after you change your indexes to see the subsequent sqlite performance at its best. In a production database, schedule ANALYZE to execute every night, so that your database does not gradually slow down over time, or abruptly after adding a harmless, useless index.

SQL: multiple queries vs joins (specific case)

This question seems to have been asked a lot and the answer seems to be "it depends on the details". so I am asking for my specific case: Is it better for me to have multiple queries or use joins?
The details are as follows:
"products" table -- probably around 2000 rows, 15 or so columns
"tags" table -- probably around 10 rows, 3 columns
"types" table -- probably around 10 rows, 3 columns
I need the "tags" and "types" table to get the tag/type-id that is in the product table.
My gut says that if i join the tables i end up searching a much much larger set so its better to do multiple queries, but i am not really sure...
Thoughts?
No, join will probably outperform multiple queries. Your tables are extremely small.
Ask yourself what extra work would be involved in doing multiple queries... I don't know what you need this data for, but I assume you would, at some point, need to correlate the results - match Tags and Types to Products, wouldn't you? If you don't do that with a join, you just have to do it elsewhere with some other mechanism.
Further, your conception of this overlooks the fact that databases are designed for join scenarios. If you perform three isolated queries, the database has no opportunity to optimize its querying behavior across the results you're looking for. If you do it in one query with a join, it does have that opportunity.
Leave the problem of producing a resultset of ~2000 * 10 * 10 records and then filtering it up to the database, in my opinion - that's what it's good at. :)
The amount of data is too small to demonstrate one over the other, but multiple separate queries will use more with respect to transferring over the wire than a single query. There is packet overhead, and separate data sets risks difference if the data set changes between queries if not in the same transaction.
JOINs specifically might not be necessary, EXISTS or IN can be used if the supporting tables don't expose columns in the resultset. A JOIN between tables that are parent & child, and there can be more than one child to a parent will inflate the rows searched -- not necessarily the rows returned.
Assuming that everything has indexes on the primary keys (you should be doing that), then joins will be very efficient. The only case where joins would be worse is if you had some kind of external caching of query results (as some ORMs will do for you), your products table was much bigger, and you were querying at a sufficient rate to keep the results of the two smaller queries (but not the third) in cache. In that scenario multiple queries becomes faster because you're only making one of the three queries. But the difference is going to be hard to measure.
If the database is not on localhost but accessed over a network it's better to send one request, let the database do the work and retrieve the data at once. This will give you less network delay. So joins are preferred.

Which is more efficient : 2 single table queries or 1 join query

Say tableA has 1 row to be returned but will have 100 columns returned while tableB has 100 rows to be returned but only one column from each. TableB has a foreign key for table A.
Will a left join of tableA to tableB return 100*100 cells of data while 2 separate queries return 100 + 100 cells of data or 50 times less data or is that a misunderstanding of how it works?
Is it ever more efficient to use many simple queries rather than fewer more complex ones?
First and foremost, I would question a table with 100 columns, and suggest that there is a possibly a better design for your schema. In the real world, this number of columns is less common, so typically the difference in the amount of data returned with one query vs. two becomes less significant. 100 columns in a table is not necessarily bad, just a flag that it shold be considered.
However, assuming your numbers are what they are to make clear the question, there are a few important variables to consider:
1 - What is the speed of the link between the db server and the application server? If it is very slow, then you are probably better off minimizing the amount data returned vs. the number of queries you run. If it is not slow, then you will likely expend more time in the execution of two queries than you would returning the increased payload. Which is better can only be determined by testing in your own environment.
2 - How efficient is the transport protocol itself? Perhaps there is some kind of compression of the data, or an even more clever algorithm that knows column 2 through 101 are duplicate for every row, so it only passes them once. Strategies like this in the transport protocol would mitigate any of your concerns. Again, this is why you need to test in your own envionment to know for sure.
As others have pointed out, you also need to consider what will be done with the data once you get it (e.g., JOINs, GROUPing, etc), but I am limiting my response to the specifics of your question around query count vs. payload size.
What is best at joining? A database engine or client code? Saying that, I use both techniques: it depends on the client and how data will be used.
Where the data requires some processing to, say, render on a web page I'd probably split header and details recordsets. We do use this because we have some business logic between DB and HTML
Where it's consumed simply and linearly, I'd join in the database to avoid unnecessary processing. For example, simple reports or exports
It depends, if you only take into account the SQL efficiency obviusly several simpler and smaller result queries will be more efficient.
But you need to take into account the whole process if the join will be made otherwise on the client or you need to filter results after the join, then probably the DBM will be more efficient that doing it on your code.
Coding is always a tradeoff between diferent systems, DB vs Client, RAM vs CPU... you need to be conscious about this and try to find the perfect solution.
In this case probably 2 queries outperform 1 but that is not a general solution.
I think that your question basically is about database normalization. In general, it is advisable to normalize a database into multiple tables (using primary and foreign keys) and to join them as needed upon queries. This is better for insert/update performance and for keeping the data consistent, and usually results in smaller database sizes as well.
As for the row numbers returned, only a cross join would actually return 100*100 rows; any inner or outer join will not create all combinations, but rather tie together rows on the given conditions, and for outer joins preserve rows which could not be matched. Wikipedia has some samples in its JOIN article.
For very query-intense applications, the performance may be better when using less normlized tables. However, as always with optimizations, I'd only consider going into that direction after seeing real measurable problems (e.g. with a profiling tool).
In general, try to keep the number of roundtrips to the database low; a large number of single simple queries will suffer from the overhead of talking to the DB engine (network etc.). If you need to execute complex series of statements, consider using stored procedures.
Generally fewer queries makes for better performance, as long as the queries return data that is actually related. There is no point in trying to put unrelated data into the same query just to reduce the number or queries.
There are of course exceptions, and your example may be one of them. However, it depends on more than the number of fields returnes, like what the fields actually return, i.e. the actual amount of data.
As an example of how the number of queries affects performance, I can mention a solution that I have (sadly enough) seen many times. In that solution the programmer would first get a number of records from one table, then loop through the records and run another query for each record to get the related records from another table. This clearly results in a lot of queries, and a solution having either one or two queries would be much more efficient.
β€œIs it ever more efficient to use many simple queries rather than fewer more complex ones?”
The query that requires the least amount of data to traverse, and gives you no more than what you need is the more efficient one. Beyond this, there can be RDBMS specific conditions that can be more efficient on one RDBMS system than another. At the very low level, when you deal with less data, then your results can be retrieved much quicker, so efficient queries are queries that only work with the least amount of data needed to get you the result you are looking for.

TSQL Join efficiency

I'm developing an ASP.NET/C#/SQL application. I've created a query for a specific grid-view that involves a lot of joins to get the data needed. On the hosted server, the query has randomly started taking up to 20 seconds to process. I'm sure it's partly an overloaded host-server (because sometimes the query takes <1s), but I don't think the query (which is actually a view reference via a stored procedure) is at all optimal regardless.
I'm unsure how to improve the efficiency of the below query:
(There are about 1500 matching records to those joins, currently)
SELECT dbo.ca_Connections.ID,
dbo.ca_Connections.Date,
dbo.ca_Connections.ElectricityID,
dbo.ca_Connections.NaturalGasID,
dbo.ca_Connections.LPGID,
dbo.ca_Connections.EndUserID,
dbo.ca_Addrs.LotNumber,
dbo.ca_Addrs.UnitNumber,
dbo.ca_Addrs.StreetNumber,
dbo.ca_Addrs.Street1,
dbo.ca_Addrs.Street2,
dbo.ca_Addrs.Suburb,
dbo.ca_Addrs.Postcode,
dbo.ca_Addrs.LevelNumber,
dbo.ca_CompanyConnectors.ConnectorID,
dbo.ca_CompanyConnectors.CompanyID,
dbo.ca_Connections.HandOverDate,
dbo.ca_Companies.Name,
dbo.ca_States.State,
CONVERT(nchar, dbo.ca_Connections.Date, 103) AS DateView,
CONVERT(nchar, dbo.ca_Connections.HandOverDate, 103) AS HandOverDateView
FROM dbo.ca_CompanyConnections
INNER JOIN dbo.ca_CompanyConnectors ON dbo.ca_CompanyConnections.CompanyID = dbo.ca_CompanyConnectors.CompanyID
INNER JOIN dbo.ca_Connections ON dbo.ca_CompanyConnections.ConnectionID = dbo.ca_Connections.ID
INNER JOIN dbo.ca_Addrs ON dbo.ca_Connections.AddressID = dbo.ca_Addrs.ID
INNER JOIN dbo.ca_Companies ON dbo.ca_CompanyConnectors.CompanyID = dbo.ca_Companies.ID
INNER JOIN dbo.ca_States ON dbo.ca_Addrs.StateID = dbo.ca_States.ID
It may have nothing to do with your query and everything to do with the data transfer.
How fast does the query run in query analyzer?
How does this compare to the web page?
If you are bringing back the entire data set you may want to introduce paging, say 100 records per page.
The first thing I normally suggest is to profile to look for potential indexes to help out. But the when the problem is sporadic like this and the normal case is for the query to run in <1sec, it's more likely due to lock contention rather than a missing index. That means the cause is something else in the system causing this query to take longer. Perhaps an insert or update. Perhaps another select query β€” one that you would normally expect to take a little longer so the extra time on it's end isn't noted.
I would start with indexing, but I have a database that is a third-party application. Creating my own indexes is not an option. I read an article (sorry, can't find the reference) recommending breaking up the query into table variables or temp tables (depending on number of records) when you have multiple tables in your query (not sure what the magic number is).
Start with dbo.ca_CompanyConnections, dbo.ca_CompanyConnectors, dbo.ca_Connections. Include the fields you need. And then subsitute these three joined tables with just the temp table.
Not sure what the issue is (would like to here recommendations) but seems like when you get over 5 tables performance seems to drop.