I came across this very odd situation, and i thought i would throw it up to the crowd to find out the WHY.
I have a query that was joining a table on a linked server:
select a.*, b.phone
from table_a a,
join remote.table_b b on b.id = a.id
(lots of data on A, but very few on B)
this query was talking forever (never even found out the actual run time), and that is when I noticed B had no index, so I added it, but that didn't fix the issue. Finally, out of desperation I tried:
select a.*, b.phone
from table_a a,
join (select id, phone from remote.B) as b on b.id = a.id
This version of the query, in my mind as least, should have the same results, but lo and behold, its responding immediately!
Any ideas why one would hang and the other process quickly? And yes, I did wait to make sure the index had been built before running both.
It's because sometimes(very often) execution plans automatically generated by sql server engine are not as good and obvious as we want to. You can look at execution plan in both situations. I suggest use hint in first query, something like that: INNER MERGE JOIN.
Here is some more information about that:
http://msdn.microsoft.com/en-us/library/ms181714.aspx
For linked servers 2nd variant prefetches all the data locally and do the join, since 1st variant may do inner loop join roundtrip to linked server for every row in A
Remote table as in not on that server? Is it possible that the join is actually making multiple calls out to the remote table while the subquery is making a single request for a copy of the table data, thus resulting in less time waiting on network?
I'm just going to have a guess here. When you access remote.b is it a table on another server?
If it is, the reason the second query is faster is because, you do one query to the other server and get all the fields you need from b, before processing the data. In the first query you are processing data and at the same time you are making several requests to the other server.
Hope this help you.
Related
This is my first question here, please be gentle.
At work, I inherited responsibility for a MS Access database, which is crucial for my department.
That database was grown over 20 years, with things added, removed and changed. Shortly, it's a convoluted mess. The VBA code contains great stuff like this, I kid you not:
Dim p, strText, A, B, C, d, E, F, G, H, i, j, K, L, M, N, O, Z, R, Q, kd, AfGb, T, LN, DC, EntBez, TP, pack, Press, Fehler, ksoll, Y, zeileninhalt, dateipfad, auslesezeile As String
I'm slowly cleaning it all up, but... anyways:
The Problem
It is slow when opening some forms (7-10 seconds loading time). I was able to narrow it down to the recordsource of these forms, which all use basically the same query or a variation of it.
The user enters a job number in the Main form and hits enter. The underlying query then pulls data from two tables based on the unique key JobNr. The result is a single row containing all the info for this job. These infos are displayed in an Editor form, using the query as recordsource.
The database is split into frontend and backend, t1 and t2 are backend tables each with about 20k entries. Backend sits somewhere on the company servers, frontend is saved locally on each user computer.
This is the query:
SELECT *
FROM t1
INNER JOIN t2 ON t1.JobNr = t2.JobNr
WHERE JobNr = [Forms]![Main]![JobNr];
t1 has JobNr as primary key, t2 has an ID as primary key, JobNr is not indexed. I want to try indexing it in hope of better performance, but currently can't make changes to the backend during busy work days...
This simple query is stupidly slow for what it is. The problem seems to be the order of execution. Instead of getting the single entries from t1 and t2 and joining these to a single dataset, Access seems to first join both friggin tables as a whole and only after that looks up the single dataset the user is interested in.
I was not able to find a solution to dictate the execution order. I tried different ways, like rewriting the SQL code with nested Selects, something like:
SELECT *
FROM
(SELECT * FROM t1
WHERE t1.JobNr = [Forms]![Main]![JobNr]) AS q1
INNER JOIN
(SELECT * FROM t2
WHERE t2.JobNr = [Forms]![Main]![JobNr]) AS q2 ON q1.JobNr = q2.JobNr;
Still slow...
I wanted to try WITH to partition the SQL code, but that's apparently not supported by MS Access SQL.
I tried splitting the query into two queries q1 and q2 in access, that pull the data from t1 resp. t2 with a third query q3 that does the joining of these supposed subsets... to no avail. q1 and q2 individually run blazingly fast with the expected data result, but q3 takes the usual 7-10 seconds.
The current approach I'm working on is running q1 and q2 and saving the acquired data to two temp tables tq1 and tq2 and then joining these in a last query. This works very well in as it rapidly loads the data and displays it in the editor (< 0.5 seconds, hurray!). The problem I'm facing now is updating any changes the user makes in the editor form to the backend tables t1 and t2... Right now, user changes don't take and are lost when closing and reopening the job/editor.
Soooo, what am I missing/doing wrong? Is there any way to make this INNER JOIN query fast without the whole temp table workaround?
If not, how would I go about updating the backend tables from the local temp tables? Changes in the Editor are saved in the temp tables until overwritten by reopening the editor.
I already added intermediary queries, that add the resp. primary keys to the temp tables (this cannot be done directly in the Create Table queries....) but...
I also tried using an Update query when closing the Editor, which doesn't seem to work either, but I might have to debug that one, I'm not sure it even dies anything right now...
Sorry for the long text!
Kind regards and thanks for any help in advance!
The most obvious rework is to move the filter into the join:
SELECT *
FROM t1
INNER JOIN t2 ON (t1.JobNr = t2.JobNr AND t2.JobNr = [Forms]![Main]![JobNr])
My guess is that it's irrelevant if you filter on t1 or t2, but then my guess would also be that Access is smart enough to filter while joining and that appears to be untrue, so check that.
For more detailed performance analysis, a query plan tends to help. See How to get query plans (showplan.out) from Access 2010?
Of course, adjust 14 to your version number.
You need to add a unique index to t2.JobNr, even better make it the primary key.
Everything else is just a waste of time at this point.
Set a date and time for the users to quit their frontends, kick them out if necessary: Force all users to disconnect from 2010 Access backend database
In the long run, moving from an Access backend to a server backend (like the free SQL Server Express) will be a good idea.
Edit: have you tried what happens if you don't do JOIN at all?
SELECT *
FROM t1, t2
WHERE t1.JobNr = [Forms]![Main]![JobNr]
AND t2.JobNr = [Forms]![Main]![JobNr]
Normally you want to avoid this, but it might help in this case.
There is a stored procedure that needs to be modified to eliminate a call to another server.
What is the easiest and feasible way to do this so that the final SP's execution time is faster and also preference to solutions which do not involve much change to the application?
Eg:
select *
from dbo.table1 a
inner join server2.dbo.table2 b on a.id = b.id
Cross server JOINs can be problematic as the optimiser doesn't always pick the most effective solution, which may even result in the entire remote table being dragged over your network to be queried for a single row.
Replication is by far the best option, if you can justify it. This will mean you need to have a primary key on the table you want to replicate, which seems a reasonable constraint (ha!), but might become an issue with a third-party system.
if the remote table is small then it might be better to take a temporary local copy, e.g. SELECT * INTO #temp FROM server2.<database>.dbo.table2;. Then you can change your query to something like this: select * from dbo.table1 a inner join #temp b on a.id = b.id;. The temporary table will be marked for garbage collection when your session ends, so no need to tidy up after yourself.
If the table is larger then you might want to do the above, but also add an index to your temporary table, e.g. CREATE INDEX ix$temp ON #temp (id);. Note that if you use a named index then you will have issues if you run the same procedure twice simultaneously, as the index name won't be unique. This isn't a problem if the execution is always in series.
If you have a small number of ids that you want to include then OPENQUERY might be the way to go, e.g. SELECT * FROM OPENQUERY('server2', 'SELECT * FROM table2 WHERE id IN (''1'', ''2'')');. The advantage here is that you are now running the query on the remote server, so it's more likely to use a more efficient query plan.
The bottom line is that if you expect to be able to JOIN a remote and local table then you will always have some level of uncertainty; even if the query runs well one day, it might suddenly decide to run a LOT slower the following day. Small things, like adding a single row of data to the remote table, can completely change the way the query is executed.
My boss wants me to do a join on three tables, let's call them tableA, tableB, tableC, which have respectively 74M, 3M and 75M rows.
In case it's useful, the query looks like this :
SELECT A.*,
C."needed_field"
FROM "tableA" A
INNER JOIN (SELECT "field_join_AB", "field_join_BC" FROM "tableB") B
ON A."field_join_AB" = B."field_join_AB"
INNER JOIN (SELECT "field_join_BC", "needed_field" FROM "tableC") C
ON B."field_join_BC" = C."field_join_BC"
When trying the query on Dataiku Data Science Studio + Vertica, it seems to create temporary data to produce the output, which fills up the 1T of space on the server, bloating it.
My boss doesn't know much about SQL, so he doesn't understand that in the worst case scenario, it can produce a table with 74M*3M*75M = 1.6*10^19 rows, possibly being the problem here (and I'm brand new and I don't know the data yet, so I don't know if the query is likely to produce that many rows or not).
Therefore I would like to know if I have a way of knowing beforehand how many rows will be produced : if I did a COUNT(), such as this, for instance :
SELECT COUNT(*)
FROM "tableA" A
INNER JOIN (SELECT "field_join_AB", "field_join_BC" FROM "tableB") B
ON A."field_join_AB" = B."field_join_AB"
INNER JOIN (SELECT "field_join_BC", "needed_field" FROM "tableC") C
ON B."field_join_BC" = C."field_join_BC"
Does the underlying engine produces the whole dataset, and then counts it ? (which would mean I can't count it beforehand, at least not that way).
Or is it possible that a COUNT() gives me a result ? (because it's not building the dataset but working it out some other way)
(NB : I am currently testing it, but the count has been running for 35mn now)
Vertica is a columnar database. Any query you do only needs to look at the columns required to resolve output, joins, predicates, etc.
Vertica also is able to query against encoded data in many cases, avoiding full materialization until it is actually needed.
Counts like that can be very fast in Vertica. You don't really need to jump through hoops, Vertica will only include columns that are actually used. The optimizer won't try to reconstitute the entire row, only the columns it needs.
What's probably happening here is that you have hash joins with rebroadcasting. If your underlying projections do not line up and your sorts are different and you are joining multiple large tables together, just the join itself can be expensive because it has to load it all into hash and do a lot of network rebroadcasting of the data to get the joins to happen on the initiator node.
I would consider running DBD using these queries as input, especially if these are common query patterns. If you haven't run DBD at all yet and are not using custom projections, then your default projections will likely not perform well and cause the situation I mention above.
You can do an explain to see what's going on.
I have a case where using a JOIN or an IN will give me the correct results... Which typically has better performance and why? How much does it depend on what database server you are running? (FYI I am using MSSQL)
Generally speaking, IN and JOIN are different queries that can yield different results.
SELECT a.*
FROM a
JOIN b
ON a.col = b.col
is not the same as
SELECT a.*
FROM a
WHERE col IN
(
SELECT col
FROM b
)
, unless b.col is unique.
However, this is the synonym for the first query:
SELECT a.*
FROM a
JOIN (
SELECT DISTINCT col
FROM b
)
ON b.col = a.col
If the joining column is UNIQUE and marked as such, both these queries yield the same plan in SQL Server.
If it's not, then IN is faster than JOIN on DISTINCT.
See this article in my blog for performance details:
IN vs. JOIN vs. EXISTS
This Thread is pretty old but still mentioned often. For my personal taste it is a bit incomplete, because there is another way to ask the database with the EXISTS keyword which I found to be faster more often than not.
So if you are only interested in values from table a you can use this query:
SELECT a.*
FROM a
WHERE EXISTS (
SELECT *
FROM b
WHERE b.col = a.col
)
The difference might be huge if col is not indexed, because the db does not have to find all records in b which have the same value in col, it only has to find the very first one. If there is no index on b.col and a lot of records in b a table scan might be the consequence. With IN or a JOIN this would be a full table scan, with EXISTS this would be only a partial table scan (until the first matching record is found).
If there a lots of records in b which have the same col value you will also waste a lot of memory for reading all these records into a temporary space just to find that your condition is satisfied. With exists this can be usually avoided.
I have often found EXISTS faster then IN even if there is an index. It depends on the database system (the optimizer), the data and last not least on the type of index which is used.
That's rather hard to say - in order to really find out which one works better, you'd need to actually profile the execution times.
As a general rule of thumb, I think if you have indices on your foreign key columns, and if you're using only (or mostly) INNER JOIN conditions, then the JOIN will be slightly faster.
But as soon as you start using OUTER JOIN, or if you're lacking foreign key indexes, the IN might be quicker.
Marc
A interesting writeup on the logical differences: SQL Server: JOIN vs IN vs EXISTS - the logical difference
I am pretty sure that assuming that the relations and indexes are maintained a Join will perform better overall (more effort goes into working with that operation then others). If you think about it conceptually then its the difference between 2 queries and 1 query.
You need to hook it up to the Query Analyzer and try it and see the difference. Also look at the Query Execution Plan and try to minimize steps.
Each database's implementation but you can probably guess that they all solve common problems in more or less the same way. If you are using MSSQL have a look at the execution plan that is generated. You can do this by turning on the profiler and executions plans. This will give you a text version when you run the command.
I am not sure what version of MSSQL you are using but you can get a graphical one in SQL Server 2000 in the query analyzer. I am sure that this functionality is lurking some where in SQL Server Studio Manager in later versions.
Have a look at the exeuction plan. As far as possible avoid table scans unless of course your table is small in which case a table scan is faster than using an index. Read up on the different join operations that each different scenario produces.
The optimizer should be smart enough to give you the same result either way for normal queries. Check the execution plan and they should give you the same thing. If they don't, I would normally consider the JOIN to be faster. All systems are different, though, so you should profile the code on your system to be sure.
I'm developing an ASP.NET/C#/SQL application. I've created a query for a specific grid-view that involves a lot of joins to get the data needed. On the hosted server, the query has randomly started taking up to 20 seconds to process. I'm sure it's partly an overloaded host-server (because sometimes the query takes <1s), but I don't think the query (which is actually a view reference via a stored procedure) is at all optimal regardless.
I'm unsure how to improve the efficiency of the below query:
(There are about 1500 matching records to those joins, currently)
SELECT dbo.ca_Connections.ID,
dbo.ca_Connections.Date,
dbo.ca_Connections.ElectricityID,
dbo.ca_Connections.NaturalGasID,
dbo.ca_Connections.LPGID,
dbo.ca_Connections.EndUserID,
dbo.ca_Addrs.LotNumber,
dbo.ca_Addrs.UnitNumber,
dbo.ca_Addrs.StreetNumber,
dbo.ca_Addrs.Street1,
dbo.ca_Addrs.Street2,
dbo.ca_Addrs.Suburb,
dbo.ca_Addrs.Postcode,
dbo.ca_Addrs.LevelNumber,
dbo.ca_CompanyConnectors.ConnectorID,
dbo.ca_CompanyConnectors.CompanyID,
dbo.ca_Connections.HandOverDate,
dbo.ca_Companies.Name,
dbo.ca_States.State,
CONVERT(nchar, dbo.ca_Connections.Date, 103) AS DateView,
CONVERT(nchar, dbo.ca_Connections.HandOverDate, 103) AS HandOverDateView
FROM dbo.ca_CompanyConnections
INNER JOIN dbo.ca_CompanyConnectors ON dbo.ca_CompanyConnections.CompanyID = dbo.ca_CompanyConnectors.CompanyID
INNER JOIN dbo.ca_Connections ON dbo.ca_CompanyConnections.ConnectionID = dbo.ca_Connections.ID
INNER JOIN dbo.ca_Addrs ON dbo.ca_Connections.AddressID = dbo.ca_Addrs.ID
INNER JOIN dbo.ca_Companies ON dbo.ca_CompanyConnectors.CompanyID = dbo.ca_Companies.ID
INNER JOIN dbo.ca_States ON dbo.ca_Addrs.StateID = dbo.ca_States.ID
It may have nothing to do with your query and everything to do with the data transfer.
How fast does the query run in query analyzer?
How does this compare to the web page?
If you are bringing back the entire data set you may want to introduce paging, say 100 records per page.
The first thing I normally suggest is to profile to look for potential indexes to help out. But the when the problem is sporadic like this and the normal case is for the query to run in <1sec, it's more likely due to lock contention rather than a missing index. That means the cause is something else in the system causing this query to take longer. Perhaps an insert or update. Perhaps another select query — one that you would normally expect to take a little longer so the extra time on it's end isn't noted.
I would start with indexing, but I have a database that is a third-party application. Creating my own indexes is not an option. I read an article (sorry, can't find the reference) recommending breaking up the query into table variables or temp tables (depending on number of records) when you have multiple tables in your query (not sure what the magic number is).
Start with dbo.ca_CompanyConnections, dbo.ca_CompanyConnectors, dbo.ca_Connections. Include the fields you need. And then subsitute these three joined tables with just the temp table.
Not sure what the issue is (would like to here recommendations) but seems like when you get over 5 tables performance seems to drop.