I'm trying to optimize a query. Basically, there are 3 parts to a transaction that can be repeated. I log all communications, but want to get the "freshest" of the 3 parts. The 3 parts are all linked through a single intermediate table (unfortunately) which is what is slowing this whole thing down (too much normalization?).
There is the center of the "star" "Transactions", then the center spokes (all represened by "TransactionDetails", which refer to the hub using "Transactions" primary key, then the outer spokes (PPGDetails, TicketDetails and CompletionDetails), all of which refer to "TransactionDetails" buy it's primary key.
Each of "PPGDetails", "TicketDetails" and "CompletionDetails" will have exactly one row in "TransactionDetails" that they link to, by primary key. There can be many of each of these pairs of objects per transaction.
So, in order to get the most recent TicketDetails for a transaction, I use this view:
CREATE VIEW [dbo].[TicketTransDetails] AS
select *
from TicketDetails tkd
join (select MAX(TicketDetail_ID) as TicketDetail_ID
from TicketDetails temp1
join TransactionDetails temp2
on temp1.TransactionDetail_ID = temp2.TransactionDetail_ID
group by temp2.Transaction_ID) qq
on tkd.TicketDetail_ID = qq.TicketDetail_ID
join TransactionDetails td
on tkd.TransactionDetail_ID = td.TransactionDetail_ID
GO
The other 2 detail types have similar views.
Then, to get all of the transaction details I want, one row per transaction, I use:
select *
from Transactions t
join CompletionTransDetails cpd
on t.Transaction_ID = cpd.Transaction_ID
left outer join TicketTransDetails tkd
on t.Transaction_ID = tkd.Transaction_ID
left outer join PPGTransDetails ppd
on t.Transaction_ID = ppd.Transaction_ID
where cpd.DateAndTime between '2/1/2017' and '3/1/2017'
It is by design that I want ONLY transactions that have at least 1 "CompletionDetail", but 0 or more "PPGDetail" or "TicketDetail".
This query returns the correct results, but takes 40 seconds to execute, on decent server hardware, and a "Merge Join (Left Outer Join)" immediately before the "SELECT" returns takes 100% of the execution plan time.
If I take out the join to either PPGTransDetails or TicketTransDetails in the final query, it brings the execution time down to ~20 seconds, so a marked improvement, but still doing a Merge Join over a significant number of records (many extraneous, I assume).
When just a single transaction is selected (via where clause), the query only takes about 4 seconds, and the query, then, has a final step of "Nested Loops" which also takes a large portion of the time (96%). I would like this query to take less than a second.
Since the views don't have a primary key, I assume that is causing the Merge Join to proceed. That said, I am having trouble creating a query that emulates this functionality - much less one that is more efficient.
Can anyone help me recognize what I may be missing?
Thanks!
--mobrien118
Edit: Adding more info -
Here is the effective data model:
Essentially, for a single transaction, there can be MANY PPGDetails, TicketDetails and CompletionDetails, but each one will have it's own TransactionDetails (they are one-to-one, but not enforced in the model, just in software).
There are currently:
1,619,307 "Transactions"
3,564518 "TransactionDetails"
512,644 "PPGDetails"
1,471,826 "TicketDetails"
1,580,043 "CompletionDetails"
There are currently no foreign key constraints or indexes set up on these items.
First a quick remark:
which also takes a large portion of the time (96%).
This is a bit of a (common) misconception. The 96% there is an estimate on how much resources said 'block' will need. It by no means indicates that 96% of the time inside the query was spent on it. I've had situations where stuff that took over half of the query time-wise were attributed virtually no cost.
Additionally, you seem to be assuming that when you query/join to the view that the system will first prepare the data from the view and then later on will use that result to further 'work out the query'. This is not the case, the system will 'expand' the view and do a 'combined' query, taking everything into account.
For us to understand what's going on you'll need to provide us with the query plan (.sqlplan if you use SqlSentry Plan Explorer), it's that or a full explanation on the table layout, indexes, foreign keys, etc... and a bit of explanation on the data (total rows, expected matches between tables, etc...)
PS: even though everybody seems to be touting 'hash joins' as the solution to everything, nested loops and merge joins often are more efficient.
(trying to understand your queries, is this view equivalent to your view?)
[edit: incorrect view removed to avoid confusion]
Second try: (think I have it right this time)
CREATE VIEW [dbo].[TicketTransDetails] AS
SELECT td.Transaction_ID, tkd.*
FROM TicketDetails tkd
JOIN TransactionDetails td
ON td.TransactionDetail_ID = tkd.TransactionDetail_ID
JOIN (SELECT MAX(TicketDetail_ID) as max_TicketDetail_ID, temp2.Transaction_ID
FROM TicketDetails temp1
JOIN TransactionDetails temp2
ON temp1.TransactionDetail_ID = temp2.TransactionDetail_ID
GROUP BY temp2.Transaction_ID) qq
ON qq.max_TicketDetail_ID = tkd.TicketDetail_ID
AND qq.TransactionDetail_ID = td.Transaction_ID
It might not be any faster when querying the entire table, but it should be when fetching specific records from the Transactions table.
Indexing-wise you probably want a unique index on TicketDetails (TransactionDetail_ID, TicketDetail_ID)
You'll need similar constructs for the other tables off course.
Thinking it through a bit further I think this would work too:
CREATE VIEW [dbo].[TicketTransDetails]
AS
SELECT *
FROM (
SELECT td.Transaction_ID,
TicketDetail_ID_rownr = ROW_NUMBER() OVER (PARTITION BY td.Transacion_ID ORDER BY tkd.TicketDetail_ID DESC),
tkd.*
FROM TicketDetails tkd
JOIN TransactionDetails td
ON td.TransactionDetail_ID = tkd.TransactionDetail_ID
) xx
WHERE TicketDetail_ID_rownr = 1 -- we want the "first one from the end" only
It looks quite a bit more readable but I'm not sure it would be faster or not... you'll have to compare timings and query plans.
Related
I've just been debugging a slow SQL query.
It's a join between 2 tables, with a WHERE clause conditioning on either a property of 1 table OR the other.
If I re-write it as a UNION then it's suddenly 2 orders of magnitude faster, even though those 2 queries produce identical outputs:
DECLARE #UserId UNIQUEIDENTIFIER = '0019813D-4379-400D-9423-56E1B98002CB'
SELECT *
FROM Bookings
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (BookingPricings.[Owner] in (#UserId) OR Bookings.MixedDealBroker in (#UserId))
--Execution time: ~4000ms
SELECT *
FROM Bookings
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (BookingPricings.[Owner] in (#UserId))
UNION
SELECT *
FROM Bookings
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (Bookings.MixedDealBroker in (#UserId))
--Execution time: ~70ms
This seems rather surprising to me! I would have expected the SQL compiler to be entirely capable of identifying that the 2nd form was equivalent and would have used that compilation approach if it were available.
Some context notes:
I've checked and IN (#UserId) vs = #UserId makes no difference.
Nor does JOIN vs LEFT JOIN.
Those tables each have 100,000s records, and the filter cuts it down to ~100.
In the slow version it seems to be reading every row of both tables.
So:
Does anyone have any ideas for how this comes about.
What (if anything) can I do to fix the performance without just re-writing the query as a series of UNIONs (not viable for a variety of reasons.)
=-=-=-=-=-=-=
Execution Plans:
This is a common limitation of SQL engines, not just in SQL Server, but also other database systems as well. The OR complicates the predicate enough that the execution plan selected isn't always ideal. This probably relates to the fact that only one index can be seeked into per instance of a table object at a time (for the most part), or in your specific case, your OR predicate is across two different tables, and other factors with how SQL engines are designed.
By using a UNION clause, you now have two instances of the Bookings table referenced, which can individually be seeked on separately in the most efficient way possible. That allows the SQL Engine to pick a better execution plan to serve you query.
This is pretty much just one of those things that are the way they are because that's just the way it is, and you need to remember the UNION clause workaround for future encounters of this kind of performance issue.
Also, in response to your comment:
I don't understand how the difference can affect the EP, given that the 2 different "phrasings" of the query are identical?
A new execution plan is generated every time one doesn't exist in the plan cache for a given query, essentially. The way the Engine determines if a plan for a query is already cached is based on the exact hashing of that query statement, so even an extra space character at the end of the query can result in a new plan being generated. Theoretically that plan can be different. So a different written query (despite being logically the same) can surely result in a different execution plan.
There are other reasons a plan can change on re-generation too, such as different data and statistics of that data, in the tables referenced in the query between executions. But these reasons don't really apply to your question above.
As already stated, the OR condition prevents the database engine from efficiently using the indexes in a single query. Because the OR condition spans tables, I doubt that the Tuning Advisor will come up with anything useful.
If you have a case where the query you have posted is part of a larger query, or the results are complex and you do not want to repeat code, you can wrap your initial query in a Common Table Expression (CTE) or a subquery and then feed the combined results into the remainder of your query. Sometimes just selecting one or more PKs in your initial query will be sufficient.
Something like:
SELECT <complex select list>
FROM (
SELECT Bookings.ID AS BookingsID, BookingPricings.ID AS BookingPricingsID
FROM Bookings
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (BookingPricings.[Owner] in (#UserId))
UNION
SELECT Bookings.ID AS BookingsID, BookingPricings.ID AS BookingPricingsID
FROM Bookings B
LEFT JOIN BookingPricings ON Booking = Bookings.ID
WHERE (Bookings.MixedDealBroker in (#UserId))
) PRE
JOIN Bookings B ON B.ID = PRE.BookingsID
JOIN BookingPricings BP ON BP.ID = PRE.BookingPricingsID
<more joins>
WHERE <more conditions>
Having just the IDs in your initial select make the UNION more efficient. The UNION can also be changed to a yet more-efficient UNION ALL with careful use of additional conditions, such as AND Bookings.MixedDealBroker <> #UserId in the second part, to avoid overlapping results.
I have a rather complex query that takes some 5 seconds to run.
EDITED: complete code for better understanding:
SELECT
AGEID, AGECLIID, CLINombre, CLIEstado,AGEOnline, AGEOnlineCancelacion,
IF(CANID>0,'Y','N') AS AGECancelado, ACTIOrden, ACTIColor, ACTINombre,
ACTICapacidad, 'Y' AS slotactivo
FROM
( agenda JOIN cliente on AGECLIID = CLIID AND CLICentro = 'Madrid'
JOIN actividad on ACTIID = AGEACTIID
LEFT JOIN agcambio on CAMAGEID = AGEID
LEFT JOIN agcancelacion on CANAGEID = AGEID
) RIGHT JOIN horario on DIAOrden = COALESCE( CAMDia, AGEDia )
ORDER BY HORHora, MINMinuto, DIAOrden
NOTE: Originally, it was "SELECT FROM horario LEFT JOIN all_the_rest of the query".
This produces the following explain plan:
NOTE: I tried to upload a picture here, but I cannot do it, it just creates a link to the picture.
explain plan with right join
It takes like 5 seconds to execute.
The key here is that the part of the query that retrieves busy slots executed alone takes like nothing. In timeslots (horario) table there are 336 rows (all possible time slots) so it should not take that long to perform the RIGHT JOIN.
The explain plan says that first it does an ALL access to timeslots, and after that all the other tables are accessed via INDEX.
So I wanted to change the access order to tables, forcing to access timeslots at the end. I changed the order in the FROM clause changing the JOIN to RIGHT, but the explain plan says the same (and the time to run the query is still 5 sec).
Testing things, I have tried to do a STRAIGHT_JOIN to force the table access order and it works, cause it took less than a second to return the rows (probably because it is a join, but it is still significantly faster). But I cannot use STRAIGHT_JOIN cause I need the RIGHT JOIN, and I've read that there is no way to do STRAIGHT in an OUTER JOIN.
The explain plan of the execution with straight join is the following:
explain plan with straight_join
I have tried to create a view but the result was the same.
So,
a) can I force somehow the optimizer to access the timeslots table the last one?
b) but the question is why it is taking so long, if there are only 336 rows to do a RIGHT JOIN with (150 rows of busy slots).... Maybe this could help me to rewrite the query.
Many thanks in advance for any hint... Meanwhile I'm rewriting the query again and again.... :)
Xavi.
I have something like this:
SELECT CompanyId
FROM Company
WHERE CompanyId not in
(SELECT CompanyId
FROM Company
WHERE (IsPublic = 0) and CompanyId NOT IN
(SELECT ShoppingLike.WhichId
FROM Company
INNER JOIN
ShoppingLike ON Company.CompanyId = ShoppingLike.UserId
WHERE (ShoppingLike.IsWaiting = 0) AND
(ShoppingLike.ShoppingScoreTypeId = 2) AND
(ShoppingLike.UserId = 75)
)
)
It has 3 select, I want to know how could I have it without making 3 selects, and which one has better speed for 1 million record? "select in select" or "left join"?
My experiences are from Oracle. There is never a correct answer to optimising tricky queries, it's a collaboration between you and the optimiser. You need to check explain plans and sometimes traces, often at each stage of writing the query, to find out what the optimiser in thinking. Having said that:
You could remove the outer SELECT by putting the entire contents of it's subquery WHERE clause in a NOT(...). On the face of it will prevent that outer full scan of Company (or it's index of CompanyId). Try it, check the output is the same and get timings, then remove it temporarily before trying the below. The NOT() may well cause the optimiser to stop considering an ANTI-JOIN against the ShoppingLike subquery due to an implicit OR being created.
Ensure that CompanyId and WhichId are defined as NOT NULL columns. Without this (or the likes of an explicit CompanyId IS NOT NULL) then ANTI-JOIN options are often discarded.
The inner most subquery is not correlated (does not reference anything from it's outer query) so can be extracted and tuned separately. As a matter of style I'd swap the table names round the INNER JOIN as you want ShoppingLike scanned first as it has all the filters against it. It wont make any difference but it reads easier and makes it possible to use a hint to scan tables in the order specified. I would even question the need for the Company table in this subquery.
You've used NOT IN when sometimes the very similar NOT EXISTS gives the optimiser more/alternative options.
All the above is just trial and error unless you start trying the explain plan. Oracle can, with a following wind, convert between LEFT JOIN and IN SELECT. 1M+ rows will create time to invest.
Which of these queries is more efficient, and would a modern DBMS (like SQL Server) make the changes under the hood to make them equal?
SELECT DISTINCT S#
FROM shipments
WHERE P# IN (SELECT P#
FROM parts
WHERE color = ‘Red’)
vs.
SELECT DISTINCT S#
FROM shipments, parts
WHERE shipments.P# = parts.P#
AND parts.color = ‘Red’
The best way to satiate your curiosity about this kind of thing is to fire up Management Studio and look at the Execution Plan. You'll also want to look at SQL Profiler as well. As one of my professors said: "the compiler is the final authority." A similar ethos holds when you want to know the performance profile of your queries in SQL Server - just look.
Starting here, this answer has been updated
The actual comparison might be very revealing. For example, in testing that I just did, I found that either approach might yield the fastest time depending on the nature of the query. For example, a query of the form:
Select F1, F2, F3 From Table1 Where F4='X' And UID in (Select UID From Table2)
yielded a table scan on Table1 and a mere index scan on table 2 followed by a right semi join.
A query of the form:
Select A.F1, A.F2, A.F3 From Table1 A inner join Table2 B on (A.UID=B.UID)
Where A.Gender='M'
yielded the same execution plan with one caveat: the hash match was a simple right join this time. So that is the first thing to note: the execution plans were not dramatically different.
These are not duplicate queries though since the second one may return multiple, identical records (one for each record in table 2). The surprising thing here was the performance: the subquery was far faster than the inner join. With datasets in the low thousands (thank you Red Gate SQL Data Generator) the inner join was 40 times slower. I was fairly stunned.
Ok, how about a real apples to apples? This is the matching inner join - note the extra step to winnow out the duplicates:
Select Distinct A.F1, A.F2, A.F3 From Table1 A inner join Table2 B
on (A.UID=B.UID)
Where A.Gender='M'
The execution plan does change in that there is an extra step - a sort after the inner join. Oddly enough, though, the time drops dramatically such that the two queries are almost identical (on two out of five trials the inner join is very slightly faster). Now, I can imagine the first inner join (without the "distinct") being somewhat longer just due to the fact that more data is being forwarded to the query window - but it was only twice as much (two Table2 records for every Table1 record). I have no good explanation why the first inner join was so much slower.
When you add a predicate to the search on table 2 using a subquery:
Select F1, F2, F3 From Table1 Where F4='X' And UID in
(Select UID From Table2 Where F1='Y')
then the Index Scan is changed to a Clustered Index Scan (which makes sense since the UID field has its own index in the tables I am using) and the percentage of time it takes goes up. A Stream Aggregate operation is also added. Sure enough, this does slow the query down. However, plan caching obviously kicks in as the first run of the query shows a much greater effect than subsequent runs.
When you add a predicate using the inner join, the entire plan changes pretty dramatically (left as an exercise to the reader - this post is long enough). The performance, again, is pretty much the same as that of the subquery - as long as the "Distinct" is included. Similar to the first example, omitting distinct led to a significant increase in time to completion.
One last thing: someone suggested (and your question now includes) a query of the form:
Select Distinct F1, F2, F3 From table1, table2
Where (table1.UID=table2.UID) AND table1.F4='X' And table2.F1='Y'
The execution plan for this query is similar to that of the inner join (there is a sort after the original table scan on table2 and a merge join rather than a hash join of the two tables). The performance of the two is comparable as well. I may need a larger dataset to tease out difference but, so far, I'm not seeing any advantage to this construct or the "Exists" construct.
With all of this being said - your results may vary. I came nowhere near covering the full range of queries that you may run into when I was doing the above tests. As I said at the beginning, the tools included with SQL Server are your friends: use them.
So: why choose one over the other? It really comes down to your personal preferences since there appears to be no advantage for an inner join to a subquery in terms of time complexity across the range of examples I tests.
In most classic query cases I use an inner join just because I "grew up" with them. I do use subqueries, however, in two situations. First, some queries are simply easier to understand using a subquery: the relationship between the tables is manifest. The second and most important reason, though, is that I am often in a position of dynamically generating SQL from within my application and subqueries are almost always easier to generate automatically from within code.
So, the takeaway is simply that the best solution is the one that makes your development the most efficient.
Using IN is more readable, and I recommend using ANSI-92 over ANSI-89 join syntax:
SELECT DISTINCT S#
FROM SHIPMENTS s
JOIN PARTS p ON p.p# = s.p#
AND p.color = 'Red'
Check your explain plans to see which is better, because it depends on data and table setup.
If you aren't selecting anything from the table I would use an EXISTS clause.
SELECT DISTINCT S#
FROM shipments a
WHERE EXISTS (SELECT 1
FROM parts b
WHERE b.color = ‘Red’
AND a.P# = b.P#)
This will optimize out to be the same as the second one you posted.
SELECT DISTINCT S#
FROM shipments,parts
WHERE shipments.P# = parts.P# and parts.color = ‘Red’;
Using IN forces SQL Server to not use indexing on that column, and subqueries are usually slower
I wonder if anyone can help improve my understanding of JOINs in SQL. [If it is significant to the problem, I am thinking MS SQL Server specifically.]
Take 3 tables A, B [A related to B by some A.AId], and C [B related to C by some B.BId]
If I compose a query e.g
SELECT *
FROM A JOIN B
ON A.AId = B.AId
All good - I'm sweet with how this works.
What happens when Table C (Or some other D,E, .... gets added)
In the situation
SELECT *
FROM A JOIN B
ON A.AId = B.AId
JOIN C ON C.BId = B.BId
What is C joining to? - is it that B table (and the values therein)?
Or is it some other temporary result set that is the result of the A+B Join that the C table is joined to?
[The implication being not all values that are in the B table will necessarily be in the temporary result set A+B based on the join condition for A,B]
A specific (and fairly contrived) example of why I am asking is because I am trying to understand behaviour I am seeing in the following:
Tables
Account (AccountId, AccountBalanceDate, OpeningBalanceId, ClosingBalanceId)
Balance (BalanceId)
BalanceToken (BalanceId, TokenAmount)
Where:
Account->Opening, and Closing Balances are NULLABLE
(may have opening balance, closing balance, or none)
Balance->BalanceToken is 1:m - a balance could consist of many tokens
Conceptually, Closing Balance of a date, would be tomorrows opening balance
If I was trying to find a list of all the opening and closing balances for an account
I might do something like
SELECT AccountId
, AccountBalanceDate
, Sum (openingBalanceAmounts.TokenAmount) AS OpeningBalance
, Sum (closingBalanceAmounts.TokenAmount) AS ClosingBalance
FROM Account A
LEFT JOIN BALANCE OpeningBal
ON A.OpeningBalanceId = OpeningBal.BalanceId
LEFT JOIN BALANCE ClosingBal
ON A.ClosingBalanceId = ClosingBal.BalanceId
LEFT JOIN BalanceToken openingBalanceAmounts
ON openingBalanceAmounts.BalanceId = OpeningBal.BalanceId
LEFT JOIN BalanceToken closingBalanceAmounts
ON closingBalanceAmounts.BalanceId = ClosingBal.BalanceId
GROUP BY AccountId, AccountBalanceDate
Things work as I would expect until the last JOIN brings in the closing balance tokens - where I end up with duplicates in the result.
[I can fix with a DISTINCT - but I am trying to understand why what is happening is happening]
I have been told the problem is because the relationship between Balance, and BalanceToken is 1:M - and that when I bring in the last JOIN I am getting duplicates because the 3rd JOIN has already brought in BalanceIds multiple times into the (I assume) temporary result set.
I know that the example tables do not conform to good DB design
Apologies for the essay, thanks for any elightenment :)
Edit in response to question by Marc
Conceptually for an account there should not be duplicates in BalanceToken for An Account (per AccountingDate) - I think the problem comes about because 1 Account / AccountingDates closing balance is that Accounts opening balance for the next day - so when self joining to Balance, BalanceToken multiple times to get opening and closing balances I think Balances (BalanceId's) are being brought into the 'result mix' multiple times. If it helps to clarify the second example, think of it as a daily reconciliation - hence left joins - an opening (and/or) closing balance may not have been calculated for a given account / accountingdate combination.
Conceptually here is what happens when you join three tables together.
The optimizer comes up with a plan, which includes a join order. It could be A, B, C, or C, B, A or any of the combinations
The query execution engine applies any predicates (WHERE clause) to the first table that doesn't involve any of the other tables. It selects out the columns mentioned in the JOIN conditions or the SELECT list or the ORDER BY list. Call this result A
It joins this result set to the second table. For each row it joins to the second table, applying any predicates that may apply to the second table. This results in another temporary resultset.
Then it joins in the final table and applies the ORDER BY
This is conceptually what happens. Infact there are many possible optimizations along the way. The advantage of the relational model is that the sound mathematical basis makes various transformations of plan possible while not changing the correctness.
For example, there is really no need to generate the full result sets along the way. The ORDER BY may instead be done via accessing the data using an index in the first place. There are lots of types of joins that can be done as well.
We know that the data from B is going to be filtered by the (inner) join to A (the data in A is also filtered). So if we (inner) join from B to C, thus the set C is also filtered by the relationship to A. And note also that any duplicates from the join will be included.
However; what order this happens in is up to the optimizer; it could decide to do the B/C join first then introduce A, or any other sequence (probably based on the estimated number of rows from each join and the appropriate indexes).
HOWEVER; in your later example you use a LEFT OUTER join; so Account is not filtered at all, and may well my duplicated if any of the other tables have multiple matches.
Are there duplicates (per account) in BalanceToken?
I often find it helps to view the actual execution plan. In query analyser/management studio, you can turn this on for queries from the Query menu, or use Ctrl+M. After running the query, the plan that was executed is shown in another result tab. From this you'll see that C and B are joined first, and then the result is joined with A. The plan might vary depending on information the DBMS has because both joins are inner, making it A-and-B-and-C. What I mean is that the result will be the same regardless of which is joined first, but the time it takes might differ greatly, and this is where the optimiser and hints come into play.
Joins can be tricky, and much of the behavior is of course dictated by how the data is stored in the actual tables.
Without seeing the tables it's hard to give a clear answer in your particular case but I think the basic issue is that you are summing over multiple result sets that are being combined into one.
Perhaps instead of multiple joins you should make two separate temporary tables in your query, one with the accountID, date and sum of openingbalances, a second one with the accountID, date and sum of closing balances, then joining those two on AccountID and date.
In order to find out exactly what is happening with joins, also in your specific case, I would do the following:
Change the initial part
SELECT accountID Accountbalancedate, sum(...) as openingbalance,
sum(...) as closingbalance FROM
to simply
"SELECT * FROM"
Study the resulting table, and you will see exactly what data is being duplicated. Remove the joins one by one and see what happens. This should give you a clue to what it is about your particular data that is causing the dupes.
If you open the query in SQL server management studio (Free version exists) you can edit the query in the designer. The visual view of how the tables are being joined might also help you realize what's going on.