Very bad performance using 3 tables join on SQL Server

Very bad performance using 3 tables join on SQL Server - sql

I have serious performance issue when I execute a SQL statements which involves 3 tables as following:
TableA<----TableB---->TableC
In particular, these tables are in a data warehouse and the table in the middle is a dimension table while the others are fact tables. TableA has about 9 millions of record, while TableC about 3 million. The dimension table (TableB) only 74 records.
The syntax of the query is very simple, as you can see, where TableA is called _PG, TableB is equal to _MDT and Table C is called _FM:
SELECT _MDT.codiceMandato as Customer, SUM(_FM.Totale) AS Revenue,
SUM(_PG.ErogatoTotale) AS Paid
FROM _PG INNER JOIN
_MDT
ON _PG.idMandato = _MDT.idMandato INNER JOIN
_FM
ON _FM.idMandato = _MDT.idMandato
GROUP BY _MDT.codiceMandato
Actually, I never has seen the end of this query :-(
_PG has a non clustered index on idMandato and the same _FM table
_MDT table has a clustered index on idMandato
and the execution plan is the following
As you can see the bottleneck is due to Stream Aggregate (33% of cost) and Merge Join (66% of cost). In particular, the stream aggregate underlines about 400 billions of estimated rows!!
I don’t know the reasons and I don’t know how to proceed in order to solve this bad issue.
I use SQL Server 2016 SP1 installed of a virtual server with Windows Server 2012 Standard with 4 Cpu core and 32 GB of RAM , 1,5TB on a dedicated volume made up SAS disks with SSD cache.
I hope anybody can help me to understand.
Thanks in advance

The most likely cause is because you are getting a Cartesian product along two dimensions. This multiplies the rows unnecessarily. The solution is to aggregate before doing the join.
You haven't provided sample data, but this is the idea:
SELECT m.codiceMandato as Customer, f.revenue, p.Paid
FROM _MDT m INNER JOIN
(SELECT p.idMandato, SUM(p.ErogatoTotale) AS Paid
FROM _PG p
GROUP BY p.idMandato
) p
ON p.idMandato = m.idMandato INNER JOIN
(SELECT f.idMandato, SUM(f.Totale) AS Revenue
FROM _FM f
GROUP BY f.idMandato
) f
ON f.idMandato = m.idMandato;
I'm not 100% sure this will fix the problem, because your data structure is not clear.

You can try doing a subquery between TableA and TableC without aggregation and then joining this subquery with TableB and apply the GROUP BY:
SELECT _MDT.codiceMandato, SUM(A.Totale) AS Revenue, sum( A.ErogatoTotale)
AS Paid
FROM ( SELECT m.idMandato, _FM.Totale, _PG.ErogatoTotale FROM _PG
INNER JOIN _FM
ON _FM.idMandato = _MDT.idMandato ) A
INNER JOIN _MDT ON A.idMandato = _MDT.idMandato
GROUP BY _MDT.codiceMandato

Related

Have merge Cartesian Join with high cost

We were querying the DB to populate some logged tickets, however the query formed causing the above issue and is communicated by our performance team.
Here I am into Java programming and I don't have much idea on these joins. How can I the re-frame below piece of query to avoid the merge Cartesian Join with high cost?
FROM
SERVICE_REQ SR,
SR_COBRAND_DATA SR_COB_DATA,
REPOSITORY rep,
SR_ASSIGNEE_INFO ASSIGNEE_INFO
WHERE
SR.SR_COBRAND_ID=rep.COBRAND_ID
AND SR.SERVICE_REQ_ID=SR_COB_DATA.SERVICE_REQ_ID (+)
AND SR.SERVICE_REQ_ID = ASSIGNEE_INFO.SERVICE_REQ_ID (+)
AND SR.SR_COBRAND_ID = 99

Create a composite index on columns SR_COBRAND_ID and SERVICE_REQ_ID of table SERVICE_REQ
-- Create Index [indexname] on SERVICE_REQ (SR_COBRAND_ID , SERVICE_REQ_ID);

Just a suggestion: you should not use old implicit join syntax but join explicit join syntax:
SELECT *
FROM SERVICE_REQ SR
LEFT JOIN SR_COBRAND_DATA SR_COB_DATA ON SR.SERVICE_REQ_ID=SR_COB_DATA.SERVICE_REQ_ID
INNER JOIN REPOSITORY rep ON SR.SR_COBRAND_ID=rep.COBRAND_ID
LEFT JOIN SR_ASSIGNEE_INFO ASSIGNEE_INFO ON SR.SERVICE_REQ_ID = ASSIGNEE_INFO.SERVICE_REQ_ID
WHERE SR.SR_COBRAND_ID = 99
Anyway, based on this condition you have not a Cartesian product between the table but a left join for SERVICE_REQ with SR_COBRAND_DATA and SR_ASSIGNEE_INFO reduce by inner join with REPOSITORY.
Perhaps to explain you goal you should add proper sample data, the expected result, and your actual result.

Query optimization for postgresql

I have to resolve a problem in my class about query optimization in postgresql.
I have to optimize the following query.
"The query determines the yearly loss in revenue if orders just with a quantity of more than the average quantity of all orders in the system would be taken and shipped to customers."
select sum(ol_amount) / 2.0 as avg_yearly
from orderline, (select i_id, avg(ol_quantity) as a
from item, orderline
where i_data like '%b'
and ol_i_id = i_id
group by i_id) t
where ol_i_id = t.i_id
and ol_quantity < t.a
Is it possible through indices or something else to optimize that query (Materialized view is possible as well)?
Execution plan can be found here. Thanks.

first if you have to do searches from the back of data, simply create an index on the reverse of the data
create index on item(reverse(i_data);
Then query it like so:
select sum(ol_amount) / 2.0 as avg_yearly
from orderline, (select i_id, avg(ol_quantity) as a
from item, orderline
where reverse(i_data) like 'b%'
and ol_i_id = i_id
group by i_id) t
where ol_i_id = t.i_id
and ol_quantity < t.a

Remember that making indexes may not speed up the query when you have to retreive something like 30% of the table. In this case bitmap index might help you but as far as I remember it is not available in Postgres. So, think which table to index, maybe it would be worth to index the big table by ol_i_id as the join you are making only needs to match less than 10% of the big table and small table is loaded to ram (I might be mistaken here, but at least in SAS hash join means that you load the smaller table to ram).
You may try aggregating data before doing any joins and reuse the groupped data. I assume that you need to do everything in one query without explicitly creating any staging tables by hand. Also recently, I have been working a lot on SQL Server so I may mix the syntax, but give it a try. There are many assumptions I have made about the data and the structure of the table, but hopefully it will work.
;WITH GrOrderline (
SELECT ol_i_id, ol_quantity, SUM(ol_amount) AS Yearly, Count(*) AS cnt
FROM orderline
GROUP BY ol_i_id, ol_quantity
),
WITH AvgOrderline (
SELECT
o.ol_i_id, SUM(o.ol_quantity)/SUM(cnt) AS AvgQ
FROM GrOrderline AS o
INNER JOIN item AS i ON (o.ol_i_id = i.i_id AND RIGHT(i.i_data, 1) = 'b')
GROUP BY o.ol_i_id
)
SELECT SUM(Yearly)/2.0 AS avg_yearly
FROM GrOrderline o INNER JOIN AvgOrderline a ON (a.ol_i_id = a.ol_i_id AND o.ol_quantity < a.AvG)

Alternative for joining two tables multiple times

I have a situation where I have to join a table multiple times. Most of them need to be left joins, since some of the values are not available. How to overcome the query poor performance when joining multiple times?
The Scenario
Tables
[Project]: ProjectId Guid, Name VARCHAR(MAX).
[UDF]: EntityId Guid, EntityType Char(1), UDFCode Guid, UDFName varchar(20)
[UDFDetail]: UDFCode Guid, Description VARCHAR(MAX)
Relationship:
[Project].ProjectId - [UDF].EntityId
[UDFDetail].UDFCode - [UDF].UDFCode
The UDF table holds custom fields for projects, based on the UDFName column. The value for these fields, however, is stored on the UDFDetail, in the column Description.
I have lots of custom columns for Project, and they are stored in the UDF table.
So for example, to get two fields for the project I do the following select:
SELECT
p.Name ProjectName,
ud1.Description Field1,
ud1.UDFCode Field1Id,
ud2.Description Field2,
ud2.UDFCode Field2Id
FROM
Project p
LEFT JOIN UDF u1 ON
u1.EntityId = p.ProjectId AND u1.ItemName='Field1'
LEFT JOIN UDFDetail ud1 ON
ud1.UDFCode = u1.UDFCode
LEFT JOIN UDF u2 ON
u2.EntityId = p.ProjectId AND u2.ItemName='Field2'
LEFT JOIN UDFDetail ud2 ON
ud2.UDFCode = u2.UDFCode
The Problem
Imagine the above select but joining with like 15 fields. In my query I have around 10 fields already and the performance is not very good. It is taking about 20 seconds to run. I have good indexes for these tables, so looking at the execution plan, it is doing only index seeks without any lookups. Regarding the joins, it needs to be left join, because Field 1 might not exist for that specific project.
The Question
Is there a more performatic way to retrieve the data?
How would you do the query to retrieve 10 different fields for one project in a schema like this?

Your choices are pivot, explicit aggregation (with conditional functions), or the joins. If you have the appropriate indexes set up, the joins may be the fastest method.
The correct index would be UDF(EntityId, ItemName, UdfCode).
You can test if the group by is faster by running a query such as:
SELECT count(*)
FROM p LEFT JOIN
UDF u1
ON u1.EntityId = p.ProjectId LEFT JOIN
UDFDetail ud1
ON ud1.UDFCode = u1.UDFCode;
If this runs fast enough, then you can consider the group by approach.

You can try this very weird contraption (it does not look pretty, but it does a single set of outer joins). The intermediate result is a very "wide" and "long" dataset, which we can then "compact" with aggregation (for example, for each ProjectName, each Field1 column will have N result, N-1 NULLs and 1 non-null result, which is then selecting with a simple MAX aggregation) [N is the number of fields].
select ProjectName, max(Field1) as Field1, max(Field1Id) as Field1Id, max(Field2) as Field2, max(Field2Id) as Field2Id
from (
select
p.Name as ProjectName,
case when u.UDFName='Field1' then ud.Description else NULL end as Field1,
case when u.UDFName='Field1' then ud.UDFCode else NULL end as Field1Id,
case when u.UDFName='Field2' then ud.Description else NULL end as Field2,
case when u.UDFName='Field2' then ud.UDFCode else NULL end as Field2Id
from Project p
left join UDF u on p.ProjectId=u.EntityId
left join UDFDetail ud on u.UDFCode=ud.UDFCode
) tmp
group by ProjectName
The query can actually be rewritten without the inner query, but that should not make a big difference :), and looking at Gordon Linoff's suggestion and your answer, it might actually take just about 20 seconds as well, but it is still worth giving a try.

Improving performance on SQL query

I'm currently having performance problems with an expensive SQL query, and I'd like to improve it.
This is what the query looks like:
SELECT TOP 50 MovieID
FROM (SELECT [MovieID], COUNT(*) AS c
FROM [tblMovieTags]
WHERE [TagID] IN (SELECT TOP 7 [TagID]
FROM [tblMovieTags]
WHERE [MovieID]=12345
ORDER BY Relevance ASC)
GROUP BY [MovieID]
HAVING COUNT(*) > 1) a
INNER JOIN [tblMovies] m ON m.MovieID=a.MovieID
WHERE (Hidden=0) AND m.Active=1 AND m.Processed=1
ORDER BY c DESC, m.IMDB DESC
What I'm trying to find movies that have at least 2 matching tags for MovieID 12345.
Database basic scheme looks like:
Each movie has 4 to 5 tags. I want a list of movies similar to any movie based on the tags. A minimum of 2 tags must match.
This query is causing my server problems as I have hundreds of concurrent users at any given time.
I have already created indexes based on execution plan suggestions, and that has made it quicker, but it's still not enough.
Is there anything I could do to make this faster?

I Like to use temp tables, because they can speed up your queries (if used correctly) and make it easier to read. Try using the query below and see if it speeds it up any. There were a few fields (hidden,imdb) that weren't in your schema, so I left them out.
This query may, or may not, be exactly what you are looking for. The point of it is to show you how to use temp tables to increase the performance and improve readability. Some minor tweaks may be necessary.
SELECT TOP 7 [TagID],[MovieTagID],[MovieID]
INTO #MovieTags
FROM [tblMovieTags]
WHERE [MovieID]=12345
SELECT mt.MovieID, COUNT(mt.MovieTagID)
INTO #Movies
FROM #MovieTags mt
INNER JOIN tblMovies m ON m.MovieID=mt.MovieID AND m.Active=1 AND m.Process=1
GROUP BY [MovieID]
HAVING COUNT(mt.MovieTagID) > 1
SELECT TOP 50 * FROM #Movies
DROP TABLE #MovieTags
DROP TABLE #Movies
Edit
Parameterized Queries
You will also want to use parameterized queries, rather than concatenating your values in your SQL string. Check out this short, to the point, blog that explains why you should use parameterized queries. This, combined with the temp table method, should improve your performance significantly.

I want to see if there is some unnecessary processing happening from that query you wrote. Try the following query and let us know if it's faster slower etc And if it's even getting the same data.
I just threw this together so no guarantees on perfect syntax
SELECT TOP 7 [TagID]
INTO #MovieTags
FROM [tblMovieTags]
WHERE [MovieID]=12345
ORDER BY TagID
;cte_movies AS
(
SELECT
mt.MovieID
,mt.TagID
FROM
tblMovieTags mt
INNER JOIN #MovieTags t ON mt.TagId = t.TagId
INNER JOIN tblMovies m ON mt.MovieID = m.MovieID
WHERE
(Hidden=0) AND m.Active=1 AND m.Processed=1
),
cte_movietags AS
(
SELECT
MovieId
,COUNT(MovieId) AS TagCount
FROM
cte_movies
GROUP BY MovieId
)
SELECT
MovieId
FROM
cte_movietags
WHERE
TagCount > 1
ORDER BY
MovieId
GO
DROP TABLE #MovieTags

Optimising Large SQL Server Tables

I have a very large table with around 50 million rows and 15 columns. Whenever I read, I always need all columns so I can't split them. I have a clustered index on the table with 4 keys and I always read data using those keys.
But the performance is still slow, my queries are fairy simple like this
select
CountryId, RetailerID, FY,
sum(col1), sum(col2),.....sum(col15)
from mytable a
join product p on a.productid = p.id
join ......
join .....
join ......
join .....
Where .......
group by CountryId, RetailerID, FY
I'm not using any IN operator or any sub queries here on any inline functions... which I know obviously make it slow. I've looked at partitioning but not sure about that, can I get some performance improvement by doing partition?
OR is there anything else I can do?
I'm using SQL Server 2012 Enterprise Edition
Please help!

Thanks guys, I have done this using aggregated tables. I run a job to aggregate data overnight which has limited the number of rows and the reports are running fine now

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Very bad performance using 3 tables join on SQL Server - sql

Related

Have merge Cartesian Join with high cost

Query optimization for postgresql

Alternative for joining two tables multiple times

Improving performance on SQL query

Optimising Large SQL Server Tables

Categories

Resources