Postgres query left join take too time - sql

i have a problem with this query. it go in loop, I mean query after 15 minutes not finish
But if remove one of the left join works
where wrong I?
Select distinct a.sito,
Count(distinct a.id_us) as us,
Count (distinct b.id_invmat) as materiali,
Count (distinct c.id_struttura) as Struttura,
Count(distinct d.id_tafonomia) as tafonomia
From us_table as a
Left join invetario_materiali as b on a.sito=b.sito
Left join struttura_table as c on a.sito=c.sito
Left join tafonomia_table as d on a.sito=d.sito
Group by a.sito
Order by us
thanks
E

This is a case where correlated subqueries might be the simplest approach:
select s.sito,
(select count(*) from invetario_materiali m where s.sito = m.sito) as materiali,
(select count(*) from struttura_tablest where s.sito = st.sito) as Struttura,
(select count(*) from tafonomia_table t where s.sito = t.sito) as tafonomia
from (select sito, count(*) as us
from us_table
group by sito
) s
order by us;
This should be much, much faster than your version for two reasons. First, it avoids the outer aggregation. Second, it avoids the Cartesian products among the tables.
You can make this even faster by creating indexes on each of the secondary tables on sito.

Assuming that id_us, id_invmat, id_struttura and id_tafonomia are all PRIMARY KEY CLUSTERED
You should add indexes on join columns:
CREATE INDEX IX_SITO ON us_table ( sito ASC) ;
CREATE INDEX IX_SITO ON invetario_materiali ( sito ASC) ;
CREATE INDEX IX_SITO ON struttura_table ( sito ASC) ;
CREATE INDEX IX_SITO ON tafonomia_table ( sito ASC) ;
Than you can reduce complexity in this way:
with
_us_table as (
select sito, count(distinct a.id_us) us
from us_table a
group by sito
),
_invetario_materiali as (
select sito, count(distinct b.id_invmat) materiali
from invetario_materiali b
group by sito
),
_struttura_table as (
select sito, count(distinct c.id_struttura) Struttura
from struttura_table c
group by sito
),
_tafonomia_table as (
select sito, count(distinct d.id_tafonomia) tafonomia
from tafonomia_table d
group by sito
)
Select a.sito, a.us, b.materiali, c.Struttura, d.tafonomia
From _us_table as a
Left join _invetario_materiali as b on a.sito=b.sito
Left join _struttura_table as c on a.sito=c.sito
Left join _tafonomia_table as d on a.sito=d.sito
Order by a.us;
should be much faster

Unfortunately COUNT(DISTINCT ...) is difficult to improve upon using an index. However, we can at least try adding indices which cover all the joins in your query:
CREATE INDEX inv_mat_idx ON invetario_materiali (sito, id_invmat);
CREATE INDEX strut_tbl_idx ON struttura_table (sito, id_struttura);
CREATE INDEX taf_tbl_idx ON tafonomia_table (sito, id_tafonomia);
Note that the above indices would only help the joins, and would not affect the aggregation step by sito and the distinct counts per group. As #jarlh has noted in the comments, SELECT DISTINCT is superfluous, since you are using GROUP BY, so just do a plain SELECT.

Related

How to get a result set containing the absence of a value?

Scenario: Have a table with four columns. District_Number, District_name, Data_Collection_Week, enrollments. Each week we get data, BUT sometimes we do not.
Task: My supervisor wants me to produce a query that will let us know, which districts did not submit a given week.
What I have tried is below, but I cannot get a NULL value on those that did not submit a week.
SELECT DISTINCT DistrictNumber, DistrictName, DataCollectionWeek
into #test4
FROM EDW_REQUESTS.INSTRUCTION_DELIVERY_ENROLLMENT_2021
order by DistrictNumber, DataCollectionWeek asc
select DISTINCT DataCollectionWeek
into #test5
from EDW_REQUESTS.INSTRUCTION_DELIVERY_ENROLLMENT_2021
order by DataCollectionWeek
select b.DistrictNumber, b.DistrictName, b.DataCollectionWeek
from #test5 a left outer join #test4 b on (a.DataCollectionWeek = b.DataCollectionWeek)
order by b.DistrictNumber, b.DataCollectionWeek asc
One option uses a cross join of two select distinct subqueries to generate all possible combinations of districts and weeks, and then not exists to identify those that are not available in the table:
select d.districtnumber, w.datacollectionweek
from (select distinct districtnumber from edw_requests.instruction_delivery_enrollment_2021) d
cross join (select distinct datacollectionweek from edw_requests.instruction_delivery_enrollment_2021) w
where not exists (
select 1
from edw_requests.instruction_delivery_enrollment_2021 i
where i.districtnumber = d.districtnumber and i.datacollectionweek = w.datacollectionweek
)
This would be simpler (and much more efficient) if you had referential tables to store the districts and weeks: you would then use them directly instead of the select distinct subqueries.

How many types of SQL subqueries are there?

In an effort to understand what types of subqueries can be correlated I wrote the SQL query shown below. It shows all types of subqueries I can think of a SQL select statement can include.
Though the example shown below runs in Oracle 12c, I would prefer to keep this question database agnostic. In the example below I included all 7 types of subqueries I can think of:
with
daily_login as ( -- 1. Independent CTE [XN]
select user_id, trunc(login_time) as day, count(*) from shopper_login
group by user_id, trunc(login_time)
),
frequent_user as ( -- 2. Dependent CTE [XN]
select user_id, count(*) as days from daily_login group by user_id
having count(*) >= 2
),
referrer (frequent_id, id, rid, ref_level) as ( -- 3. Recursive CTE [XN]
select fu.user_id, s.id, s.ref_id, 1 from frequent_user fu
join shopper s on fu.user_id = s.id
union all
select r.frequent_id, s.id, s.ref_id, r.ref_level + 1 from referrer r
join shopper s on s.id = r.rid
)
select s.id, s.name, r.id as original_referrer,
( -- 4. Scalar Subquery [CN]
select max(login_time) from shopper_login l
where l.user_id = s.id and l.success = 1
) as last_login,
m.first_login
from shopper s
join referrer r on r.frequent_id = s.id
join ( -- 5. Table Expression / Inline View / Derived Table [XN]
select user_id, min(login_time) first_login from shopper_login
where success = 1 group by user_id
) m on m.user_id = s.id
where r.rid is null
and s.id not in ( -- 6. Traditional Subquery [CN]
select user_id from persona
where description = 'Fashionista'
and id in ( -- 7. Nested subquery [CN]
select user_id from users where region = 'NORTH')
);
Legend:
[C]: Can be correlated
[X]: Cannot be corretaled
[N]: Can include nested subqueries
My questions are:
Did I get all possible types? Are there alternative names for these types of subqueries?
Am I correct thinking that only Scalar (#4), Traditional (#6), and Nested (#7) subqueries can be correlated?
Am I correct thinking Table Expressions and CTEs (#1, #2, and #3) cannot be correlated? (however, they can include Nested subqueries that can be correlated)
Correlated subquery:
FROM shopper s
...
AND EXISTS (SELECT *
FROM otherTable t
WHERE t.id = s.id)

Full text search with CONTAINS is very slow

We try to use Full text search on Azure database and got performance problems on using CONTAINS search.
Our data has star schema, Fact table has clustered column store index enabled and around 40 million rows. Below is how we use CONTAINS on dimension and do aggregation on Fact table on different queries:
Query 1 using EXISTS:
SELECT f.[FK_DimCompanyCodeId], SUM(f.NetValueInUSD)
FROM [SPENDBY].[FactInvoiceDetail] f
WHERE EXISTS (
SELECT * FROM [SPENDBY].[DimCompanyCode] d
WHERE f.[FK_DimCompanyCodeId] = d.Id
AND CONTAINS(d.*, 'Comcast'))
GROUP BY f.[FK_DimCompanyCodeId]
ORDER BY SUM(f.NetValueInUSD) DESC
This query seems run forever and never return the result.
There is non-clustered index on the foreign key FK_DimCompanyCodeId] and there is only one row returned when searching Comcast:
SELECT id FROM [SPENDBY].[DimCompanyCode] d
WHERE CONTAINS(d.*, 'Comcast');
-- will return id = 5
And there are around 27 million rows of Fact table which has FK_DimCompanyCodeId = 5.
Query 2 using INNER JOIN:
SELECT f.[FK_DimCompanyCodeId], SUM(f.NetValueInUSD)
FROM [SPENDBY].[FactInvoiceDetail] f
INNER JOIN [SPENDBY].[DimCompanyCode] d ON (f.[FK_DimCompanyCodeId] = d.Id)
WHERE CONTAINS(d.*, 'Comcast')
GROUP BY f.[FK_DimCompanyCodeId]
ORDER BY SUM(f.NetValueInUSD) DESC
This query seems run forever and never return the result as well.
Query 3 using #temp table:
SELECT id INTO #temp FROM [SPENDBY].[DimCompanyCode] d
WHERE CONTAINS(d.*, 'Comcast');
SELECT f.[FK_DimCompanyCodeId], SUM(f.NetValueInUSD)
FROM [SPENDBY].[FactInvoiceDetail] f
WHERE EXISTS (
SELECT * FROM #temp
WHERE f.[FK_DimCompanyCodeId] = #temp.Id)
GROUP BY f.[FK_DimCompanyCodeId]
ORDER BY SUM(f.NetValueInUSD) DESC
Very fast, returns the result after 5 seconds.
Why full text search is so slow for in case 1 and case 2.
The problem is competing indexes -- one for the JOIN and one for the filter. Perhaps a subquery would convince SQL Server to use the text index first:
SELECT f.[FK_DimCompanyCodeId], SUM(f.NetValueInUSD)
FROM [SPENDBY].[FactInvoiceDetail] f JOIN
(SELECT id
FROM [SPENDBY].[DimCompanyCode] cc
WHERE CONTAINS(cc.*, 'Comcast')
) cc
ON cc.id = f.FK_DimCompanyCodeId
GROUP BY f.[FK_DimCompanyCodeId]
ORDER BY SUM(f.NetValueInUSD) DESC
It would probably also help if you have an index on FactInvoiceDetail(FK_DimCompanyCodeId).
Eventually, I figured out CONTAINS works well on specific column (Description for example):
SELECT f.[FK_DimCompanyCodeId], SUM(f.NetValueInUSD)
FROM [SPENDBY].[FactInvoiceDetail] f
WHERE f.[FK_DimCompanyCodeId] IN (
SELECT d.Id FROM [SPENDBY].[DimCompanyCode] d
WHERE CONTAINS(d.[Description], 'Comcast')
)
GROUP BY f.[FK_DimCompanyCodeId]
ORDER BY SUM(f.NetValueInUSD) DESC
In order to search for the whole table, CONTAINSTABLE will have the best performance and avoid using #temp table:
SELECT f.[FK_DimCompanyCodeId], SUM(f.NetValueInUSD)
FROM [SPENDBY].[FactInvoiceDetail] f
LEFT OUTER JOIN CONTAINSTABLE([SPENDBY].[DimCompanyCode], *, '"Comcast"') ct
ON f.[FK_DimCompanyCodeId] = ct.[Key]
WHERE ct.[Key] IS NOT NULL
GROUP BY f.[FK_DimCompanyCodeId]
ORDER BY SUM(f.NetValueInUSD) DESC

performance with left join and include max in the query with values null

The data of the table exceeds 7 billion.
I want to display the max of entryDate affiliation for each participant and i want to include the null values so i used left join but the query takes long minute. Anyway it gives me the expected results.
Could anyone has a better idea or another better solution to fix the performance?
Select ParticipantID,MaxDate
From dbo.Participant Par
LEFT JOIN dbo.Affiliation Aff
ON AFF.ParticipantID=Par.ParticipantID
LEFT JOIN (
SELECT AFF.AffiliationID,
MAX(EntryDate) as MaxDate
FROM dbo.Affiliation
GROUP BY AFF.AffiliationID
)AS AFF1
ON AFF1.AffiliationID = AFF.AffiliationID
AND AFF1.MaxDate = AFF.EntryDate
I think the first join is unnecessary
SELECT ParticipantID, MaxDate
FROM dbo.Participant Par
OUTER APPLY (
SELECT MAX(EntryDate) as MaxDate
FROM dbo.Affiliation Aff
WHERE Aff.ParticipantID = ParParticipantID
) A
Also you need index on Affilation:
CREATE INDEX IX_Affiliation_ParticipantID_EntryDate ON dbo.Affiliation(ParticipantID, EntryDate)
#llyas There could be lot more considerations if you have xml show plan on or include actual execution plan and check for subtree cost
Anyways, you can use this query by using row_num function
WITH par
AS (
SELECT ParticipantID
,EntryDate AS MaxDate
,ROW_NUMBER() OVER (
PARTITION BY AffiliationID ORDER BY ENTRYDATE DESC
) rn
FROM dbo.Participant Par
LEFT JOIN dbo.Affiliation Aff
ON AFF.ParticipantID = Par.ParticipantID
)
SELECT Participantid
,Maxdate
WHERE rn = 1

How to fetch all rows from table with count of specific group?

I have a simple table like this
spatialite> select id, group_id, object_id, object, param from controlled_object;
1|1|150|nodes|0.5
2|1|186|nodes|0.5
3|2|372|nodes|1.0
The second column is group_id. I want to retrieve all entries from the table, plus the count of the group.
1|1|150|nodes|0.5|2
2|1|186|nodes|0.5|2
3|2|372|nodes|1.0|1
I thought a cross join would be the way to go
SELECT
*
, cj.cnt
FROM
controlled_object
CROSS JOIN (
SELECT
COUNT(DISTINCT group_id) AS cnt
FROM
controlled_object
) AS cj
But that gives me
1|1|150|nodes|0.5|2|2
2|1|186|nodes|0.5|2|2
3|2|372|nodes|1.0|2|2
How do I fetch all rows from table including the count of a specific group?
Join source data with counters, grouped by group_id
select c.id, c.group_id, c.object_id, c.object, c.param,cnt from controlled_object c join
(select group_id,count(*) cnt from controlled_object group by group_id) p on c.group_id =p.group_id ;
Not very good idea for big tables
Sqlite is not very good idea for big tables at all :-)
You can compute the count with a correlated subquery:
SELECT id,
group_id,
object_id,
object,
param,
(SELECT count(*)
FROM controlled_object AS co2
WHERE group_id = controlled_object.group_id)
FROM controlled_object;