SQL: Counting and Numbering Duplicates - Optimising Correlated Subquery

SQL: Counting and Numbering Duplicates - Optimising Correlated Subquery - sql

In an SQLite database I have one table where I need to count the duplicates across certain columns (i.e. rows where 3 particular columns are the same) and then also number each of these cases (i.e. if there are 2 occurrences of a particular duplicate, they need to be numbered as 1 and 2). I'm finding it a bit difficult to explain in words so I'll use a simplified example below.
The data I have is similar to the following (first line is header row, table is referenced in following as "idcountdata"):
id match1 match2 match3 data
1 AbCde BC 0 data01
2 AbCde BC 0 data02
3 AbCde BC 1 data03
4 AbCde AB 0 data04
5 FGhiJ BC 0 data05
6 FGhiJ AB 0 data06
7 FGhiJ BC 1 data07
8 FGhiJ BC 1 data08
9 FGhiJ BC 2 data09
10 HkLMop BC 1 data10
11 HkLMop BC 1 data11
12 HkLMop BC 1 data12
13 HkLMop DE 1 data13
14 HkLMop DE 2 data14
15 HkLMop DE 2 data15
16 HkLMop DE 2 data16
17 HkLMop DE 2 data17
And the output I need to generate for the above would be:
id match1 match2 match3 data matchid matchcount
1 AbCde BC 0 data01 1 2
2 AbCde BC 0 data02 2 2
3 AbCde BC 1 data03 1 1
4 AbCde AB 0 data04 1 1
5 FGhiJ BC 0 data05 1 1
6 FGhiJ AB 0 data06 1 1
7 FGhiJ BC 1 data07 1 2
8 FGhiJ BC 1 data08 2 2
9 FGhiJ BC 2 data09 1 1
10 HkLMop BC 1 data10 1 3
11 HkLMop BC 1 data11 2 3
12 HkLMop BC 1 data12 3 3
13 HkLMop DE 1 data13 1 1
14 HkLMop DE 2 data14 1 4
15 HkLMop DE 2 data15 2 4
16 HkLMop DE 2 data16 3 4
17 HkLMop DE 2 data17 4 4
Previously I was using a couple of correlated subqueries to achieve this as follows:
SELECT id, match1, match2, match3, data,
(SELECT count(*) FROM idcountdata d2
WHERE d1.match1=d2.match1 AND d1.match2=d2.match2 AND d1.match3=d2.match3
AND d2.id<=d1.id)
AS matchid,
(SELECT count(*) FROM idcountdata d2
WHERE d1.match1=d2.match1 AND d1.match2=d2.match2 AND d1.match3=d2.match3)
AS matchcount
FROM idcountdata d1;
But the table has over 200,000 rows (and the data can be variable in length/content) and hence this takes hours to run. (Strangely, when I first used the same query on the same data back in mid-to-late 2013 it took minutes rather than hours, but that is beside the point - even back then I thought it was inelegant and inefficient.)
I've already converted the correlated subquery for "matchcount" in the above to an uncorrelated subquery with a JOIN as follows:
SELECT d1.id, d1.match1, d1.match2, d1.match3, d1.data,
matchcount
FROM idcountdata d1
JOIN
(SELECT id,match1,match2,match3,count(*) matchcount
FROM idcountdata
GROUP BY match1,match2,match3) d2
ON (d1.match1=d2.match1 and d1.match2=d2.match2 and d1.match3=d2.match3);
So it's just the subquery for "matchid" that I would like some help to optimise.
In short, the following query runs too slowly for larger datasets:
SELECT id, match1, match2, match3, data,
(SELECT count(*) FROM idcountdata d2
WHERE d1.match1=d2.match1 AND d1.match2=d2.match2 AND d1.match3=d2.match3
AND d2.id<=d1.id)
matchid
FROM idcountdata d1;
How can I improve the performance of the above query?
It doesn't have to run in seconds, but it needs to be minutes rather than hours (for around 200,000 rows).

A self join may be faster than a correlated subquery
SELECT d1.id, d1.match1, d1.match2, d1.match3, d1.data, count(*) matchid
FROM idcountdata d1
JOIN idcountdata d2 on d1.match1 = d2.match1
and d1.match2 = d2.match2
and d1.match3 = d2.match3
and d1.id >= d2.id
GROUP BY d1.id, d1.match1, d1.match2, d1.match3, d1.data
This query can take advantage of a composite index on (match1,match2,match3,id)

Related

Select data based on few consideration

I wanted to select data based on the below considerations -
Now we have two tables -
TABLE 1 -
DLR_ID
CALL_ID
VEHICLE
PLAN_NO
CALL_STATUS
1
11
AA
5
Generated
2
12
AA
5
Generated
1
13
AA
10
Generated
2
14
AA
10
Not Generated
1
15
BB
5
Generated
1
16
BB
10
Generated
2
17
CC
5
Not Generated
3
18
CC
5
Generated
1
19
DD
5
Not Generated
4
20
DD
5
Not Generated
3
21
EE
5
Generated
2
22
FF
10
Generated
4
23
FF
10
Generated
5
24
GG
20
Generated
6
25
GG
20
Generated
TABLE 2 -
DLR_ID
CALL_ID
CALL_COUNT
CALL_RESULT_STATUS
CALL_DATE(DD/MM/YYYY)
1
11
1
Continue
16/03/2021
1
11
2
Give-up
20/03/2021
2
12
1
Completed
15/03/2021
1
13
1
Continue
01/04/2021
1
15
1
Completed
21/02/2021
1
16
1
Give-up
20/03/2021
3
18
1
Continue
21/05/2021
3
21
1
Give-up
24/04/2021
2
22
1
Completed
19/03/2021
4
23
1
Completed
03/05/2021
5
24
1
Continue
11/02/2021
5
24
2
Completed
11/05/2021
6
25
1
Continue
10/02/2021
6
25
2
Continue
21/02/2021
6
25
3
Continue
21/04/2021
OUTPUT -
DLR_ID
VEHICLE
PLAN_NO
CALL_STATUS
CALL_ID
CALL_DATE
CALL_RESULT_STATUS
1
AA
5
Generated
12
15/03/2021
Completed
2
AA
5
Generated
12
15/03/2021
Completed
1
AA
10
Generated
13
01/04/2021
Continue
2
AA
10
Not Generated
13
01/04/2021
Continue
1
BB
5
Generated
15
21/02/2021
Completed
1
BB
10
Generated
15
21/02/2021
Completed
2
CC
5
Not Generated
18
21/05/2021
Continue
3
CC
5
Generated
18
21/05/2021
Continue
1
DD
5
Not Generated
4
DD
5
Not Generated
3
EE
5
Generated
21
21/04/2021
Give-up
2
FF
10
Generated
23
03/05/2021
Completed
4
FF
10
Generated
23
03/05/2021
Completed
5
GG
20
Generated
24
11/05/2021
Completed
6
GG
20
Generated
24
11/05/2021
Completed
Kindly help me out in extracting the building oracle query to extract the data like mentioned in OUTPUT table.
Code which I was trying is -
SELECT t1.DLR_id, t1.VEHICLE,t1.PLAN_NO,t1.CALL_STATUS,
NVL(MAX(CASE WHEN t1.CALL_STATUS='Generated' and t2.CALL_RESULT_STATUS = 'Completed' THEN t2.CALL_ID END),
MAX(CASE WHEN t1.CALL_STATUS!='Generated' and t2.CALL_RESULT_STATUS != 'Completed' THEN t2.CALL_ID END)) as CALL_ID
FROM Table1 t1
left JOIN Table2 t2
ON t1.DLR_ID=t2.DLR_ID
and t2.call_id = t1.call_id
group by T1.DLR_ID,t1.VEHICLE,t1.PLAN_NO,
T1.CALL_STATUS
order by t1.VEHICLE,t1.plan_no,t1.dlr_id

Conditional sum in SQL (SAS) (SUMIFS equivalent) - Part 2

Let say I have to table:
Table1:
ID Item
1 A
1 B
1 A
2 B
2 B
3 A
3 B
3 B
3 A
Table2:
ID A B C
1 91 94 90
2 100 97 93
1 97 94 96
2 97 95 90
3 99 100 93
1 90 97 97
Now I would like to take the sum conditional for my table1 from table2 (when the ID by row and the Item match by COLUMN):
ID Item Want
1 A 278
1 B 285
2 A 197
2 B
2 B
3 A
3 B
3 B
3 A
So that I have 278 is the sum of all item 1 in column A, 285 is the sum of all itme 1 in column B, 197 is the sum of all item 2 in column A.
So what am I supposed to do in SQL?
Thanks in advance.

You can use join and conditional aggregation:
select t1.id, t1.item,
sum(case when t1.item = 'A' then t2.A
when t1.item = 'B' then t2.B
when t1.item = 'C' then t2.C
end) as want
from table1 t1 left join
table2 t2
on t1.id = t2.id
group by t1.id, t1.item

Proc MEANS is built from the ground up for the sole purpose of computing statistics for aggregates.
Consider this example:
data have; input
ID $ A B C; datalines;
1 91 94 90
2 100 97 93
1 97 94 96
2 97 95 90
3 99 100 93
1 90 97 97
;
ods _all_ close;
proc means data=have stackodsoutput sum;
class id;
var a b c;
ods output summary=want;
run;
that produces data set

return the count of row even if null sql server

I trying to do a sql query to get the count for shift for each user
I used this query :
SELECT
COUNT(s.id) AS count, s.user_id
FROM
sarcuser AS u
INNER JOIN
sarcshiftpointuser AS s ON s.user_id = u.id
INNER JOIN
sarcalllevel AS l ON l.id = u.levelid
INNER JOIN
sarcshiftpointtable AS t ON t.shift_id = s.shift_id AND s.table_id = t.table_id
WHERE
(s.shift_id + '' LIKE '2')
AND (CAST(s.xdate AS DATE) BETWEEN CAST(N'2014-01-01' AS DATE) AND CAST(N'2015-01-01' AS DATE))
AND (u.gender + '' LIKE N'%')
AND (u.levelid + '' LIKE N'%')
AND (s.point_id + '' LIKE '2')
GROUP BY
s.user_id
ORDER BY
count
It works very well ... but there is a logic problem :
when the user didn't appear in the shift didn't return the count and I need it to return 0
For example :
user1 user2
shift1 2 2
shift2 5 0
shift3 6 10
but actually the code returns :
user1 user2
shift1 2 2
shift2 5 10
shift3 6
and that's wrong ... how to return the count even if it zero with this condition and this inner join ?
Sample for data in table :
sarcuser :
id firstname lastname gender levelid
52 samy sammour male 1
62 ibrahim jackob male 1
71 rebeca janson female 3
sarcalllevel :
id name
1 field leader
2 leader
3 paramdic
sarcshiftpointtable :
id shift_id table_id name_of_shift point_id
1 1 1 shift1 2
2 2 1 shift2 2
3 3 1 shift3 2
4 1 2 shift1 7
5 2 2 shift2 7
6 3 2 shift3 7
sarcshiftpointuser :
id point_id shift_id table_id user_id xdate
1 2 1 1 62 2014-01-05
2 2 1 1 0 2014-01-05
3 2 1 1 71 2014-01-05
4 2 2 1 0 2014-01-05
5 2 2 1 0 2014-01-05
6 2 2 1 52 2014-01-05
7 2 3 1 52 2014-01-05
8 2 3 1 62 2014-01-05
9 2 3 1 71 2014-01-05
10 2 1 1 71 2014-01-06
11 2 1 1 52 2014-01-06
12 2 1 1 0 2014-01-06
13 2 2 1 62 2014-01-06
14 2 2 1 0 2014-01-06
15 2 2 1 52 2014-01-06
16 2 3 1 62 2014-01-06
17 2 3 1 52 2014-01-06
18 2 3 1 71 2014-01-06
if i apply this query 3 times by changing the shift should return :
52 62 71
shift1 1 2 2
shift2 2 1 0
shift3 2 2 2
in shift2 in sarcshiftpointuser the user 71 is not appear
so when I do the code it will return just to field not three ? the count 0 is not returned
52 62 71
shift2 2 1
to be more specific :
I need to export this table into excel so when the 0 is not return it give me a wrong order and wrong value (logically )

You will need to use a nested query using IFNULL
Take a look to this
http://www.w3schools.com/sql/sql_isnull.asp
Something like,
IFNULL(user,0)

I think you are referring a crosstab query. you can use PIVOT to return your result set. Please refer below link.
Sql Server 2008 Cross Tab Query.
If you give few sample data for sarcuser , sarcshiftpointuser, sarcalllevel & sarcshiftpointtable tables, then we can give you a better answer.

SQL terminology to combine a NOT EXIST query with latest value

I am a beginner with basic knowledge.
I have a single table that I am trying to pull all UID's that have not had a particular code in the table within the past year.
My table looks like this: (but much larger of course)
FACID DPID EID DID UID DT Code Units Charge ET Ord
1 1 6 2 1002 15-Mar-07 99204 1 180 09:36.7 1
1 1 7 5 10004 15-Mar-07 99213 1 68 02:36.9 1
1 1 24 55 25887 15-Mar-07 99213 1 68 43:55.3 1
1 1 25 2 355688 15-Mar-07 99213 1 68 53:20.2 1
1 1 26 5 555654 15-Mar-07 99213 1 68 42:22.6 1
1 1 27 44 135514 15-Mar-07 99213 1 68 00:36.8 1
1 1 28 2 3244522 15-Mar-07 99214 1 98 34:59.4 1
1 1 29 5 235445 15-Mar-07 99213 1 68 56:42.1 1
1 1 30 3 3214444 15-Mar-07 99213 1 68 54:56.5 1
1 1 33 1 221444 15-Mar-07 99204 1 180 37:44.5 1
I am attempting to use the following, but this is not working for my time frame limits.
select distinct UID from PtProcTbl
where DT<'20120101'
and NOT EXISTS (Select Distinct UID
where Code in ('99203','99204','99205','99213',
'99214','99215','99244','99245'))
I need to know how to make sure the UID's that I am pulling are the ones don't have a DT after the 1/1/2012 cut off date that contains one of the NOT Exists codes.
The above query returned UID's that actually dates after 1/1/2012 that does contain one of the above codes...
Not sure what I am doing wrong or if I am totally off base on this..
Thanks in advance.

Are you sure you need the NOT EXISTS? How about instead:
AND Code NOT IN ('99203','99204','99205','99213','99214','99215','99244','99245')

Using Sum() with multiple where clauses

I'm pretty new to this, so forgive if this has been posted (I had no idea what to even search on).
I have 2 tables, Accounts and Usage
AccountID AccountStartDate AccountEndDate
-------------------------------------------
1 12/1/2012 12/1/2013
2 1/1/2013 1/1/2014
UsageId AccountID EstimatedUsage StartDate EndDate
------------------------------------------------------
1 1 10 1/1 1/31
2 1 11 2/1 2/29
3 1 23 3/1 3/31
4 1 23 4/1 4/30
5 1 15 5/1 5/31
6 1 20 6/1 6/30
7 1 15 7/1 7/31
8 1 12 8/1 8/31
9 1 14 9/1 9/30
10 1 21 10/1 10/31
11 1 27 11/1 11/30
12 1 34 12/1 12/31
13 2 13 1/1 1/31
14 2 13 2/1 2/29
15 2 28 3/1 3/31
16 2 29 4/1 4/30
17 2 31 5/1 5/31
18 2 26 6/1 6/30
19 2 43 7/1 7/31
20 2 32 8/1 8/31
21 2 18 9/1 9/30
22 2 20 10/1 10/31
23 2 47 11/1 11/30
24 2 33 12/1 12/31
I'd like to write one query that gives me estimated usage for each month (starting now until the last month that we serve an account) for all accounts being served during that month.
The results would be as follows:
Month-Year Total Est Usage
------------------------------
Oct-12 0 (none being served)
Nov-12 0 (none being served)
Dec-12 34 (only accountid 1 being served)
Jan-13 23 (accountid 1 & 2 being served)
Feb-13 24 (accountid 1 & 2 being served)
Mar-13 51 (accountid 1 & 2 being served)
...
Dec-13 33 (only accountid 2 being served)
Jan-14 0 (none being served)
Feb-14 0 (none being served)
I'm assuming I need to sum and then do a Group By...but not really sure logically how I'd lay this out.

Revised Answer:
I've created a Months table with columns MonthID, Month with values like (201212, 12), (201301, 1), ...
I've also reorganised the usage table to have a month column rather than the start date and end date, as it makes the idea clearer.
See http://sqlfiddle.com/#!3/f57d84/6 for details
The query is now:
Select
m.MonthID,
Sum(u.EstimatedUsage) TotalEstimatedUsage
From
Accounts a
Inner Join
Usage u
On a.AccountID = u.AccountID
Inner Join
Months m
On m.MonthID Between
Year(a.AccountStartDate) * 100 + Month(a.AccountStartDate) And
Year(a.AccountEndDate) * 100 + Month(a.AccountEndDate) And
m.Month = u.Month
Group By
m.MonthID
Order By
1
Previous answer, for reference which assumed usages ranges were full dates rather than just months.
Select
Year(u.StartDate),
Month(u.StartDate),
Sum(Case When a.AccountStartDate <= u.StartDate And a.AccountEndDate >= u.EndDate Then u.EstimatedUsage Else 0 End) TotalEstimatedUsage
From
Accounts a
Inner Join
Usage u
On a.AccountID = u.AccountID
Group By
Year(u.StartDate),
Month(u.StartDate)
Order By
1, 2

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL: Counting and Numbering Duplicates - Optimising Correlated Subquery - sql

Related

Select data based on few consideration

Conditional sum in SQL (SAS) (SUMIFS equivalent) - Part 2

return the count of row even if null sql server

SQL terminology to combine a NOT EXIST query with latest value

Using Sum() with multiple where clauses

Categories

Resources