Over lapping in SQL - sql

I have a table with following data:
User# App
1 A
1 B
2 A
2 B
3 A
I want to know overlapping between Apps by distinct Users, so my end result with look like this
App1 App2 DistinctUseroverlapped
A A 3
A B 2
B B 2
So what result means is there are 3 users using app A only , there are 2 users who use App A and App B both , and there are 2 users who use App B only.
Remember there lot of app and users how can I do this in SQL?

My solution starts by generating all possible pairs of applications that are of interest. This is the driver subquery.
It then joins in the original data for each of the apps.
Finally, it uses count(distinct) to count the distinct users that match between the two lists.
select pairs.app1, pairs.app2,
COUNT(distinct case when tleft.user = tright.user then tleft.user end) as NumCommonUsers
from (select t1.app as app1, t2.app as app2
from (select distinct app
from t
) t1 cross join
(select distinct app
from t
) t2
where t1.app <= t2.app
) pairs left outer join
t tleft
on tleft.app = pairs.app1 left outer join
t tright
on tright.app = pairs.app2
group by pairs.app1, pairs.app2
You could move the conditional comparison in the count to the joins and just use count(distinct):
select pairs.app1, pairs.app2,
COUNT(distinct tleft.user) as NumCommonUsers
from (select t1.app as app1, t2.app as app2
from (select distinct app
from t
) t1 cross join
(select distinct app
from t
) t2
where t1.app <= t2.app
) pairs left outer join
t tleft
on tleft.app = pairs.app1 left outer join
t tright
on tright.app = pairs.app2 and
tright.user = tleft.user
group by pairs.app1, pairs.app2
I prefer the first method because it is more explicit on what is being counted.
This is standard SQL, so it should work on Vertica.

this works in vertica 6
with tab as
( select 1 as user,'A' as App
union select 1 as user,'B' as App
union select 2 as user,'A' as App
union select 2 as user,'B' as App
union select 3 as user,'A' as App
)
, apps as
( select distinct App from tab )
select apps.app as APP1,tab.app as APP2 ,count(distinct tab.user) from tab,apps
where tab.app>=apps.app
group by 1,2
order by 1

Related

Select minimal count of grouping by result without window functions

As always, want to do with one sql request. Have a table of send attempts:
ID TIMESTAMP TASK_ID
1 2019-01-30 15:29:38 1
2 2019-01-30 15:29:39 1
3 2019-01-30 15:29:40 2
4 2019-01-30 15:29:41 3
Task table:
ID EMAIL
1 boxOne#test.com
2 boxOne#test.com
3 boxTwo#test.com
Purpose is to get task ids for unique emails that has minimal count of attempts (in our case is 2 and 3). Problem number one is that i want make some tests using H2 that not supports window functions. Problem two is that several tasks can have same email.
Tried this :
SELECT TASK.id, TASK.EMAIL, count(att.TASK_ID)
FROM TASK
JOIN ATTEMPTS on TASK.id = ATTEMPTS.TASK_ID
GROUP BY ATTEMPTS.TASK_ID
and have such result:
TASK.id EMAIL count(TASK.id)
1 boxOne#test.com 2
2 boxOne#test.com 1
3 boxTwo#test.com 1
but i need minimal count for each unique email like this:
TASK.id EMAIL count(TASK.id)
2 boxOne#test.com 1
3 boxTwo#test.com 1
min(count(TASK.id)) didn't work for me result is always zero.
Can this be done without window functions or i should accept temp result and process it in my code ?
Try to use a correlated subquery, HAVING and ALL
SELECT t.id, t.email, count(a.task_ID) cnt
FROM task t
JOIN attempts a on t.id = a.task_ID
GROUP BY t.id, t.email
HAVING count(a.task_ID) <= ALL
(
SELECT count(a.task_ID)
FROM task t2
JOIN attempts a on t2.id = a.task_ID
WHERE t2.email = t.email
GROUP BY t2.id
)
DEMO
you can try by using correlated subquery
select distinct t1.* from
(
SELECT TASK.id, TASK.EMAIL, count(att.TASK_ID) cnt
FROM TASK
JOIN ATTEMPTS on TASK.id = ATTEMPTS.TASK_ID
group by TASK.id, TASK.EMAIL
) t1 where t1.cnt= (select min(cnt) from
(SELECT TASK.id, TASK.EMAIL, count(att.TASK_ID) cnt
FROM TASK
JOIN ATTEMPTS on TASK.id = ATTEMPTS.TASK_ID
group by TASK.id, TASK.EMAIL
) t2 where t2.EMAIL=t1.EMAIL)

How can I count and show the number of times a row appears within multiple joined tables?

I wrote the below query which shows me ApplicationIDs associated with two specific ables. I need the results to return the number of times each Applications.AppID appears in those tables next to the row with the application name. Ive used distinct because in my results I only want the name to appear once but have a number next to it indicating how many times it has been used. Examples below. Ive written count conditions before but only for single tables.
SELECT 0 AppId ,
'Select an Application' ApplicationName
union all
select .1 ,
'--All--'
union all
SELECT DISTINCT
Applications.AppId ,
Applications.ApplicationName
FROM ImpactedApplications ,
SupportingApplications
JOIN applications ON SupportingApplications.Appid = applications.appid
JOIN ImpactedApplications Apps on SupportingApplications.AppId = Applications.AppId
Returns something like this:
0.0 Select an Application
0.1 --All--
12.0 APP A
59.0 APP B
60.0 APP C
71.0 APP D
74.0 APP E
121.0 APP F
124.0 APP G
130.0 APP H
I want it to return something like this:
0.0 Select an Application
0.1 --All--
12.0 APP A 1
59.0 APP B 2
60.0 APP C 1
71.0 APP D 4
74.0 APP E 3
121.0 APP F 1
124.0 APP G 2
130.0 APP H 2
Any help is appreciated thank you.
Adding Results from Help Query
12 APP A 17161
59 APP B 51483
60 APP C 85805
71 APP D 17161
DISTINCT is logically equivalent to a GROUP BY:
SELECT Applications.AppId, Applications.ApplicationName
,COUNT(*)
FROM SupportingApplications
INNER JOIN applications ON SupportingApplications.Appid = applications.appid
INNER JOIN ImpactedApplications as Apps on SupportingApplications.AppId = Applications.AppId
GROUP BY Applications.AppId, Applications.ApplicationName
First, you do realize that order is unspecified unless you order the result set using order by? That means there is no guarantee that the first two selects in your union all will come first.
So, let's strip those two out as they are really extraneous to the actual problem. Let us consider the core select:
select distinct
Applications.AppId ,
Applications.ApplicationName
from ImpactedApplications ,
SupportingApplications
JOIN applications ON SupportingApplications.Appid = applications.appid
JOIN ImpactedApplications Apps on SupportingApplications.AppId = Applications.AppId
and dissect it.
Problem #1.
select distinct is often a code smell indicating that you don't have the correct join criteria or you don't correclty understand the cardinality of the relationships involved.
Problem #2.Indeed, this is the case. You are mixing old-school, pre-ISO/ANSI joins with ISO/ANSI joins. Since the first two tables in the FROM clause are joined pre-ISO/ANSI style, and you have no where clause with criteria to join them, The above select statement is exactly identical to
select distinct
a.AppId ,
a.ApplicationName
from ImpactedApplications ia
cross join SupportingApplications sa
join applications a on sa.Appid = a.appid
join ImpactedApplications Apps on sa.AppId = a.AppId
I'm pretty sure you didn't intend to generate the cartesian product of the 2 tables. You haven't described the table schema, but my suspicion, from your problem statement
I need the results to return the number of times each Applications.AppID
appears in those tables next to the row with the application name.
is that you want something more along these lines:
select AppId = a.AppId ,
AppName = a.ApplicationName ,
ImpactedCount = coalesce( ia.Cnt , 0 ) ,
SupportingCount = coalesce( sa.Cnt , 0 ) ,
Total = coalesce( ia.Cnt , 0 )
+ coalesce( sa.Cnt , 0 )
from Applications a
left join ( select AppId = t.AppId ,
Cnt = count(*)
from ImpactedApplications t
group by t.AppId
) ia on ia.AppId = a.AppId
left join ( select AppId = t.AppId ,
Cnt = count(*)
from SupportingApplications t
group by t.AppId
) sa on sa.AppId = a.AppId
If you want to restrict the results to just those rows with non-zero values, you could change the left join clauses to join, but that would mean you would only get those rows that have a non zero value for both. Instead, add a where clause to restrict the result set:
where sa.Cnt > 0
OR ia.Cnt > 0
In addition to filtering out any rows where both counts are zero, it also removes rows where both rows have a null count, indicating that no match occurred in the left join.
SELECT 0 AppId, 'Select an Application' ApplicationName union all
select .1,'--All--' union all
SELECT DISTINCT app.AppId, app.ApplicationName,count(app.AppId)
FROM SupportingApplications sa
INNER JOIN applications app ON sa.Appid = applications.appid
--INNER JOIN ImpactedApplications as Apps on sa.AppId = app.AppId
group by app.AppId, app.ApplicationName
It doesn't look like your doing anything with ImpactedApplications table, so idk maybe remove that line

Is there a way to make this query more efficient performance wise?

This query takes a long time to run on MS Sql 2008 DB with 70GB of data.
If i run the 2 where clauses seperately it takes a lot less time.
EDIT - I need to change the 'select *' to 'delete' afterwards, please keep it in mind when answering. thanks :)
select *
From computers
Where Name in
(
select T2.Name
from
(
select Name
from computers
group by Name
having COUNT(*) > 1
) T3
join computers T2 on T3.Name = T2.Name
left join policyassociations PA on T2.PK = PA.EntityId
where (T2.EncryptionStatus = 0 or T2.EncryptionStatus is NULL) and
(PA.EntityType <> 1 or PA.EntityType is NULL)
)
OR
ClientId in
(
select substring(ClientID,11,100)
from computers
)
Swapping IN for EXISTS will help.
Also, as per Gordon's answer: UNION can out-perform OR.
SELECT computers.*
FROM computers
LEFT
JOIN policyassociations
ON policyassociations.entityid = computers.pk
WHERE (
computers.encryptionstatus = 0
OR computers.encryptionstatus IS NULL
)
AND (
policyassociations.entitytype <> 1
OR policyassociations.entitytype IS NULL
)
AND EXISTS (
SELECT name
FROM (
SELECT name
FROM computers
GROUP
BY name
HAVING Count(*) > 1
) As duplicate_computers
WHERE name = computers.name
)
UNION
SELECT *
FROM computers As c
WHERE EXISTS (
SELECT SubString(clientid, 11, 100)
FROM computers
WHERE SubString(clientid, 11, 100) = c.clientid
)
You've now updated your question asking to make this a delete.
Well the good news is that instead of the "OR" you just make two DELETE statements:
DELETE
FROM computers
LEFT
JOIN policyassociations
ON policyassociations.entityid = computers.pk
WHERE (
computers.encryptionstatus = 0
OR computers.encryptionstatus IS NULL
)
AND (
policyassociations.entitytype <> 1
OR policyassociations.entitytype IS NULL
)
AND EXISTS (
SELECT name
FROM (
SELECT name
FROM computers
GROUP
BY name
HAVING Count(*) > 1
) As duplicate_computers
WHERE name = computers.name
)
;
DELETE
FROM computers As c
WHERE EXISTS (
SELECT SubString(clientid, 11, 100)
FROM computers
WHERE SubString(clientid, 11, 100) = c.clientid
)
;
Some things I would look at are
1. are indexes in place?
2. 'IN' will slow your query, try replacing it with joins,
3. you should use column name, I guess 'Name' in this case, while using count(*),
4. try selecting required data only, by selecting particular columns.
Hope this helps!
or can be poorly optimized sometimes. In this case, you can just split the query into two subqueries, and combine them using union:
select *
From computers
Where Name in
(
select T2.Name
from
(
select Name
from computers
group by Name
having COUNT(*) > 1
) T3
join computers T2 on T3.Name = T2.Name
left join policyassociations PA on T2.PK = PA.EntityId
where (T2.EncryptionStatus = 0 or T2.EncryptionStatus is NULL) and
(PA.EntityType <> 1 or PA.EntityType is NULL)
)
UNION
select *
From computers
WHERE ClientId in
(
select substring(ClientID,11,100)
from computers
);
You might also be able to improve performance by replacing the subqueries with explicit joins. However, this seems like the shortest route to better performance.
EDIT:
I think the version with join's is:
select c.*
From computers c left outer join
(select c.Name
from (select c.*, count(*) over (partition by Name) as cnt
from computers c
) c left join
policyassociations PA
on T2.PK = PA.EntityId and PA.EntityType <> 1
where (c.EncryptionStatus = 0 or c.EncryptionStatus is NULL) and
c.cnt > 1
) cpa
on c.Name = cpa.Name left outer join
(select substring(ClientID, 11, 100) as name
from computers
) csub
on c.Name = csub.name
Where cpa.Name is not null or csub.Name is not null;

Produce result table trom multiple tables

SQL Server 2008 R2
I have 3 tables contained data for 3 different types of events
Type1, Type2, Type3 with two columns:
DatePoint ValuePoint
I want to produce result table which would look like that:
DatePoint TotalType1 TotalType2 TotalType3
I've started from that
SELECT [DatePoint]
,SUM(ValuePoint) as TotalType1
FROM [dbo].[Type1]
GROUP BY [DatePoint]
ORDER BY [DatePoint]
SELECT [DatePoint]
,SUM(ValuePoint) as TotalType2
FROM [dbo].[Type2]
GROUP BY [DatePoint]
ORDER BY [DatePoint]
SELECT [DatePoint]
,SUM(ValuePoint) as TotalType3
FROM [dbo].[Type3]
GROUP BY [DatePoint]
ORDER BY [DatePoint]
So I have three result but I need to produce one (Date TotalType1 TotalType2 TotalType3), what I need to do next achieve my goal?
UPDATE
Forgot to mention that DatePoint which is exists in one type may or may not exist in another
Here's my take. I assume that you don't have the same datetime values in every table (certainly, the stuff I get to work with is never so consistant). There should be an easier way to do this, but once you're past two outer joins things can get pretty tricky.
SELECT
dp.DatePoint
,isnull(t1.TotalType1, 0) TotalType1
,isnull(t2.TotalType2, 0) TotalType2
,isnull(t3.TotalType3, 0) TotalType3
from (-- Without "ALL", UNION will filter out duplicates
select DatePoint
from Type1
union select DatePoint
from Type2
union select DatePoint
from Type3) dp
left outer join (select DatePoint, sum(ValuePoint) TotalType1
from Type1
group by DatePoint) t1
on t1.DatePoint = db.DatePoint
left outer join (select DatePoint, sum(ValuePoint) TotalType2
from Type2
group by DatePoint) t2
on t2.DatePoint = db.DatePoint
left outer join (select DatePoint, sum(ValuePoint) TotalType3
from Type3
group by DatePoint) t3
on t3.DatePoint = db.DatePoint
order by dp.DatePoint
Suppose some distinct could help, but the general idea should be the following:
SELECT
t.[DatePoint],
SUM(t1.ValuePoint) as TotalType1,
SUM(t2.ValuePoint) as TotalType2,
SUM(t3.ValuePoint) as TotalType3
FROM
(
SELECT [DatePoint] FROM [dbo].[Type1]
UNION
SELECT [DatePoint] FROM [dbo].[Type2]
UNION
SELECT [DatePoint] FROM [dbo].[Type3]
) as t
LEFT JOIN
[dbo].[Type1] t1
ON
t1.[DatePoint] = t.[DatePoint]
LEFT JOIN
[dbo].[Type2] t2
ON
t2.[DatePoint] = t.[DatePoint]
LEFT JOIN
[dbo].[Type3] t3
ON
t3.[DatePoint] = t.[DatePoint]
GROUP BY
t.[DatePoint]
ORDER BY
t.[DatePoint]
To avoid all of the JOINs:
SELECT
SQ.DatePoint,
SUM(CASE WHEN SQ.type = 1 THEN SQ.ValuePoint ELSE 0 END) AS TotalType1,
SUM(CASE WHEN SQ.type = 2 THEN SQ.ValuePoint ELSE 0 END) AS TotalType2,
SUM(CASE WHEN SQ.type = 3 THEN SQ.ValuePoint ELSE 0 END) AS TotalType3
FROM (
SELECT
1 AS type,
DatePoint,
ValuePoint
FROM
dbo.Type1
UNION ALL
SELECT
2 AS type,
DatePoint,
ValuePoint
FROM
dbo.Type2
UNION ALL
SELECT
3 AS type,
DatePoint,
ValuePoint
FROM
dbo.Type3
) AS SQ
GROUP BY
DatePoint
ORDER BY
DatePoint
From the little information provided though, it seems like there are some flaws in the database design, which is probably part of the reason that querying the data is so difficult.

SELECT Data from multiple tables?

I have 3 tables, with 3 fields all the same. I basically want to select information from each table
For example:
userid = 1
I want to select data from all 3 tables, where userid = 1
I am currently using:
SELECT r.*,
p.*,
l.*
FROM random r
LEFT JOIN pandom p ON r.userid = p.userid
LEFT JOIN landom l ON l.userid = r.userid
WHERE r.userid = '1'
LIMIT 0, 30
But it doesn't seem to work.
with 3 fields all the same
So you mean that you want the same 3 fields from all 3 tables?
SELECT r.col1, r.col2, r.col3
FROM random r
WHERE r.userid = '1'
LIMIT 0, 30
UNION ALL
SELECT p.pcol1, p.pcol_2, p.p3
FROM pandom p
WHERE p.userid = '1'
LIMIT 0, 30
UNION ALL
SELECT l.l1, l.l2, l.l3
FROM landom l
WHERE l.userid = '1'
LIMIT 0, 30
The fields don't have to be named the same, but the same types need to line up in position 1, 2 and 3.
The way the limits work is:
it will attempt to get 30 from random.
If it has 30 already, it won't even look at the other 2 tables
if it has less than 30 from random, it will try to fill up to 30 from pandom and only finally landom
SELECT t1.*, t2.*, t3.*
FROM `random` as t1, `pandom` as t2, `landom` as t3
WHERE t1.`userid`='1' AND t2.`userid`='1' AND t3.`userid`='1'
SELECT * FROM `random`
JOIN `pandom` USING (`userid`)
JOIN `landom` USING (`userid`)
WHERE `userid`='1'