Find only common column items in a table WITHIN the same group - sql

I have a table that has 3 columns
Plate_Id, Prod_id and Location
You can say that the plate_id is the "header" column.
The prod_id groups the 'locations' together for a particular 'plate_id'
Given a particular set of values, I only want to pick locations that are 'COMMON' amongst prod_ids for a particular plate_id.
NOTE: My table can have multiple plate_ids
I am close, but its not perfect.
I tried to isolate the smallest group for a given plate_id and then tried to inner join it with the original list, but it fails for the scenario where I have 3 prod_ids and a location is common for even one group(i need only locations that are strictly in every prod_id)
Following is the result I desire, based on the how far I have gotten so far,
-- DESIRED RESULT:
-- plate_id location
-- 100 1
-- 100 2
-- 200 3
-- 200 4
-- 300 1
-- 300 2
-- 300 5
create table #AllTab
(
plate_id int,
prod_id int,
location int
)
insert into #AllTab
values
(100,10, 1),
(100,10, 2),
(100,10, 3),
(100,10, 4),
(100,20, 1),
(100,20, 2),
(100,20, 3),
(100,20, 4),
(100,20, 5),
(100,20, 6),
(100,20, 9),
(100,30, 1),
(100,30, 2),
(100,30, 9),
(100,40, 1),
(100,40, 2),
(100,40, 12),
(100,40, 14),
(100,40, 1),
(100,40, 2),
(100,40, 25),
(100,40, 30),
-----------------
(200,10, 1),
(200,10, 2),
(200,10, 3),
(200,10, 4),
(200,20, 1),
(200,20, 2),
(200,20, 3),
(200,20, 4),
(200,20, 5),
(200,20, 6),
(200,20, 7),
(200,30, 3),
(200,30, 4),
(200,30, 9),
-----------------
(300,10, 1),
(300,10, 2),
(300,10, 3),
(300,10, 5),
(300,20, 1),
(300,20, 2),
(300,20, 3),
(300,20, 4),
(300,20, 5),
(300,20, 6),
(300,20, 7),
(300,20, 9),
(300,30, 1),
(300,30, 2),
(300,30, 5)
-- The #SubTab table isolates the smallest group from the above table
-- for a particular plate_id
create table #SubTab
(
plate_id int,
prod_id int,
location int
)
insert into #SubTab
values
(100,30, 1),
(100,30, 2),
(100,30, 9),
------------
(200,30, 3),
(200,30, 4),
(200,30, 9),
------------
(300,30, 1),
(300,30, 2),
(300,30, 5)
select distinct pr.plate_id, pr.prod_id, pr.location from #SubTab pr
inner join #AllTab pl on pr.plate_id = pl.plate_id
and pr.location = pl.location
where pr.Prod_Id <> pl.prod_id
group by pr.plate_id, pr.prod_id, pr.location

This query returns the locations that are in all the products for a given plate:
select plate_id, location
from #alltab a
group by plate_id, location
having count(distinct prod_id) = (select count(distinct prod_id) from #alltab a2 where a2.plate_id = a.plate_id);
This assumes no duplicates in the table -- a reasonable assumption given your data.
Here is a rextester.

try this:
;with cte1
AS
(
Select Plate_Id,Count(DISTINCT prod_id) as ProdCount
From #AllTab
Group by Plate_Id
)
,cte2
AS
(
Select Plate_Id,Location,Count(Location) As LocCount
from #AllTab
Group by Plate_Id,Location
)
SELECT t1.plate_id ,t2.location
FROM cte1 t1 JOIN cte2 t2
ON t1.Plate_Id =t2.Plate_Id
Where LocCount>=ProdCount

I have a solution, it might be a tad lengthy, but works,
SELECT
SubGroupCounts.plate_id,
LocationSubGroupCounts.location
FROM
(-- Number of sub-grouping relative to main grouping
SELECT
plate_id,
count(distinct prod_id) as num
FROM
AllTab
GROUP BY
plate_id) SubGroupCounts
INNER JOIN
(-- Count the number of sub-groups each location appears in
SELECT
plate_id,
Location,
COUNT(distinct prod_id) AS num
FROM
AllTab
GROUP BY
Location, plate_id) LocationSubGroupCounts ON LocationSubGroupCounts.plate_id = SubGroupCounts.plate_id
AND LocationSubGroupCounts.num = SubGroupCounts.num

Related

Count distinct values with multiple group by in SQL Server

I am getting problems when I try to count distinct orders with multiple group by statements
Please recommend a solution.
Let me give an example with 4 unique orders
SELECT COUNT(Distinct Sale.id) as Ordr
FROM (VALUES('One1', 1), ('Two2', 2), ('Three3', 3), ('Four4', 4)) Sale(orderName, id)
left join (VALUES
('p1', 1, 1), ('p2', 2, 1), ('p3', 3, 1), ('p4', 4, 1),
('p2', 5, 2), ('p4', 6, 2), ('p1', 7, 3), ('p4', 8, 3))
SaleItem(productName, id, orderId) on Sale.id = SaleItem.orderId
If you run above query, it will give you total order count as 4 and it is correct. Now i am just going to add a group by with productName and count the total result and the output will be incorrect
Select SUM(Ordr) from (
SELECT COUNT(Distinct Sale.id) as Ordr
FROM (VALUES('One1', 1), ('Two2', 2), ('Three3', 3), ('Four4', 4)) Sale(orderName, id)
left join (VALUES
('p1', 1, 1), ('p2', 2, 1), ('p3', 3, 1), ('p4', 4, 1),
('p2', 5, 2), ('p4', 6, 2), ('p1', 7, 3), ('p4', 8, 3))
SaleItem(productName, id, orderId) on Sale.id = SaleItem.orderId
GROUP BY SaleItem.productName
) data
As far as I understand, here we have duplicate orders in each group and I do not see any way to just get distinct count.

Forward fill since (possibly non existent) date in BigQuery

I have data from two different sources. On one hand I have user data from our app. This has a primary key of ID and UTC date. There are only rows for UTC dates when are users uses the app. On the other hand I have advertisement campaign attribition data for the users (which can be multiple advertisment campaigns per user). This table has a primary key of ID and campaign and a metric containing a advertisment attribution timestamp. I want to combine the two data sources such that I can compute if a campaign is generating more revenue than it costs among other campaign statistics.
App data example:
SELECT
*
FROM UNNEST(ARRAY<STRUCT<ID INT64, UTC_Date DATE, Revenue FLOAT64>>
[(1, DATE('2021-01-01'), 0),
(1, DATE('2021-01-05'), 5),
(1, DATE('2021-01-10'), 0),
(2, DATE('2021-01-03'), 10),
(2, DATE('2021-01-08'), 0),
(2, DATE('2021-01-09'), 0)])
advertisement campaign attribition data example:
SELECT
*
FROM UNNEST(ARRAY<STRUCT<ID INT64, Attribution_Timestamp Timestamp, campaign_name STRING>>
[(1, TIMESTAMP('2021-01-01 09:54:31'), "A"),
(1, TIMESTAMP('2021-01-09 22:32:51'), "B"),
(2, TIMESTAMP('2021-01-03 19:12:11'), "A")])
The end result I would like to get is:
SELECT
*
FROM UNNEST(ARRAY<STRUCT<ID INT64, UTC_Date DATE, Revenue FLOAT64, campaign_name STRING>>
[(1, DATE('2021-01-01'), 0, "A"),
(1, DATE('2021-01-05'), 5, "A"),
(1, DATE('2021-01-10'), 0, "B"),
(2, DATE('2021-01-03'), 10, "A"),
(2, DATE('2021-01-08'), 0, "A"),
(2, DATE('2021-01-09'), 0, "A")])
This can be achieved by somehow joining the campaign attribution data to the app data and then forward filling.
The problem I have is that the advertisment attribution timestamp can have a mismatch with the UTC dates in the app data table. This means I cannot use a left join as it will not assign campaign_name B to ID 1. Does anyone know an elegant way to solve this problem?
Found a solution! Here is what I did (and a little bit more sample data):
WITH app_data AS
(
SELECT
*
FROM UNNEST(ARRAY<STRUCT<adid INT64, utc_date DATE, Revenue FLOAT64>>
[(1, DATE('2021-01-01'), 0),
(1, DATE('2021-01-05'), 5),
(1, DATE('2021-01-10'), 0),
(1, DATE('2021-01-12'), 0),
(1, DATE('2021-01-15'), 0),
(1, DATE('2021-01-16'), 15),
(1, DATE('2021-01-18'), 0),
(2, DATE('2021-01-03'), 10),
(2, DATE('2021-01-08'), 0),
(2, DATE('2021-01-09'), 0),
(2, DATE('2021-01-15'), 4),
(2, DATE('2021-02-01'), 0),
(2, DATE('2021-02-08'), 8),
(2, DATE('2021-02-15'), 0),
(2, DATE('2021-03-04'), 0),
(2, DATE('2021-03-06'), 12),
(3, DATE('2021-02-15'), 10),
(3, DATE('2021-02-23'), 5),
(3, DATE('2021-03-25'), 0),
(3, DATE('2021-03-30'), 0)])
),
advertisment_attribution_data AS
(
SELECT
*
FROM UNNEST(ARRAY<STRUCT<adid INT64, utc_date DATE, campaign_name STRING>>
[(1, DATE(TIMESTAMP('2021-01-01 09:54:31')), "A"),
(1, DATE(TIMESTAMP('2021-01-09 22:32:51')), "B"),
(1, DATE(TIMESTAMP('2021-01-17 14:30:05')), "C"),
(2, DATE(TIMESTAMP('2021-01-03 19:12:11')), "A"),
(1, DATE(TIMESTAMP('2021-01-15 18:17:57')), "B"),
(3, DATE(TIMESTAMP('2021-03-14 22:32:51')), "C")])
)
SELECT
t1.*,
IFNULL(LAST_VALUE(t2.campaign_name IGNORE NULLS) OVER (PARTITION BY t1.adid ORDER BY t1.utc_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), "Organic") as campaign_name
FROM
app_data t1
LEFT JOIN
advertisment_attribution_data t2
ON t1.adid = t2.adid
AND t1.utc_date = (SELECT MIN(t3.utc_date) FROM app_data t3 WHERE t2.adid=t3.adid AND t2.utc_date <= t3.utc_date)
EDIT
It doesn't work when I select a real table in app_data. It says: Unsupported subquery with table in join predicate.
EDIT 2
Found a way to solve the problem where you cannot use subqueries in joins (apparently it is possible for tables which are not selected from an existing table...) This is the way it works in any case:
WITH app_data AS
(
SELECT
*
FROM UNNEST(ARRAY<STRUCT<adid INT64, utc_date DATE, Revenue FLOAT64>>
[(1, DATE('2021-01-01'), 0),
(1, DATE('2021-01-05'), 5),
(1, DATE('2021-01-10'), 0),
(1, DATE('2021-01-12'), 0),
(1, DATE('2021-01-15'), 0),
(1, DATE('2021-01-16'), 15),
(1, DATE('2021-01-18'), 0),
(2, DATE('2021-01-03'), 10),
(2, DATE('2021-01-08'), 0),
(2, DATE('2021-01-09'), 0),
(2, DATE('2021-01-15'), 4),
(2, DATE('2021-02-01'), 0),
(2, DATE('2021-02-08'), 8),
(2, DATE('2021-02-15'), 0),
(2, DATE('2021-03-04'), 0),
(2, DATE('2021-03-06'), 12),
(3, DATE('2021-02-15'), 10),
(3, DATE('2021-02-23'), 5),
(3, DATE('2021-03-25'), 0),
(3, DATE('2021-03-30'), 0)])
),
advertisment_attribution_data AS
(
SELECT
*,
(
SELECT
MIN(t2.utc_date)
FROM app_data t2
WHERE t1.adid=t2.adid
AND t1.utc_date <= t2.utc_date
) as attribution_join_date -- is the closest next date for this adid in app_data to the attribution date. This ensures the join lateron works.
FROM UNNEST(ARRAY<STRUCT<adid INT64, utc_date DATE, campaign_name STRING>>
[(1, DATE(TIMESTAMP('2021-01-01 09:54:31')), "A"),
(1, DATE(TIMESTAMP('2021-01-09 22:32:51')), "B"),
(1, DATE(TIMESTAMP('2021-01-17 14:30:05')), "C"),
(2, DATE(TIMESTAMP('2021-01-03 19:12:11')), "A"),
(1, DATE(TIMESTAMP('2021-01-15 18:17:57')), "B"),
(3, DATE(TIMESTAMP('2021-03-14 22:32:51')), "C")]) t1
)
SELECT
t1.*,
IFNULL(LAST_VALUE(t2.campaign_name IGNORE NULLS) OVER (PARTITION BY t1.adid ORDER BY t1.utc_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), 'Organic') as campaign_name
FROM
app_data t1
LEFT JOIN
advertisment_attribution_data t2
ON t1.adid = t2.adid
AND t1.utc_date = t2.attribution_join_date

SQL query. Get similar rows ordered by desc

I have a simple table with two fields.
1) book_id - int
2) tag_id - int
One book can have multiple tags, such as
book_id:1 - tag_id: 2, 3, 5, 9
Here is the question: how can i get a similar books from specific book? Also, they should be ordered desc by something like "likeness" count.
Example: i wanna get all book_ids with similar tags from book_id = 1 ordered by similar tags count.
Specific book: book_id: 1 - tag_id: 2, 3, 5 , 9
Result:
book_id: 54 - tag_id: 2, 3, 5, 14
book_id: 104 - tag id: 2, 3, 10
You can order the books by the number of tags they have in common with your given book:
select bt2.book_id, count(*) as tags_in_common
from book_tags bt join
book_tags bt2
on bt.tag_id = bt2.tag_id
where bt.book_id = ?
group by bt2.book_id
order by tags_in_common desc;
Let's say your table and data look like this:
create table #books_tags(
book_id int,
tag_id int
)
insert into #books_tags values(1, 2), (1, 3), (1, 5), (1, 9) -- book_id:1
insert into #books_tags values(54, 2), (54, 3), (54, 5), (54, 14) -- book_id:54
insert into #books_tags values(104, 2), (104, 3), (104, 10) -- book_id:104
insert into #books_tags values(2, 3), (2, 5), (2, 11), (2, 14) -- book_id:2
insert into #books_tags values(3, 3), (3, 9), (3, 10), (3, 11) -- book_id:3
Then your query is this:
select a.book_id,
b.book_id similar_book_id,
count(*) matching_tags,
string_agg(a.tag_id, ',') tag_ids
from #books_tags a
left join #books_tags b on b.tag_id = a.tag_id and b.book_id <> a.book_id
group by a.book_id, b.book_id
order by matching_tags desc, a.book_id
(in SQL Server 2017 or later)

Create range of ID without cursor

Not sure if this or similar question is asked already but i could not find one.
The Requirement to create range of IDs while the Value is not changed. This schema can be used:
declare #mytable as table(ID int, Val int)
insert into #mytable values
(1, 1),
(2, 1),
(3, 1),
(4, 2),
(5, 2),
(6, 2),
(7, 2),
(8, 1),
(9, 1),
(10, 1),
(11, 4),
(12, 4),
(13, 4),
(14, 4),
(15, 4),
(16, 5);
And the expected result would be
StartID EndID Val
1 3 1
4 7 2
8 10 1
11 15 4
16 16 5
Now I can achieve this by running cursor and If n case the number of records will be millions, I think, cursor will be slower. I hope it can be written using some compound query but could not figure-out how.
So I need help in writing that kind of query and needless to mention yet, it is not a school/collage project/assignment.
This is a gaps-and-islands scenario where you're trying to group records together based on the change in Val.
This is using window functions to determine when the Val changes, and assign the island_nbr.
Answer:
select min(b.ID) as StartID
, max(b.ID) as EndID
, max(b.Val) as Val
from (
select a.ID
, a.Val
, sum(a.is_chng_flg) over (order by a.ID asc) as island_nbr
from (
select m.ID
, m.Val
, case lag(m.Val, 1, m.Val) over (order by m.ID asc) when m.Val then 0 else 1 end is_chng_flg
from #mytable as m
) as a
) as b
group by b.island_nbr --forces the right records to show up
order by 1
This is a gaps-and-islands problem. But the simplest method is the difference of row numbers:
select min(id) as startId, max(id) as endId, val
from (select t.*,
row_number() over (order by id) as seqnum,
row_number() over (partition by val order by id) as seqnum_v
from #mytable t
) t
group by (seqnum - seqnum_v), val
order by startId;

T-Sql (Complex?) query

I'm trying to make a query, but I can't find a way to do it.
So I got 3 tables
Table Card (card_id)
Table Level(leve_id, leve_desc)
Table CardDetails(cade_id, card_id, leve_id)
So here is the problem : Each cards got a list of details.
I want a query to count for each cards, the number of cards who have the exact same details, excluding the card it-self. Which means the same list of leve_id.
Is it possible to achieve it in plain t-sql?
I hope I have been clear enough, if not, I'll try to explain better what I need.
Edit:
I don't really need to know which cards it is for the moment, but it would sure credit bonus points if it did.
Edit #2:
So lets say
table Card (card_id)
1,2,3,4,5,6
Table level (leve_id, leve_desc)
(1, Level 1), (2,Level 2), (3,Level 3), (4,Level 4), (5, Level5), (6, Level6)
Table CardDetails (card_id, leve_id)
(1, 1), (1, 3), (1, 4), (2, 1), (2, 2), (3, 1)
(3, 3), (3, 4), (4, 5), (5, 1), (5, 2), (5, 3)
(5, 4), (5, 5), (5, 6), (6, 1), (6, 3), (6, 4)
So, the result should be :
Card_id Nbr_Cards
1 .. 2
2 .. 0
3 .. 2
4 .. 0
5 .. 0
6 .. 2
If I understand you correctly you want something like this
SELECT *
FROM cards c
INNER JOIN carddetails cd
ON c.card_id = cd.card_id
INNER JOIN (SELECT cade_id,
leve_id
FROM carddetails
GROUP BY cade_id,
leve_id
HAVING COUNT (card_id) > 1)dups
ON cd.cade_id = dups.cade_id
AND cd.leve_id = dups.leve_id
Or if you like COUNT OVER
with dups as (
SELECT
COUNT(CARD_ID) OVER (PARTITION BY cade_id, leve_id) cardCount
cade_id,
leve_id
FROM carddetails
)
    SELECT * 
    FROM   cards c 
           INNER JOIN carddetails cd 
             ON c.card_id = cd.card_id 
           INNER JOIN dups
ON cd.cade_id = dups.cade_id
AND cd.leve_id = dups.leve_id
WHERE cardCount > 1
If I understood your question.
For each card count the number of exactly equal details:
declare #CardDetails table (card_id int, leve_id int)
insert into #CardDetails values
(1, 1), (1, 3), (1, 4),
(2, 1), (2, 2),
(3, 1), (3, 3), (3, 4),
(4, 5),
(5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6),
(6, 1), (6, 3), (6, 4)
select card_id,
count(*) over(partition by leve_ids) - 1 as EqualCount
from
(
select card_id,
(select ','+cast(leve_id as varchar(10))
from #CardDetails as C2
where C1.card_id = C2.card_id
order by C2.leve_id
for xml path('')) as leve_ids
from #CardDetails as C1
group by card_id
) T
order by card_id
Result:
card_id EqualCount
----------- -----------
1 2
2 0
3 2
4 0
5 0
6 2