Select rows based on 2 column combination - hive

I've a table with the following (partial) structure:
id_1 score_1 id_2 score_2
77 10 88 50
77 10 88 30
77 25 88 50
77 25 88 30
meaning, id can get more than one score.
What I want is to leave the rows where the id's combination is with the maximum score of each id.
In the above-mentioned example, I'd like to leave only the following row:
id_1 score_1 id_2 score_2
77 25 88 50
I tried to use self-join methods, but with no success.
Any help will be appreciated.

If you want all the combinations, you will need to perform a full join, depending on your Hive configuration you could need set the following property
set hive.mapred.mode=nonstrict
This query will work for your case. DEMO:
create table tmp3
(
id_1 string, score int, id_2 string, score_2 int
);
INSERT INTO TABLE tmp3
VALUES (77, 10, 88, 50),(77 , 10 , 88 , 30),(77 , 25 , 88 , 50),(77 , 25 , 88 , 30);
select a.id_1, a.score, b.id_2,b.score_2 from
(
select id_1, max(score) as score from tmp3 group by id_1
) a
full join
(
select id_2, max(score_2) as score_2 from tmp3 group by id_2
) b;
result
a.id_1,a.score,b.id_2,b.score_2
77,25,88,50
By the way, depending on your data size and the distribution of your ID's, the fulljoin could take several time ..
UPDATE:
Updating answer to use window function alllowing selecting multiple columns for max scores
select a.id_1, a.score, b.id_2,b.score_2 from
(
select id_1, score from (
select id_1, score ,
ROW_NUMBER() OVER (partition by id_1 order by score desc) AS row_num
from tmp3
) x1 where row_num = 1
) a
full join
(
select id_2, score_2 from (
select id_2, score_2 ,
ROW_NUMBER() OVER (partition by id_2 order by score_2 desc) AS row_num
from tmp3
) x2 where row_num = 1
) b;

Related

T SQL Cte delete where group by is greater than 1

I'm using SQL Server 2016. I have the below table:
SKU Mkt Week Cost Code
ABC 05 1 10 100
ABC 05 2 12 100
DEF 05 3 20 100
DEF 05 3 25 125
XYZ 08 1 10 100
XYZ 08 2 12 100
XZY 08 2 14 125
This is the desired result:
SKU Mkt Week Cost Code
ABC 05 1 10 100
ABC 05 2 12 100
DEF 05 3 25 125
XYZ 08 1 10 100
XZY 08 2 14 125
So if a SKU\Mkt\Week\Cost exist more than once, I want to keep the record where code = 125 and delete the row where the code is 100.
I'm using the below Cte:
;WITH CTE AS
(
SELECT *,
RN = ROW_NUMBER() OVER( PARTITION BY SKU, Mkt, Week
ORDER BY SKU, Mkt, Week)
FROM [table]
WHERE code = 100
)
DELETE FROM CTE
WHERE RN > 1
However, the Cte doesnot delete anything -what am I missing?
Based on the query and sample data you have provided, You need to note to this section of the cte inner query:
WHERE code = 100
when this filter applied you have the following data:
SKU Mkt Week Cost Code
ABC 05 1 10 100
ABC 05 2 12 100
DEF 05 3 20 100
which will get the 1 as Row_Number()'s output!, so running the following query will not effect any rows:
DELETE FROM CTE
WHERE RN > 1
To achieve the desired result you need to remove the WHERE section in CTE's inner query.
;WITH CTE AS
(
SELECT *,
RN = ROW_NUMBER() OVER( PARTITION BY SKU, Mkt, Week
ORDER BY SKU, Mkt, Week, Cost DESC) --Code/Cost DESC <==== Note this too
FROM [table]
--WHERE code = 100 <========== HERE, I've commented it
)
DELETE FROM CTE
WHERE RN > 1
You need to also add the Cost DESC or Code Desc to Row_Number()'s Order By section.
Ranking function will be evaluated in the select statement , which means the where clause WHERE code = 100 is evaluated before ROW_NUMBER() and so it has already removed the rows with code 125. Use order by Code as well and then apply the code=100 check when deleting from the CTE
;WITH CTE AS
(
SELECT *,
RN = ROW_NUMBER() OVER( PARTITION BY SKU, Mkt, Week
ORDER BY SKU, Mkt, Week,Code DESC)
FROM tt1
)
DELETE FROM CTE
WHERE RN > 1
AND CODE = 100
Try below query to get the desired result -
Sample data and Query
Declare #Table table
(SKU varchar(20), Mkt int, [Week] int, Cost int, Code int)
Insert into #Table
values
( 'ABC', 05 , 1, 10 , 100),
( 'ABC' , 05 , 2 , 12 , 100),
('DEF' ,05 , 3 , 20 , 100),
('DEF' ,05 , 3 ,25 , 125),
('XYZ' , 08 , 1 ,10 , 100),
('XYZ' , 08 , 2 ,12 , 100),
('XYZ' , 08, 2 ,14, 125)
;WITH CTE AS
(
SELECT *,
RN = ROW_NUMBER() OVER( PARTITION BY SKU, Mkt, Week
ORDER BY SKU, Mkt, Week, code desc)
FROM #Table
)
delete from Cte where RN > 1
Along with moving your Where statement, I believe you also want a second cte to work with the records you are identifying... In the following your first cte identifies the duplicate records while the second cte isolates them so you can perform your delete against those SKUs
Table
Create Table #tbl
(
SKU VarChar(10),
Mkt VarChar(10),
Week Int,
Cost Int,
Code Int
)
Insert Into #tbl Values
('ABC','05',1,10,100),
('ABC','05',2,12,100),
('DEF','05',3,20,100),
('DEF','05',3,25,125),
('XYZ','08',1,10,100),
('XYZ','08',2,12,100),
('XYZ','08',2,14,125)
Query
;WITH CTE AS
(
SELECT *,
RN = ROW_NUMBER() OVER( PARTITION BY SKU, Mkt, Week
ORDER BY SKU, Mkt, Week)
FROM #tbl
--WHERE code = 100
)
, cte1 As
(
Select sku from cte where rn > 1
)
DELETE c FROM CTE c inner join cte1 c1 On c.SKU = c1.SKU
WHERE c.Code = 100
Select * From #tbl
Result (Your 'desired result' example removed an XYZ record where the week was not duplicated?)
SKU Mkt Week Cost Code
ABC 05 1 10 100
ABC 05 2 12 100
DEF 05 3 25 125
XYZ 08 1 10 100
XYZ 08 2 12 100
XZY 08 2 14 125
Your CTE statement is only considering rows with code = 100. If you remove it, then CTE will rank based on all rows from the table. Using this, first find out which combination of have multiple rows. Then, among these combinations, identify rows with code = 100 and delete them.
create table #e1
(
SKU varchar(50)
,Mkt varchar(50)
,_Week int
,Cost int
,_code int
)
insert into #e1(SKU, Mkt, _Week, Cost, _code)
select 'ABC', '05', 1, 10, 100 UNION
SELECT 'ABC', '05', 2, 12, 100 union
SELECT 'DEF', '05', 3, 20, 100 UNION
SELECT 'DEF', '05', 3, 25, 125 UNION
SELECT 'XYZ', '08', 1, 10, 100 UNION
SELECT 'XYZ', '08', 2, 12, 100 UNION
SELECT 'XZY', '08', 2, 14, 125
delete s
from
#e1 s
JOIN
(
SELECT SKU, Mkt, _Week
FROM #e1
group by
SKU, Mkt, _Week
having count(1) > 1
) m
ON
s.SKU = m.sku and s.mkt = m.mkt and s._Week = m._Week
WHERE s._code = 100
Create table #tab1 (SKU varchar(50),Mkt varchar(50),[Week] varchar(50),Cost varchar(50),Code varchar(50))
insert into #tab1
select 'ABC','05','1','10','100'
union
select 'ABC','05','2','12','100'
union
select 'DEF','05','3','20','100'
union
select 'DEF','05','3','25','125'
union
select 'XYZ','08','1','10','100'
union
select 'XYZ','08','2','12','100'
union
select 'XYZ','08','2','14','125'
delete t from #tab1 t
inner join (select t1.SKU,t1.Mkt,t1.[Week],t1.Cost as Cost,t1.Code as Code,ROW_NUMBER()over(partition by t1.SKU,t1.Mkt,t1.[Week] order by t1.Cost desc,t1.Code desc ) as rno
from #tab1 t1
) c on c.SKU = t.SKU and c.Mkt = t.Mkt and c.Cost = t.Cost and c.[Week] = t.[Week] and c.Code = t.Code
where c.rno = 2
select * from #tab1
Output:
SKU Mkt Week Cost Code
ABC 05 1 10 100
ABC 05 2 12 100
DEF 05 3 25 125
XYZ 08 1 10 100
XYZ 08 2 14 125

How can I select distinct by one column?

I have a table with the columns below, and I need to get the values if COD is duplicated, get the non NULL on VALUE column. If is not duplicated, it can get a NULL VALUE. Like the example:
I'm using SQL SERVER.
This is what I get:
COD ID VALUE
28 1 NULL
28 2 Supermarket
29 1 NULL
29 2 School
29 3 NULL
30 1 NULL
This is what I want:
COD ID VALUE
28 2 Supermarket
29 2 School
30 1 NULL
What I'm tryin' to do:
;with A as (
(select DISTINCT COD,ID,VALUE from CodId where ID = 2)
UNION
(select DISTINCT COD,ID,NULL from CodId where ID != 2)
)select * from A order by COD
You can try this.
DECLARE #T TABLE (COD INT, ID INT, VALUE VARCHAR(20))
INSERT INTO #T
VALUES(28, 1, NULL),
(28, 2 ,'Supermarket'),
(29, 1 ,NULL),
(29, 2 ,'School'),
(29, 3 ,NULL),
(30, 1 ,NULL)
;WITH CTE AS (
SELECT *, RN= ROW_NUMBER() OVER (PARTITION BY COD ORDER BY VALUE DESC) FROM #T
)
SELECT COD, ID ,VALUE FROM CTE
WHERE RN = 1
Result:
COD ID VALUE
----------- ----------- --------------------
28 2 Supermarket
29 2 School
30 1 NULL
Another option is to use the WITH TIES clause in concert with Row_Number()
Example
Select top 1 with ties *
from YourTable
Order By Row_Number() over (Partition By [COD] order by Value Desc)
Returns
COD ID VALUE
28 2 Supermarket
29 2 School
30 1 NULL
I would use GROUP BY and JOIN. If there is no NOT NULL value for a COD than it should be resolved using the OR in JOIN clause.
SELECT your_table.*
FROM your_table
JOIN (
SELECT COD, MAX(value) value
FROM your_table
GROUP BY COD
) gt ON your_table.COD = gt.COD and (your_table.value = gt.value OR gt.value IS NULL)
If you may have more than one non null value for a COD this will work
drop table MyTable
CREATE TABLE MyTable
(
COD INT,
ID INT,
VALUE VARCHAR(20)
)
INSERT INTO MyTable
VALUES (28,1, NULL),
(28,2,'Supermarket'),
(28,3,'School'),
(29,1,NULL),
(29,2,'School'),
(29,3,NULL),
(30,1,NULL);
WITH Dups AS
(SELECT COD FROM MyTable GROUP BY COD HAVING count (*) > 1 )
SELECT MyTable.COD,MyTable.ID,MyTable.VALUE FROM MyTable
INNER JOIN dups ON MyTable.COD = Dups.COD
WHERE value IS NOT NULL
UNION
SELECT MyTable.COD,MyTable.ID,MyTable.VALUE FROM MyTable
LEFT JOIN dups ON MyTable.COD = Dups.COD
WHERE dups.cod IS NULL

select based on specific values

I have this table:
ID NO.
111 6
222 7
333 9
111 8
333 4
222 3
111 7
222 5
333 2
I want to select only 2 ID numbers from table where NO. column equal specific values.
For example i tried this query but i didn't get the expected result:
SELECT top 2 * FROM mytable where NO. in
(select NO. from mytable )
Expected result:
111 6
111 8
222 7
222 3
333 9
333 3
You seem to want to select two rows in the table for each id, based on a condition on the No column. For this, one method uses row_number():
select t.*
from (select t.*, row_number() over (partition by id order by id) as seqnum
from mytable t
where <condition goes here>
) t
where seqnum <= 2;
I'm guessing (333,3) is a mistake and you expect (333,2). If not I have no idea.
SELECT
ua.ID
, ua.[NO.]
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY ID ORDER BY t.[NO.] ASC) AS RowNum
, t.ID
, t.[NO.]
FROM dbo.t1 AS t
UNION ALL
SELECT
ROW_NUMBER() OVER (PARTITION BY ID ORDER BY t.[NO.] DESC)
, ID
, t.[NO.]
FROM dbo.t1 AS t
) ua
WHERE ua.RowNum = 1
ORDER BY ID, ua.[NO.] DESC
If you're just trying to get top 2 values for each group, you need something to define the order, ie. a third column. Then you don't need UNION ALL, just use WHERE ua.RowNum < 3.
/*Select 2 random rows per id where the number of rows per id can vary between 1 and infinity
A good article for this:-*/
--https://www.mssqltips.com/sqlservertip/3157/different-ways-to-get-random-data-for-sql-server-data-sampling/
DECLARE #TABLE TABLE(ID INT,NO INT)
INSERT INTO #TABLE
VALUES
(111, 6),
(222, 7),
(333 , 9),
(111 , 8),
(333 , 4),
(222 , 3),
(111 , 7),
(222 , 5),
(333 , 2)
select t.* from
(
Select s.* ,ROW_NUMBER() OVER(PARTITION BY ID ORDER BY randomnumber) ROWNUMBER
from
(
SELECT ID,NO,
(ABS(CHECKSUM(NEWID())) % 100001) + ((ABS(CHECKSUM(NEWID())) % 100001) * 0.00001) [randomnumber]
FROM #TABLE
) s
) t
where t.rownumber < 3

Return results from one of two tables, based on max sum of costs

I have two tables with costs. One set is what was actually recorded, and one is estimates based on the brand
What I want to do is report on whichever is higher.
Sample data is:
ParentTable:
GroupId, TransactionId, Otherinfo.....
123, 4444, ...
530, 2311, ...
201, 1111, ...
ActualData
TransactionId, Product, Cost
4444, 3039, 100
4444, 3002, 4000
2311, 3004, 693
EstimateData
GroupId, Brand, Cost
123, 33, 80
123, 42, 3000
530, 222, 1200
201, 121, 4040
In this situation, what I want to return is a table that contains
GroupId, Code, Cost
123, 3039, 100 <- Actual data
123, 3002, 4000 <- Actual data
530, 222, 1200 <- Estimate data
201, 121, 4040 <- Estimate data
Currently I am looking at first doing a select from both tables, returning GroupId with Max(cost). I'm struggling on how to use this to return the results I want.
Can anyone help me?
EDIT Added in the parent table.. It doesn't really change things, but might give more insight as to the data
If I understand correctly, you want to select every row of a group from the table where its total is the largest. The cte contains all groupId's and which table the largest total comes from. The union then uses the cte to only select rows belonging to the largest groups for each table.
with cte as (
select * from (
select source, GroupId,
row_number() over (partition by GroupId order by total_cost desc) rn
from (
select 'ad' source, GroupId, sum(Cost) total_cost
from ActualData ad
group by GroupId
union all
select 'ed' source, GroupId, sum(Cost) total_cost
from EstimatedData ed
group by GroupId
) t1
) t1 where rn = 1
)
select GroupId, Product Code, Cost from ActualData ad
where GroupId in (select GroupId from cte where source = 'ad')
union all
select GroupId, Brand Code, Cost from EstimatedData ed
where GroupId in (select GroupId from cte where source = 'ed')
The relationship is definitely fuzzy however the expected result
| GROUPID | CODE | COST |
|---------|------|------|
| 123 | 3039 | 100 |
| 123 | 3002 | 4000 |
| 201 | 121 | 4040 |
| 530 | 222 | 1200 |
was produced by this query:
WITH
acte AS (
SELECT p.GroupId, ad.Product, ad.Cost
, SUM(ad.cost) OVER (PARTITION BY Groupid) AS grp_cost
FROM ActualData AS ad
INNER JOIN parenttable p ON ad.TransactionId = p.TransactionId
),
ecte AS (
SELECT GroupId, Brand, SUM(Cost) AS cost
FROM EstimateData
GROUP BY
GroupId
, Brand
)
SELECT acte.GroupId, acte.Product AS Code, acte.Cost
FROM acte
WHERE NOT EXISTS (
SELECT
NULL
FROM ecte
WHERE ecte.GroupId = acte.GroupId
AND ecte.cost > acte.grp_cost
)
UNION ALL
SELECT ecte.GroupId, ecte.Brand AS Code, ecte.Cost
FROM ecte
WHERE NOT EXISTS (
SELECT
NULL
FROM acte
WHERE acte.GroupId = ecte.GroupId
AND acte.grp_cost > ecte.cost
)
;
See this SQLfiddle demo

In Oracle, how do I get a page of distinct values from sorted results?

I have 2 columns in a one-to-many relationship. I want to sort on the "many" and return the first occurrence of the "one". I need to page through the data so, for example, I need to be able to get the 3rd group of 10 unique "one" values.
I have a query like this:
SELECT id, name
FROM table1
INNER JOIN table2 ON table2.fkid = table1.id
ORDER BY name, id;
There can be multiple rows in table2 for each row in table1.
The results of my query look like this:
id | name
----------------
2 | apple
23 | banana
77 | cranberry
23 | dark chocolate
8 | egg
2 | yak
19 | zebra
I need to page through the result set with each page containing n unique ids. For example, if start=1 and n=4 I want to get back
2
23
77
8
in the order they were sorted on (i.e., name), where id is returned in the position of its first occurrence. Likewise if start=3 and n=4 and order = desc I want
8
23
77
2
I tried this:
SELECT * FROM (
SELECT id, ROWNUM rnum FROM (
SELECT DISTINCT id FROM (
SELECT id, name
FROM table1
INNER JOIN table2 ON table2.fkid = table1.id
ORDER BY name, id)
WHERE ROWNUM <= 4)
WHERE rnum >=1)
which gave me the ids in numerical order, instead of being ordered as the names would be.
I also tried:
SELECT * FROM (
SELECT DISTINCT id, ROWNUM rnum FROM (
SELECT id FROM (
SELECT id, name
FROM table1
INNER JOIN table2 ON table2.fkid = table1.id
ORDER BY name, id)
WHERE ROWNUM <= 4)
WHERE rnum >=1)
but that gave me duplicate values.
How can I page through the results of this data? I just need the ids, nothing from the "many" table.
update
I suppose I'm getting closer with changing my inner query to
SELECT id, name, rank() over (order by name, id)
FROM table1
INNER JOIN table2 ON table2.fkid = table1.id
...but I'm still getting duplicate ids.
You may need to debug it a little, but but it will be something like this:
SELECT * FROM (
SELECT * FROM (
SELECT id FROM (
SELECT id, name, row_number() over (partition by id order by name) rn
FROM table1
INNER JOIN table2 ON table2.fkid = table1.id
)
) WHERE rn=1 ORDER BY name, id
) WHERE rownum>=1 and rownum<=4;
It's a bit convoluted (and I would tend to suspect that it could be simplified) but it should work. You'd can put whatever start and end position you want in the WHERE clause-- I'm showing here with start=2 and n=4 are pulled from a separate table but you could simplify things by using a couple of parameters instead.
SQL> ed
Wrote file afiedt.buf
1 with t as (
2 select 2 id, 'apple' name from dual union all
3 select 23, 'banana' from dual union all
4 select 77, 'cranberry' from dual union all
5 select 23, 'dark chocolate' from dual union all
6 select 8, 'egg' from dual union all
7 select 2, 'yak' from dual union all
8 select 19, 'zebra' from dual
9 ),
10 x as (
11 select 2 start_pos, 4 n from dual
12 )
13 select *
14 from (
15 select distinct
16 id,
17 dense_rank() over (order by min_id_rnk) outer_rnk
18 from (
19 select id,
20 min(rnk) over (partition by id) min_id_rnk
21 from (
22 select id,
23 name,
24 rank() over (order by name) rnk
25 from t
26 )
27 )
28 )
29 where outer_rnk between (select start_pos from x) and (select start_pos+n-1 from x)
30* order by outer_rnk
SQL> /
ID OUTER_RNK
---------- ----------
23 2
77 3
8 4
19 5