Hive query optimization - sql

My requirement is to get the id and name of the students having more than 1 email id's and type=1.
I am using a query like
select distinct b.id, b.name, b.email, b.type,a.cnt
from (
select id, count(email) as cnt
from (
select distinct id, email
from table1
) c
group by id
) a
join table1 b on a.id = b.id
where b.type=1
order by b.id
Please let me know is this fine or any simpler version available.
Sample data is like:
id name email type
123 AAA abc#xyz.com 1
123 AAA acd#xyz.com 1
123 AAA ayx#xyz.com 3
345 BBB nch#xyz.com 1
345 BBB nch#xyz.com 1
678 CCC iuy#xyz.com 1
Expected Output:
123 AAA abc#xyz.com 1 2
123 AAA acd#xyz.com 1 2
345 BBB nch#xyz.com 1 1
678 CCC iuy#xyz.com 1 1

you can use group by -> having count() for this requirement.
select distinct b.id
, b.name,
, b.email
, b.type
from table1 b
where id in
(select distinct id from table1 group by email, id having count(email) > 1)
and b.type=1
order by b.id

You can try to use the analytical way of count() function:
SELECT sub.ID, sub.NAME
FROM (SELECT ID, NAME, COUNT (*) OVER (PARTITION BY ID, EMAIL) cnt
FROM raw.crddacia_raw) sub
WHERE sub.cnt > 1 AND sub.TYPE = 1

I strongly recommend using window functions. However, Hive does not support count(distinct) as a window function. There are different methods to solve this. One is the sum of dense_rank()s:
select id, name, email, type, cnt
from (select t1.*,
(dense_rank() over (partition by id order by email) +
dense_rank() over (partition by id order by email desc)
) as cnt
from table1 t1
) t
where type = 1;
I would expect this to have better performance than your version. However, it is worth testing different versions to see which has the better performance (and feel free to come back to let others know which is better).

One more method using collect_set and taking the size of returned array for calculating distinct emails.
Demo:
--your data example
with table1 as ( --use your table instead of this
select stack(6,
123, 'AAA', 'abc#xyz.com', 1,
123, 'AAA', 'acd#xyz.com', 1,
123, 'AAA', 'ayx#xyz.com', 3,
345, 'BBB', 'nch#xyz.com', 1,
345, 'BBB', 'nch#xyz.com', 1,
678, 'CCC', 'iuy#xyz.com', 1
) as (id, name, email, type )
)
--query
select distinct id, name, email, type,
size(collect_set(email) over(partition by id)) cnt
from table1
where type=1
Result:
id name email type cnt
123 AAA abc#xyz.com 1 2
123 AAA acd#xyz.com 1 2
345 BBB nch#xyz.com 1 1
678 CCC iuy#xyz.com 1 1
We still need DISTINCT here because analytic function does not remove duplicates like in case 345 BBB nch#xyz.com.

This is very similar to your query but here i am filtering data at initial step(in inner query)so that the join should not happen on less data
select distinct b.id,b.name,b.email,b.type,intr_table.cnt from table1 orig_table join
(
select a.id,a.type,count(a.email) as cnt from table1 as a where a.type=1 group by a
) intr_table on inter_table.id=orig_table.id,inter_table.type=orig_table.type

Related

Count on UNION in Oracle

I have 3 table and I need to get the details from 2 tables where the count of UNION is greater than 1.But need to apply certain conditions as well
Table A
id entity_id name category
1 45 abcd win_1
2 46 efgh win_2
3 47 efgh1 win_2
4 48 dfgh win_5
5 49 adfgh win_4
Table B
id product_id name parent_id
1 P123 asdf win_1
2 P234 adfgh win_4
Table 3 category_list
id cat_id name
1 win_1 Households
2 win_2 Outdoors
3 win_3 Mixed
4 win_4 Omni
Now I need to have the count of UNION from Table A and Table B where they have count of cat_id greater than 1 and Table A.name != Table B.name
The result which I require is
p_id name cat_id
45 abcd win_1
P123 asdf win_1
46 efgh win_2
47 efgh1 win_2
win_5 is excluded as the count is one and win_4 should be excluded as name in Table A nd B is same.
I have run out of Ideas as i am relatively new to Oracle and DB.Any help is appreciated.
I think you can use exists to ensure that the cat_id is present in both tables
select entity_id as p_id, name, category as cat_id
from table_a a
where exists (select null from table_b where a.category = table_b.parent_id)
union
select entity_id, name, parent_id
from table_b b
where exists (select null from table_a where b.parent_id = table_a.category)
I believe you are looking for something like this -
Select T2.*
from
(Select category
from
(Select name, category from TableA
Union all
Select name, parent_id as category from TableB) t
group by category
having count(distinct name) > 1) T1
Join
(Select entity_id as Pid, name, category from TableA
Union
Select product_id as Pid, name, parent_id as category from TableB) T2
ON T1.category = T2.category;
Would you try this code.
First CTE (Common Table Expression) "list_union" gets the records for each table those have different names then makes the union. with the second CTE "list_cnt" counts the categories and finally gets the result cnt>1 with the last select statement as you pictured.
With
list_union AS (
SELECT
id,
----------
TO_CHAR(entity_id) entity_id,
----------
name,
category
FROM table_A a
WHERE NOT EXISTS(SELECT 1 FROM table_B b WHERE a.name=b.name)
----------
UNION ALL
----------
SELECT
id,
product_id,
name,
parent_id
FROM table_B b
WHERE NOT EXISTS(SELECT 1 FROM table_A a WHERE a.name=b.name)
)
,list_cnt AS (
SELECT
l.*,
----------
COUNT(*) over (PARTITION BY category) cnt
----------
FROM list_union l
)
SELECT
entity_id AS p_id,
name,
category AS cat_id
FROM list_cnt
WHERE cnt>1
ORDER BY cat_id ASC, p_id ASC
;
Just use a union all and window functions:
select ab.*
from (select ab.*,
count(distinct name) over (partition by category) as cnt
from ((select a.* from a
) union all
(select b.* from b
)
) ab
) ab
where cnt > 1;
Although you describe the problem as:
Now I need to have the count of UNION from Table A and Table B where they have count of cat_id greater than 1 and Table A.name != Table B.name
You seem to just want cat_ids that have different names across the two tables. Your sample data includes cat_id = 'win_2', which is not even in the second table.

SQL select youngest record

I have a table. I want to run the SQL query and select the youngest record per ID, I also need to output all other columns associated with the youngest row. In the real table, there are more than 500+ columns.
Please note, I am using AWS Athena. The table has no indexes.
ID COL1 COL2 LAST_UPDATED
1 yyy ddd 01/01/2020
1 ccc eee 12/01/2020
2 xxx ddd 02/01/2020
2 vvv eee 19/01/2020
Desired result:
ID COL1 COL2 LAST_UPDATED
1 ccc eee 12/01/2020
2 vvv eee 19/01/2020
I found solution to use ROW_NUMBER() OVER(PARTITION BY
SELECT *
FROM (
SELECT id, updated_at, ROW_NUMBER() OVER(PARTITION BY id ORDER BY updated_at desc) rn
from table t
)
where rn = 1
Try using below query:
select * from aws
where last_updated in (select max(last_updated) from aws group by id)
A typical and efficient way in most databases is to use a correlated subquery:
select t.*
from t
where t.LAST_UPDATED = (select max(t2.LAST_UPDATED)
from t t2
where t2.id = t.id
);
For performance, you want an index on (id, LAST_UPDATED).
In a database that doesn't have indexes, then use row_nmber():
select t.*
from (select t.*, row_number() over (partition by id order by last_id desc) as seqnum
from t
) t
where seqnum = 1;

Sum analytical function or any other easy way

I have below Data and need to select all columns with sum of one column
id size desc1, desc2
1 13 xxx yyy
1 13 xxx yyy
1 10 mmm kkk
1 10 mmm kkk
I need below output
id **total_size** desc1 des2
1 23 xxx yyy
1 23 xxx yyy
1 23 mmm kkk
1 23 mmm kkk
total_size should be sum (distinct size)
select a.id
,a.size
,sum(b.size) as 'total_size'
,a.desc1
,a.desc2
from (
select *, row_number() over (order by id, size, desc1, desc2) as 'RowNumber'
from #tmp
) a
left join (
select *, row_number() over(partition by id, size order by id) as 'dupe'
from #tmp
) b
on a.id = b.id
and b.dupe=1
group by a.RowNumber
,a.id
,a.size
,a.desc1
,a.desc2
Not here to argue, but you should really consider reviewing the data structure you're working with.
Select your data, adding a column to number the rows
Join a copy of your data (with distinct records only)
Sum the size column from the list of distinct records
You just need to add sum(distinct "size") over (partition by id) for computing total_size column for each row in your SQL :
with tab(id,"size","desc1","desc2") as
(
select 1 ,13,'xxx','yyy' from dual union all
select 1 ,13,'xxx','yyy' from dual union all
select 1 ,10,'mmm','kkk' from dual union all
select 1 ,10,'mmm','kkk' from dual
)
select t.id,
sum(distinct t."size") over (partition by id) as "total_size",
t."desc1",t."desc2"
from tab t;
P.S. size is a reserved keyword, so, cannot be used as a column name, unless quoted. as "size"

SQL Server query - get items that exist in more than one column

I have a simple table which contains barcode ids of tools and associated room location in which the tool should belong to.
Unfortunately, I've noticed that some users have entered the same barcode id for another room location.
For example, I have these 2 columns:
barcodeNumber | RoomLocation
--------------+-------------
123456 | 400
654321 | 300
875421 | 200
654321 | 400
999999 | 250
878787 | 300
777777 | 400
999999 | 200
Note that barcodeNumber "654321" is stored in roomLocations 300 & 400 ad "999999" are stored in room locations 200 & 250
How do I write the SQL query to list the duplicate barcode Number and RoomLocation they are located in and not just the "count" of duplicates?
For example, the end result I wish to see is:
654321 | 300
654321 | 400
999999 | 200
999999 | 250
Using window functions (SQL:1999) you would get the result like this:
with c as (
select barcodeNumber, RoomLocation,
count(*) over(partition by barcodeNumber) cnt
from t)
select barcodeNumber, RoomLocation
from c where cnt > 1
order by 1,2
You can also use SQL-92 syntax:
select barcodeNumber, RoomLocation
from t
where barcodeNumber IN (
select barcodeNumber from t
group by barcodeNumber
having count(*) > 1)
order by 1,2
You can try this also. Use count(*) over (partition by barcodenumber) to determine the duplicate values.
create table #sample (barcodenumber nvarchar(30),roomlocation int)
insert into #sample (barcodenumber,roomlocation)
select '123456',400 union all
select '654321',300 union all
select '875421',200 union all
select '654321',400 union all
select '999999',250 union all
select '878787',300 union all
select '777777',400 union all
select '999999',200
select barcodenumber,roomlocation from (
select *, count(*) over (partition by barcodenumber) as rnk
from #sample
)t
group by barcodenumber,roomlocation,rnk
having rnk >1
Hope this could help.
Do you want to find the duplicate barcode?
;WITH tb(barcodenumber,roomlocation)AS(
SELECT '123456',400 UNION ALL
SELECT '654321',300 UNION ALL
SELECT '875421',200 UNION ALL
SELECT '654321',400 UNION ALL
SELECT '999999',250 UNION ALL
SELECT '878787',300 UNION ALL
SELECT '777777',400 UNION ALL
SELECT '999999',200
)
SELECT * FROM (
SELECT *,COUNT(0)OVER(PARTITION BY tb.barcodenumber) AS cnt FROM tb
) AS t WHERE t.cnt>1
barcodenumber roomlocation cnt
------------- ------------ -----------
654321 400 2
654321 300 2
999999 200 2
999999 250 2
Here is another way to achive your result:
SELECT barcodenumber, roomlocation
FROM table_name
WHERE barcodenumber IN (
SELECT barcodenumber
FROM table_name
GROUP BY barcodenumber
HAVING COUNT(DISTINCT roomlocation) > 1);
--If you dont have duplicate rows then just use COUNT(*)
Use JOIN and HAVING clause :
SELECT A.barcodenumber,roomlocation
FROM #sample
JOIN
(
SELECT barcodenumber
FROM #sample
GROUP BY barcodenumber
HAVING COUNT(*) > 1
) A ON A.barcodenumber = #sample.barcodenumber

SQL Server group by first then ungroup?

I have a list of data need to be grouped, but we only want to group data that count are greater than 3.
AA
AA
BB
CCC
CCC
CCC
return
AA 1
AA 1
BB 1
CCC 3
Thank you for your help
select data, case when total < 3 then 1 else total end total
from
(
select data, Count(Data) Total
from tbl
group by data
) g
join (select 1 union all select 2) a(b)
on a.b <= case when total < 3 then Total else 1 end
order by data
This should perform faster than LittleBobbyTables's answer most of the time.
Off the top of my head, you could use a get a count of everything with a count greater than 2, and then use UNION ALL to get any records not in the first query:
SELECT 'AA' AS Data
INTO #Temp
UNION ALL SELECT 'AA'
UNION ALL SELECT 'BB'
UNION ALL SELECT 'CCC'
UNION ALL SELECT 'CCC'
UNION ALL SELECT 'CCC'
SELECT Data, COUNT(Data) AS MyCount
FROM #Temp
GROUP BY Data
HAVING COUNT(Data) > 2
UNION ALL
SELECT Data, 1
FROM #Temp
WHERE Data NOT IN (
SELECT Data
FROM #Temp
GROUP BY Data
HAVING COUNT(Data) > 2
)
ORDER BY Data
DROP TABLE #Temp
Use the window functions for this:
select col, count(*) as cnt
from (select col, count(*) over (partition by col) as colcnt,
row_number() over (order by (select NULL)) as seqnum
from t
) t
group by col, (case when colcnt < 3 then seqnum else NULL end)
This calculates the total count over the column and a unique identifier for each row. The group by clause then tests for the condition. If less than 3, then it uses the identifier to get each row. If greater, it uses a constant value (NULL) in this case.