how to avoid duplicates in hive query

how to avoid duplicates in hive query - sql

I have two tables:
table1
the_date | my_id |
02/03/2021,123
02/03/2021, 1234
02/03/2021, 12345
table2
the_date | my_id |seq | txt
02/03/2021, 1234, 1 , 'OK'
02/03/2021, 12345, 1, 'OK'
02/03/2021, 12345, 2, 'HELLO HI THERE'
02/03/2021, 123456, 1, 'Ok'
Here is my code:
WITH AB AS (
SELECT A1.my_id
FROM DB1.table1 A1 , DB1.MSG_REC A2 WHERE
A1.my_id=A2.my_id
),
BC AS (
SELECT AB.the_date
COUNT ( DISTINCT (CASE WHEN (TXT like '%OK%') THEN AB.my_id ELSE NULL END )) AS
CASE1 ,
COUNT ( DISTINCT (CASE WHEN (TXT like '%HELLO HI THERE%') THEN AB.my_id ELSE NULL END )) AS
CASE2
FROM AB left JOIN DB1.my_id BC ON AB.my_id =BC.my_id
The issue that stems from above is I am looping over the value '12345' twice because it satisfies both of the case statements.
That causes data duplicates when capturing metrics of the counts. Is there a way to execute the first case and then perform the second case but exclude looping any of the "my_id' records from the first case.
So for example, when it is time to run the above script and the first case executes, it will pick up the below records and the count would be 3
02/03/2021, 1234, 1 , 'OK'
02/03/2021, 12345, 1, 'OK'
02/03/2021, 123456, 1, 'Ok
The second case should only be looping through the below records and the count would be only 1
02/03/2021, 12345, 2, 'HELLO HI THERE'
CASE1 would be 4 and CASE2 would by 2 if I don't create a condition to circumvent this issue. Any tips or suggestions?

Assign case to each your ID before DISTINCT aggregation . After that do distinct aggregation, in such way you will eliminate same IDs counted in different cases. See comments in the code:
select --do final distinct aggregation
count(distinct (case when assigned_case='CASE1' then my_id else null end ) ) as CASE1,
count(distinct (case when assigned_case='CASE2' then my_id else null end ) ) as CASE2
from
(
select my_id
--assign single CASE to all rows with the same id based on some logic:
case when case1_flag = 1 then 'CASE1'
when case1_flag = 1 then 'CASE2'
else NULL
end as assigned_case
from
(--calculate all CASE flags for each ID
select AB.my_id,
max(CASE WHEN (TXT like '%OK%') THEN 1 ELSE NULL END) over (partition by AB.my_id) as case1_flag
max(CASE WHEN (TXT like '%HELLO HI THERE%') THEN 1 ELSE NULL END) over (partition by AB.my_id) as case2_flag
from ...
) s
) s

Related

Pulling data while pivoting at the same time

ID | Type | Code
1 Purchase A1
1 Return B1
1 Exchange C1
2 Purchase D1
2 Return NULL
2 Exchange F1
3 Purchase G1
3 Return H1
3 Exchange I1
4 Purchase J1
4 Exchange K1
Above is sample data. What I want to return is:
ID | Type | Code
1 Purchase A1
1 Return B1
1 Exchange C1
3 Purchase G1
3 Return H1
3 Exchange I1
So if a field is null in code or the values of Purchase, Return and Exchange are not all present for that ID, ignore that ID completely. However there is one last step. I want this data to then be pivoted this way:
ID | Purchase | Return | Exchange
1 A1 B1 C1
3 G1 H1 I1
I asked this yesterday without the pivot portion which you can see here:
SQL query to return data only if ALL necessary columns are present and not NULL
However I forgot to note the last part. I tried to play around with excel but had no luck. I tried to make a temp table but the data is too large to do that so I was wondering if this could all be done in 1 sql statement?
I personally used this query with success:
select t.*
from t
where 3 = (select count(distinct t2.type)
from t t2
where t2.id = t.id and
t2.type in ('Purchase', 'Exchange', 'Return') and
t2.Code is not null
);
So how can we adjust that to include the pivot part. Is that possible?

Quite easily. Just use conditional aggregation:
select t.id,
max(case when type = 'Purchase' then code end) as Purchase,
max(case when type = 'Exchange' then code end) as Exchange,
max(case when type = 'Return' then code end) as Return
from t
where 3 = (select count(distinct t2.type)
from t t2
where t2.id = t.id and
t2.type in ('Purchase', 'Exchange', 'Return') and
t2.Code is not null
)
group by t.id;
This is actually simpler to express (in my opinion) using having without the subquery:
select t.id,
max(case when type = 'Purchase' then code end) as Purchase,
max(case when type = 'Exchange' then code end) as Exchange,
max(case when type = 'Return' then code end) as Return
from t
group by t.id
having max(case when type = 'Purchase' then code end) is not null and
max(case when type = 'Exchange' then code end) is not null and
max(case when type = 'Return' then code end) is not null;
Many databases would allow:
having Purchase is not null and Exchange is not null and Return is not null
But Oracle doesn't allow the use of table aliases in the having clause.

UPDATE - Based on discussion in the question comments, my previous query had a faulty assumption (which I carried over from what I thought I saw in the original query in the question); I've eliminated the bad assumption.
select id
, max(case when type='Purchase' then Code end) Purchase
, max(case when type='Return' then Code end) Return
, max(case when type='Exchange' then Code end) Exchange
from t
where code is not null
and type in ('Purchase', 'Return', 'Exchange')
group by id
having count(distinct type) = 3

I will point out again (as I did in your other thread) that analytic functions will do the job much faster - they need the base table to be read just once, and there are no explicit or implicit joins.
with
test_data ( id, type, code ) as (
select 1, 'Purchase', 'A1' from dual union all
select 1, 'Return' , 'B1' from dual union all
select 1, 'Exchange', 'C1' from dual union all
select 2, 'Purchase', 'D1' from dual union all
select 2, 'Return' , null from dual union all
select 2, 'Exchange', 'F1' from dual union all
select 3, 'Purchase', 'G1' from dual union all
select 3, 'Return' , 'H1' from dual union all
select 3, 'Exchange', 'I1' from dual union all
select 4, 'Purchase', 'J1' from dual union all
select 4, 'Exchange', 'K1' from dual
)
-- end of test data; actual solution (SQL query) begins below this line
select id, purchase, return, exchange
from ( select id, type, code
from ( select id, type, code,
count( distinct case when type in ('Purchase', 'Return', 'Exchange')
then type end
) over (partition by id) as ct_type,
count( case when code is null then 1 end
) over (partition by id) as ct_code
from test_data
)
where ct_type = 3 and ct_code = 0
)
pivot ( min(code) for type in ('Purchase' as purchase, 'Return' as return,
'Exchange' as exchange)
)
;
Output:
ID PURCHASE RETURN EXCHANGE
--- -------- -------- --------
1 A1 B1 C1
3 G1 H1 I1
2 rows selected.

Oracle SQL: Joining same table and getting desired output

I have a table like this
FILEID | FILENAME | STATUS
100 |Employee_06102016.txt |PASS
100 |Employee_06092016.txt |FAIL
100 |Employee_06092016.txt |MISS
101 |ABC_06092016.txt |PASS
I am reading a filename from file and passing to SQL. Lets say, i have only the file name 'Emplyee_06102016.txt' which is with PASS staus. With this, i need to join the same table and take the count of PASS and FAIL filenames which have same file id and should exclude the MISS status.
I am trying something like this below but gives count as 3 including all. I should get only 2.
SELECT COUNT (T.FILEID) FROM TABLE_NAME T, TABLE_NAME S
WHERE T.FILEID=S.FILEID
AND T.FILENAME = 'Employee_06102016.txt' AND T.STATUS IN ('PASS', 'FAIL');

Oracle Setup:
CREATE TABLE table_name ( FILEID, FILENAME, STATUS ) AS
SELECT 100, 'Employee_06102016.txt', 'PASS' FROM DUAL UNION ALL
SELECT 100, 'Employee_06092016.txt', 'FAIL' FROM DUAL UNION ALL
SELECT 100, 'Employee_06092016.txt', 'MISS' FROM DUAL UNION ALL
SELECT 101, 'ABC_06092016.txt', 'PASS' FROM DUAL;
Query:
SELECT *
FROM (
SELECT t.*,
COUNT(1) OVER ( PARTITION BY FileID ) AS num_pass_fail
FROM table_name t
WHERE status IN ( 'PASS', 'FAIL' )
)
WHERE filename = 'Employee_06102016.txt';
Output:
FILEID FILENAME STATUS NUM_PASS_FAIL
---------- --------------------- ------ -------------
100 Employee_06102016.txt PASS 2

SELECT
FILENAME,
SUM(CASE status WHEN 'PASS' THEN 1 ELSE 0 END) as "Pass Count",
SUM(CASE status WHEN 'FAIL' THEN 1 ELSE 0 END) as "Fail Count",
SUM(CASE status WHEN 'MISS' THEN 1 ELSE 0 END) as "Miss Count"
FROM
TableName
WHERE
FILENAME = 'Employee_06102016.txt'

It seems that you simply need:
SELECT COUNT (1)
FROM TABLE_NAME
WHERE FILENAME = 'Employee_06102016.txt'
AND STATUS IN ('PASS', 'FAIL');

Try this one
select cnt from (
select count(*) as cnt,
listagg(filename, ',') within group(order by filename) as filename_list from table_name
where status in ('PASS', 'FAIL') group by fileid
) where instr(filename_list, 'Employee_06102016.txt')>0;

Using CASE to Mark No If No Results From SELECT Statement

is it possible to print "no" if no result found
SELECT mobileno,
CASE
WHEN region = '1234'
THEN 'Yes'
ELSE 'NO'
END
FROM subscriber
WHERE region = '1234'
and status = 1
and mobileno in (77777,88888)
Currently it only print 1 row like
77777,yes
but i want like following
77777,yes
88888,no
Update: One mobileno like 7777 may belongs from two regions then 7777 will get print with NO and YES in two rows if we remove region condition.
Sample Data
sr.No, Name, mobileno, region, status
1, abc, 77777, 1234, 1
2, xyz, 88888, 1222, 1
3, tyu, 22342, 9898, 1
4, abc, 77777, 8787, 1
Sample OutPut
77777, Yes
88888, No

You can 'create' a table by selecting from dual, and left joining :
SELECT t.dummy_num,
CASE WHEN s.mobileno is null then 'No' else 'Yes' end
FROM (SELECT 77777 as dummy_num from dual
UNION select 88888 from dual) t
LEFT JOIN subscriber s
ON(t.dummy_num = s.mobileno and s.region = '1234' and s.status = 1 )
Edit: you can also do it dynamically like this:
SELECT t.mobileno,
CASE WHEN s.mobileno is null then 'No' else 'Yes' end
FROM (select distinct mobileno from subscriber) t
LEFT JOIN subscriber s
ON(t.mobileno= s.mobileno and s.region = '1234' and s.status = 1 )
WHERE t.mobileno IN(777,888,.....)

How to select two rows into a single row output

The select statement looks like this right now
Here i got the output in 2 lines for the same stg_edi835_id ,I want to select the results in a single line for that stg_edi835_id.
Output should look like this
Can some please help me in doing this
Thanks in advance..

SELECT STG_EDI835_PLB_ID, STG_EDI835_ID, ADJUSTMENTREASONCODE1, ADJUSTMENTIDENTIFIER1, SUM(NTVE_ADJUSTMENTAMOUNT1_T1+NTVE_ADJUSTMENTAMOUNT1_T2) AS ADJUSTMENTAMOUNT1,
SUM(PTVE_ADJUSTMENTAMOUNT1_T1+PTVE_ADJUSTMENTAMOUNT1_T2) AS ADJUSTMENTAMOUNT2, ADJUSTMENTREASONCODE2, ADJUSTMENTIDENTIFIER2
FROM
(
SELECT T1.STG_EDI835_PLB_ID , T2.STG_EDI835_ID, T1.ADJUSTMENTREASONCODE1, T1.ADJUSTMENTIDENTIFIER1,
(CASE WHEN T1.ADJUSTMENTAMOUNT1 < 0 THEN T1.ADJUSTMENTAMOUNT1 ELSE 0 END) AS NTVE_ADJUSTMENTAMOUNT1_T1,
(CASE WHEN T2.ADJUSTMENTAMOUNT1 < 0 THEN T2.ADJUSTMENTAMOUNT1 ELSE 0 END) AS NTVE_ADJUSTMENTAMOUNT1_T2,
(CASE WHEN T1.ADJUSTMENTAMOUNT1 >= 0 THEN T1.ADJUSTMENTAMOUNT1 ELSE 0 END) AS PTVE_ADJUSTMENTAMOUNT1_T1,
(CASE WHEN T2.ADJUSTMENTAMOUNT1 >= 0 THEN T2.ADJUSTMENTAMOUNT1 ELSE 0 END) AS PTVE_ADJUSTMENTAMOUNT1_T2,
COALESCE(T2.ADJUSTMENTREASONCODE1, 'NULL') AS ADJUSTMENTREASONCODE2, COALESCE(T2.ADJUSTMENTIDENTIFIER1, NULL) AS ADJUSTMENTIDENTIFIER2
FROM TABLE1 AS T1
INNER JOIN TABLES T2
ON T2.STG_EDI835_ID = T1.STG_EDI835_ID
AND T2.STG_EDI835_PLB_ID = T1.STG_EDI835_PLB_ID
) A
GROUP BY STG_EDI835_PLB_ID, STG_EDI835_ID, ADJUSTMENTREASONCODE1, ADJUSTMENTIDENTIFIER1, ADJUSTMENTREASONCODE2, ADJUSTMENTIDENTIFIER2

Your question is somewhat incomplete (you should show your desired output), however, here is a sample of what you could do:
get rid of all unique values you don't need. (column 1, containing ID's, etc.)
Use aggregate functions on the rest.
Example:
Select
--column 1 removed
MAX(column2) as ID,
MAX(column3) as RefID,
--column 4 removed
--column 5 removed
--column 6 removed
SUM(column7) as Ad1,
--column 8 removed
--column 9 removed
SUM(column10) as Ad2
From
table

Try something like
WITH DATA As (
select 697 as Stg_EDI835_Id, -87.75 as AdjustmentAmount1 union
select 697, -4.64 union
select 612, -6.39 union
select 612, 60.75
)
select SUM(AdjustmentAmount1) AS AdjustmentAmount1, 0 AS adjustmentamount2 FROM DATA GROUP BY Stg_EDI835_Id HAVING SUM(AdjustmentAmount1) <= 0 UNION
select 0, SUM(AdjustmentAmount1)FROM DATA GROUP BY Stg_EDI835_Id HAVING SUM(AdjustmentAmount1) > 0
Output is
AdjustmentAmount1 | Adjustmentamount2
-92.39 | 0.00
0.00 | 54.36

one sql (oracle) query for getting unique information that has two different (null and not null) values per column

Table foobar is, for clarity, structured and has data as follows:
id, action_dt, status_id
1, '02-JUL-10', 'x'
1, '02-JUL-10', '2'
1, '02-JUL-10', NULL
2, '02-JUL-10', 'a'
2, '02-JUL-10', 'b'
3, '02-JUL-10', 'k'
3, '02-JUL-10', NULL
3, '03-JUL-10', 'k'
3, '03-JUL-10', NULL
I need a query that gets IDs such that for each ID a NULL value and a NOT NULL value exists per day. So, in the example dataset above, the query needs to return:
'02-JUL-10', 1
'02-JUL-10', 3
'03-JUL-10', 3
Yes, it can be done using something like:
SELECT
nulls.action_dt
, nulls.id
FROM (SELECT
action_dt
, id
FROM foobar
WHERE status_id IS NULL
GROUP BY action_dt) nulls
INNER JOIN (SELECT
action_dt
, id
FROM foobar
WHERE status_id IS NOT NULL
GROUP BY action_dt) non_nulls ON nulls.action_dt = non_nulls.action_dt
AND nulls.id = non_nulls.id
but as you can see, among other things, two subqueries and another iteration for the join...
The query I've been working on and have hopes for is of the form:
SELECT
action_dt
, id
FROM
foobar
GROUP BY
action_dt
, id
, CASE WHEN status_id IS NOT NULL THEN 1 ELSE 0 END
HAVING
COUNT(prim_card_nb) > 1
but it doesn't quite return what I need (as you know, the HAVING clause applies to the underlying data that is being queried). Any ideas?
After all this, it seems a solution would be to have the above query in a subquery and filter it down that way, such as:
SELECT
action_dt
, id
FROM (SELECT
action_dt
, id
FROM
foobar
GROUP BY
action_dt
, id
, CASE WHEN status_id IS NOT NULL THEN 1 ELSE 0 END
) repeat_ids_per_day
GROUP BY
action_dt
, id
HAVING
COUNT(id) > 1
but I feel it can be better...

Your idea is sound: in such a case you don't need a subquery, an aggregate is sufficient and should be more efficient. This should work:
SQL> SELECT action_dt, id
2 FROM foobar
3 GROUP BY action_dt, ID
4 HAVING COUNT(DISTINCT CASE WHEN status_id IS NULL THEN 1 ELSE 0 END) > 1;
ACTION_DT ID
--------- ----------
02-JUL-10 1
02-JUL-10 3
03-JUL-10 3

I think you have to do some minor changes in your first posted query
as below -
SELECT
nulls.action_dt, nulls.id
FROM
(SELECT
action_dt
, id
FROM foobar
WHERE status_id IS NULL
GROUP BY action_dt,id
uniou all
SELECT
action_dt
, id
FROM foobar
WHERE status_id IS NOT NULL
GROUP BY action_dt,id)
group by action_dt, id
having count(*) >1
what you have posted there is not a correct, as in oracle database..
you can't include not grouped column name while selecting..
so please check that .. it could be your mistake .. and may be it was couse of problem..

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

how to avoid duplicates in hive query - sql

Related

Pulling data while pivoting at the same time

Oracle SQL: Joining same table and getting desired output

Using CASE to Mark No If No Results From SELECT Statement

How to select two rows into a single row output

one sql (oracle) query for getting unique information that has two different (null and not null) values per column

Categories

Resources