Athena: compare columns consisting array<strings> in two different tables - sql

I have 2 external tables (parquet files in S3) in Athena, each of them has a column which is array of strings. One of the tables is a subset and I need to compare these array values with the other table having the superset array. I believe the problem would be clearer with the below illustration. Both tables do not have any duplicate records.
Table 1 (Sample Subset table)
+---+-----------+---------------------------+
|no | prod_name | article_list |
+---+-----------+---------------------------+
| 1 |sofa | ['ABC','PQR'] |
| 2 |cupboard | ['LMN','DEF','XYZ'] |
| 3 |table | ['DEF'] |
| 4 |chair | ['DEF','PQR','ABC'] |
| 5 |dresser | ['LMN','IJK','WXY','STU'] |
+---+--------------------+------------------+
Table 2 (Sample Superset table)
+---+---------+--------------+---------------------------------------------------+
|no | wh_code | restock_date | article_list |
+---+---------+--------------+---------------------------------------------------+
| 1 |WH0001 | 2020-01-12 | ['ABC','BCE','CDE','DEF','JKL','PQR','QRS','STU'] |
| 2 |WH0001 | 2020-04-15 | ['ABC','CDE','DEF','IJK','LMN','PQR','STU','XYZ'] |
| 3 |WH0002 | 2021-03-17 | ['BCE','DEF','IJK','LMN','PQR','RST','STU','WXY'] |
| 4 |WH0003 | 2021-08-20 | ['ABC','IJK','LMN','NOP','PQR','RST','STU','WXY'] |
| 5 |WH0003 | 2022-03-26 | ['DEF','IJK','LMN','NOP','PQR','RST','STU','XYZ'] |
+---+---------+--------------+---------------------------------------------------+
Required result
+------------------------+---------+-----------------+
|article_list (table 1) | wh_code | restock_date |
+---+--------------------+---------------------------+
| ['ABC','PQR'] | WH0001 | 2020-01-12 |
| ['ABC','PQR'] | WH0001 | 2020-04-15 |
| ['ABC','PQR'] | WH0003 | 2021-08-20 |
| ['LMN','DEF','XYZ'] | WH0001 | 2020-04-15 |
| ['LMN','DEF','XYZ'] | WH0003 | 2021-08-20 |
| ['DEF'] | WH0001 | 2020-01-12 |
| ['DEF'] | WH0001 | 2020-04-15 |
| ['DEF'] | WH0002 | 2021-03-17 |
| ['DEF'] | WH0003 | 2022-03-26 |
| . | . | . |
| . | . | . |
| . | . | . |
+------------------------+---------+-----------------+
The following query in Athena works to find a particular combination (['ABC', 'PQR']) in table 2 consisting of the superset array. It results in the first 3 rows of the required result.
SELECT ['ABC', 'PQR'] as article_list,
wh_code,
restock_date
FROM "table_2"
WHERE filter(ARRAY ['ABC', 'PQR'], x -> NOT CONTAINS(article_list, x)) = ARRAY[]
group by wh_code, restock_date
Request help to write a generic query (considering all the combinations from table 1) to get the desired result

Join the two table on the required condition. Also it seems that you should consider using array_except to simplify the query (also I use cardinality to count number of elements):
-- sample data
with table1(no, prod_name, article_list ) as (
values ( 1, 'sofa', array['ABC','PQR']),
( 2, 'cupboard', array['LMN','DEF','XYZ'] )
),
table2 (no, wh_code, restock_date, article_list) as (
values (1, 'WH0001', date '2020-01-12', array['ABC','BCE','CDE','DEF','JKL','PQR','QRS','STU']),
(2, 'WH0001', date '2020-04-15', array['ABC','CDE','DEF','IJK','LMN','PQR','STU','XYZ']),
(3, 'WH0002', date '2021-03-17', array['BCE','DEF','IJK','LMN','PQR','RST','STU','WXY']),
(4, 'WH0003', date '2021-08-20', array['ABC','IJK','LMN','NOP','PQR','RST','STU','WXY'])
)
-- query
select t1.article_list, t2.wh_code, t2.restock_date
from table1 t1
join table2 t2 on cardinality(array_except(t1.article_list, t2.article_list)) = 0;
Output:
article_list
wh_code
restock_date
[ABC, PQR]
WH0001
2020-01-12
[ABC, PQR]
WH0001
2020-04-15
[ABC, PQR]
WH0003
2021-08-20
[LMN, DEF, XYZ]
WH0001
2020-04-15
UPD
Try next one, but taking in account size of the data maybe you will need to partition the queries:
-- query
select arbitrary(article_list), wh_code, restock_date
from (select no, article_list, article
from table1, unnest (article_list) as t(article)) as t1
join (select no, wh_code, restock_date, article
from table2, unnest (article_list) as t(article)) as t2 on t1.article = t2.article
group by t1.no, wh_code, restock_date
having count(t1.article) = cardinality(arbitrary(article_list));

Related

How to split these multiple rows in SQL?

I am currently studying SQL and I am still a newbie. I have this task where I need to split some rows with various entries like dates and user IDs. I really need help
+-------+------------------------------+---------------------------+
| TYPE | DATES | USER _ID |
+-------+------------------------------+---------------------------+
| WORK | ["2022-06-02", "2022-06-03"] | {74042,88357,83902,88348} |
| LEAVE | ["2022-05-16", "2022-05-26"] | {83902,74042,88357,88348} |
+-------+------------------------------+---------------------------+
the end result should look like this. the user id's should be aligned or should be in the same as their respective dates.
+-------+------------+---------+
| TYPE | DATES | USER_ID |
+-------+------------+---------+
| LEAVE | 05/16/2022 | 74042 |
| LEAVE | 05/16/2022 | 88357 |
| LEAVE | 05/16/2022 | 88348 |
| LEAVE | 05/16/2022 | 83902 |
| LEAVE | 05/26/2022 | 74042 |
| LEAVE | 05/26/2022 | 88357 |
| LEAVE | 05/26/2022 | 88348 |
| LEAVE | 05/26/2022 | 83902 |
| WORK | 06/2/2022 | 74042 |
| WORK | 06/2/2022 | 88357 |
| WORK | 06/2/2022 | 88348 |
| WORK | 06/2/2022 | 83902 |
| WORK | 06/3/2022 | 74042 |
| WORK | 06/3/2022 | 88357 |
| WORK | 06/3/2022 | 88348 |
| WORK | 06/3/2022 | 83902 |
+-------+------------+---------+
Create table:
CREATE TABLE work_leave (
TYPE varchar,
DATES date,
USER_ID integer
);
INSERT INTO work_leave
VALUES ('LEAVE', '05/16/2022', 74042),
('LEAVE', '05/16/2022', 88357),
('LEAVE', '05/16/2022', 88348),
('LEAVE', '05/16/2022', 83902),
('LEAVE', '05/26/2022', 74042),
('LEAVE', '05/26/2022', 88357),
('LEAVE', '05/26/2022', 88348),
('LEAVE', '05/26/2022', 83902),
('WORK', '06/2/2022', 74042),
('WORK', '06/2/2022', 88357),
('WORK', '06/2/2022', 88348),
('WORK', '06/2/2022', 83902),
('WORK', '06/3/2022', 74042),
('WORK', '06/3/2022', 88357),
('WORK', '06/3/2022', 88348),
('WORK', '06/3/2022', 83902);
WITH date_ends AS (
SELECT
type,
ARRAY[min(dates),
max(dates)] AS dates
FROM
work_leave
GROUP BY
type
),
users AS (
SELECT
type,
array_agg(DISTINCT (user_id)
ORDER BY user_id) AS user_ids
FROM
work_leave
GROUP BY
type
)
SELECT
de.type,
de.dates,
u.user_ids
FROM
date_ends AS de
JOIN
users as u
ON de.type = u.type;
type | dates | user_ids
-------+-------------------------+---------------------------
LEAVE | {05/16/2022,05/26/2022} | {74042,83902,88348,88357}
WORK | {06/02/2022,06/03/2022} | {74042,83902,88348,88357}
I adjusted the data slightly for simplicity. Here's one idea:
WITH rows (type, dates, user_id) AS (
VALUES ('WORK', array['2022-06-02', '2022-06-03'], array[74042,88357,83902,88348])
, ('LEAVE', array['2022-05-16', '2022-05-26'], array[83902,74042,88357,88348])
)
SELECT r1.type, x.*
FROM rows AS r1
CROSS JOIN LATERAL (
SELECT r2.dates, r3.user_id
FROM unnest(r1.dates) AS r2(dates)
, unnest(r1.user_id) AS r3(user_id)
) AS x
;
The fiddle
The result:
type
dates
user_id
WORK
2022-06-02
74042
WORK
2022-06-02
88357
WORK
2022-06-02
83902
WORK
2022-06-02
88348
WORK
2022-06-03
74042
WORK
2022-06-03
88357
WORK
2022-06-03
83902
WORK
2022-06-03
88348
LEAVE
2022-05-16
83902
LEAVE
2022-05-16
74042
LEAVE
2022-05-16
88357
LEAVE
2022-05-16
88348
LEAVE
2022-05-26
83902
LEAVE
2022-05-26
74042
LEAVE
2022-05-26
88357
LEAVE
2022-05-26
88348

How to use wm_concat one a column that already exists in the query?

So... I am currently using Oracle 11.1g and I need to create a query that uses the ID and CusCODE from Table_with_value and checks Table_with_status using the ID to find active CO_status but on different CusCODE.
This is what I have so far - obviously does not work as it should unless CusCODE and ID are provided manually:
SELECT wm_concat(CoID) as active_CO_Status_for_same_ID_but_different_CusCODE
FROM Table_with_status
WHERE
CoID IN (SELECT CoID FROM Table_with_status WHERE ID = Table_with_value.ID AND CusCODE != Table_with_value.CusCODE)) AND Co_status = 'active';
Table_with_value:
|CoID | CusCODE | ID | Value |
|--------|---------|----------|----|
|354223 | 1.432 | 0784296L | 99 |
|321232 | 4.212321.22 | 0432296L | 32 |
|938421 | 3.213 | 0021321L | 93 |
Table_with_status:
|CoID | CusCODE | ID | Co_status|
|--------|--------------|----------|--------|
|354223 | 1.432 | 0784296L | active|
|354232 | 1.432 | 0784296L | inactive |
|666698 | 1.47621 | 0784296L | active |
|666700 | 1.5217 | 0784296L | active |
|938421 | 3.213 | 0021321L | active |
|938422 | 3.213 | 0021321L | active |
|938423 | 3.213 | 0021321L | active |
|321232 | 4.212321.22 | 0432296L | active |
|321232 | 4.212321.22 | 0432296L | active |
|321232 | 1.689 | 0432296L | inactive |
Expected output:
|CoID | active_CO_Status_for_same_ID_but_different_CusCODE | ID | Value |
|--------|---------|----------|----|
|354223 | 666698,666700 | 1.432 | 0784296L | 99 |
|321232 | N/A | 4.212321.22 | 0432296L | 32 |
|938421 | N/A | 3.213 | 0021321L | 93 |
Any idea on how this can be implemented ideally without any PL/SQL for loops, but it should be fine as well since the output dataset is expected < 300 IDs.
I apologize in advance for the cryptic nature in which I structured the question :) Let me know if something is not clear.
From your description and expected output, it looks like you need a left outer join, something like:
SELECT v.CoID,
wm_concat(s.CoID) as other_active_CusCODE -- active_CO_Status_for_same_ID_but_different_CusCODE
v.CusCODE,
v.ID,
v.value
FROM Table_with_value v
LEFT JOIN Table_with_status s
ON s.ID = v.ID
AND s.CusCODE != v.CusCODE
AND s.Co_status = 'active'
GROUP BY v.CoID, v.CusCODE, v.ID, v.value;
SQL Fiddle using listagg() instead of the never-supported and now-removed wm_concat(); with a couple of different approaches if the logic isn't quite what I interpreted. With your sample data they all get:
COID OTHER_ACTIVE_CUSCODE CUSCODE ID VALUE
------ -------------------- ----------- -------- -----
321232 (null) 4.212321.22 0432296L 32
354223 666698,666700 1.432 0784296L 99
938421 (null) 3.213 0021321L 93
Your code looks like it should work, assuming you are referring to the correct tables:
SELECT wm_concat(s.CoID) as active_CO_Status_for_same_ID_but_different_CusCODE
FROM Table_with_status s
WHERE s.CoID IN (SELECT v.CoID
FROM Table_with_value v
WHERE v.ID = s.ID AND
v.CusCODE <> s.CusCODE
) AND
s.Co_status = 'active';

Postgresql query substract from one table

I have a one tables in Postgresql and cannot find how to build a query.
The table contains columns nr_serii and deleteing_time. I trying to count nr_serii and substract from this positions with deleting_time.
My query:
select nr_serii , count(nr_serii ) as ilosc,count(deleting_time) as ilosc_delete
from MyTable
group by nr_serii, deleting_time
output is:
+--------------------+
| "666666";1;1 |
| "456456";1;0 |
| "333333";3;0 |
| "333333";1;1 |
| "111111";1;1 |
| "111111";3;0 |
+--------------------+
The part of table with raw data:
+--------------------------------+
| "666666";"2020-11-20 14:08:13" |
| "456456";"" |
| "333333";"" |
| "333333";"" |
| "333333";"" |
| "333333";"2020-11-20 14:02:23" |
| "111111";"" |
| "111111";"" |
| "111111";"2020-11-20 14:08:04" |
| "111111";"" |
+--------------------------------+
And i need substract column ilosc and column ilosc_delete
example:
nr_serii:333333 ilosc:3-1=2
Expected output:
+-------------+
| "666666";-1 |
| "456456";1 |
| "333333";2 |
| "111111";2 |
| ... |
+-------------+
I think this is very simple solution for this but i have empty in my head.
I see what you want now. You want to subtract the number where deleting_time is not null from the ones where it is null:
select nr_serii,
count(*) filter (where deleting_time is null) - count(deleting_time) as ilosc_delete
from MyTable
group by nr_serii;
Here is a db<>fiddle.

SQL Right Join on Non Unique

I'm hoping that im over thinking this. but i need to sum a column where i have no unique link to join on and when i do it double ups columns.
This is my current SQL that works until i add the join on vwBatchInData then it doubles up every record, what is the best way to achieve this?
select b.fldBatchID as 'ID',SUM(bIn.fldBatchDetailsWeight) as 'Batch In', sum(t.fldTransactionNetWeight) as 'Batch Out' , format((sum(t.fldTransactionNetWeight) / sum(bIn.fldBatchDetailsWeight)),'P2' ) as 'Yield'
from [TRANSACTION] t
right join vwBatchInData bIn on bIn.fldBatchID = t.fldBatchID
inner join Batch b on b.fldBatchID = t.fldBatchID
where CAST(b.fldBatchDate as date) = '2020-03-04'
group by b.fldBatchID**
vwBatchInData Table
+------------+---------------+-----------------------+
| fldBatchID | fldKillNumber | fldBatchDetailsWeight |
+------------+---------------+-----------------------+
| 2862 | 601598 | 164.40 |
| 2862 | 601599 | 190.80 |
| 2862 | 601596 | 195.00 |
| 2862 | 601597 | 200.20 |
| 2862 | 601594 | 176.60 |
+------------+---------------+-----------------------+
Transaction Table
+------------+------------------+-------------------------+
| fldBatchID | fldTransactionID | fldTransactionNetWeight |
+------------+------------------+-------------------------+
| 2862 | 10242352 | 16.26 |
| 2862 | 10242353 | 22.82 |
| 2862 | 10242362 | 18.52 |
| 2862 | 10242363 | 21.44 |
| 2862 | 10242364 | 20.32 |
+------------+------------------+-------------------------+
Batch Table
+------------+-------------------------+
| fldBatchID | fldBatchDate |
+------------+-------------------------+
| 2862 | 2020-03-04 00:00:00.000 |
+------------+-------------------------+
Desired output with the above snipets
+------+----------+-----------+---------+
| ID | Batch In | Batch Out | Yield |
+------+----------+-----------+---------+
| 2862 | 927.00 | 90.36 | 10.76 % |
+------+----------+-----------+---------+
I think you just want to aggregate before joining:
select b.fldBatchID as ID,
(bIn.fldBatchDetailsWeight) as batch_in,
(t.fldTransactionNetWeight) as batch_out,
format(t.fldTransactionNetWeight / bIn.fldBatchDetailsWeight, 'P2' ) as Yield
from batch b left join
(select bin.fldBatchID, sum(fldBatchDetailsWeight) as fldBatchDetailsWeight
from vwBatchInData bin
group by bin.fldBatchID
) bin
on bIn.fldBatchID = b.fldBatchID left join
(select t.fldBatchID, sum(fldTransactionNetWeight) as fldTransactionNetWeight
from transactions t
group by t.fldBatchID
) bin
on t.fldBatchID = b.fldBatchID
where CAST(b.fldBatchDate as date) = '2020-03-04';

In Hive, what is the difference between explode() and lateral view explode()

Assume there is a table employee:
+-----------+------------------+
| col_name | data_type |
+-----------+------------------+
| id | string |
| perf | map<string,int> |
+-----------+------------------+
and the data inside this table:
+-----+------------------------------------+--+
| id | perf |
+-----+------------------------------------+--+
| 1 | {"job":80,"person":70,"team":60} |
| 2 | {"job":60,"team":80} |
| 3 | {"job":90,"person":100,"team":70} |
+-----+------------------------------------+--+
I tried the following two queries but they all return the same result:
1. select explode(perf) from employee;
2. select key,value from employee lateral view explode(perf) as key,value;
The result:
+---------+--------+--+
| key | value |
+---------+--------+--+
| job | 80 |
| team | 60 |
| person | 70 |
| job | 60 |
| team | 80 |
| job | 90 |
| team | 70 |
| person | 100 |
+---------+--------+--+
So, what is the difference between them? I did not find suitable examples. Any help is appreciated.
For your particular case both queries are OK. But you can't use multiple explode() functions without lateral view. So, the query below will fail:
select explode(array(1,2)), explode(array(3, 4))
You'll need to write something like:
select
a_exp.a,
b_exp.b
from (select array(1, 2) as a, array(3, 4) as b) t
lateral view explode(t.a) a_exp as a
lateral view explode(t.b) b_exp as b