Spark dataframe add Missing Values

Spark dataframe add Missing Values - sql

I have a dataframe of the following format. I want to add empty rows for missing time stamps for each customer.
+-------------+----------+------+----+----+
| Customer_ID | TimeSlot | A1 | A2 | An |
+-------------+----------+------+----+----+
| c1 | 1 | 10.0 | 2 | 3 |
| c1 | 2 | 11 | 2 | 4 |
| c1 | 4 | 12 | 3 | 5 |
| c2 | 2 | 13 | 2 | 7 |
| c2 | 3 | 11 | 2 | 2 |
+-------------+----------+------+----+----+
The resulting table should be of the format
+-------------+----------+------+------+------+
| Customer_ID | TimeSlot | A1 | A2 | An |
+-------------+----------+------+------+------+
| c1 | 1 | 10.0 | 2 | 3 |
| c1 | 2 | 11 | 2 | 4 |
| c1 | 3 | null | null | null |
| c1 | 4 | 12 | 3 | 5 |
| c2 | 1 | null | null | null |
| c2 | 2 | 13 | 2 | 7 |
| c2 | 3 | 11 | 2 | 2 |
| c2 | 4 | null | null | null |
+-------------+----------+------+------+------+
I have 1 Million customers and 360(in the above example only 4 is depicted) Time slots.
I figured out a way to create a Dataframe with 2 columns (Customer_id,Timeslot) with (1 M x 360 rows) and do a Left outer join with the original dataframe.
Is there a better way to do this?

You can express this as a SQL query:
select df.customerid, t.timeslot,
t.A1, t.A2, t.An
from (select distinct customerid from df) c cross join
(select distinct timeslot from df) t left join
df
on df.customerid = c.customerid and df.timeslot = t.timeslot;
Notes:
You should probably put this into another dataframe.
You might have tables with the available customers and/or timeslots. Use those instead of the subqueries.

I think can used the answer of gordon linoff but you can add the following thinsg as you stated that there are millions of customer and you are performing join in them.
use tally table for TimeSlot?? because it might give a better performance.
for more usabllity please refer the following link
http://www.sqlservercentral.com/articles/T-SQL/62867/
and I think you should use partition or row number function to divide you column customerid and select the customers based on some partition value. For example just select the row number values and then cross join with the tally table. it can imporove your performance .

Related

Merge groups if they contain the same value

I have the following table:
+-----+----+---------+
| grp | id | sub_grp |
+-----+----+---------+
| 10 | A2 | 1 |
| 10 | B4 | 2 |
| 10 | F1 | 2 |
| 10 | B3 | 3 |
| 10 | C2 | 4 |
| 10 | A2 | 4 |
| 10 | H4 | 5 |
| 10 | K0 | 5 |
| 10 | Z3 | 5 |
| 10 | F1 | 5 |
| 10 | A1 | 5 |
| 10 | A | 6 |
| 10 | B | 6 |
| 10 | B | 7 |
| 10 | C | 7 |
| 10 | C | 8 |
| 10 | D | 8 |
| 20 | A | 1 |
| 20 | B | 1 |
| 20 | B | 2 |
| 20 | C | 2 |
| 20 | C | 3 |
| 20 | D | 3 |
+-----+----+---------+
Within every grp, my goal is to merge all the sub_grp sharing at least one id.
More than 2 sub_grp can be merged together.
The expected result should be:
+-----+----+---------+
| grp | id | sub_grp |
+-----+----+---------+
| 10 | A2 | 1 |
| 10 | B4 | 2 |
| 10 | F1 | 2 |
| 10 | B3 | 3 |
| 10 | C2 | 1 |
| 10 | A2 | 1 |
| 10 | H4 | 2 |
| 10 | K0 | 2 |
| 10 | Z3 | 2 |
| 10 | F1 | 2 |
| 10 | A1 | 2 |
| 10 | A | 6 |
| 10 | B | 6 |
| 10 | B | 6 |
| 10 | C | 6 |
| 10 | C | 6 |
| 10 | D | 6 |
| 20 | A | 1 |
| 20 | B | 1 |
| 20 | B | 1 |
| 20 | C | 1 |
| 20 | C | 1 |
| 20 | D | 1 |
+-----+----+---------+
Here is a SQL Fiddle with the test values: http://sqlfiddle.com/#!9/13666c/2
I am trying to solve this either with a stored procedure or queries.
This is an evolution from my previous problem: Merge rows containing same values

My understanding of the problem
Merge sub_grp (for a given grp) if any one of the IDs in one sub_grp match any one of the IDs in another sub_grp. A given sub_grp can be merged with only one other (the earliest in ascending order) sub_grp.
Disclaimer
This code may work. Not tested as OP did not provide DDLs and data scripts.
Solution
UPDATE final
SET sub_grp = new_sub_grp
FROM
-- For each grp, sub_grp combination return a matching new_sub_grp
( SELECT a.grp, a.sub_grp, MatchGrp.sub_grp AS new_sub_grp
FROM tbl AS a
-- Inner join will exclude cases where there are no matching sub_grp and thus nothing to update.
INNER JOIN
-- Find the earliest (if more than one sub-group is a match) matching sub-group where one of the IDs matches
( SELECT TOP 1 grp, sub_grp
FROM tbl AS b
-- b.sub_grp > a.sub_grp - this will only look at the earlier sub-groups avoiding the "double linking"
WHERE b.grp = a.grp AND b.sub_grp > a.sub_grp AND b.ID = a.ID
ORDER BY grp, sub_grp ) AS MatchGrp ON 1 = 1
-- Only return one record per grp, sub_grp combo
GROUP BY grp, sub_grp, MatchGrp.sub_grp ) AS final
You can re-number sub groups afterwards as a separate update statement with the help of DENSE_RANK window function.

Sum with 3 tables to join

I have 3 tables. The link between the first and the second table is REQ_ID and the link between the second and the third table is ENC_ID. There is no direct link between the first and the third table.
INS_RCPT
+----+--------+------+----------+
| ID | REQ_ID | CURR | RCPT_AMT |
+----+--------+------+----------+
| 1 | 1 | USD | 100 |
| 2 | 2 | USD | 200 |
| 3 | 3 | USD | 300 |
+----+--------+------+----------+
ENC_LOG
+----+--------+--------+-------------+
| ID | REQ_ID | ENC_ID | ENC_LOG_AMT |
+----+--------+--------+-------------+
| 1 | 1 | 1 | 20 |
| 2 | 1 | 2 | 50 |
| 3 | 1 | 3 | 30 |
| 4 | 2 | 4 | 20 |
+----+--------+--------+-------------+
ENC_RCPT
+----+--------+--------------+
| ID | ENC_ID | ENC_RCPT_AMT |
+----+--------+--------------+
| 1 | 1 | 10 |
| 2 | 1 | 10 |
| 3 | 2 | 15 |
| 4 | 2 | 25 |
| 5 | 2 | 10 |
| 6 | 3 | 12 |
| 7 | 3 | 18 |
| 8 | 4 | 10 |
+----+--------+--------------+
I would like to have output as follows:
+----+--------+------+----------+-------------+--------------+
| ID | REQ_ID | CURR | RCPT_AMT | ENC_LOG_AMT | ENC_RCPT_AMT |
+----+--------+------+----------+-------------+--------------+
| 1 | 1 | USD | 100 | 100 | 100 |
| 2 | 2 | USD | 200 | 20 | 10 |
| 3 | 3 | USD | 300 | 0 | 0 |
+----+--------+------+----------+-------------+--------------+
I am using SQL Server to write this query. Any help is appreciated.

One approach would be to join the first table to two subqueries which compute the sums separately:
SELECT
ir.ID,
ir.REQ_ID,
ir.CURR,
ir.RCPT_AMT,
el.ENC_LOG_AMT,
er.ENC_RCPT_AMT
FROM INS_RCPT ir
LEFT JOIN
(
SELECT REQ_ID, SUM(ENC_LOG_AMT) AS ENC_LOG_AMT
FROM ENC_LOG
GROUP BY REQ_ID
) el
ON ir.REQ_ID = el.REQ_ID
LEFT JOIN
(
SELECT t1.REQ_ID, SUM(t2.ENC_RCPT_AMT) AS ENC_RCPT_AMT
FROM ENC_LOG t1
INNER JOIN ENC_RCPT t2 ON t1.ENC_ID = t2.ENC_ID
GROUP BY t1.REQ_ID
) er
ON ir.REQ_ID = er.REQ_ID
Demo
Note that your question includes a curve ball. The second subquery needs to return aggregates of the receipt table by REQ_ID, even though this field does not appear in that table. As a result, we actually need to join ENC_LOG to ENC_RCPT in that subquery, and then aggregate by REQ_ID.

You can try the below query. Also change the join from left to inner as per your requirement.
select a.id,a.req_id,a.curr,sum(a.rcpt_amt) rcpt_amt,sum(a.enc_log_amt) enc_log_amt,sum(c.enc_rcpt_amt) enc_rcpt_amt
from
(
select a.id id ,a.req_id req_id ,a.curr curr,sum(rcpt_amt) as rcpt_amt,sum(enc_log_amt) as enc_log_amt
from ins_rcpt a
left join enc_log b
on a.req_id=b.req_id
group by id,req_id,curr
) a
left join enc_rcpt c
on a.enc_id = c.enc_id
group by id,req_id,curr;

TSQL - Referencing a changed value from previous row

I am trying to do a row calculation whereby the larger value will carry forward to the subsequent rows until a larger value is being compared. It is done by comparing the current value to the previous row using the lag() function.
Code
DECLARE #TAB TABLE (id varchar(1),d1 INT , d2 INT)
INSERT INTO #TAB (id,d1,d2)
VALUES ('A',0,5)
,('A',1,2)
,('A',2,4)
,('A',3,6)
,('B',0,4)
,('B',2,3)
,('B',3,2)
,('B',4,5)
SELECT id
,d1
,d2 = CASE WHEN id <> (LAG(id,1,0) OVER (ORDER BY id,d1)) THEN d2
WHEN d2 < (LAG(d2,1,0) OVER (ORDER BY id,d1)) THEN (LAG(d2,1,0) OVER (ORDER BY id,d1))
ELSE d2 END
Output (Added row od2 for clarity)
+----+----+----+ +----+
| id | d1 | d2 | | od2|
+----+----+----+ +----+
| A | 0 | 5 | | 5 |
| A | 1 | 5 | | 2 |
| A | 2 | 4 | | 4 |
| A | 3 | 6 | | 6 |
| B | 0 | 4 | | 4 |
| B | 2 | 4 | | 3 |
| B | 3 | 3 | | 2 |
| B | 4 | 5 | | 5 |
+----+----+----+ +----+
As you can see from the output it lag function is referencing the original value of the previous row rather than the new value. Is there anyway to achieve this?
Desired Output
+----+----+----+ +----+
| id | d1 | d2 | | od2|
+----+----+----+ +----+
| A | 0 | 5 | | 5 |
| A | 1 | 5 | | 2 |
| A | 2 | 5 | | 4 |
| A | 3 | 6 | | 6 |
| B | 0 | 4 | | 4 |
| B | 2 | 4 | | 3 |
| B | 3 | 4 | | 2 |
| B | 4 | 5 | | 5 |
+----+----+----+ +----+

Try this:
SELECT id
,d1
,d2
,MAX(d2) OVER (PARTITION BY ID ORDER BY d1)
FROM #TAB
The idea is to use the MAX to get the max value from the beginning to the current row for each partition.

Thanks for providing the DDL scripts and the DML.
One way of doing it would be using recursive cte as follows.
1. First rank all the records according to id, d1 and d2. -> cte block
2. Use recursive cte and get the first elements using rnk=1
3. the field "compared_val" will check against the values from the previous rnk to see if the value is > than the existing and if so it would swap
DECLARE #TAB TABLE (id varchar(1),d1 INT , d2 INT)
INSERT INTO #TAB (id,d1,d2)
VALUES ('A',0,5)
,('A',1,2)
,('A',2,4)
,('A',3,6)
,('B',0,4)
,('B',2,3)
,('B',3,2)
,('B',4,5)
;with cte
as (select row_number() over(partition by id order by d1,d2) as rnk
,id,d1,d2
from #TAB
)
,data(rnk,id,d1,d2,compared_val)
as (select rnk,id,d1,d2,d2 as compared_val
from cte
where rnk=1
union all
select a.rnk,a.id,a.d1,a.d2,case when b.compared_val > a.d2 then
b.compared_val
else a.d2
end
from cte a
join data b
on a.id=b.id
and a.rnk=b.rnk+1
)
select * from data order by id,d1,d2

Query Performance Netezza SQL

What is the best way to have the below logic to be in a single Netezza SQL. I implemented the logic in a for loop but for my data set it is taking a long time in netezza (say 47 mins to complete the loop) I have two tables, “TABLE - A” (Sector_ID | Value) and “TABLE B” holds which sector_id is intersected with other sector_id combination.
Now, the TABLE-A will be sorted descending on Value, and need to take the each highest sector_id from table A and eliminate all the corresponding intersected sector_id for point A in TABLE- B.
For Example,
TABLE – A (After Sorting)
SECTOR_ID VALUE(DESC) DELETED ROWS
6 150
1 140 DELETED
4 50
2 45 DELETED
3 15
TABLE – B
SECTOR_ID INTERSECTED_ID DELETED ROWS
6 6
6 1 DELETED
6 2 DELETED
1 1 DELETED
1 4 DELETED
1 2 DELETED
4 4
4 1 DELETED
2 6 DELETED
2 1 DELETED
2 2 DELETED
3 3
Now the remaining values in TABLE – A will be the desired output. Please suggest. The DB I am using is Netezza.

I'm going to attempt to restate your problem, so if it is not accurate let me know in the comments so that we can formulate it (and this answer).
You need to remove records in table_a when table_a.sector_id appears in the list of previous table_b.intersected_ids given that table_b is sorted by table_a.value.
Solution
Note that this solution is not relegated to Netezza-only, but rather relational algebra. Also, as far as I know, this will be faster than any cursor or loop-based approach for any RDBMS.
The biggest chore is setting up the list of sector_ids that need to be deleted from table_a. See in-line comments for descriptions.
create temporary table table_b_extended as
with tba as ( --Enhance table_a to include a row number.
select
row_number() over (order by sector_value desc) rwn
,*
from
table_a
), tab as ( --Join tables A and B together to attach the sorting key.
select
tba.rwn table_a_rwn
,row_number() over (order by tba.rwn) table_b_rwn
,tbb.*
,case
when tbb.sector_id = tbb.intersected_id
then 1
else 0
end sector_is_intersected
from
tba
join table_b tbb on
tba.sector_id = tbb.sector_id
)
select * from tab
distribute on (sector_id);
-- Find out the row where the intersected id first appears.
create temporary table table_b_first_appearance as
select
intersected_id sector_id
,min(table_b_rwn) first_appearance
from
table_b_extended
where
sector_id <> intersected_id
group by 1
distribute on random;
create temporary table table_a_deletes as
with pid as ( --Get all previous intersected_ids.
select distinct
tab.*
,case --See if this row is after the intersected_id's first appearance.
when app.first_appearance < tab.table_b_rwn then 1
else 0
end sector_in_previous
from
table_b_extended tab
left outer join table_b_first_appearance app using (sector_id)
), vld as ( --Select records that qualify to delete from table_a.
select distinct
intersected_id
from
pid
where
--If it hasn't been seen and isn't equal to the intersected_id, delete it.
sector_is_intersected + sector_in_previous = 0
)
select
*
from
vld
distribute on random;
Given your initial input for table_b:
+------------+----------------+
| sector_id | intersected_id |
+------------+----------------+
| 6 | 6 |
| 1 | 2 |
| 1 | 1 |
| 1 | 4 |
| 4 | 4 |
| 4 | 1 |
| 2 | 6 |
| 2 | 1 |
| 2 | 2 |
| 3 | 3 |
| 6 | 1 |
| 6 | 2 |
+------------+----------------+
This generates a table, table_a_deletes, with two values: 1 and 2. Then deleting from table_a is simple.
delete from table_a tbl
where tbl.sector_id in (select sector_id from table_a_deletes);
And I'm not sure if those DELETED flags in table_b need to be replicated or not, but if so:
delete from table_b tbl
where
tbl.sector_id in (select sector_id from table_a_deletes)
or tbl.sector_id <> tbl.intersected_id;
Performance
On an 8 SPU test system, including the delete steps:
Test 1
table_a size 14976
table_b size 179427095
Runtime 2:50
Test 2
table_a size 14976
table_b size 196240063
Runtime 3:16
Test 3
table_a size 19919
table_b size 317428924
Runtime 5:28
By far the longest time is creating the _extended table. So if you can use some other comparator rather than establishing a new id field, that would be best.
Extra Cases
Here are some different cases to show that it works in a variety of situations. In all cases, I have just modified table_b, since changing table_a is always the trivial case.
Case 1
table_b
+--+-------------+------------------+--+
| | sector_id | intersected_id | |
+--+-------------+------------------+--+
| | 6 | 6 | |
| | 1 | 1 | |
| | 1 | 2 | |
| | 4 | 4 | |
| | 4 | 1 | |
| | 2 | 6 | |
| | 2 | 1 | |
| | 2 | 2 | |
| | 3 | 3 | |
+--+-------------+------------------+--+
Deletes 1 and 2.
Case 2
table_b
+--+-------------+------------------+--+
| | sector_id | intersected_id | |
+--+-------------+------------------+--+
| | 6 | 6 | |
| | 6 | 1 | |
| | 1 | 1 | |
| | 1 | 2 | |
| | 4 | 4 | |
| | 4 | 1 | |
| | 2 | 6 | |
| | 2 | 1 | |
| | 2 | 2 | |
| | 3 | 3 | |
+--+-------------+------------------+--+
Deletes 1.
Case 3
table_b
+--+-------------+------------------+--+
| | sector_id | intersected_id | |
+--+-------------+------------------+--+
| | 6 | 6 | |
| | 1 | 1 | |
| | 1 | 4 | |
| | 4 | 4 | |
| | 4 | 1 | |
| | 2 | 6 | |
| | 2 | 1 | |
| | 2 | 2 | |
| | 3 | 3 | |
+--+-------------+------------------+--+
Deletes 1, 4, and 6.

find other columns value based on maximum of one column using groupby particular column

I have data like below
+-------+---------+--------+
| Count | Mindif | Device |
+-------+---------+--------+
| 45 | 3 | A |
| 78 | 4 | A |
| 52 | 5 | A |
| 24 | 6 | A |
| 22 | 1 | B |
| 22 | 2 | B |
| 34 | 3 | B |
| 37 | 4 | B |
| 52 | 5 | B |
| 34 | 6 | B |
| 13 | 1 | C |
| 30 | 2 | C |
| 57 | 3 | C |
| 111 | 4 | C |
| 35 | 5 | C |
+-------+---------+--------+
Want to find Mindif and device based on max value of count.
Output be like
+-------+---------+--------+
| Count | Mindif | Device |
+-------+---------+--------+
| 78 | 4 | A |
| 52 | 5 | B |
| 111 | 4 | C |
+-------+---------+--------+

You can use a query like this:
SELECT t1.Count, t1.Mindif, t1.Device
FROM mytable AS t1
JOIN (
SELECT Device, MAX(Count) AS Count
FROM mytable
GROUP BY Device
) AS t2 ON t1.Device = t2.Device AND t1.Count = t2.Count
The query uses a derived table that returns the max Count value per Device. Joining back to the original table we can get the desired result.

using Window Function
SELECT Count, Mindif, Device
FROM
(SELECT Count, Mindif, Device,
rank() over (order by Count desc) as r
FROM table) S
WHERE S.r = 1;
OR
Simple Join with MAX
SELECT a.* FROM table a
LEFT SEMI JOIN
(SELECT MAX(Count)Cnt
FROM table)b on (a.Count = b.Cnt)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Spark dataframe add Missing Values - sql

Related

Merge groups if they contain the same value

Sum with 3 tables to join

TSQL - Referencing a changed value from previous row

Query Performance Netezza SQL

find other columns value based on maximum of one column using groupby particular column

Categories

Resources