Populating values from earlier records based on column value in Spark query

Populating values from earlier records based on column value in Spark query - apache-spark-sql

I have a requirement to write a Spark SQL (2.4) query to determine and populate column value from earlier records based on a (srvc_typ) column/condition -
Source:
id | policy | prsn | prv | srvc_typ
1 | 1234G | xyz | prv1 | proc
2 | 1234G | xyz | prv1 | drg
3 | 1234G | xyz | prv1 | drg
4 | 1234G | xyz | prv1 | proc
5 | 1234G | xyz | prv1 | drg
Requirement:
Within a particular group of records (key: policy, prsn, prv), populate the id column as -
Within a particular group if the srvc_typ of any record is 'drg' then populate its id column with the id value of its last encountered non-drg record - i.e. srvc_typ <> 'drg'.
If the srvc_typ of any record (within that same group) has srvc_typ = 'proc' then retain its own id.
Expected output: (Note the changes in the id column)
id | policy | prsn | prv | srvc_typ
1 | 1234G | xyz | prv1 | proc
1 | 1234G | xyz | prv1 | drg
1 | 1234G | xyz | prv1 | drg
4 | 1234G | xyz | prv1 | proc
4 | 1234G | xyz | prv1 | drg
Explanation:
Since the first record is of srvc_typ = 'proc', its original id
value is retained i.e. 1
Now, for the second and third rows,
their srvc_typ value is 'drg', hence the id column of these rows
are to be changed to their last encountered non-drg (proc) record
i.e. 1 (record number 1)
The fourth record has srvc_typ = 'proc' hence it retains its original id value i.e 4
Now, the fifth and final record has a type of 'drg' hence its id value should change
and hence should be equal to its last encountered 'proc' record i.e. 4 (record number 4)
There could be multiple occurrences of proc records in the same group as shown above.
Can someone please help me write the query in Spark SQL using the %sql% api.
Happy to provide additional information if required.
Thanks

You need to use a window function (eg. last_value) to look at previous rows, and a CASE statement to only take srvc_typ = 'proc' into account. That's the working query:
WITH input (id, policy, prsn, prv, srvc_typ) AS (
SELECT 1, '1234G', 'xyz', 'prv1', 'proc'
UNION ALL
SELECT 2, '1234G', 'xyz', 'prv1', 'drg'
UNION ALL
SELECT 3, '1234G', 'xyz', 'prv1', 'drg'
UNION ALL
SELECT 4, '1234G', 'xyz', 'prv1', 'proc'
UNION ALL
SELECT 5, '1234G', 'xyz', 'prv1', 'drg'
)
SELECT last_value(CASE WHEN srvc_typ = 'proc' THEN id END, TRUE)
OVER (
PARTITION BY policy, prsn, prv
ORDER BY id
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS id,
policy,
prsn,
prv,
srvc_typ
FROM input
id
policy
prsn
prv
srvc_typ
1
1234G
xyz
prv1
proc
1
1234G
xyz
prv1
drg
1
1234G
xyz
prv1
drg
4
1234G
xyz
prv1
proc
4
1234G
xyz
prv1
drg
*) I ran it with spark 3.0.1 - hopefully the version 2.4 has all the sql implemented. But even if not, you get the idea - I am sure you can also pull it off using the imperative style.

Related

SQL SELECT most recently created row WHERE something is true

I am trying to SELECT the most recently created row, WHERE the ID field in the row is a certain number, so I don't want the most recently created row in the WHOLE table, but the most recently created one WHERE the ID field is a specific number.
My Table:
Table:
| name | value | num |SecondName| Date |
| James | HEX124 | 1 | Carl | 11022020 |
| Jack | JEU836 | 4 | Smith | 19042020 |
| Mandy | GER234 | 33 | Jones | 09042020 |
| Mandy | HER575 | 7 | Jones | 10052020 |
| Jack | JEU836 | 4 | Smith | 14022020 |
| Ryan | GER631 | 33 | Jacque | 12042020 |
| Sarah | HER575 | 7 | Barlow | 01022019 |
| Jack | JEU836 | 4 | Smith | 14042020 |
| Ryan | HUH233 | 33 | Jacque | 15042020 |
| Sarah | HER575 | 7 | Barlow | 02022019 |
My SQL:
SELECT name, value, num, SecondName, Date
FROM MyTable
INNER JOIN (SELECT NAME, MAX(DATE) AS MaxTime FROM MyTable GROUP BY NAME) grouped ON grouped.NAME = NAME
WHERE NUM = 33
AND grouped.MaxTime = Date
What I'm doing here, is selecting the table, and creating an INNER JOIN where I'm taking the MAX Date value (the biggest/newest value), and grouping by the Name, so this will return the newest created row, for each person (Name), WHERE the NUM field is equal to 33.
Results:
| Ryan | HUH233 | 33 | Jacque | 15042020 |
As you can see, it is returning one row, as there are 3 rows with the NUM value of 33, two of which are with the Name 'Ryan', so it is grouping by the Name, and returning the latest entry for Ryan (This works fine).
But, Mandy is missing, as you can see in my first table, she has two entries, one under the NUM value of 33, and the other with the NUM value of 7. Because the entry with the NUM value of 7 was created most recently, my query where I say 'grouped.MaxTime = Date' is taking that row, and it is not being displayed, as the NUM value is not 33.
What I want to do, is read every row WHERE the NUM field is 33, THEN select the Maximum Time inside of the rows with the value of 33.
I believe what it is doing, prioritising the Maximum Date value first, then filtering the selected fields with the NUM value of 33.
Desired Results:
| Ryan | HUH233 | 33 | Jacque | 15042020 |
| Mandy | GER234 | 33 | Jones | 09042020 |
Any help would be appreciated, thank you.

If I folow you correctly, you can filter with a subquery:
select t.*
from mytable t
where t.num = 33 and t.date = (
select max(t1.date) from mytable t1 where t1.name = t.name and t1.num = t.num
)

Look at your subquery. You want the maximum dates for num 33, but you are selecting the maximum dates independent from num.
I think you want:
select *
from mytable
where (name, date) in
(
select name, max(date)
from mytable
where num = 33
group by name
);

pl sql update rows with value from the same table

In agent table I have field bossId which is an Id of other agent,
so I generated data for all other fields in agent table and now I need to set bossId for all the rows.
So each row need to select agentId for its bossId field from the same table.
there can be some agents with the same boss but agent cant be boss of himself.
this is the table I have
+---------+------+--------+
| agentId | name | bossId |
+---------+------+--------+
| 123 | aaa | |
| 124 | bbb | |
| 125 | ccc | |
| 126 | ddd | |
+---------+------+--------+
wanted resault:
+---------+------+--------+
| agentId | name | bossId |
+---------+------+--------+
| 123 | aaa | 124 |
| 124 | bbb | 123 |
| 125 | ccc | 126 |
| 126 | ddd | 123 |
+---------+------+--------+
so the empty column bossId needs to be filled with AgentId of the same table
how to do it in pl sql?
UPDATE:
tried using this code which seems to be ok but I get errors
begin
for i in 1..17 loop
update policeman p
set bossid = (select p2.officerid
from policeman p2
order by dbms_random.value
where rownum = 1)
where rownum =i;
end loop;
end ;
error:
ORA-06550: line 6, column 18:
PL/SQL: ORA-00907: missing right parenthesis
ORA-06550: line 3, column 5:
PL/SQL: SQL Statement ignored

Does this do what you want?
update t
set bossid = (select max(bossid) keep (dense_rank first order by dbms_random.random)
from t
);
Note: This may not be very efficient if your data is not small.

This should work provided agentid is unique.
update t tgt
set bossid = (select max(agentid)
from t src
where tgt.agentid<>src.agentid);

SQL SUM hours by week across number of years

I have a table in a SQL database that holds information about the hours worked by employees across a number of years. Each employee can have more than one record for a specific date and each employees start date can be different.
I am trying to sum the weekly hours of each employee based on their first week.
So if the employee started on the 17/04/2018 any hours logged in this week would be considered week 1 for this employee and the following week would be week two etc.
For another employee week one could start in a different day/month/year etc.
My data includes the following fields:
Sequence_ID: relates to an individual employee
Date_European: relates to each date an employee has logged hours with the minimum of this being the first date the employee started in the company
Hours: The amount of hours logged
I also have a year field in the data which is the year of the Date_European column.
The below is what I have attempted but I know it isn't even close to the format I need.
select
Sequence_ID
,DATEPART(week,Date_European) AS Week
,DATEPART(year,Date_European) AS Year
,SUM([Hours]) AS Weekly_Hours
from [AB_DCU_IP_2018].[dbo].[mytable]
group by
Sequence_ID
,DATEPART(week,Date_European)
,DATEPART(year,Date_European)
order by
Sequence_ID
,DATEPART(week,Date_European)
,DATEPART(year,Date_European)
I tried to create the 'Week' field. From the above code it just gives me what week of a particular year a date relates to. I then added the 'Year' column to distinguish between different years, but again this only gives me what particular year that is.
Is there any way to create a 'Week' field in the format I am looking for? (Week of earliest date and surrounding dates would be week 1).
I was attempting to use the rank and partition by function by couldn't get this to work properly.
Any help would be greatly appreciated as I have been searching for a solution for hours.
Thanks in advance.
EDIT:
How to create the initial table
CREATE TABLE mytable(Sequence_ID VARCHAR(6) NOT NULL ,Date_European DATE NOT NULL ,Hours NUMERIC(5,1) NOT NULL);
INSERT INTO mytable(Sequence_ID,Date_European,Hours) VALUES ('da6Wrw','09/05/2016',7.3);
INSERT INTO mytable(Sequence_ID,Date_European,Hours) VALUES ('da6Wrw','09/06/2016',7.3);
INSERT INTO mytable(Sequence_ID,Date_European,Hours) VALUES ('da6Wrw','09/07/2016',7.3);
INSERT INTO mytable(Sequence_ID,Date_European,Hours) VALUES ('da6Wrw','09/08/2016',7.3);
INSERT INTO mytable(Sequence_ID,Date_European,Hours) VALUES ('da6Wrw','09/09/2016',7.3);
INSERT INTO mytable(Sequence_ID,Date_European,Hours) VALUES ('da6Wrw','09/12/2016',7.3);
INSERT INTO mytable(Sequence_ID,Date_European,Hours) VALUES ('da6Wrw','09/13/2016',7.3);
INSERT INTO mytable(Sequence_ID,Date_European,Hours) VALUES ('da6Wrw','09/14/2016',7.3);
INSERT INTO mytable(Sequence_ID,Date_European,Hours) VALUES ('da6Wrw','09/15/2016',7.3);
INSERT INTO mytable(Sequence_ID,Date_European,Hours) VALUES ('da6Wrw','09/16/2016',7.3);
INSERT INTO mytable(Sequence_ID,Date_European,Hours) VALUES ('da6Wrw','09/19/2016',7.3);
INSERT INTO mytable(Sequence_ID,Date_European,Hours) VALUES ('da6Wrw','09/20/2016',7.3);
INSERT INTO mytable(Sequence_ID,Date_European,Hours) VALUES ('da6Wrw','09/21/2016',7.3);
INSERT INTO mytable(Sequence_ID,Date_European,Hours) VALUES ('da6Wrw','09/22/2016',7.3);
INSERT INTO mytable(Sequence_ID,Date_European,Hours) VALUES ('da6Wrw','09/23/2016',7.3);
INSERT INTO mytable(Sequence_ID,Date_European,Hours) VALUES ('da6Wrw','09/26/2016',7.3);
INSERT INTO mytable(Sequence_ID,Date_European,Hours) VALUES ('da6Wrw','09/27/2016',7.3);
INSERT INTO mytable(Sequence_ID,Date_European,Hours) VALUES ('da6Wrw','09/28/2016',7.3);
INSERT INTO mytable(Sequence_ID,Date_European,Hours) VALUES ('da6Wrw','09/29/2016',7.3);
INSERT INTO mytable(Sequence_ID,Date_European,Hours) VALUES ('da6Wrw','09/30/2016',7.3);
What I want as the desired outcome:
| Sequence_ID | Date_European | DATEPART(week,Date_European) | Hours | Desired_OutCome_Week |
| da6Wrw | 05/09/2016 | 37 | 7.3 | 1 |
| da6Wrw | 06/09/2016 | 37 | 7.3 | 1 |
| da6Wrw | 07/09/2016 | 37 | 7.3 | 1 |
| da6Wrw | 08/09/2016 | 37 | 7.3 | 1 |
| da6Wrw | 09/09/2016 | 37 | 7.3 | 1 |
| da6Wrw | 12/09/2016 | 38 | 7.3 | 2 |
| da6Wrw | 13/09/2016 | 38 | 7.3 | 2 |
| da6Wrw | 14/09/2016 | 38 | 7.3 | 2 |
| da6Wrw | 15/09/2016 | 38 | 7.3 | 2 |
| da6Wrw | 16/09/2016 | 38 | 7.3 | 2 |
| da6Wrw | 19/09/2016 | 39 | 7.3 | 3 |
| da6Wrw | 20/09/2016 | 39 | 7.3 | 3 |
| da6Wrw | 21/09/2016 | 39 | 7.3 | 3 |
| da6Wrw | 22/09/2016 | 39 | 7.3 | 3 |
| da6Wrw | 23/09/2016 | 39 | 7.3 | 3 |
| da6Wrw | 26/09/2016 | 40 | 7.3 | 4 |
| da6Wrw | 27/09/2016 | 40 | 7.3 | 4 |
| da6Wrw | 28/09/2016 | 40 | 7.3 | 4 |
| da6Wrw | 29/09/2016 | 40 | 7.3 | 4 |
| da6Wrw | 30/09/2016 | 40 | 7.3 | 4 |

Set DateFirst 1
select
Sequence_ID,
(datediff(day , DQ.WeekStarted, Date_European) / 7 + 1) EmployeeWeekNumber
,SUM([Hours]) AS Weekly_Hours
--into [AB_DCU_IP_2018].[dbo].[Weekly_Work_Hours_Employee]
from [AB_DCU_IP_2018].[dbo].[All_IPower_HR_Assurance_4]
CROSS APPLY (SELECT DATEADD(day, -1 * (datepart(weekday,start_date) % 7), start_date) AS WeekStarted
FROM YourTable
WHERE <condition to get the start_date you need>
) DQ
group by
Sequence_ID,
(datediff(day , DQ.WeekStarted, Date_European) / 7 + 1)
order by
Sequence_ID
,DATEPART(week,Date_European)
,DATEPART(year,Date_European)

Here is another approach using the sample data you posted.
select mt.Sequence_ID
, mt.Date_European
, DATEPART(week, mt.Date_European)
, mt.Hours
, MyRow.GroupNum
from mytable mt
join
(
select WeekNum = DATEPART(week,Date_European)
, GroupNum = ROW_NUMBER() over(order by DATEPART(week,Date_European))
from mytable
group by DATEPART(week,Date_European)
) MyRow on MyRow.WeekNum = DATEPART(week, mt.Date_European)

try this
select *,rn-1 [Employee_week] from (
select *,dense_RANK() over(Partition by Sequence_ID order by iif(weekly_hours=0,0,week) ) [rn] from (
select
Sequence_ID
,DATEPART(week,Date_European) AS Week
,DATEPART(year,Date_European) AS Year
,SUM([Hours]) AS Weekly_Hours
--into [AB_DCU_IP_2018].[dbo].[Weekly_Work_Hours_Employee]
from [AB_DCU_IP_2018].[dbo].[All_IPower_HR_Assurance_4]
group by
Sequence_ID
,DATEPART(week,Date_European)
,DATEPART(year,Date_European)
order by
Sequence_ID
,DATEPART(week,Date_European)
,DATEPART(year,Date_European))a)a
where rn = 2
This'll give you the hours each employee worked on their first week, use rn>2 to get the remaining weeks

I actually found an easier way to calculate the week number of the employee that uses the DENSE_Rank function.
I have included this below incase anyone as similar issues. I have commented out the DATEPART sections as I was only using these columns as a check to ensure it was working correctly:
select
Sequence_ID
,Date_European
--,DATEPART(week,Date_European) AS Week
--,DATEPART(year,Date_European) AS Year
,DENSE_RANK() OVER (PARTITION BY Sequence_ID ORDER BY DATEPART(year,Date_European), DATEPART(week,Date_European) asc) AS EmployeeWeekNumber
,Hours
from [AB_DCU_IP_2018].[dbo].[All_IPower_HR_Assurance_4]
order by
Sequence_ID
,Date_European
--,DATEPART(week,Date_European)
--,DATEPART(year,Date_European)

Subtract the value of a row from grouped result

I have a table supplier_account which has five coloumns supplier_account_id(pk),supplier_id(fk),voucher_no,debit and credit. I want to get the sum of debit grouped by supplier_id and then subtract the value of credit of the rows in which voucher_no is not null. So for each subsequent rows the value of sum of debit gets reduced. I have tried using 'with' clause.
with debitdetails as(
select supplier_id,sum(debit) as amt
from supplier_account group by supplier_id
)
select acs.supplier_id,s.supplier_name,acs.purchase_voucher_no,acs.purchase_voucher_date,dd.amt-acs.credit as amount
from supplier_account acs
left join supplier s on acs.supplier_id=s.supplier_id
left join debitdetails dd on acs.supplier_id=dd.supplier_id
where voucher_no is not null
But here the debit value will be same for all rows. After subtraction in the first row I want to get the result in second row and subtract the next credit value from that.
I know it is possible by using temporary tables. The problem is I cannot use temporary tables because the procedure is used to generate reports using Jasper Reports.

What you need is an implementation of the running total. The easiest way to do it with a help of a window function:
with debitdetails as(
select id,sum(debit) as amt
from suppliers group by id
)
select s.id, purchase_voucher_no, dd.amt, s.credit,
dd.amt - sum(s.credit) over (partition by s.id order by purchase_voucher_no asc)
from suppliers s
left join debitdetails dd on s.id=dd.id
order by s.id, purchase_voucher_no
SQL Fiddle
Results:
| id | purchase_voucher_no | amt | credit | ?column? |
|----|---------------------|-----|--------|----------|
| 1 | 1 | 43 | 5 | 38 |
| 1 | 2 | 43 | 18 | 20 |
| 1 | 3 | 43 | 8 | 12 |
| 2 | 4 | 60 | 5 | 55 |
| 2 | 5 | 60 | 15 | 40 |
| 2 | 6 | 60 | 30 | 10 |

Selecting the most recent entry subject to a condition

I have a table with a 'date' field (written in an integer form), and a 'grocery_item' field. The date specifies when a certain grocery item was ordered
I am trying to write a query that slelect the most recent entry for every grocery items that occured before a given date:
ex:
id | date | grocery_item
1 | 201101 | a
2 | 201101 | b
3 | 201102 | a
4 | 201103 | b
5 | 201104 | c
get most recent that occured before 201103
id | date | grocery_item
2 | 201101 | b
3 | 201102 | a
Any help will be more than appreciated!! -- I am blanking out on this ...

SELECT id, MAX(date) AS date, grocery_item
FROM table
WHERE date < 201103
GROUP BY grocery_item

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Populating values from earlier records based on column value in Spark query - apache-spark-sql

Related

SQL SELECT most recently created row WHERE something is true

pl sql update rows with value from the same table

SQL SUM hours by week across number of years

Subtract the value of a row from grouped result

Selecting the most recent entry subject to a condition

Categories

Resources