Find last job change date with JOB_TITLE and EVENT_DATE - sql

Hi I am working in an Azure Databricks and I am looking for a SQL query solution.
Assuming that my db has five columns:
ID
EVENT_DATE
JOB_TITLE
PAY
12345
2021-01-01
VP1
100,000
12345
2020-01-10
VP1
90,000
12345
2019-01-20
Analyst1
80,000
12346
2021-02-01
VP2
200,000
12346
2020-02-10
Analyst2
150,000
12346
2020-01-20
Analyst2
110,000
Basically I want the EVENT_DATE when JOB_TITLE changed the last time. This is my desired output:
ID
JOB_TITLE
PAY
LAST_JOB_CHANGE_DATE
12345
VP1
90,000
2021-01-10
12346
VP2
200,000
2021-02-01
For the last column LAST_JOB_CHANGE_DATE, we are pulling from the 2nd and 4th row of the table because that's the date when they changed job the last time.
Thank you!

You can just use INNER JOIN to accomplish that, ie
%sql
SELECT a.*
FROM yourTable a
INNER JOIN
(
SELECT id, MAX(event_date) event_date
FROM yourTable b
GROUP BY id
) b ON a.id = b.id
AND a.event_date = b.event_date
The ROW_NUMBER approach would also work well:
%sql
WITH cte AS
(
SELECT
ROW_NUMBER() OVER( PARTITION BY id ORDER BY event_date DESC ) AS rn,
*
FROM yourTable a
)
SELECT *
FROM cte
WHERE rn = 1
My results:

There's probably a simpler solution for this but the following should work.
I'm assuming you wanted the MOST resent job change for each employee. To illustrate this, I added an extra row for an Engineer1. The ROW_NUMBER() window function helps us with this.
ID
EVENT_DATE
JOB_TITLE
PAY
12345
2021-01-01
VP1
100,000
12345
2020-01-10
VP1
90,000
12345
2019-01-20
Analyst1
80,000
12345
2018-01-04
Engineer1
75,000
12346
2021-02-01
VP2
200,000
12346
2020-02-10
Analyst2
150,000
12346
2020-01-20
Analyst2
110,000
Here is the query:
SELECT <---- (4)
c.ID,
c.JOB_TITLE,
c.PAY,
c.last_job_change_date
FROM
(
SELECT <---- (3)
b.ID,
ROW_NUMBER() OVER (PARTITION BY b.ID ORDER BY b.last_job_change_date DESC) AS row_id,
b.JOB_TITLE,
b.PAY,
b.last_job_change_date
FROM
(
SELECT <---- (2)
a.ID,
a.JOB_TITLE,
a.PAY,
a.EVENT_DATE as last_job_change_date
FROM
(
SELECT <---- (1)
ID,
EVENT_DATE,
PAY,
JOB_TITLE,
LEAD(JOB_TITLE, 1) OVER (
PARTITION BY ID ORDER BY EVENT_DATE DESC) job_change
FROM yourtable
) a
WHERE JOB_TITLE <> job_change
) b
) c
WHERE row_id = 1
I used a 4 step process and annotated the query with each step:
Returns a table with a column for the subsequent job title (ordered by most recent title) of each employee.
Returns the table from (1) but removes rows where the employee did not change their job
Add row numbers so we can get the most recent job change of each employee
Return most recent job changes for each employee

Related

Select earliest date and count rows in table with duplicate IDs

I have a table called table1:
id created_date
1001 2020-06-01
1001 2020-01-01
1001 2020-07-01
1002 2020-02-01
1002 2020-04-01
1003 2020-09-01
I'm trying to write a query that provides me a list of distinct IDs with the earliest created_date they have, along with the count of rows each id has:
id created_date count
1001 2020-01-01 3
1002 2020-02-01 2
1003 2020-09-01 1
I managed to write a window function to grab the earliest date, but I'm having trouble figuring out where to fit the count statement in one:
SELECT
id,
created_date
FROM ( SELECT
id,
created_date,
row_number() OVER(PARTITION BY id ORDER BY created_date) as row_num
FROM table1)
) AS a
WHERE row_num = 1
You would use aggregation:
select id, min(create_date), count(*)
from table1
group by id;
I find it amusing that you want to use window functions -- which are considered more advanced -- when lowly aggregation suffices.

How to get ' COUNT DISTINCT' over moving window

I'm working on a query to compute the distinct users of particular features of an app within a moving window. So, if there's a range from 15-20th October, I want a query to go from 8-15 Oct, 9-16 Oct etc and get the count of distinct users per feature. So for each date, it should have x rows where x is the number of features.
I have a query the following query so far:
WITH V1(edate, code, total) AS
(
SELECT date, featurecode,
DENSE_RANK() OVER ( PARTITION BY (featurecode ORDER BY accountid ASC) + DENSE_RANK() OVER ( PARTITION BY featurecode ORDER By accountid DESC) - 1
FROM....
GROUP BY edate, featurecode, appcode, accountid
HAVING appcode='sample' AND eventdate BETWEEN '15-10-2018' And '20-10-2018'
)
Select distinct date, code, total
from V1
WHERE date between '2018-10-15' AND '2018-10-20'
This returns the same set of values for all the dates. Is there any way to do this efficiently?? It's a DB2 database by the way but I'm looking for insight from postgresql users too.
Present result- All the totals are being repeated.
date code total
10/15/2018 appname-feature1 123
10/15/2018 appname-feature2 234
10/15/2018 appname-feature3 321
10/16/2018 appname-feature1 123
10/16/2018 appname-feature2 234
10/16/2018 appname-feature3 321
Desired result.
date code total
10/15/2018 appname-feature1 123
10/15/2018 appname-feature2 234
10/15/2018 appname-feature3 321
10/16/2018 appname-feature1 212
10/16/2018 appname-feature2 577
10/16/2018 appname-feature3 2345
This is not easy to do efficiently. DISTINCT counts are't incrementally maintainable (unless you go down the route of in-exact DISTINCT counts such as HyperLogLog).
It is easy to code in SQL, and try the usual indexing etc to help.
It is (possibly) not possible, however, to code with OLAP functions.. not least because you can only use RANGE BETWEEN for SUM(), COUNT(), MAX() etc, but not RANK() or DENSE_RANK() ... so just use a traditional co-related sub-select
First some data
CREATE TABLE T(D DATE,F CHAR(1),A CHAR(1));
INSERT INTO T (VALUES
('2018-10-10','X','A')
, ('2018-10-11','X','A')
, ('2018-10-15','X','A')
, ('2018-10-15','X','A')
, ('2018-10-15','X','B')
, ('2018-10-15','Y','A')
, ('2018-10-16','X','C')
, ('2018-10-18','X','A')
, ('2018-10-21','X','B')
)
;
Now a simple select
WITH B AS (
SELECT DISTINCT D, F FROM T
)
SELECT D,F
, (SELECT COUNT(DISTINCT A)
FROM T
WHERE T.F = B.F
AND T.D BETWEEN B.D - 3 DAYS AND B.D + 4 DAYS
) AS DISTINCT_A_MOVING_WEEK
FROM
B
ORDER BY F,D
;
giving, e.g.
D F DISTINCT_A_MOVING_WEEK
---------- - ----------------------
2018-10-10 X 1
2018-10-11 X 2
2018-10-15 X 3
2018-10-16 X 3
2018-10-18 X 3
2018-10-21 X 2
2018-10-15 Y 1

SQL select specific group from table

I have a table named trades like this:
id trade_date trade_price trade_status seller_name
1 2015-01-02 150 open Alex
2 2015-03-04 500 close John
3 2015-04-02 850 close Otabek
4 2015-05-02 150 close Alex
5 2015-06-02 100 open Otabek
6 2015-07-02 200 open John
I want to sum up trade_price grouped by seller_name when last (by trade_date) trade_status was 'open'. That is:
sum_trade_price seller_name
700 John
950 Otabek
The rows where seller_name is Alex are skipped because the last trade_status was 'close'.
Although I can get desirable output result with the help of nested select
SELECT SUM(t1.trade_price), t1.seller_name
WHERE t1.seller_name NOT IN
(SELECT t2.seller_name FROM trades t2
WHERE t2.seller_name = t1.seller_name AND t2.trade_status = 'close'
ORDER BY t2.trade_date DESC LIMIT 1)
from trades t1
group by t1.seller_name
But it takes more than 1 minute to execute above query (I have approximately 100K rows).
Is there another way to handle it?
I am using PostgreSQL.
I would approach this with window functions:
SELECT SUM(t.trade_price), t.seller_name
FROM (SELECT t.*,
FIRST_VALUE(trade_status) OVER (PARTITION BY seller_name ORDER BY trade_date desc) as last_trade_status
FROM trades t
) t
WHERE last_trade_status <> 'close;
GROUP BY t.seller_name;
This should perform reasonably with an index on seller_name
select
sum(trade_price) as sum_trade_price,
seller_name
from
trades
inner join
(
select distinct on (seller_name) seller_name, trade_status
from trades
order by seller_name, trade_date desc
) s using (seller_name)
where s.trade_status = 'open'
group by seller_name

Firebird Query- Return first row each group

In a firebird database with a table "Sales", I need to select the first sale of all customers. See below a sample that show the table and desired result of query.
---------------------------------------
SALES
---------------------------------------
ID CUSTOMERID DTHRSALE
1 25 01/04/16 09:32
2 30 02/04/16 11:22
3 25 05/04/16 08:10
4 31 07/03/16 10:22
5 22 01/02/16 12:30
6 22 10/01/16 08:45
Result: only first sale, based on sale date.
ID CUSTOMERID DTHRSALE
1 25 01/04/16 09:32
2 30 02/04/16 11:22
4 31 07/03/16 10:22
6 22 10/01/16 08:45
I've already tested following code "Select first row in each GROUP BY group?", but it did not work.
In Firebird 2.5 you can do this with the following query; this is a minor modification of the second part of the accepted answer of the question you linked to tailored to your schema and requirements:
select x.id,
x.customerid,
x.dthrsale
from sales x
join (select customerid,
min(dthrsale) as first_sale
from sales
group by customerid) p on p.customerid = x.customerid
and p.first_sale = x.dthrsale
order by x.id
The order by is not necessary, I just added it to make it give the order as shown in your question.
With Firebird 3 you can use the window function ROW_NUMBER which is also described in the linked answer. The linked answer incorrectly said the first solution would work on Firebird 2.1 and higher. I have now edited it.
Search for the sales with no earlier sales:
SELECT S1.*
FROM SALES S1
LEFT JOIN SALES S2 ON S2.CUSTOMERID = S1.CUSTOMERID AND S2.DTHRSALE < S1.DTHRSALE
WHERE S2.ID IS NULL
Define an index over (customerid, dthrsale) to make it fast.
in Firebird 3 , get first row foreach customer by min sales_date :
SELECT id, customer_id, total, sales_date
FROM (
SELECT id, customer_id, total, sales_date
, row_number() OVER(PARTITION BY customer_id ORDER BY sales_date ASC ) AS rn
FROM SALES
) sub
WHERE rn = 1;
İf you want to get other related columns, This is where your self-answer fails.
select customer_id , min(sales_date)
, id, total --what about other colums
from SALES
group by customer_id
So simple as:
select CUSTOMERID min(DTHRSALE) from SALES group by CUSTOMERID

fill in a null cell with cell from previous record

Hi I am using DB2 sql to fill in some missing data in the following table:
Person House From To
------ ----- ---- --
1 586 2000-04-16 2010-12-03
2 123 2001-01-01 2012-09-27
2 NULL NULL NULL
2 104 2004-01-01 2012-11-24
3 987 1999-12-31 2009-08-01
3 NULL NULL NULL
Where person 2 has lived in 3 houses, but the middle address it is not known where, and when. I can't do anything about what house they were in, but I would like to take the previous house they lived at, and use the previous To date to replace the NULL From date, and use the next address info and use the From date to replace the null To date ie.
Person House From To
------ ----- ---- --
1 586 2000-04-16 2010-12-03
2 123 2001-01-01 2012-09-27
2 NULL 2012-09-27 2004-01-01
2 104 2004-01-01 2012-11-24
3 987 1999-12-31 2009-08-01
3 NULL 2009-08-01 9999-01-01
I understand that if there is no previous address before a null address, that will have to stay null, but if a null address is the last know address I would like to change the To date to 9999-01-01 as in person 3.
This type of problem seems to me where set theory no longer becomes a good solution, however I am required to find a DB2 solution because that's what my boss uses!
any pointers/suggestions welcome.
Thanks.
It might look something like this:
select
person,
house,
coalesce(from_date, prev_to_date) from_date,
case when rn = 1 then coalesce (to_date, '9999-01-01')
else coalesce(to_date, next_from_date) end to_date
from
(select person, house, from_date, to_date,
lag(to_date) over (partition by person order by from_date nulls last) prev_to_date,
lead(from_date) over (partition by person order by from_date nulls last) next_from_date,
row_number() over (partition by person order by from_date desc nulls last) rn
from temp
) t
The above is not tested but it might give you an idea.
I hope in your actual table you have a column other than to_date and from_date that allows you to order rows for each person, otherwise you'll have trouble sorting NULL dates, as you have no way of knowing the actual sequence.
create table Temp
(
person varchar(2),
house int,
from_date date,
to_date date
)
insert into temp values
(1,586,'2000-04-16','2010-12-03 '),
(2,123,'2001-01-01','2012-09-27'),
(2,NULL,NULL,NULL),
(2,104,'2004-01-01','2012-11-24'),
(3,987,'1999-12-31','2009-08-01'),
(3,NULL,NULL,NULL)
select A.person,
A.house,
isnull(A.from_date,BF.to_date) From_date,
isnull(A.to_date,isnull(CT.From_date,'9999-01-01')) To_date
from
((select *,ROW_NUMBER() over (order by (select 0)) rownum from Temp) A left join
(select *,ROW_NUMBER() over (order by (select 0)) rownum from Temp) BF
on A.person = BF.person and
A.rownum = BF.rownum + 1)left join
(select *,ROW_NUMBER() over (order by (select 0)) rownum from Temp) CT
on A.person = CT.person and
A.rownum = CT.rownum - 1