sql query to fill sparse data in timeline - sql

I have a table holding various information change related to employees. Some information change over time, but not alltogether, and changes occur periodically but not regularly. Changes are recorded by date, and if an item is not changed for the given employee at the given time, then the item's value is Null for that record. Say it looks like this:
employeeId
Date
Salary
CommuteDistance
1
2000-01-01
1000
Null
2
2000-01-15
2000
20
3
2000-01-30
3000
Null
2
2010-02-15
2100
Null
3
2010-03-30
Null
30
1
2020-02-01
1100
10
1
2030-03-01
Null
100
Now, how can I write a query to fill the null values with the most recent non-null values for all employees at all dates, while keeping the value Null if there is no such previous non-null value? It should look like:
employeeId
Date
Salary
CommuteDistance
1
2000-01-01
1000
Null
2
2000-01-15
2000
20
3
2000-01-30
3000
Null
2
2010-02-15
2100
20
3
2010-03-30
3000
30
1
2020-02-01
1100
10
1
2030-03-01
1100
100
(Note how the bolded values are taken over from previous records of same employee).
I'd like to use the query inside a view, then in turn query that view to get the picture at an arbitrary date (e.g., what were the salary and commute distance for the employees on 2021-08-17? - I should be able to do that, but I'm unable to build the view). Or, is there a better way to acomplish this?
There's no point in showing my attempts, since I'm quite inexperienced with advanced sql (I assume the solution empolys advanced knowledge, since I found my basic knowledge insufficient for this) and I got nowhere near the desired result.

You may get the last not null value for employee salary or CommuteDistance using the following:
SELECT T.employeeId, T.Date,
COALESCE(Salary, MAX(Salary) OVER (PARTITION BY employeeId, g1)) AS Salary,
COALESCE(CommuteDistance, MAX(CommuteDistance) OVER (PARTITION BY employeeId, g2)) AS CommuteDistance
FROM
(
SELECT *,
MAX(CASE WHEN Salary IS NOT null THEN Date END) OVER (PARTITION BY employeeId ORDER BY Date) AS g1,
MAX(CASE WHEN CommuteDistance IS NOT null THEN Date END) OVER (PARTITION BY employeeId ORDER BY Date) AS g2
FROM TableName
) T
ORDER BY Date
See a demo.

We group by employeeId and by Salary/CommuteDistance and all the nulls after them by Date. Then we fill in the blanks.
select employeeId
,Date
,max(Salary) over(partition by employeeId, s_grp) as Salary
,max(CommuteDistance) over(partition by employeeId, d_grp) as CommuteDistance
from (
select *
,count(case when Salary is not null then 1 end) over(partition by employeeId order by Date) as s_grp
,count(case when CommuteDistance is not null then 1 end) over(partition by employeeId order by Date) as d_grp
from t
) t
order by Date
employeeId
Date
Salary
CommuteDistance
1
2000-01-01
1000
null
2
2000-01-15
2000
20
3
2000-01-30
3000
null
2
2010-02-15
2100
20
3
2010-03-30
3000
30
1
2020-02-01
1100
10
1
2030-03-01
1100
100
Fiddle

Related

Populate empty values from another table

Let us say that I have two SQL tables
Employee Recognition Table
Employee Id
Reward Date
Coupon
1
1/1/2020
null
1
1/2/2020
null
1
1/3/2020
null
2
2/1/2020
null
2
2/2/2020
null
3
2/2/2020
null
Coupons
Employee Id
Coupon
1
COUPON1
1
COUPON2
1
COUPON3
2
COUPON4
What I want to do is allot coupons to all the employee uniquely, example
employee 1 has three coupons so they should be allotted
employee 2 just has 1 coupon so 1 should get allotted
employee 3 has none
So the output should be something like
Employee Recognition Table Updated
Employee Id
Reward Date
Coupon
1
1/1/2020
COUPON1
1
1/2/2020
COUPON2
1
1/3/2020
COUPON3
2
2/1/2020
COUPON4
2
2/2/2020
null
3
2/2/2020
null
Also the table contains a lot of records both tables above 100k records so wondering what a performant query can look like. I have thought about using lateral joins but the speed seems to be the issue there.
Use below
select * except(pos)
from (
select Employee_Id, Reward_Date,
row_number() over(partition by Employee_Id order by Reward_Date) pos
from recognitions
)
left join (
select Employee_Id, Coupon,
row_number() over(partition by Employee_Id order by Coupon) pos
from coupons
)
using (Employee_Id, pos)
-- order by Employee_Id, Reward_Date
if applied to sample data in your question - output is

How to select rows where values changed for an ID

I have a table that looks like the following
id effective_date number_of_int_customers
123 10/01/19 0
123 02/01/20 3
456 10/01/19 6
456 02/01/20 6
789 10/01/19 5
789 02/01/20 4
999 10/01/19 0
999 02/01/20 1
I want to write a query that looks at each ID to see if the salespeople have newly started working internationally between October 1st and February 1st.
The result I am looking for is the following:
id effective_date number_of_int_customers
123 02/01/20 3
999 02/01/20 1
The result would return only the salespeople who originally had 0 international customers and now have at least 1.
I have seen similar posts here that use nested queries to pull records where the first date and last have different values. But I only want to pull records where the original value was 0. Is there a way to do this in one query in SQL?
In your case, a simple aggregation would do -- assuming that 0 is the earliest value:
select id, max(number_of_int_customers)
from t
where effective_date in ('2019-10-01', '2020-02-01')
group by id
having min(number_of_int_customers) = 0;
Obviously, this is not correct if the values can decrease to zero. But this having clause fixes that problem:
having min(case when number_of_int_customers = 0 then effective_date end) = min(effective_date)
An alternative is to use window functions, such asfirst_value():
select distinct id, last_noic
from (select t.*,
first_value(number_of_int_customers) over (partition by id order by effective_date) as first_noic,
first_value(number_of_int_customers) over (partition by id order by effective_date desc) as last_noic,
from t
where effective_date in ('2019-10-01', '2020-02-01')
) t
where first_noic = 0;
Hmmm, on second thought, I like lag() better:
select id, number_of_int_customers
from (select t.*,
lag(number_of_int_customers) over (partition by id order by effective_date) as prev_noic
from t
where effective_date in ('2019-10-01', '2020-02-01')
) t
where prev_noic = 0;

How to write a sql script for a range of Oracle assignment date records by different employee's job titles

I am trying to write an ad-hoc query for a range of assignment date records by employee's job title. These examples are used for the Oracle application assignment table.
First sample:
AsgId Start_Date End_Date Job_ID
1 1/1/14 6/30/14 10
1 7/1/14 11/15/14 10
1 11/16/14 1/10/15 20
1 1/11/15 3/10/15 10
1 3/11/15 3/31/15 10
1 4/1/15 12/31/18 20
I have tried analytical functions, in-line views, and other code without success.
Expected report results of 3 date-range records by job title:
asgid start_date end_date job_title
1 1/1/14 11/15/14 10
1 11/16/14 1/10/15 20
1 1/11/15 3/31/15 10
1 4/1/15 12/31/18 20
Second sample:
EMP_ID START_DATE END_DATE JOB_TITLE
1 1/1/14 11/15/14 10
1 11/16/14 11/10/15 10
1 11/11/15 12/31/15 20
1 1/1/16 1/31/16 10
1 2/1/16 12/31/16 10
Expected report results of 3 date-range records by job title
EMP_ID START_DATE END_DATE JOB_TITLE
1 1/1/14 11/10/15 10
1 11/11/15 12/31/15 20
1 1/1/16 12/31/16 10
This is a type of gaps-and-islands problem. Assuming that there are no gaps or overlaps, you can use left join and a cumulative sum to determine the islands. The rest is aggregation:
select asgid, job_id, min(start_date) as start_date,
max(end_date) as end_date
from (select a.*,
sum(case when aprev.asgid is null then 1 else 0 end) over (partition by a.asgid, a.job_id order by a.start_date) as grp
from assignment a left join
assignment aprev
on aprev.asgid = a.asgid and
aprev.job_id = a.job_id and
aprev.end_date = a.start_date - 1
) a
group by asgid, job_id, grp
order by asgid, min(a.start_date);
Here is a db<>fiddle.

Expanding/changing my query to find more entries using (potentially) IFELSE

My question will use this dataset as an example. I have a query setup (I have changed variables to more generic variables for the sake of posting this on the internet so the query may not make perfect sense) that picks the most recent date for a given account. So the query returns values with a reason_type of 1 with the most recent date. This query has effective_date set to is not null.
account date effective_date value reason_type
123456 4/20/2017 5/1/2017 5 1
123456 1/20/2017 2/1/2017 10 1
987654 2/5/2018 3/1/2018 15 1
987654 12/31/2017 2/1/2018 20 1
456789 4/27/2018 5/1/2018 50 1
456789 1/24/2018 2/1/2018 60 1
456123 4/25/2017 null 15 2
789123 5/1/2017 null 16 2
666888 2/1/2018 null 31 2
333222 1/1/2018 null 20 2
What I am looking to do now is to basically use that logic to only apply to reason_type
if there is an entry for it, otherwise have it default to reason_type
I think I should be using an IFELSE, but I'm admittedly not knowledgeable about how I would go about that.
Here is the code that I currently have to return the reason_type 1s most recent entry.
I hope my question is clear.
SELECT account, date, effective_date, value, reason_type
from
(
SELECT account, date, effective_date, value, reason_type
ROW_NUMBER() over (partition by account order by date desc) rn
from mytable
WHERE value is not null
AND effective_date is not null
)
WHERE rn =1
I think you might want something like this (do you really have a column named date by the way? That seems like a bad idea):
SELECT account, date, effective_date, value, reason_type
FROM (
SELECT account, date, effective_date, value, reason_type
, ROW_NUMBER() OVER ( PARTITION BY account ORDER BY date DESC ) AS rn
FROM mytable
WHERE value IS NOT NULL
) WHERE rn = 1
-- effective_date IS NULL or is on or before today's date
AND ( effective_date IS NULL OR effective_date < TRUNC(SYSDATE+1) );
Hope this helps.

Sql Server group by sets of columns

I have a data set where I need to count patient visits with such rules:
Two or more visits to the same doctor in the same day count as 1 visit, regardless of the reason
Two or more visits to different doctors for the same reason count as 1 visit
Two or more visits to different doctors on the same day for different reasons count as two or more visits.
Example data:
DoctorId PatientId VisitDate ReasonCode RowId
-------- --------- --------- ---------- -----
1 100 2014-01-01 200 1
1 100 2014-01-01 210 2
2 100 2014-01-01 200 3
2 100 2014-01-11 300 4
1 100 2014-01-15 200 5
2 400 2014-01-15 200 6
In this example, my final count would be based on grouping rowId 1, 2, 3 for 1 visit; grouping row 4 as 1 visit, grouping row 5 as 1 visit for a total of 3 visits for patient 100. Patient 400 has 1 visit as well.
patientid visitdate numberofvisits
--------- --------- --------------
100 2014-01-01 3
100 2014-01-11 1
100 2014-01-15 1
400 2014-01-15 1
Where I'm stuck is how to handle the group by so that I get the different scenarios covered. If the grouping were doctor, date, I'd be fine. If it were doctor, date, ReasonCode, I'd be fine. It's the logic of the doctorId and the ReasonCode in the scenario where 2 doctors are involved, and doctorid and date in the other when it's the same doctor. I've not been deeply into Sql Server in a long time, so it's possible that a common table expression is the solution and I'm not seeing it. I'm using Sql Server 2014 and there's a decent lattitude in performance. I would be looking for a sql server query that produces the results above. As best I can tell, there's no way to group this the way I need it counted.
The answer was an except clause and grouping each of the sets before a final count. Sometimes, we over-complicate things.
DECLARE #tblAllData TABLE
(
DoctorId INT NOT NULL
, PatientId INT NOT NULL
, VisitDate DATE NOT NULL
, ReasonCode INT NOT NULL
, RowId INT NOT NULL
)
INSERT #tblAllData
SELECT
1,100,'2014-01-01',200,1
UNION ALL
SELECT
1,100,'2014-01-01',210,2
UNION ALL
SELECT
2,100,'2014-01-01',200,3
UNION ALL
SELECT
2,100,'2014-01-11',300,4
UNION ALL
SELECT
1,100,'2014-01-15',200,5
UNION ALL
SELECT
2,400,'2014-01-15',200,6
DECLARE #tblTempCountedRows AS TABLE
(
PatientId INT NOT NULL
, VisitDate DATE
, ReasonCode INT
)
INSERT #tblTempCountedRows
SELECT PatientId, VisitDate,0
FROM #tblAllData
GROUP BY PatientId, DoctorId, VisitDate
EXCEPT
SELECT PatientId, VisitDate, ReasonCode
FROM #tblAllData
GROUP BY PatientId, VisitDate, ReasonCode
select * from #tblTempCountedRows
DECLARE #tblFinalCountedRows AS TABLE
(
PatientId INT NOT NULL
, VisitCount INT
)
INSERT #tblFinalCountedRows
SELECT
PatientId
, count(1) as Member_visit_Count
FROM
#tblTempCountedRows
GROUP BY PatientId
SELECT * from #tblFinalCountedRows
Here's a Sql Fiddle with the results:
Sql Fiddle