How to write a sql script for a range of Oracle assignment date records by different employee's job titles - sql

I am trying to write an ad-hoc query for a range of assignment date records by employee's job title. These examples are used for the Oracle application assignment table.
First sample:
AsgId Start_Date End_Date Job_ID
1 1/1/14 6/30/14 10
1 7/1/14 11/15/14 10
1 11/16/14 1/10/15 20
1 1/11/15 3/10/15 10
1 3/11/15 3/31/15 10
1 4/1/15 12/31/18 20
I have tried analytical functions, in-line views, and other code without success.
Expected report results of 3 date-range records by job title:
asgid start_date end_date job_title
1 1/1/14 11/15/14 10
1 11/16/14 1/10/15 20
1 1/11/15 3/31/15 10
1 4/1/15 12/31/18 20
Second sample:
EMP_ID START_DATE END_DATE JOB_TITLE
1 1/1/14 11/15/14 10
1 11/16/14 11/10/15 10
1 11/11/15 12/31/15 20
1 1/1/16 1/31/16 10
1 2/1/16 12/31/16 10
Expected report results of 3 date-range records by job title
EMP_ID START_DATE END_DATE JOB_TITLE
1 1/1/14 11/10/15 10
1 11/11/15 12/31/15 20
1 1/1/16 12/31/16 10

This is a type of gaps-and-islands problem. Assuming that there are no gaps or overlaps, you can use left join and a cumulative sum to determine the islands. The rest is aggregation:
select asgid, job_id, min(start_date) as start_date,
max(end_date) as end_date
from (select a.*,
sum(case when aprev.asgid is null then 1 else 0 end) over (partition by a.asgid, a.job_id order by a.start_date) as grp
from assignment a left join
assignment aprev
on aprev.asgid = a.asgid and
aprev.job_id = a.job_id and
aprev.end_date = a.start_date - 1
) a
group by asgid, job_id, grp
order by asgid, min(a.start_date);
Here is a db<>fiddle.

Related

sql query to fill sparse data in timeline

I have a table holding various information change related to employees. Some information change over time, but not alltogether, and changes occur periodically but not regularly. Changes are recorded by date, and if an item is not changed for the given employee at the given time, then the item's value is Null for that record. Say it looks like this:
employeeId
Date
Salary
CommuteDistance
1
2000-01-01
1000
Null
2
2000-01-15
2000
20
3
2000-01-30
3000
Null
2
2010-02-15
2100
Null
3
2010-03-30
Null
30
1
2020-02-01
1100
10
1
2030-03-01
Null
100
Now, how can I write a query to fill the null values with the most recent non-null values for all employees at all dates, while keeping the value Null if there is no such previous non-null value? It should look like:
employeeId
Date
Salary
CommuteDistance
1
2000-01-01
1000
Null
2
2000-01-15
2000
20
3
2000-01-30
3000
Null
2
2010-02-15
2100
20
3
2010-03-30
3000
30
1
2020-02-01
1100
10
1
2030-03-01
1100
100
(Note how the bolded values are taken over from previous records of same employee).
I'd like to use the query inside a view, then in turn query that view to get the picture at an arbitrary date (e.g., what were the salary and commute distance for the employees on 2021-08-17? - I should be able to do that, but I'm unable to build the view). Or, is there a better way to acomplish this?
There's no point in showing my attempts, since I'm quite inexperienced with advanced sql (I assume the solution empolys advanced knowledge, since I found my basic knowledge insufficient for this) and I got nowhere near the desired result.
You may get the last not null value for employee salary or CommuteDistance using the following:
SELECT T.employeeId, T.Date,
COALESCE(Salary, MAX(Salary) OVER (PARTITION BY employeeId, g1)) AS Salary,
COALESCE(CommuteDistance, MAX(CommuteDistance) OVER (PARTITION BY employeeId, g2)) AS CommuteDistance
FROM
(
SELECT *,
MAX(CASE WHEN Salary IS NOT null THEN Date END) OVER (PARTITION BY employeeId ORDER BY Date) AS g1,
MAX(CASE WHEN CommuteDistance IS NOT null THEN Date END) OVER (PARTITION BY employeeId ORDER BY Date) AS g2
FROM TableName
) T
ORDER BY Date
See a demo.
We group by employeeId and by Salary/CommuteDistance and all the nulls after them by Date. Then we fill in the blanks.
select employeeId
,Date
,max(Salary) over(partition by employeeId, s_grp) as Salary
,max(CommuteDistance) over(partition by employeeId, d_grp) as CommuteDistance
from (
select *
,count(case when Salary is not null then 1 end) over(partition by employeeId order by Date) as s_grp
,count(case when CommuteDistance is not null then 1 end) over(partition by employeeId order by Date) as d_grp
from t
) t
order by Date
employeeId
Date
Salary
CommuteDistance
1
2000-01-01
1000
null
2
2000-01-15
2000
20
3
2000-01-30
3000
null
2
2010-02-15
2100
20
3
2010-03-30
3000
30
1
2020-02-01
1100
10
1
2030-03-01
1100
100
Fiddle

Select max date for each register, null if does not exists

I have these tables: Employee (id, name, number), Configuration (id, years, licence_days), Periods (id, start_date, end_date, configuration_id, employee_id, period_type):
Employee table:
id name number
---- ----- -------
1 Bob 355
2 John 467
3 Maria 568
4 Josh 871
configuration table:
id years licence_days
---- ----- ------------
1 1 8
2 3 16
3 5 24
Periods table:
id start_date end_date configuration_id employee_id period_type
---- ---------- ------- ---------------- ----------- -----------
1 2021-05-23 2021-05-31 1 1 vaccation
2 2021-05-24 2021-06-01 1 2 vaccation
3 2021-03-01 2021-03-17 2 2 vaccation
4 2021-05-05 2021-05-21 2 2 vaccation
5 2021-01-01 2021-01-17 2 4 vaccation
I want this result:
Result:
employee_id years licence_days max(end_date)
1 1 8 2021-05-31
1 3 16 null
1 5 24 null
2 1 8 2021-06-01
2 3 16 2021-05-21
2 5 24 null
3 1 8 null
3 3 16 null
3 5 24 null
4 1 8 null
4 3 16 2021-01-17
4 5 24 null
i.e., I want to select all Employees with all configuration, and for each one of that, the max end_date of the "vaccation" type (or null if it does not exists).
How can I do that
Oracle supports cross joins, right? So may be something like that?
SELECT e.employee_id, c.years, c.licence_days, max(p.end_date)
FROM Employee e
CROSS JOIN configuration c
LEFT JOIN Periods p
ON e.employee_id = p.employee_id
AND c.configuration_id = p.configuration_id
GROUP BY e.employee_id, c.years, c.licence_days
ORDER BY e.employee_id, c.years
#umberto-petrov chooses wisely with the ANSI CROSS JOIN syntax for a cartesian join. However, in the very weak probability that your requires output of configurations even where there is no employees, you can go with something like :
EDIT: Filtering the Periods join with 'vaccation' as asked in the comments.
If you have to filter for some employee ids, change ON 1 = 1 by ON Employee.id IN (id1, id2, ...). It still keeps every configurations but only takes employees that match the ids.
SELECT Employee.employee_id,
Configuration.years,
Configuration.licence_days,
MAX(Configuration.end_date) max_end_date
FROM Configuration LEFT JOIN Employee ON 1 = 1
LEFT JOIN Periods ON Periods.configuration_id = Configuration.id
AND Periods.employee_id = Employee.id
AND Periods.period_type = 'vaccation'
GROUP BY Employee.employee_id,
Configuration.years,
Configuration.licence_days
ORDER BY Employee.employee_id,
Configuration.years,
Configuration.licence_days
We start from configuration to take every records from this one at least, then made a LEFT CARTESIAN JOIN with Employee and finally a full LET JOIN on Periods for both. That way , if there is no employees, this will output configuration_id and NULL for years, licence_days and max end_date.

adjust date overlaps within a group

I have this table and I want to adjust END_DATE one day prior to the next ST_DATE in case if there are overlap dates for a group of ID
TABLE HAVE
ID ST_DATE END_DATE
1 2020-01-01 2020-02-01
1 2020-05-10 2020-05-20
1 2020-05-18 2020-06-19
1 2020-11-11 2020-12-01
2 1999-03-09 1999-05-10
2 1999-04-09 2000-05-10
3 1999-04-09 2000-05-10
3 2000-06-09 2000-08-16
3 2000-08-17 2009-02-17
Below is what I'm looking for
TABLE WANT
ID ST_DATE END_DATE
1 2020-01-01 2020-02-01
1 2020-05-10 2020-05-17 =====changed to a day less than the next ST_DATE due to some sort of overlap
1 2020-05-18 2020-06-19
1 2020-11-11 2020-12-01
2 1999-03-09 1999-04-08 =====changed to a day less than the next ST_DATE due to some sort of overlap
2 1999-04-09 2000-05-10
3 1999-04-09 2000-05-10
3 2000-06-09 2000-08-16
3 2000-08-17 2009-02-17
Maybe you can use LEAD() for this. Initial idea:
select
id, st_date, end_date
, lead( st_date ) over ( partition by id order by st_date ) nextstart_
from overlap
;
-- result
ID ST_DATE END_DATE NEXTSTART
---------- --------- --------- ---------
1 01-JAN-20 01-FEB-20 10-MAY-20
1 10-MAY-20 20-MAY-20 18-MAY-20
1 18-MAY-20 19-JUN-20 11-NOV-20
1 11-NOV-20 01-DEC-20
2 09-MAR-99 10-MAY-99 09-APR-99
2 09-APR-99 10-MAY-00
3 09-APR-99 10-MAY-00 09-JUN-00
3 09-JUN-00 16-AUG-00 17-AUG-00
3 17-AUG-00 17-FEB-09
Once you have the next start date and the end_date side by side (as it were),
you can use CASE ... for adjusting the dates as you need them.
select ilv.id, ilv.st_date
, case
when ilv.end_date > ilv.nextstart_ then
to_char( ilv.nextstart_ - 1 ) || ' <- modified end date'
else
to_char( ilv.end_date )
end dt_modified
from (
select
id, st_date, end_date
, lead( st_date ) over ( partition by id order by st_date ) nextstart_
from overlap
) ilv
;
ID ST_DATE DT_MODIFIED
---------- --------- ---------------------------------------
1 01-JAN-20 01-FEB-20
1 10-MAY-20 17-MAY-20 <- modified end date
1 18-MAY-20 19-JUN-20
1 11-NOV-20 01-DEC-20
2 09-MAR-99 08-APR-99 <- modified end date
2 09-APR-99 10-MAY-00
3 09-APR-99 10-MAY-00
3 09-JUN-00 16-AUG-00
3 17-AUG-00 17-FEB-09
DBfiddle here.
If two "windows" for the same id have the same start date, then the problem doesn't make sense. So, let's assume that the problem makes sense - that is, the combination (id, st_date) is unique in the inputs.
Then, the problem can be formulated as follows: for each id, order rows by st_date ascending. Then, for each row, if its end_dt is less than the following st_date, return the row as is. Otherwise replace end_dt with the following st_date, minus 1. This last step can be achieved with the analytic lead() function.
A solution might look like this:
select id, st_date,
least(end_date, lead(st_date, 1, end_date + 1)
over (partition by id order by st_date) - 1) as end_date
from have
;
The bit about end_date + 1 in the lead function handles the last row for each id. For such rows there is no "next" row, so the default application of lead will return null. The default can be overridden by using the third parameter to the function.

Count the number of transactions per month for an individual group by date Hive

I have a table of customer transactions where each item purchased by a customer is stored as one row. So, for a single transaction there can be multiple rows in the table. I have another col called visit_date.
There is a category column called cal_month_nbr which ranges from 1 to 12 based on which month transaction occurred.
The data looks like below
Id visit_date Cal_month_nbr
---- ------ ------
1 01/01/2020 1
1 01/02/2020 1
1 01/01/2020 1
2 02/01/2020 2
1 02/01/2020 2
1 03/01/2020 3
3 03/01/2020 3
first
I want to know how many times customer visits per month using their visit_date
i.e i want below output
id cal_month_nbr visit_per_month
--- --------- ----
1 1 2
1 2 1
1 3 1
2 2 1
3 3 1
and what is the avg frequency of visit per ids
ie.
id Avg_freq_per_month
---- -------------
1 1.33
2 1
3 1
I tried with below query but it counts each item as one transaction
select avg(count_e) as num_visits_per_month,individual_id
from
(
select r.individual_id, cal_month_nbr, count(*) as count_e
from
ww_customer_dl_secure.cust_scan
GROUP by
r.individual_id, cal_month_nbr
order by count_e desc
) as t
group by individual_id
I would appreciate any help, guidance or suggestions
You can divide the total visits by the number of months:
select individual_id,
count(*) / count(distinct cal_month_nbr)
from ww_customer_dl_secure.cust_scan c
group by individual_id;
If you want the average number of days per month, then:
select individual_id,
count(distinct visit_date) / count(distinct cal_month_nbr)
from ww_customer_dl_secure.cust_scan c
group by individual_id;
Actually, Hive may not be efficient at calculating count(distinct), so multiple levels of aggregation might be faster:
select individual_id, avg(num_visit_days)
from (select individual_id, cal_month_nbr, count(*) as num_visit_days
from (select distinct individual_id, visit_date, cal_month_nbr
from ww_customer_dl_secure.cust_scan c
) iv
group by individual_id, cal_month_nbr
) ic
group by individual_id;

Vertica SQL for running count distinct and running conditional count

I'm trying to build a department level score table based on a deeper product url level score table.
Date is not consecutive
Not all urls got score updates at same day (independent to each other)
dist_url should be running count distinct (cumulative count distinct)
dist urls and urls score >=30 are both count distinct
What I have now is:
Date url Store Dept Page Score
10/1 a US A X 10
10/1 b US A X 30
10/1 c US A X 60
10/4 a US A X 20
10/4 d US A X 60
10/6 b US A X 22
10/9 a US A X 40
10/9 e US A X 10
Date Store Dept Page dist urls urls score >=30
10/1 US A X 3 2
10/4 US A X 4 3
10/6 US A X 4 2
10/9 US A X 5 2
I think the dist_url can be done by using window function, just not sure on query.
Current query is as below, but it's wrong since not cumulative count distinct:
SELECT
bm.AnalysisDate,
su.SoID AS Store,
su.DptCaID AS DTID,
su.PageTypeID AS PTID,
COUNT(DISTINCT bm.SeoURLID) AS NumURLsWithDupScore,
SUM(CASE WHEN bm.DuplicationScore > 30 THEN 1 ELSE 0 END) AS Over30Count
FROM csn_seo.tblBotifyMetrics bm
INNER JOIN csn_seo.tblSEOURLs su
ON bm.SeoURLID = su.ID
WHERE su.DptCaID IS NOT NULL
AND su.DptCaID <> 0
AND su.PageTypeID IS NOT NULL
AND su.PageTypeID <> -1
AND bm.iscompliant = 1
GROUP BY bm.AnalysisDate, su.SoID, su.DptCaID, su.PageTypeID;
Please let me know if anyone has any idea.
Based on your question, you seem to want two levels of logic:
select date, store, dept,
sum(sum(start)) over (partition by dept, page order by date) as distinct_urls,
sum(sum(start_30)) over (partition by dept, page order by date) as distinct_urls_30
from ((select store, dept, page, url, min(date) as date, 1 as start, 0 as start_30
from t
group by store, dept, page, url
) union all
(select store, dept, page, url, min(date) as date, 0, 1
from t
where score >= 30
group by store, dept, page, url
)
) t
group by date, store, dept, page;
I don't understand how your query is related to your question.
Try as I might, I don't get your output either:
But I think you can avoid UNION SELECTs - Does this do what you expect?
NULLS don't figure in COUNT DISTINCTs - and here you can combine an aggregate expression with an OLAP one ...
And Vertica has named windows to increase readability ....
WITH
input(Date,url,Store,Dept,Page,Score) AS (
SELECT DATE '2019-10-01','a','US','A','X',10
UNION ALL SELECT DATE '2019-10-01','b','US','A','X',30
UNION ALL SELECT DATE '2019-10-01','c','US','A','X',60
UNION ALL SELECT DATE '2019-10-04','a','US','A','X',20
UNION ALL SELECT DATE '2019-10-04','d','US','A','X',60
UNION ALL SELECT DATE '2019-10-06','b','US','A','X',22
UNION ALL SELECT DATE '2019-10-09','a','US','A','X',40
UNION ALL SELECT DATE '2019-10-09','e','US','A','X',10
)
SELECT
date
, store
, dept
, page
, SUM(COUNT(DISTINCT url) ) OVER(w) AS dist_urls
, SUM(COUNT(DISTINCT CASE WHEN score >=30 THEN url END)) OVER(w) AS dist_urls_gt_30
FROM input
GROUP BY
date
, store
, dept
, page
WINDOW w AS (PARTITION BY store,dept,page ORDER BY date)
;
-- out date | store | dept | page | dist_urls | dist_urls_gt_30
-- out ------------+-------+------+------+-----------+-----------------
-- out 2019-10-01 | US | A | X | 3 | 2
-- out 2019-10-04 | US | A | X | 5 | 3
-- out 2019-10-06 | US | A | X | 6 | 3
-- out 2019-10-09 | US | A | X | 8 | 4
-- out (4 rows)
-- out
-- out Time: First fetch (4 rows): 45.321 ms. All rows formatted: 45.364 ms