How to fill missing dates between empty records? - sql

I am trying to fill dates between empty records but without success. Tried to do multiple selects method, tried to join, but it seems like I am missing the point. I would like to generate records with missing dates, to generate chart from this block of code. Firstly I would like to have dates filled "manually", later I will reorganise this code and swap that method for an argument.
Can someone help me with that expression?
SELECT
LOG_LAST AS "data",
SUM(run_cnt) AS "Number of runs"
FROM
dual l
LEFT OUTER JOIN "LOG_STAT" stat ON
stat."LOG_LAST" = l."CLASS"
WHERE
new_class = '$arg[klasa]'
--SELECT to_date(TRUNC (SYSDATE - ROWNUM), 'DD-MM-YYYY'),
--0
--FROM dual CONNECT BY ROWNUM < 366
GROUP BY
LOG_LAST
ORDER BY
LOG_LAST
//Edit:
LOG_LAST is just a column with date (for example: 25.04.2018 15:44:21), run_cnt is a column with just a simple number, LOG_STAT is a table that contains LOG_LAST and run_cnt, new_class is a column with name of the record I would like to list records even when they are no existing. For example: I have a records with date 24-09-2018, 23-09-2018, 20-09-2018, 18-09-2018, and I would like to list records even without names and run_cnt, but to generate missing dates in some period

try to fill with isnull:
SELECT
case when trim(LOG_LAST) is null then '01-01-2018'
else isnull(LOG_LAST,'01-01-2018')end AS data,
SUM(isnull(run_cnt,0)) AS "Number of runs"
FROM
dual l
LEFT OUTER JOIN "LOG_STAT" stat ON
stat."LOG_LAST" = l."CLASS"
WHERE
new_class = '$arg[klasa]'
--SELECT to_date(TRUNC (SYSDATE - ROWNUM), 'DD-MM-YYYY'),
--0
--FROM dual CONNECT BY ROWNUM < 366
GROUP BY
LOG_LAST
ORDER BY
LOG_LAST

What you want is more or less:
select d.day, sum(ls.run_cnt)
from all_dates d
left join log_stat ls on trunc(ls.log_last) = d.day
where ls.new_class = :klasa
group by d.day
order by d.day;
The all_dates table in above query is supposed to contain all dates beginning with the minimum klasa log_last date and ending with the maximum klasa log_last date. You get these dates with a recursive query.
with ls as
(
select trunc(log_last) as day, sum(run_cnt) as total
from log_stat
where new_class = :klasa
group by trunc(log_last)
)
, all_dates(day) as
(
select min(day) from ls
union all
select day + 1 from all_dates where day < (select max(day) from ls)
)
select d.day, ls.total
from all_dates d
left join ls on ls.day = d.day
order by d.day;

It's called data densification. From oracle doc Data Densification for Reporting, An example data densification
with ls as
(
select trunc(created) as day,object_type new_class, sum(1) as total
from user_objects
group by trunc(created),object_type
)
, all_dates(day) as
(
select min(day) from ls
union all
select day + 1 from all_dates where day < (select max(day) from ls)
)
select d.day, nvl(ls.total,0),new_class
from all_dates d
left join ls partition by (ls.new_class) on ls.day = d.day
order by d.day;

Related

Oracle query to fill in the missing data in the same table

I have a table in oracle which has missing data for a given id. I am trying to figure out the sql to fill in the data from start date: 01/01/2019 to end_dt: 10/1/2020. see the input data below. for status key the data can be filled based on its previous status key. see input:
expected output:
You can use a recursive query to generate the dates, then cross join that with the list of distinct ids available in the table. Then, use window functions to bring the missing key values:
with recursive cte (mon) as (
select date '2019-01-01' mon from dual
union all select add_months(mon, 1) from cte where mon < date '2020-10-01'
)
select i.id,
coalesce(
t.status_key,
lead(t.previous_status_key ignore nulls) over(partition by id order by c.mon)
) as status_key,
coalesce(
t.status_key,
lag(t.status_key ignore nulls, 1, -1) over(partition by id order by c.mon)
) previous_status_key,
c.mon
from cte c
cross join (select distinct id from mytable) i
left join mytable t on t.mon = c.mon and t.id = i.id
You did not give a lot of details on how to bring the missing status_keys and previous_status_keys. Here is what the query does:
status_key is taken from the next non-null previous_status_key
previous_status_key is taken from the last non-null status_key, with a default of -1
You can generate the dates and then use cross join and some additional logic to get the information you want:
with dates (mon) as (
select date '2019-01-01' as mon
from dual
union all
select mon + interval '1' month
from dates
where mon < date '2021-01-01'
)
select d.mon, i.id,
coalesce(t.status_key,
lag(t.status_key ignore nulls) over (partition by i.id order by d.mon)
) as status_key,
coalesce(t.previous_status_key,
lag(t.previous_status_key ignore nulls) over (partition by i.id order by d.mon)
) as previous_status_key
from dates d cross join
(select distinct id from t) i left join
t
on d.mon = t.mon and i.id = i.id;

How to partition my data by a specific date and another identifier SQL

with cte as
(
select to_date('01-JUN-2020','DD-MON-YYYY')+(level-1) DT
from dual
connect bY level<= 30
)
select *
from cte x
left outer join
(select date from time where emp in (1, 2)) a on x.dt = a.date
In this scenario I am trying to find the missing days that these persons didn't report to work... it works well for 1 person. I get back their missing days correctly. But when I add 2 persons.. I do not get back the correct missing days for them because I'm only joining on date I guess.
I would like to know how I can partition this data by the persons id and date to be able get accurate days that each were missing.
Please help, thanks.
You would typically cross join the list of dates with the list of persons, and then use not exists to pull out the missing person/date tuples:
with cte as (
select date '2020-06-01' + level - 1 dt
from dual
connect by level <= 30
)
select c.dt, e.emp
from cte c
cross join (select distinct emp from times) e
where not exists (
select 1
from times t
where t.emp = e.emp and t.dt = e.date
)
Note that this uses a literal date rather than to_date(), which is more appropriate here.
This gives the missing tuples for all persons at once. If you want just for a predefined list of persons, then:
with cte as (
select date '2020-06-01' + level - 1 dt
from dual
connect by level <= 30
)
select c.dt, e.emp
from cte c
cross join (select 1 emp from dual union all select 2 from dual) e
where not exists (
select 1
from times t
where t.emp = e.emp and t.dt = e.date
)
If you want to also see the "presence" dates, then use a left join rather than not exists, as in your original query:
with cte as (
select date '2020-06-01' + level - 1 dt
from dual
connect by level <= 30
)
select c.dt, e.emp, -- enumerate the relevant columns from "t" here
from cte c
cross join (select 1 emp from dual union all select 2 from dual) e
left join times t on t.emp = e.emp and t.dt = e.date

Same output in two different lateral joins

I'm working on a bit of PostgreSQL to grab the first 10 and last 10 invoices of every month between certain dates. I am having unexpected output in the lateral joins. Firstly the limit is not working, and each of the array_agg aggregates is returning hundreds of rows instead of limiting to 10. Secondly, the aggregates appear to be the same, even though one is ordered ASC and the other DESC.
How can I retrieve only the first 10 and last 10 invoices of each month group?
SELECT first.invoice_month,
array_agg(first.id) first_ten,
array_agg(last.id) last_ten
FROM public.invoice i
JOIN LATERAL (
SELECT id, to_char(invoice_date, 'Mon-yy') AS invoice_month
FROM public.invoice
WHERE id = i.id
ORDER BY invoice_date, id ASC
LIMIT 10
) first ON i.id = first.id
JOIN LATERAL (
SELECT id, to_char(invoice_date, 'Mon-yy') AS invoice_month
FROM public.invoice
WHERE id = i.id
ORDER BY invoice_date, id DESC
LIMIT 10
) last on i.id = last.id
WHERE i.invoice_date BETWEEN date '2017-10-01' AND date '2018-09-30'
GROUP BY first.invoice_month, last.invoice_month;
This can be done with a recursive query that will generate the interval of months for who we need to find the first and last 10 invoices.
WITH RECURSIVE all_months AS (
SELECT date_trunc('month','2018-01-01'::TIMESTAMP) as c_date, date_trunc('month', '2018-05-11'::TIMESTAMP) as end_date, to_char('2018-01-01'::timestamp, 'YYYY-MM') as current_month
UNION
SELECT c_date + interval '1 month' as c_date,
end_date,
to_char(c_date + INTERVAL '1 month', 'YYYY-MM') as current_month
FROM all_months
WHERE c_date + INTERVAL '1 month' <= end_date
),
invocies_with_month as (
SELECT *, to_char(invoice_date::TIMESTAMP, 'YYYY-MM') invoice_month FROM invoice
)
SELECT current_month, array_agg(first_10.id), 'FIRST 10' as type FROM all_months
JOIN LATERAL (
SELECT * FROM invocies_with_month
WHERE all_months.current_month = invoice_month AND invoice_date >= '2018-01-01' AND invoice_date <= '2018-05-11'
ORDER BY invoice_date ASC limit 10
) first_10 ON TRUE
GROUP BY current_month
UNION
SELECT current_month, array_agg(last_10.id), 'LAST 10' as type FROM all_months
JOIN LATERAL (
SELECT * FROM invocies_with_month
WHERE all_months.current_month = invoice_month AND invoice_date >= '2018-01-01' AND invoice_date <= '2018-05-11'
ORDER BY invoice_date DESC limit 10
) last_10 ON TRUE
GROUP BY current_month;
In the code above, '2018-01-01' and '2018-05-11' represent the dates between we want to find the invoices. Based on those dates, we generate the months (2018-01, 2018-02, 2018-03, 2018-04, 2018-05) that we need to find the invoices for.
We store this data in all_months.
After we get the months, we do a lateral join in order to join the invoices for every month. We need 2 lateral joins in order to get the first and last 10 invoices.
Finally, the result is represented as:
current_month - the month
array_agg - ids of all selected invoices for that month
type - type of the selected invoices ('first 10' or 'last 10').
So in the current implementation, you will have 2 rows for each month (if there is at least 1 invoice for that month). You can easily join that in one row if you need to.
LIMIT is working fine. It's your query that's broken. JOIN is just 100% the wrong tool here; it doesn't even do anything close to what you need. By joining up to 10 rows with up to another 10 rows, you get up to 100 rows back. There's also no reason to self join just to combine filters.
Consider instead window queries. In particular, we have the dense_rank function, which can number every row in the result set according to groups:
SELECT
invoice_month,
time_of_month,
ARRAY_AGG(id) invoice_ids
FROM (
SELECT
id,
invoice_month,
-- Categorize as end or beginning of month
CASE
WHEN month_rank <= 10 THEN 'beginning'
WHEN month_reverse_rank <= 10 THEN 'end'
ELSE 'bug' -- Should never happen. Just a fall back in case of a bug.
END AS time_of_month
FROM (
SELECT
id,
invoice_month,
dense_rank() OVER (PARTITION BY invoice_month ORDER BY invoice_date) month_rank,
dense_rank() OVER (PARTITION BY invoice_month ORDER BY invoice_date DESC) month_rank_reverse
FROM (
SELECT
id,
invoice_date,
to_char(invoice_date, 'Mon-yy') AS invoice_month
FROM public.invoice
WHERE invoice_date BETWEEN date '2017-10-01' AND date '2018-09-30'
) AS fiscal_year_invoices
) ranked_invoices
-- Get first and last 10
WHERE month_rank <= 10 OR month_reverse_rank <= 10
) first_and_last_by_month
GROUP BY
invoice_month,
time_of_month
Don't be intimidated by the length. This query is actually very straightforward; it just needed a few subqueries.
This is what it does logically:
Fetch the rows for the fiscal year in question
Assign a "rank" to the row within its month, both counting from the beginning and from the end
Filter out everything that doesn't rank in the 10 top for its month (counting from either direction)
Adds an indicator as to whether it was at the beginning or end of the month. (Note that if there's less than 20 rows in a month, it will categorize more of them as "beginning".)
Aggregate the IDs together
This is the tool set designed for the job you're trying to do. If really needed, you can adjust this approach slightly to get them into the same row, but you have to aggregate before joining the results together and then join on the month; you can't join and then aggregate.

Hits per day in Google Big Query

I am using Google Big Query to find hits per day. Here is my query,
SELECT COUNT(*) AS Key,
DATE(EventDateUtc) AS Value
FROM [myDataSet.myTable]
WHERE .....
GROUP BY Value
ORDER BY Value DESC
LIMIT 1000;
This is working fine but it ignores the date with 0 hits. I wanna include this. I cannot create temp table in Google Big Query. How to fix this.
Tested getting error Field 'day' not found.
SELECT COUNT(*) AS Key,
DATE(t.day) AS Value from (
select date(date_add(day, i, "DAY")) day
from (select '2015-05-01 00:00' day) a
cross join
(select
position(
split(
rpad('', datediff(CURRENT_TIMESTAMP(),'2015-05-01 00:00')*2, 'a,'))) i
from (select NULL)) b
) d
left join [sample_data.requests] t on d.day = t.day
GROUP BY Value
ORDER BY Value DESC
LIMIT 1000;
You can query data that exists in your tables, the query cannot guess which dates are missing from your table. This problem you need to handle either in your programming language, or you could join with a numbers table and generates the dates on the fly.
If you know the date range you have in your query, you can generate the days:
select date(date_add(day, i, "DAY")) day
from (select '2015-01-01' day) a
cross join
(select
position(
split(
rpad('', datediff('2015-01-15','2015-01-01')*2, 'a,'))) i
from (select NULL)) b;
Then you can join this result with your query table:
SELECT COUNT(*) AS Key,
DATE(t.day) AS Value from (...the.above.query.pasted.here...) d
left join [myDataSet.myTable] t on d.day = t.day
WHERE .....
GROUP BY Value
ORDER BY Value DESC
LIMIT 1000;

Total Count of Active Employees by Date

I have in the past written queries that give me counts by date (hires, terminations, etc...) as follows:
SELECT per.date_start AS "Date",
COUNT(peo.EMPLOYEE_NUMBER) AS "Hires"
FROM hr.per_all_people_f peo,
hr.per_periods_of_service per
WHERE per.date_start BETWEEN peo.effective_start_date AND peo.EFFECTIVE_END_DATE
AND per.date_start BETWEEN :PerStart AND :PerEnd
AND per.person_id = peo.person_id
GROUP BY per.date_start
I was now looking to create a count of active employees by date, however I am not sure how I would date the query as I use a range to determine active as such:
SELECT COUNT(peo.EMPLOYEE_NUMBER) AS "CT"
FROM hr.per_all_people_f peo
WHERE peo.current_employee_flag = 'Y'
and TRUNC(sysdate) BETWEEN peo.effective_start_date AND peo.EFFECTIVE_END_DATE
Here is a simple way to get started. This works for all the effective and end dates in your data:
select thedate,
SUM(num) over (order by thedate) as numActives
from ((select effective_start_date as thedate, 1 as num from hr.per_periods_of_service) union all
(select effective_end_date as thedate, -1 as num from hr.per_periods_of_service)
) dates
It works by adding one person for each start and subtracting one for each end (via num) and doing a cumulative sum. This might have duplicates dates, so you might also do an aggregation to eliminate those duplicates:
select thedate, max(numActives)
from (select thedate,
SUM(num) over (order by thedate) as numActives
from ((select effective_start_date as thedate, 1 as num from hr.per_periods_of_service) union all
(select effective_end_date as thedate, -1 as num from hr.per_periods_of_service)
) dates
) t
group by thedate;
If you really want all dates, then it is best to start with a calendar table, and use a simple variation on your original query:
select c.thedate, count(*) as NumActives
from calendar c left outer join
hr.per_periods_of_service pos
on c.thedate between pos.effective_start_date and pos.effective_end_date
group by c.thedate;
If you want to count all employees who were active during the entire input date range
SELECT COUNT(peo.EMPLOYEE_NUMBER) AS "CT"
FROM hr.per_all_people_f peo
WHERE peo.[EFFECTIVE_START_DATE] <= :StartDate
AND (peo.[EFFECTIVE_END_DATE] IS NULL OR peo.[EFFECTIVE_END_DATE] >= :EndDate)
Here is my example based on Gordon Linoff answer
with a little modification, because in SUBSTRACT table all records were appeared with -1 in NUM, even if no date was in END DATE = NULL.
use AdventureWorksDW2012 --using in MS SSMS for choosing DATABASE to work with
-- and may be not work in other platforms
select
t.thedate
,max(t.numActives) AS "Total Active Employees"
from (
select
dates.thedate
,SUM(dates.num) over (order by dates.thedate) as numActives
from
(
(
select
StartDate as thedate
,1 as num
from DimEmployee
)
union all
(
select
EndDate as thedate
,-1 as num
from DimEmployee
where EndDate IS NOT NULL
)
) AS dates
) AS t
group by thedate
ORDER BY thedate
worked for me, hope it will help somebody
I was able to get the results I was looking for with the following:
--Active Team Members by Date
SELECT "a_date",
COUNT(peo.EMPLOYEE_NUMBER) AS "CT"
FROM hr.per_all_people_f peo,
(SELECT DATE '2012-04-01'-1 + LEVEL AS "a_date"
FROM dual
CONNECT BY LEVEL <= DATE '2012-04-30'+2 - DATE '2012-04-01'-1
)
WHERE peo.current_employee_flag = 'Y'
AND "a_date" BETWEEN peo.effective_start_date AND peo.EFFECTIVE_END_DATE
GROUP BY "a_date"
ORDER BY "a_date"