SQL - Vertica: How to generate daily rows with most previous date data - sql

I have a base table like below:
score_upd (Upd_dt,Url,Score) AS (
SELECT DATE '2019-07-26','A','x'
UNION ALL SELECT DATE '2019-07-26','B','alpha'
UNION ALL SELECT DATE '2019-08-01','A','y'
UNION ALL SELECT DATE '2019-08-01','B','beta'
UNION ALL SELECT DATE '2019-08-03','A','z'
UNION ALL SELECT DATE '2019-08-03','B','gamma'
)
Upd_dt URL Score
2019-07-26 A x
2019-07-26 B alpha
2019-08-01 A y
2019-08-01 B beta
2019-08-03 A z
2019-08-03 B gamma
And I want to create a table in daily-url level, using most previous date's value for the new rows, result should look like below:
score_upd (Upd_dt,Url,Score) AS (
SELECT DATE '2019-07-26','A','x'
UNION ALL SELECT DATE '2019-07-26','B','alpha'
UNION ALL SELECT DATE '2019-07-27','A','x'
UNION ALL SELECT DATE '2019-07-27','B','alpha'
UNION ALL SELECT DATE '2019-07-28','A','x'
UNION ALL SELECT DATE '2019-07-28','B','alpha'
UNION ALL SELECT DATE '2019-07-29','A','x'
UNION ALL SELECT DATE '2019-07-29','B','alpha'
UNION ALL SELECT DATE '2019-07-30','A','x'
UNION ALL SELECT DATE '2019-07-30','B','alpha'
UNION ALL SELECT DATE '2019-07-31','A','x'
UNION ALL SELECT DATE '2019-07-31','B','alpha'
UNION ALL SELECT DATE '2019-08-01','A','y'
UNION ALL SELECT DATE '2019-08-01','B','beta'
UNION ALL SELECT DATE '2019-08-02','A','y'
UNION ALL SELECT DATE '2019-08-02','B','beta'
UNION ALL SELECT DATE '2019-08-03','A','z'
UNION ALL SELECT DATE '2019-08-03','B','gamma'
UNION ALL SELECT DATE '2019-08-04','A','z'
UNION ALL SELECT DATE '2019-08-04','B','gamma'
UNION ALL SELECT DATE '2019-08-05','A','z'
UNION ALL SELECT DATE '2019-08-05','B','gamma'
)
Which looks like:
Upd_dt URL Score
2019-07-26 A x
2019-07-26 B alpha
2019-07-27 A x
2019-07-27 B alpha
2019-07-28 A x
2019-07-28 B alpha
2019-07-29 A x
2019-07-29 B alpha
2019-07-30 A x
2019-07-30 B alpha
2019-07-31 A x
2019-07-31 B alpha
2019-08-01 A y
2019-08-01 B beta
2019-08-02 A y
2019-08-02 B beta
2019-08-03 A z
2019-08-03 B gamma
2019-08-04 A z
2019-08-04 B gamma
2019-08-05 A z
2019-08-05 B gamma
.
.
.
Current process is:
I built a daily dimension table since 7/26/2019 till today by:
/*
SELECT CAST(slice_time AS DATE) dates
FROM testcalendar mtc
TIMESERIES slice_time as '1 day'
OVER (ORDER BY CAST(mtc.dates as TIMESTAMP));
*/
so I get:
Dates
2019-07-26
2019-07-27
2019-07-28
2019-07-29
.
.
.
2019-10-12 (today)
I'm thinking if I can use function such as "interpolate previous value" to join my first table by dates, to generate missing days by using values from most previous date data, while it failed.
The result didn't generate rows for missing days.
Please let me know if anyone has any better idea on this.
Thanks!

As a starting warning : only store a "daily photograph" when it really, really is necessary. In my past, I once ended up having 364 rows too many per year, as the values only changed once a year. In Vertica, that costs license, and CPU and clock time for joining and grouping ...
But, for the rest - Good start.
But you could apply the TIMESERIES without having to build a calendar.
The trick is to "extrapolate" manually what you can INTERPOLATE automatically.
Add an in-line 'padding' table, which contains the newest value per URL, but give it CURRENT_DATE instead of the newest actual date - using Vertica's peculiar analytic limit clause LIMIT 1 OVER(PARTITION BY url ORDER BY upd_dt DESC) .
UNION SELECT that padding table with your input, and apply the TIMESERIES clause to that UNION SELECT.
Like so:
WITH
-- your input ...
score_upd (Upd_dt,Url,Score) AS (
SELECT DATE '2019-07-26','A','x'
UNION ALL SELECT DATE '2019-07-26','B','alpha'
UNION ALL SELECT DATE '2019-08-01','A','y'
UNION ALL SELECT DATE '2019-08-01','B','beta'
UNION ALL SELECT DATE '2019-08-03','A','z'
UNION ALL SELECT DATE '2019-08-03','B','gamma'
)
-- real WITH clause would start here ...
,
-- newest row per Url, just with current date
pad_newest AS (
SELECT
CURRENT_DATE
, url
, score
FROM score_upd
LIMIT 1 OVER(PARTITION BY url ORDER BY upd_dt DESC)
)
,
with_newest AS (
SELECT
*
FROM score_upd
UNION ALL
SELECT *
FROM pad_newest
)
SELECT
ts_dt::DATE AS upd_dt
, url AS url
, TS_FIRST_VALUE(score) AS score
FROM with_newest
TIMESERIES ts_dt AS '1 day' OVER (
PARTITION BY url ORDER BY upd_dt::TIMESTAMP
)
ORDER BY 1,2
;

Related

Getting last 4 months data from given date column some months data is midding

I have below data
Record_date ID
28-feb-2022 xyz
31-Jan-2022 ABC
30-nov-2022 jkl
31-oct-2022 dcs
I want to get last 3 months data from given date column. We don't have to consider the missing month.
Output should be:
Record_date ID
28-feb-2022 xyz
31-Jan-2022 ABC
30-nov-2022 jkl
In the last 3 months Dec is missing but we have to ignore it as the data is not available. Tried many things but not working.
Any suggestions?
Assuming you are using Oracle then you can use Oralce ADD_MONTHS function and filter the data.
--- untested
-- Assumption Record_date is a date column
SELECT * FROM table1
where Record_date > ADD_MONTHS(SYSDATE, -3)
To get the data for the three months that are latest in the table, you can use:
SELECT record_date,
id
FROM (
SELECT t.*,
DENSE_RANK() OVER (ORDER BY TRUNC(Record_date, 'MM') DESC) AS rnk
FROM table_name t
)
WHERE rnk <= 3;
Which, for the sample data:
CREATE TABLE table_name (Record_date, ID) AS
SELECT DATE '2022-02-28', 'xyz' FROM DUAL UNION ALL
SELECT DATE '2022-01-31', 'ABC' FROM DUAL UNION ALL
SELECT DATE '2022-11-30', 'jkl' FROM DUAL UNION ALL
SELECT DATE '2022-10-31', 'dcs' FROM DUAL;
Outputs:
RECORD_DATE
ID
2022-11-30 00:00:00
jkl
2022-10-31 00:00:00
dcs
2022-02-28 00:00:00
xyz
db<>fiddle here

Calculate working hours between exit date and 3 months before exit date bigquery sql

I'm trying to calculate the total working hours between two dates in bigquery sql:
The dates being between MAX(date) and DATE_SUB(MAX(date), interval 3 month).
In other words, I want to know the sum of working hours between the exit date and 3 months prior to the exit date.
The current table is something like this:
id
date
hours
abc
2020-10-01
12
abc
2020-12-07
4
abc
2020-12-12
12
abc
2020-12-25
6
abc
2021-01-07
9
abc
2021-02-04
7
The ideal output is:
id
hours
abc
38
I have multiple workers and workers have different working dates and hours.
We need a subquery here to calculate exit_date first:
with mytable as (
select 'abc' as id, DATE '2020-10-01' as date, 12 as hours union all
select 'abc' as id, DATE '2020-12-07' as date, 4 as hours union all
select 'abc' as id, DATE '2020-12-12' as date, 12 as hours union all
select 'abc' as id, DATE '2020-12-25' as date, 6 as hours union all
select 'abc' as id, DATE '2021-01-07' as date, 9 as hours union all
select 'abc' as id, DATE '2021-02-04' as date, 7 as hours
)
select
id,
sum(hours) as hours
from (
select *, MAX(date) OVER (PARTITION BY id) as exit_date
from mytable
)
where date >= DATE_SUB(exit_date, INTERVAL 3 MONTH)
group by id

Google Big query ML ARIMA is not forecasting correctly

I have input data as shown below. (actual data removed). I am trying to forecast next 3 months of VAL using almost 14 months of data. (Frequency is monthly). For below data, I am getting all 3 months of foretasted values as same.
In model evaluation, I am getting all FALSE values for 'Has_Drift' column. AIC is all -negative.
Can anyone help? What it is missing which is making it difficult to forecast for 3 months.
Sample input and output below.
CREATE OR REPLACE MODEL <MODEL_NAME>
OPTIONS(MODEL_TYPE = 'ARIMA',
time_series_timestamp_col='DATE_COL',
time_series_data_col='VAL',
DATA_FREQUENCY = 'MONTHLY') AS
SELECT CAST (P1_date as DATE) DATE_COL , VAL from (
SELECT '2019-06-07' DATE_COL ,0.09262066947 VAL union all
SELECT '2019-07-07',0.07495576437 union all
SELECT '2019-08-07',0.09832972783 union all
SELECT '2019-09-07',0.09959302865 union all
SELECT '2019-10-07',0.1445173433 union all
SELECT '2019-11-07',0.1116498012 union all
SELECT '2019-12-07',0.1065453852 union all
SELECT '2020-01-07',0.1403350342 union all
SELECT '2020-02-07',0.105060523 union all
SELECT '2020-03-07',0.2191159052 union all
SELECT '2020-04-07',0.07962838894 union all
SELECT '2020-05-07',0.131412274 union all
SELECT '2020-06-07',0.173012701 union all
SELECT '2020-07-07',0.1504522412 union all
SELECT '2020-08-07',0.1073950999
)
Forecast Values
SELECT forecast_timestamp, forecast_value
FROM ML.FORECAST(MODEL <MODEL_NAME>,
STRUCT(4 AS horizon, 0.6 AS confidence_level))
Row forecast_timestamp forecast_value
1 2020-08-30 00:00:00 UTC|0.12230825916400008
2 2020-09-29 00:00:00 UTC|0.12230825916400008
3 2020-10-29 00:00:00 UTC|0.12230825916400008
4 2020-11-28 00:00:00 UTC| 0.12230825916400008

SQL Select only missing months

Notice the 2017-04-01, 2018-02-01, 2018-07-01, and 2019-01-01 months are missing in the output. I want to show only those months which are missing. Does anyone know how to go about this?
Query:
SELECT TO_DATE("Month", 'mon''yy') as dates FROM sample_sheet
group by dates
order by dates asc;
Output:
2017-01-01
2017-02-01
2017-03-01
2017-05-01
2017-06-01
2017-07-01
2017-08-01
2017-09-01
2017-10-01
2017-11-01
2017-12-01
2018-01-01
2018-03-01
2018-04-01
2018-05-01
2018-06-01
2018-08-01
2018-09-01
2018-10-01
2018-11-01
2018-12-01
2019-02-01
2019-03-01
2019-04-01
I don't know Vertica, so I wrote a working proof of concept in Microsoft SQL Server and tried to convert it to Vertica syntax based on the online documentation.
It should look like this:
with
months as (
select 2017 as date_year, 1 as date_month, to_date('2017-01-01', 'YYYY-MM-DD') as first_date, to_date('2017-01-31', 'yyyy-mm-dd') as last_date
union all
select
year(add_months(first_date, 1)) as date_year,
month(add_months(first_date, 1)) as date_month,
add_months(first_date, 1) as first_date,
last_day(add_months(first_date, 1)) as last_date
from months
where first_date < current_date
),
sample_dates (a_date) as (
select to_date('2017-01-15', 'YYYY-MM-DD') union all
select to_date('2017-01-22', 'YYYY-MM-DD') union all
select to_date('2017-02-01', 'YYYY-MM-DD') union all
select to_date('2017-04-15', 'YYYY-MM-DD') union all
select to_date('2017-06-15', 'YYYY-MM-DD')
)
select *
from sample_dates right join months on sample_dates.a_date between first_date and last_date
where sample_dates.a_date is null
Months is a recursive dynamic table that holds all months since 2017-01, with first and last day of the month. sample_dates is just a list of dates to test the logic - you should replace it with your own table.
Once you build that monthly calendar table all you need to do is check your dates against it using an outer query to see what dates are not between any of those periods between first_date and last_date columns.
You can build a TIMESERIES of all dates between the first input date and the last input date (The highest granularity of a TIMESERIES is the day.), and filter out only the months' first days out of that; then left join that created sequence of firsts of month with your input to find out where the join would fail, checking for NULLS from the input branch of the join:
WITH
-- your input
input(mth1st) AS (
SELECT DATE '2017-01-01'
UNION ALL SELECT DATE '2017-02-01'
UNION ALL SELECT DATE '2017-03-01'
UNION ALL SELECT DATE '2017-05-01'
UNION ALL SELECT DATE '2017-06-01'
UNION ALL SELECT DATE '2017-07-01'
UNION ALL SELECT DATE '2017-08-01'
UNION ALL SELECT DATE '2017-09-01'
UNION ALL SELECT DATE '2017-10-01'
UNION ALL SELECT DATE '2017-11-01'
UNION ALL SELECT DATE '2017-12-01'
UNION ALL SELECT DATE '2018-01-01'
UNION ALL SELECT DATE '2018-03-01'
UNION ALL SELECT DATE '2018-04-01'
UNION ALL SELECT DATE '2018-05-01'
UNION ALL SELECT DATE '2018-06-01'
UNION ALL SELECT DATE '2018-08-01'
UNION ALL SELECT DATE '2018-09-01'
UNION ALL SELECT DATE '2018-10-01'
UNION ALL SELECT DATE '2018-11-01'
UNION ALL SELECT DATE '2018-12-01'
UNION ALL SELECT DATE '2019-02-01'
UNION ALL SELECT DATE '2019-03-01'
UNION ALL SELECT DATE '2019-04-01'
)
,
-- need a series of month's firsts
-- TIMESERIES works for INTERVAL DAY TO SECOND
-- so build that timeseries, and filter out
-- the month's firsts
limits(mth1st) AS (
SELECT MIN(mth1st) FROM input
UNION ALL SELECT MAX(mth1st) FROM input
)
,
alldates AS (
SELECT dt::DATE FROM limits
TIMESERIES dt AS '1 day' OVER(ORDER BY mth1st::TIMESTAMP)
)
,
allfirsts(mth1st) AS (
SELECT dt FROM alldates WHERE DAY(dt)=1
)
SELECT
allfirsts.mth1st
FROM allfirsts
LEFT JOIN input USING(mth1st)
WHERE input.mth1st IS NULL;
-- out mth1st
-- out ------------
-- out 2017-04-01
-- out 2018-02-01
-- out 2018-07-01
-- out 2019-01-01

Next 5 Available Dates

I wonder if anyone could tell me how I can get the next 5 available dates using a table which only stores the Weekend dates and Bank Holiday dates.. So it has to select the next 5 days which do not collide with any dates in the table.
I would like to see the following results from this list of dates:
07/11/2015 (Saturday)
08/11/2015 (Sunday)
09/11/2015 (Holiday)
14/11/2015 (Saturday)
15/11/2015 (Sunday)
Results:
05/11/2015 (Thursday)
06/11/2015 (Friday)
10/11/2015 (Tuesday)
11/11/2015 (Wednesday)
12/11/2015 (Thursday)`
Based on limited information, here's a quick hack:
with offsets(n) as (
select 1 union all
select 2 union all
select 3 union all
select 4 union all
select 5 union all
select 6 union all
select 7 union all
select 8 union all
select 9 union all
select 10 union all
select 11
)
select top 5 dateadd(dd, n, cast(getdate() as date)) as dt from offsets
where dateadd(dd, n, cast(getdate() as date) not in (
select dt from <exclude_dates>
)
order by dt
A possible solution is to create a table of all possible dates in a year.
select top 5 date
from possible_dates
where date not in
(select date from unavailable_dates)
and date > [insert startdate here]
order by date