Add rows between two dates Presto - sql

I have a table that has 3 columns- start, end and emp_num. I want to generate a new table which has all dates between these dates for every employee. Need to use Presto.
I refered this link - inserting dates into a table between a start and end date in Presto
Tried using unnest function by creating sequence but , I don't know how do I create sequence by pulling dates from two columns in another table.
select unnest(seq) as t(days)
from (select sequence(start, end, interval '1' day) as seq
from table1)
Here's table and expected format
Table 1:
start | end | emp_num
2018/01/01 | 2018/01/05 | 1
2019/02/01 | 2019/02/05 | 2
Expected:
start | emp_num
2018/01/01 | 1
2018/01/02 | 1
2018/01/03 | 1
2018/01/04 | 1
2018/01/05 | 1
2019/02/01 | 2
2019/01/02 | 2
2019/02/03 | 2
2019/02/04 | 2
2019/02/05 | 2

Here is a query that might get the job done for your use case.
The logic is to use Presto sequence() function to generate a wide date range (since year 2000 to end of 2018, you can adapt that as needed), that can be joined with the table to generate the output.
select dt.x, emp_num
from
( select x from unnest(sequence(date '2000-01-01', date '2018-01-31')) t(x) ) dt
inner join table1 ta on dt.x >= ta.start and dt.x <= ta.end
However, as commented JNevill, it would be more efficient to create a calendar table rather than generating it on the fly every time the query runs.
It should be a simple as :
create table calendar as
select x from unnest(sequence(date '1970-01-01', date '2099-01-01')) t(x);
And then your query would become :
select dt.x, emp_num
from
calendar dt
inner join table1 ta on dt.x >= ta.start and dt.x <= ta.end
PS : due to the lack of DB Fiddles for Presto in the wild, I could not test the queries (#PiotrFindeisen - if you happen to read this - a Presto fiddle would be nice to have !).

Related

Compare one row of a table to every rows of a second table

I am trying to retrieve the number of days between a random date and the next known date for a holiday. Let's say my first table looks like this :
date | is_holiday | zone
9/11/18 | 0 | A
22/12/18 | 1 | A
and my holidays table looks like this
start_date | end_date | zone
20/12/18 | 04/01/18 | A
21/12/18 | 04/01/18 | B
...
I want to be able to know how many days are between an entry that is not a holiday in the first table and the next holiday date.
I have tried to get the next row with a later date in a join clause but the join isn't the tool for this task. I also have tried grouping by date and comparing the date with the next row but I can have multiple entries with the same date in the first table so it doesn't work.
This is the join clause I have tried :
SELECT mai.*, vac.start_date, datediff(vac.start_date, mai.date)
FROM (SELECT *
FROM MAIN
WHERE is_holiday = 0
) mai LEFT JOIN
(SELECT start_date, zone
FROM VACATIONS_UPDATED
ORDER BY start_date
) vac
ON mai.date < vac.start_date AND mai.zone = vac.zone
I expect to get a table looking like this :
date | is_holiday | zone | next_holiday
9/11/18 | 0 | A | 11
22/12/18 | 1 | A | 0
Any lead on how to achieve this ?
It might get messy to do it in SQL but if in case you are open to doing it from code, here is what it should look like. You basically need a crossJoin
Dataset<Row> table1 = <readData>
Dataset<Row> holidays = <readData>
//then cache the small table to get the best performance
table1.crossJoin( holidays ).filter("table1.zone == holidays.zone AND table1.date < holidays.start_date").select( "table1.*", "holidays.start_date").withColumn("nextHoliday", *calc diff*)
In scenarios where one row from table1 matches multiple holidays, then you can add an id column to table1 and then group the crossJoin.
// add unique id to the rows
table1 = table1.withColumn("id", functions.monotonically_increasing_id() )
Some details on crossJoins:
http://kirillpavlov.com/blog/2016/04/23/beyond-traditional-join-with-apache-spark/

how to sum up database field value based on date in oracle

I have a table having two fields in it just like below given.
How to create a view that will sum TOT_HITS field's value till each date appeared in corresponding row in TODAY column like given below.
Use an analytic function to perform the query with only a single table scan:
SQL Fiddle
Oracle 11g R2 Schema Setup:
CREATE TABLE your_table( today, tot_hits ) As
SELECT DATE '2018-01-16', 5498 FROM DUAL UNION ALL
SELECT DATE '2018-01-17', 4235 FROM DUAL;
Query 1:
SELECT t.*,
SUM( tot_hits ) OVER ( ORDER BY today ) AS tot_hits_to_date
FROM your_table t
Results:
| TODAY | TOT_HITS | TOT_HITS_TO_DATE |
|----------------------|----------|------------------|
| 2018-01-16T00:00:00Z | 5498 | 5498 |
| 2018-01-17T00:00:00Z | 4235 | 9733 |
Just Try This
SELECT
Today,
Hits,
TillDate = Hits+NVL((SELECT SUM(Hits) FROM YourTable WHERE Today < T.Today),0)
FROM YourTable T

How to fill in empty date rows multiple times?

I am trying to fill in dates with empty data, so that my query returned has every date and does not skip any.
My application needs to count bookings for activities by date in a report, and I cannot have skipped dates in what is returned by my SQL
I am trying to use a date table (I have a table with every date from 1/1/2000 to 12/31/2030) to accomplish this by doing a RIGHT OUTER JOIN on this date table, which works when dealing with one set of activities. But I have multiple sets of activities, each needing their own full range of dates regardless if there were bookings on that date.
I also have a function (DateRange) I found that allows for this:
SELECT IndividualDate FROM DateRange('d', '11/01/2017', '11/10/2018')
Let me give an example of what I am getting and what I want to get:
BAD: Without empty date rows:
date | activity_id | bookings
-----------------------------
1/2 | 1 | 5
1/4 | 1 | 4
1/3 | 2 | 6
1/4 | 2 | 2
GOOD: With empty date rows:
date | activity_id | bookings
-----------------------------
1/2 | 1 | 5
1/3 | 1 | NULL
1/4 | 1 | 4
1/2 | 2 | NULL
1/3 | 2 | 6
1/4 | 2 | 2
I hope this makes sense. I get the whole point of joining to a table of just a list of dates OR using the DateRange table function. But neither get me the "GOOD" result above.
Use a cross join to generate the rows and then left join to fill in the values:
select d.date, a.activity_id, t.bookings
from DateRange('d', ''2017-11-01',''2018-11-10') d cross join
(select distinct activity_id from t) a left join
t
on t.date = d.date and t.activity_id = a.activity_id;
It is a bit hard to follow what your data is and what comes from the function. But the idea is the same, wherever the data comes from.
I figured it out:
SELECT TOP 100 PERCENT masterlist.dt, masterlist.activity_id, count(r_activity_sales_bymonth.bookings) AS totalbookings
FROM (SELECT c.activity_id, dateadd(d, b.incr, '2016-12-31') AS dt
FROM (SELECT TOP 365 incr = row_number() OVER (ORDER BY object_id, column_id), *
FROM (SELECT a.object_id, a.column_id
FROM sys.all_columns a CROSS JOIN
sys.all_columns b) AS a) AS b CROSS JOIN
(SELECT DISTINCT activity_id
FROM r_activity_sales_bymonth) AS c) AS masterlist LEFT OUTER JOIN
r_activity_sales_bymonth ON masterlist.dt = r_activity_sales_bymonth.purchase_date AND masterlist.activity_id = r_activity_sales_bymonth.activity_id
GROUP BY masterlist.dt, masterlist.activity_id
ORDER BY masterlist.dt, masterlist.activity_id

Using generate_series in PostgreSQL query

I have a table with data and I want to fill the date gaps with generate_series with PostgreSQL but I can't solve it. How can I join my table to the generate_series result? My attempt is the following:
SELECT series AS time,
data
FROM generate_series('2012-11-11', '2012-11-12', '2 minutes'::interval) AS series
LEFT JOIN data_table ON data_table.time = series
In this date range the result is:
TIME DATA
2012.11.11. 13:00:06 | data 1
2012.11.11. 13:08:06 | data 2
My aim would be similar like this:
TIME DATA
2012.11.11. 13:00:06 | data 1
2012.11.11. 13:02:06 | NULL
2012.11.11. 13:06:06 | NULL
2012.11.11. 13:08:06 | data 2
2012.11.11. 13:10:06 | NULL
2012.11.11. 13:12:06 | NULL
...
Ergo the the whole table fill with time rows with 2 minutes interval. How can I achieve this?
I think your query should work. I would give the value a column name:
SELECT g.series AS time, t.data
FROM generate_series('2012-11-11', '2012-11-12', '2 minutes'::interval) AS g(series) LEFT JOIN
data_table t
ON t.time = g.series;

yet another date gap-fill SQL puzzle

I'm using Vertica, which precludes me from using CROSS APPLY, unfortunately. And apparently there's no such thing as CTEs in Vertica.
Here's what I've got:
t:
day | id | metric | d_metric
-----------+----+--------+----------
2011-12-01 | 1 | 10 | 10
2011-12-03 | 1 | 12 | 2
2011-12-04 | 1 | 15 | 3
Note that on the first day, the delta is equal to the metric value.
I'd like to fill in the gaps, like this:
t_fill:
day | id | metric | d_metric
-----------+----+--------+----------
2011-12-01 | 1 | 10 | 10
2011-12-02 | 1 | 10 | 0 -- a delta of 0
2011-12-03 | 1 | 12 | 2
2011-12-04 | 1 | 15 | 3
I've thought of a way to do this day by day, but what I'd really like is a solution that works in one go.
I think I could get something working with LAST_VALUE, but I can't come up with the right JOIN statements that will let me properly partition and order on each id's day-by-day history.
edit:
assume I have a table like this:
calendar:
day
------------
2011-01-01
2011-01-02
...
that can be involved with joins. My intent would be to maintain the date range in calendar to match the date range in t.
edit:
A few more notes on what I'm looking for, just to be specific:
In generating t_fill, I'd like to exactly cover the date range in t, as well as any dates that are missing in between. So a correct t_fill will start on the same date and end on the same date as t.
t_fill has two properties:
1) once an id appears on some date, it will always have a row for each later date. This is the gap-filling implied in the original question.
2) Should no row for an id ever appear again after some date, the t_fill solution should merrily generate rows with the same metric value (and 0 delta) from the date of that last data point up to the end date of t.
A solution might backfill earlier dates up to the start of the date range in t. That is, for any id that appears after the first date in t, rows between the first date in t and the first date for the id will be filled with metric=0 and d_metric=0. I don't prefer this kind of solution, since it has a higher growth factor for each id that enters the system. But I could easily deal with it by selecting into a new table only rows where metric!=0 and d_metric!=0.
This about what Jonathan Leffler proposed, but into old-fashioned low-level SQL (without fancy CTE's or window functions or aggregating subqueries):
SET search_path='tmp'
DROP TABLE ttable CASCADE;
CREATE TABLE ttable
( zday date NOT NULL
, id INTEGER NOT NULL
, metric INTEGER NOT NULL
, d_metric INTEGER NOT NULL
, PRIMARY KEY (id,zday)
);
INSERT INTO ttable(zday,id,metric,d_metric) VALUES
('2011-12-01',1,10,10)
,('2011-12-03',1,12,2)
,('2011-12-04',1,15,3)
;
DROP TABLE ctable CASCADE;
CREATE TABLE ctable
( zday date NOT NULL
, PRIMARY KEY (zday)
);
INSERT INTO ctable(zday) VALUES
('2011-12-01')
,('2011-12-02')
,('2011-12-03')
,('2011-12-04')
;
CREATE VIEW v_cte AS (
SELECT t.zday,t.id,t.metric,t.d_metric
FROM ttable t
JOIN ctable c ON c.zday = t.zday
UNION
SELECT c.zday,t.id,t.metric, 0
FROM ctable c, ttable t
WHERE t.zday < c.zday
AND NOT EXISTS ( SELECT *
FROM ttable nx
WHERE nx.id = t.id
AND nx.zday = c.zday
)
AND NOT EXISTS ( SELECT *
FROM ttable nx
WHERE nx.id = t.id
AND nx.zday < c.zday
AND nx.zday > t.zday
)
)
;
SELECT * FROM v_cte;
The results:
zday | id | metric | d_metric
------------+----+--------+----------
2011-12-01 | 1 | 10 | 10
2011-12-02 | 1 | 10 | 0
2011-12-03 | 1 | 12 | 2
2011-12-04 | 1 | 15 | 3
(4 rows)
I am not Vertica user, but if you do not want to use their native support for GAP fillings, here you can find a more generic SQL-only solution to do so.
If you want to use something like a CTE, how about using a temporary table? Essentially, a CTE is a view for a particular query.
Depending on your needs you can make the temporary table transaction or session-scoped.
I'm still curious to know why gap-filling with constant-interpolation wouldn't work here.
Given the complete calendar table, it is doable, though not exactly trivial. Without the calendar table, it would be a lot harder.
Your query needs to be stated moderately precisely, which is usually half the battle in any issue with 'how to write the query'. I think you are looking for:
For each date in Calendar between the minimum and maximum dates represented in T (or other stipulated range),
For each distinct ID represented in T,
Find the metric for the given ID for the most recent record in T on or before the date.
This gives you a complete list of dates with metrics.
You then need to self-join two copies of that list with dates one day apart to form the deltas.
Note that if some ID values don't appear at the start of the date range, they won't show up.
With that as guidance, you should be able get going, I believe.