PrestoDB: select all dates between two dates - sql

I need to form a report which provides some information per each date within dates interval.
I need to have it within a single query (can't create any functions or supporting tables).
How can I achieve that in PrestoDB?
Note: There are lots of vendor specific solution here, here and even here. But none of them satisfies my need as they either don't work in Presto or use tables/functions.
To be more precise here is an example of query:
WITH ( query to select all dates between 2017.01.01 and 2018.01.01 ) AS dates
SELECT
date date,
count(*) number_of_orders
FROM dates dates
LEFT JOIN order order
ON order.created_at = dates.date

You can use the Presto SEQUENCE() function to generate a sequence of days as an array, and then use UNNEST to explode that array as a result set.
Something like this should work for you:
SELECT date_array AS DAY
FROM UNNEST(
SEQUENCE(
cast('2017-01-01' AS date),
cast('2018-01-01' AS date),
INTERVAL '1' DAY
)
) AS t1(date_array)

Related

SQL: Apply an aggregate result per day using window functions

Consider a time-series table that contains three fields time of type timestamptz, balance of type numeric, and is_spent_column of type text.
The following query generates a valid result for the last day of the given interval.
SELECT
MAX(DATE_TRUNC('DAY', (time))) as last_day,
SUM(balance) FILTER ( WHERE is_spent_column is NULL ) AS value_at_last_day
FROM tbl
2010-07-12 18681.800775017498741407984000
However, I am in need of an equivalent query based on window functions to report the total value of the column named balance for all the days up to and including the given date .
Here is what I've tried so far, but without any valid result:
SELECT
DATE_TRUNC('DAY', (time)) AS daily,
SUM(sum(balance) FILTER ( WHERE is_spent_column is NULL ) ) OVER ( ORDER BY DATE_TRUNC('DAY', (time)) ) AS total_value_per_day
FROM tbl
group by 1
order by 1 desc
2010-07-12 16050.496339044977568391974000
2010-07-11 13103.159119670350269890284000
2010-07-10 12594.525752964512456914454000
2010-07-09 12380.159588711091681327014000
2010-07-08 12178.119542536668113577014000
2010-07-07 11995.943973804127033140014000
EDIT:
Here is a sample dataset:
LINK REMOVED
The running total can be computed by applying the first query above on the entire dataset up to and including the desired day. For example, for day 2009-01-31, the result is 97.13522530000000000000, or for day 2009-01-15 when we filter time as time < '2009-01-16 00:00:00' it returns 24.446144000000000000.
What I need is an alternative query that computes the running total for each day in a single query.
EDIT 2:
Thank you all so very much for your participation and support.
The reason for differences in result sets of the queries was on the preceding ETL pipelines. Sorry for my ignorance!
Below I've provided a sample schema to test the queries.
https://www.db-fiddle.com/f/veUiRauLs23s3WUfXQu3WE/2
Now both queries given above and the query given in the answer below return the same result.
Consider calculating running total via window function after aggregating data to day level. And since you aggregate with a single condition, FILTER condition can be converted to basic WHERE:
SELECT daily,
SUM(total_balance) OVER (ORDER BY daily) AS total_value_per_day
FROM (
SELECT
DATE_TRUNC('DAY', (time)) AS daily,
SUM(balance) AS total_balance
FROM tbl
WHERE is_spent_column IS NULL
GROUP BY 1
) AS daily_agg
ORDER BY daily

How can I iteratively "INSERT INTO {table} SELECT FROM".... for a range of dates?

I am working on a building a Data Vault Temporal Point in Time (TPIT) table. For this I need to iterate through a list of dates and insert rows for each Date. Here is the SQL for one Date '2020-08-19' . I need to iterate over all dates from year 2000 to today.
What is the best way to implement this using SQL in Snowflake?
I can do this in Python, but I was looking for a SQL only way to do this, so that it can be embedded in an SQL refresh script that runs daily.
Thanks.
INSERT
INTO
MEMBERSHIP_TPIT
SELECT
MEMBERSHIP_HUB.MEMBERSHIP_HASHKEY AS MEMBERSHIP_HASHKEY,
'2020-08-19' AS SNAPSHOT_DATE,
MEMBERSHIP_SAT.LOAD_DATE AS LOAD_DATE
FROM
MEMBERSHIP_HUB INNER MEMBERSHIP_SAT ON
(MEMBERSHIP_HUB.MEMBERSHIP_HASHKEY = MEMBERSHIP_SAT.MEMBERSHIP_HASHKEY
AND '2020-08-19' BETWEEN START_DATE AND COALESCE(END_DATE,
'9999-12-31'));
You can use below query along with WITH clause and refer DATE_RANGE as a table in your From clause and refer MY_DATE in place of '2020-08-19'.
WITH DATE_RANGE AS (
SELECT DATEADD(DAY, -1*SEQ4(), CURRENT_DATE()) AS MY_DATE
FROM TABLE(GENERATOR(ROWCOUNT => (365000) )) where my_date >='01-Jan-2000'
)
SELECT * FROM DATE_RANGE

Trying to UNNEST timestamp array field, but need to GROUP BY

I have a repeated field of type TIMESTAMP in a BigQuery table. I am attempting to UNNEST this field. However, I must group or aggregate the field in order. I am not knowledgable with SQL, so I could use some help. The code snippet is part of a larger query that works when substituting subscription.future_renewal_dates with GENERATE_TIMESTAMP_ARRAY
subscription.future_renewal_dates is ARRAY<TIMESTAMP>
The TIMESTAMP array is unique (recurring subscriptions) and cannot be generated using GENERATE_TIMESTAMP_ARRAY, so I have to generate the dates before uploading to BigQuery. UDF is too much.
SELECT
subscription.amount AS subscription_amount,
subscription.status AS subscription_status,
"1" AS analytic_name,
ARRAY (
SELECT
AS STRUCT FORMAT_TIMESTAMP("%x", days) AS type_value, subscription.amount AS analytic_name
FROM
UNNEST(subscription.future_renewal_dates) as days
WHERE
(
days >= TIMESTAMP("2019-06-05T19:30:02+00:00")
AND days <= TIMESTAMP("2019-08-01T03:59:59+00:00")
)
) AS forecast
FROM
`mydataset.subscription` AS subscription
GROUP BY
subscription_amount,
subscription_status,
analytic_name
Cannot figure out how to successfully unnest subscription.future_renewal_dates without error 'UNNEST expression references subscription.future_renewal_dates which is neither grouped nor aggregated'
When you do GROUP BY - all expressions, columns in the SELECT (except those in GROUP BY list) should be used with some aggregation function - which you clearly do not have. So you need to decide what it is that you actually trying to achieve here with that grouping
Below is the option I think you had in mind - though it can be different - but at least you have an idea on how to fix it
SELECT
subscription.amount AS subscription_amount,
subscription.status AS subscription_status,
"1" AS analytic_name,
ARRAY_CONCAT_AGG( ARRAY (
SELECT
AS STRUCT FORMAT_TIMESTAMP("%x", days) AS type_value, subscription.amount AS analytic_name
FROM
UNNEST(subscription.future_renewal_dates) as days
WHERE
(
days >= TIMESTAMP("2019-06-05T19:30:02+00:00")
AND days <= TIMESTAMP("2019-08-01T03:59:59+00:00")
)
)) AS forecast
FROM
`mydataset.subscription` AS subscription
GROUP BY
subscription_amount,
subscription_status,
analytic_name

How to generate Month list in PostgreSQL?

I have a table A with startdate column which is TIMESTAMP WITHOUT TIME ZONE I need to write a query/function that generate a list of months from the MIN value of the column till MAX value of the column.
For example:
startdate
2014-12-08
2015-06-16
2015-02-17
will generate a list of: (Dec-14,Jan-15,Feb-15,Mar-15,Apr-15,May-15,Jun-15)
How do I do that? I never used PostgreSQL to generate data that wasn't there... it always has been finding the correct data in the DB... any ideas how to do that? Is it doable in a query?
For people looking for an unformatted list of months:
select * from generate_series('2017-01-01', now(), '1 month')
You can generate sequences of data with the generate_series() function:
SELECT to_char(generate_series(min, max, '1 month'), 'Mon-YY') AS "Mon-YY"
FROM (
SELECT date_trunc('month', min(startdate)) AS min,
date_trunc('month', max(startdate)) AS max
FROM a) sub;
This generates a row for every month, in a pretty format. If you want to have it like a list, you can aggregate them all in an outer query:
SELECT string_agg("Mon-YY", ', ') AS "Mon-YY list"
FROM (
-- Query above
) subsub;
SQLFiddle here

How to find sum of a column between a given date range, where the table has only start date and end date

I have a postgresql table userDistributions like this :
user_id, start_date, end_date, project_id, distribution
I need to write a query in which a given date range and user id the output should be the sum of all distributions for every day for that given user.
So the output should be like this for input : '2-2-2012' - '2-4-2012', some user id :
Date SUM(Distribution)
2-2-2012 12
2-3-2012 15
2-4-2012 34
A user has distribution in many projects, so I need to sum the distributions in all projects for each day and output that sum against that day.
My problem is what I should group by against ? If I had a field as date (instead of start_date and end_date), then I could just write something like
select date, SUM(distributions) from userDistributions group by date;
but in this case I am stumped as what to do. Thanks for the help.
Use generate_series to produce your dates, something like this:
select dt.d::date, sum(u.distributions)
from userdistributions u
join generate_series('2012-02-02'::date, '2012-02-04'::date, '1 day') as dt(d)
on dt.d::date between u.start_date and u.end_date
group by dt.d::date
Your date format is ambiguous so I guess while converting it to ISO 8601.
This is much like #mu's answer.
However, to cover days with no matches you should use LEFT JOIN:
SELECT d.d::date, sum(u.distributions) AS dist_sum
FROM generate_series('2012-02-02'::date, '2012-02-04'::date, '1 day') AS d(d)
LEFT JOIN userdistributions u ON d.d::date BETWEEN u.start_date AND u.end_date
GROUP BY 1