Is there any way to apply loop in a presto query - sql

My use case is to use the below presto view on top of the table to get daily count by subtracting the todays value with that of yesterday. If there is no data in the table then the view should dynamically consider next day value and then take the average for the missing day view.
This is a presto query. I have taken only one field in the below query
CREATE OR REPLACE VIEW hive.facebook.post_metrics_daily AS
SELECT
a.post_id,
a.page,
a.dt,
a.created_time,
(
COALESCE(
(
CAST(a.likes AS integer)
- IF(
(CAST(b.likes AS integer) IS NULL),
0,
CAST(b.likes AS integer)
)
)
, 0
)
) likes
FROM
hive.facebook.post_metrics a
LEFT JOIN hive.facebook.post_metrics b
ON a.dt = (b.dt + INTERVAL '+1' DAY)
AND a.post_id = b.post_id
AND a.brandname = b.brandname
WHERE a.dt = date'2019-09-10'
If the data is for 9th to 12th and 10th data is missing then the view should take 11th day data and take the avg of 9th and 11th to give the 10th view. How can It be done? Can this formula be applied in the query and If yes then how?
(today-yesterday)/n+1 where n will be the number of days missing.
This is the sample data for likes. In case of missing likes, I need avg likes and the number of days which is missing should be dynamically identified by the query..
Date Likes-org. missing likes daily likes org. expected likes
2019-10-17 20487 20487 20487 20487
2019-10-18 25384 25384 4897 4897
2019-10-19 26817 26817 1433 1433
2019-10-20 27499 missing likes 682 257
2019-10-21 27854 missing likes 355 258
2019-10-22 27987 missing likes 133 258
2019-10-23 28065 missing likes 78 258
2019-10-24 28106 28106 41 258
2019-10-25 28134 28134 28 28

I think you just want lag():
SELECT pm.*,
(pm.likes +
LAG(pm.likes) OVER (PARTITION BY pm.post_id, pm.brand_name ORDER BY pm.dt)
) / 2
FROM hive.facebook.post_metrics pm ;
If you want to treat the missing days as 0s, you need date arithmetic. I think this would be:
SELECT pm.*,
( (pm.likes +
LAG(pm.likes) OVER (PARTITION BY pm.post_id, pm.brand_name ORDER BY pm.dt)
) /
DATE_DIFF(day,
LAG(pm.dt) OVER (PARTITION BY pm.post_id, pm.brand_name ORDER BY pm.dt),
dt
)
)
FROM hive.facebook.post_metrics pm ;
If you want this for a particular day, use a subquery or for the above expression and then filter in the outer query.

Related

How do I do a SQL join to get the latest data from table1 as of the date in table2?

I have two tables, call them "monthlyStoreCount" and "weeklySales"
monthlyStoreCount
date
storeCount
2022-01-01
89
2022-02-01
94
...
...
weeklySales
date
sales
2021-12-31
66
2022-01-07
16
2022-01-14
147
2022-01-21
185
2022-01-28
145
2022-04-04
2572
...
...
I am looking to join these tables to get the "storeCount" and latest "sales" as of the dates in the monthlyStoreCount table.
Is there any performant way to do this join? With the data shown the desired output would be:
date
storeCount
sales
2022-01-01
89
66
2022-02-01
94
145
...
...
...
UNTESTED:
Using: https://docs.snowflake.com/en/sql-reference/constructs/join-lateral.html as a primer...
"for each row in left_hand_table LHT:
execute right_hand_subquery RHS using the values from the current row in the LHT"
Lateral allows us to execute the sub query for each record in the Monthly Store Count. So we get the MSC record whose date is >= ws date. ordered by weekly sales date descending and get the 1st record (the one closest to the monthly store count date which is equal to or before that date.)
SELECT MSC.Date, MSC.StoreCount, sWS.Sales
FROM monthlyStoreCount as MSC,
LATERAL (SELECT WS.Sales
FROM WeeklySales as WS
WHERE MSC.date>= WS.date
ORDER BY WS.Date DESC LIMIT 1) as sWS
ORDER BY MSC.Date ASC;
Instead of using a cartesian product, what if you stack them up and look for the date that occurs right before the date for monthly store counts?
with cte as
(select date, storeCount, 1 as is_monthly
from monthlyStoreCount
union all
select date, sales, 0 as is_monthly
from weeklySales)
select *, lag(storeCount) over (order by date asc, is_monthly asc)
from cte
qualify is_monthly=1;
Hmm....It appears there is one way to make xQbert's lateral join solution work. By slapping an aggregate on it. I don't know why Snowflake doesn't allow the same using limit/top 1.
select *
from monthlyStoreCount as m,
lateral (select array_agg(w.sales) within group(order by w.date desc)[0] as sales
from WeeklySales as w
where m.date>= w.date)

SQL to find sum of total days in a window for a series of changes

Following is the table:
start_date
recorded_date
id
2021-11-10
2021-11-01
1a
2021-11-08
2021-11-02
1a
2021-11-11
2021-11-03
1a
2021-11-10
2021-11-04
1a
2021-11-10
2021-11-05
1a
I need a query to find the total day changes in aggregate for a given id. In this case, it changed from 10th Nov to 8th Nov so 2 days, then again from 8th to 11th Nov so 3 days and again from 11th to 10th for a day, and finally from 10th to 10th, that is 0 days.
In total there is a change of 2+3+1+0 = 6 days for the id - '1a'.
Basically for each change there is a recorded_date, so we arrange that in ascending order and then calculate the aggregate change of days grouped by id. The final result should be like:
id
Agg_Change
1a
6
Is there a way to do this using SQL. I am using vertica database.
Thanks.
you can use window function lead to get the difference between rows and then group by id
select id, sum(daydiff) Agg_Change
from (
select id, abs(datediff(day, start_Date, lead(start_date,1,start_date) over (partition by id order by recorded_date))) as daydiff
from tablename
) t group by id
It's indeed the use of LAG() to get the previous date in an OLAP query, and an outer query getting the absolute date difference, and the sum of it, grouping by id:
WITH
-- your input - don't use in real query ...
indata(start_date,recorded_date,id) AS (
SELECT DATE '2021-11-10',DATE '2021-11-01','1a'
UNION ALL SELECT DATE '2021-11-08',DATE '2021-11-02','1a'
UNION ALL SELECT DATE '2021-11-11',DATE '2021-11-03','1a'
UNION ALL SELECT DATE '2021-11-10',DATE '2021-11-04','1a'
UNION ALL SELECT DATE '2021-11-10',DATE '2021-11-05','1a'
)
-- real query starts here, replace following comma with "WITH" ...
,
w_lag AS (
SELECT
id
, start_date
, LAG(start_date) OVER w AS prevdt
FROM indata
WINDOW w AS (PARTITION BY id ORDER BY recorded_date)
)
SELECT
id
, SUM(ABS(DATEDIFF(DAY,start_date,prevdt))) AS dtdiff
FROM w_lag
GROUP BY id
-- out id | dtdiff
-- out ----+--------
-- out 1a | 6
I was thinking lag function will provide me the answer, but it kept giving me wrong answer because I had the wrong logic in one place. I have the answer I need:
with cte as(
select id, start_date, recorded_date,
row_number() over(partition by id order by recorded_date asc) as idrank,
lag(start_date,1) over(partition by id order by recorded_date asc) as prev
from table_temp
)
select id, sum(abs(date(start_date) - date(prev))) as Agg_Change
from cte
group by 1
If someone has a better solution please let me know.

Aggregating two values in same select statement. Second aggregation is decreasing in value for each row for some reason

I'm currently trying to aggregate two values simultaneously in one select statement; however, the second aggregated value is decreasing for some reason. I know what I'm doing is wrong, but I don't understand why it's wrong (assuming it's the very last code block). Mainly just trying to better understand what's going on, and why it's happening.
I already have a corrected query that works (at the bottom)
Note: Query and outputs are simplified, please ignore any syntax issues. Additionally, in real query, I need to keep subscription_start_date field in until the end.
Query with issue (very last block):
WITH max_product_user_count AS (
-- The total count is obtained when "days" = 0
SELECT
subscription_start_date,
datediff('days', subscription_start_date, subscription_date) AS days,
product,
num_users AS total_user_count
FROM users
WHERE days = 0
),
daily_product_user_count AS (
-- As "days" go up, the number of subscribers for each start date/product type decreases
SELECT
subscription_start_date,
datediff('days', subscription_start_date, subscription_date) AS days,
product,
num_users AS daily_user_count
FROM users
WHERE days IN (0,5,14,21,30,33,60)
)
-- Trying to aggregate by product and day, across all subscription start dates
SELECT
d.product,
d.days,
SUM(daily_user_count) AS daily_count,
SUM(total_user_count) AS total_count
FROM daily_product_user_count d
INNER JOIN max_product_user_count m ON d.subscription_start_date = m.subscription_start_date
AND d.product = m.product
GROUP BY 1,2
ORDER BY 1,2
Current Output:
PRODUCT DAYS DAILY_COUNT TOTAL_COUNT
product_1 0 10000 10000
product_1 5 99231 99781
product_1 14 96124 98123
product_1 21 85123 96441
product_1 30 23412 94142
product_1 33 12931 92111
product_1 60 10231 90123
Expected Output:
PRODUCT DAYS DAILY_COUNT TOTAL_COUNT
product_1 0 10000 10000
product_1 5 99231 10000
product_1 14 96124 10000
product_1 21 85123 10000
product_1 30 23412 10000
product_1 33 12931 10000
product_1 60 10231 10000
Updated correct query:
WITH max_product_user_count AS (
SELECT
subscription_start_date,
datediff('days', subscription_start_date, subscription_date) AS days,
product,
num_users AS total_user_count
FROM users
WHERE days = 0
),
max_user_count_aggregation AS (
SELECT
product,
SUM(total_user_count) AS total_count
FROM max_product_user_count
GROUP BY 1
),
daily_product_user_count AS (
SELECT
subscription_start_date,
datediff('days', subscription_start_date, subscription_date) AS days,
product,
num_users AS daily_user_count
FROM users
WHERE days IN (0,5,14,21,30,33,60)
)
daily_user_count_aggregation AS (
SELECT
product,
days,
SUM(daily_user_count) AS daily_count
FROM daily_product_user_count
GROUP BY 1
)
SELECT
d.product,
d.days,
daily_count,
total_count
FROM daily_user_count_aggregation d
INNER JOIN max_user_count_aggregation m ON d.product = m.product
ORDER BY 1,2
If I understand what you are trying to do, the query is way more complicated than necessary. I think this does what you want:
SELECT datediff('days', subscription_start_date, subscription_date) AS days,
product,
SUM(num_users) FILTER (WHERE days IN (0, 5, 14, 21, 30, 33, 60)) AS daily_user_count,
SUM(num_users) FILTER (WHERE days = 0) AS total_user_count
FROM users
GROUP BY days, product;
I would advise you to ask a new question, explaining the logic you want to implement and providing reasonable sample data and desired results.

PostgreSQL query to count/group by day and display days with no data

I need to create a PostgreSQL query that returns
a day
the number of objects found for that day
It's important that every single day appear in the results, even if no objects were found on that day. (This has been discussed before but I haven't been able to get things working in my specific case.)
First, I found a sql query to generate a range of days, with which I can join:
SELECT to_char(date_trunc('day', (current_date - offs)), 'YYYY-MM-DD')
AS date
FROM generate_series(0, 365, 1)
AS offs
Results in:
date
------------
2013-03-28
2013-03-27
2013-03-26
2013-03-25
...
2012-03-28
(366 rows)
Now I'm trying to join that to a table named 'sharer_emailshare' which has a 'created' column:
Table 'public.sharer_emailshare'
column | type
-------------------
id | integer
created | timestamp with time zone
message | text
to | character varying(75)
Here's the best GROUP BY query I have so far:
SELECT d.date, count(se.id) FROM (
select to_char(date_trunc('day', (current_date - offs)), 'YYYY-MM-DD')
AS date
FROM generate_series(0, 365, 1)
AS offs
) d
JOIN sharer_emailshare se
ON (d.date=to_char(date_trunc('day', se.created), 'YYYY-MM-DD'))
GROUP BY d.date;
The results:
date | count
------------+-------
2013-03-27 | 11
2013-03-24 | 2
2013-02-14 | 2
(3 rows)
Desired results:
date | count
------------+-------
2013-03-28 | 0
2013-03-27 | 11
2013-03-26 | 0
2013-03-25 | 0
2013-03-24 | 2
2013-03-23 | 0
...
2012-03-28 | 0
(366 rows)
If I understand correctly this is because I'm using a plain (implied INNER) JOIN, and this is the expected behavior, as discussed in the postgres docs.
I've looked through dozens of StackOverflow solutions, and all the ones with working queries seem specific to MySQL/Oracle/MSSQL and I'm having a hard time translating them to PostgreSQL.
The guy asking this question found his answer, with Postgres, but put it on a pastebin link that expired some time ago.
I've tried to switch to LEFT OUTER JOIN, RIGHT JOIN, RIGHT OUTER JOIN, CROSS JOIN, use a CASE statement to sub in another value if null, COALESCE to provide a default value, etc, but I haven't been able to use them in a way that gets me what I need.
Any assistance is appreciated! And I promise I'll get around to reading that giant PostgreSQL book soon ;)
You just need a left outer join instead of an inner join:
SELECT d.date, count(se.id)
FROM
(
SELECT to_char(date_trunc('day', (current_date - offs)), 'YYYY-MM-DD') AS date
FROM generate_series(0, 365, 1) AS offs
) d
LEFT OUTER JOIN sharer_emailshare se
ON d.date = to_char(date_trunc('day', se.created), 'YYYY-MM-DD')
GROUP BY d.date;
Extending Gordon Linoff's helpful answer, I would suggest a couple of improvements such as:
Use ::date instead of date_trunc('day', ...)
Join on a date type rather than a character type (it's cleaner).
Use specific date ranges so they're easier to change later. In this case I select a year before the most recent entry in the table - something that couldn't have been done easily with the other query.
Compute the totals for an arbitrary subquery (using a CTE). You just have to cast the column of interest to the date type and call it date_column.
Include a column for cumulative total. (Why not?)
Here's my query:
WITH dates_table AS (
SELECT created::date AS date_column FROM sharer_emailshare WHERE showroom_id=5
)
SELECT series_table.date, COUNT(dates_table.date_column), SUM(COUNT(dates_table.date_column)) OVER (ORDER BY series_table.date) FROM (
SELECT (last_date - b.offs) AS date
FROM (
SELECT GENERATE_SERIES(0, last_date - first_date, 1) AS offs, last_date from (
SELECT MAX(date_column) AS last_date, (MAX(date_column) - '1 year'::interval)::date AS first_date FROM dates_table
) AS a
) AS b
) AS series_table
LEFT OUTER JOIN dates_table
ON (series_table.date = dates_table.date_column)
GROUP BY series_table.date
ORDER BY series_table.date
I tested the query, and it produces the same results, plus the column for cumulative total.
I'll try to provide an answer that includes some explanation. I'll start with the smallest building block and work up.
If you run a query like this:
SELECT series.number FROM generate_series(0, 9) AS series(number)
You get output like this:
number
--------
0
1
2
3
4
5
6
7
8
9
(10 rows)
This can be turned into dates like this:
SELECT CURRENT_DATE + sequential_dates.date AS date
FROM generate_series(0, 9) AS sequential_dates(date)
Which will give output like this:
date
------------
2019-09-29
2019-09-30
2019-10-01
2019-10-02
2019-10-03
2019-10-04
2019-10-05
2019-10-06
2019-10-07
2019-10-08
(10 rows)
Then you can do a query like this (for example), joining the original query as a subquery against whatever table you're ultimately interested in:
SELECT sequential_dates.date,
COUNT(calendar_items.*) AS calendar_item_count
FROM (SELECT CURRENT_DATE + sequential_dates.date AS date
FROM generate_series(0, 9) AS sequential_dates(date)) sequential_dates
LEFT JOIN calendar_items ON calendar_items.starts_at::date = sequential_dates.date
GROUP BY sequential_dates.date
Which will give output like this:
date | calendar_item_count
------------+---------------------
2019-09-29 | 1
2019-09-30 | 8
2019-10-01 | 15
2019-10-02 | 11
2019-10-03 | 1
2019-10-04 | 12
2019-10-05 | 0
2019-10-06 | 0
2019-10-07 | 27
2019-10-08 | 24
Based on Gordon Linoff's answer I realized another problem was that I had a WHERE clause that I didn't mention in the original question.
Instead of a naked WHERE, I made a subquery:
SELECT d.date, count(se.id) FROM (
select to_char(date_trunc('day', (current_date - offs)), 'YYYY-MM-DD')
AS date
FROM generate_series(0, 365, 1)
AS offs
) d
LEFT OUTER JOIN (
SELECT * FROM sharer_emailshare
WHERE showroom_id=5
) se
ON (d.date=to_char(date_trunc('day', se.created), 'YYYY-MM-DD'))
GROUP BY d.date;
I like Jason Swett SQL however ran into issue where the count on some dates should be a zero rather than a one.
Running the statment select count(*) from public.post_call_info where timestamp::date = '2020-11-23' count = zero, but below equals a one.
Also the + give me a forward schedule so changed to a minus provide 9 days data prior to current date.
SELECT sequential_dates.date,
COUNT(*) AS call_count
FROM (SELECT CURRENT_DATE - sequential_dates.date AS date
FROM generate_series(0, 9) AS sequential_dates(date)) sequential_dates
LEFT JOIN public.post_call_info ON public.post_call_info.timestamp::date =
sequential_dates.date
GROUP BY sequential_dates.date
order by date desc

T-SQL - SELECT by nearest date and GROUPED BY ID

From the data below I need to select the record nearest to a specified date for each Linked ID using SQL Server 2005:
ID Date Linked ID
...........................
1 2010-09-02 25
2 2010-09-01 25
3 2010-09-08 39
4 2010-09-09 39
5 2010-09-10 39
6 2010-09-10 34
7 2010-09-29 34
8 2010-10-01 37
9 2010-10-02 36
10 2010-10-03 36
So selecting them using 01/10/2010 should return:
1 2010-09-02 25
5 2010-09-10 39
7 2010-09-29 34
8 2010-10-01 37
9 2010-10-02 36
I know this must be possible, but can't seem to get my head round it (must be too near the end of the day :P) If anyone can help or give me a gentle shove in the right direction it would be greatly appreciated!
EDIT: Also I have come across this sql to get the closest date:
abs(DATEDIFF(minute, Date_Column, '2010/10/01'))
but couldn't figure out how to incorporate into the query properly...
Thanks
you can try this.
DECLARE #Date DATE = '10/01/2010';
WITH cte AS
(
SELECT ID, LinkedID, ABS(DATEDIFF(DD, #date, DATE)) diff,
ROW_NUMBER() OVER (PARTITION BY LinkedID ORDER BY ABS(DATEDIFF(DD, #date, DATE))) AS SEQUENCE
FROM MyTable
)
SELECT *
FROM cte
WHERE SEQUENCE = 1
ORDER BY ID
;
You didn't indicate how you want to handle the case where multiple rows in a LinkedID group represent the closest to the target date. This solution will only include one row And, in this case you can't guarantee which row of the multiple valid values is included.
You can change ROW_NUMBER() with RANK() in the query if you want to include all rows that represent the closest value.
You want to look at the absolute value of the DATEDIFF function (http://msdn.microsoft.com/en-us/library/ms189794.aspx) by days.
The query can look something like this (not tested)
with absDates as
(
select *, abs(DATEDIFF(day, Date_Column, '2010/10/01')) as days
from table
), mdays as
(
select min(days) as mdays, linkedid
from absDates
group by linkedid
)
select *
from absdates
inner join mdays on absdays.linkedid = mdays.linkedid and absdays.days = mdays.mdays
You can also try to do it with a subquery in the select statement:
select [LinkedId],
(select top 1 [Date] from [Table] where [LinkedId]=x.[LinkedId] order by abs(DATEDIFF(DAY,[Date],#date)))
from [Table] X
group by [LinkedId]