Getting 3 best Posts per Month with three different queries - google-bigquery

I am having a hard time wrapping my head around the row_number function.
This is my SCHEMA :
I am trying to build a query that would output the top value for Post_impressions within a date range (I.E. a month) WHEN the RowNumber is set to 1, the second best value when it is set to 2 and so on.
Here is the query I came up with so far
SELECT Post_timestamp,
Post_impressions,
Post_tipo,
from
(SELECT Post_timestamp,
Post_impressions,
Post_tipo,
FORMAT_DATE("%Y-%m-%d",DATE_TRUNC(TIMESTAMP(Post_timestamp), DAY)) as TheDate,
row_number() OVER
(PARTITION BY FORMAT_DATE("%Y-%m-%d",DATE_TRUNC(TIMESTAMP(Post_timestamp), DAY)) ORDER BY Post_impressions DESC) AS RowNumber
from `***DATABASENAME***`
WHERE RowNumber = 1 AND TheDate BETWEEN "2021-07-01" AND "2021-07-31";
Thans for your help!

You're getting 31 rows because you're partitioning the subquery by day, and each partition has a RowNumber = 1. You could partition your query by month but I suspect that wouldn't address all your use cases, particularly when you want to look at a time period over multiple partitions.
Alternatively if your use case is limited to month over month, you can simply partition by the month.
SELECT Post_timestamp,
Post_impressions,
Post_tipo,
from
(SELECT Post_timestamp,
Post_impressions,
Post_tipo,
FORMAT_DATE("%Y-%m-%d",DATE_TRUNC(TIMESTAMP(Post_timestamp), day)) as TheDate,
row_number() OVER
(PARTITION BY FORMAT_DATE("%Y-%m-%d",DATE_TRUNC(TIMESTAMP(Post_timestamp), month)) ORDER BY Post_impressions DESC) AS RowNumber
from `***DATABASENAME***`
WHERE RowNumber = 1 AND TheDate BETWEEN "2021-07-01" AND "2021-07-31";

Related

Update bigquery value based on partition by row number

I have a table in which I have records on the wrong date. I want to update them to be the day before for "snapshot_date". I have written the query to select the values I want to update the date for, but I don't know how to write the update query to change it to the previous day.
See screenshot
Query to select problematic records
Select * FROM(
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY Period, User_Struct) rn
FROM `XXX.YYY.TABLE`
where Snapshot_Date = '2021-10-04'
order by Period, User_Struct, Num_Active_Users asc
) where rn = 1
Using DATE_SUB you may get the previous day i.e.
SELECT DATE_SUB(cast('2021-10-04' as DATE), interval '1' day)
will give 2021-10-03.
You may try the following using Big Query Update Statement Syntax
UPDATE
`XXX.YYY.TABLE` t0
SET
t0.Snapshot_Date = DATE_SUB(t2.Snapshot_Date, interval '1' day)
FROM (
SELECT * FROM(
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY Period, User_Struct) rn
FROM
`XXX.YYY.TABLE`
WHERE
Snapshot_Date = '2021-10-04'
ORDER BY -- recommend removing order by here and use recommendation below for row_number
Period, User_Struct, Num_Active_Users asc
) t1
WHERE rn = 1
) t2
WHERE
t0.Snapshot_Date = t2.Snapshot_Date AND -- include other columns to match/join subquery with main table on
You should also specify how your rows should be ordered when using ROW_NUMBER eg
ROW_NUMBER() OVER (PARTITION BY Period, User_Struct ORDER BY Num_Active_Users asc)
if this generates the same/desired results.
Let me know if this works for you.

Postgres windowing (determine contiguous days)

Using Postgres 9.3, I'm trying to count the number of contiguous days of a certain weather type. If we assume we have a regular time series and weather report:
date|weather
"2016-02-01";"Sunny"
"2016-02-02";"Cloudy"
"2016-02-03";"Snow"
"2016-02-04";"Snow"
"2016-02-05";"Cloudy"
"2016-02-06";"Sunny"
"2016-02-07";"Sunny"
"2016-02-08";"Sunny"
"2016-02-09";"Snow"
"2016-02-10";"Snow"
I want something count the contiguous days of the same weather. The results should look something like this:
date|weather|contiguous_days
"2016-02-01";"Sunny";1
"2016-02-02";"Cloudy";1
"2016-02-03";"Snow";1
"2016-02-04";"Snow";2
"2016-02-05";"Cloudy";1
"2016-02-06";"Sunny";1
"2016-02-07";"Sunny";2
"2016-02-08";"Sunny";3
"2016-02-09";"Snow";1
"2016-02-10";"Snow";2
I've been banging my head on this for a while trying to use windowing functions. At first, it seems like it should be no-brainer, but then I found out its much harder than expected.
Here is what I've tried...
Select date, weather, Row_Number() Over (partition by weather order by date)
from t_weather
Would it be better just easier to compare the current row to the next? How would you do that while maintaining a count? Any thoughts, ideas, or even solutions would be helpful!
-Kip
You need to identify the contiguous where the weather is the same. You can do this by adding a grouping identifier. There is a simple method: subtract a sequence of increasing numbers from the dates and it is constant for contiguous dates.
One you have the grouping, the rest is row_number():
Select date, weather,
Row_Number() Over (partition by weather, grp order by date)
from (select w.*,
(date - row_number() over (partition by weather order by date) * interval '1 day') as grp
from t_weather w
) w;
The SQL Fiddle is here.
I'm not sure what the query engine is going to do when scanning multiple times across the same data set (kinda like calculating area under a curve), but this works...
WITH v(date, weather) AS (
VALUES
('2016-02-01'::date,'Sunny'::text),
('2016-02-02','Cloudy'),
('2016-02-03','Snow'),
('2016-02-04','Snow'),
('2016-02-05','Cloudy'),
('2016-02-06','Sunny'),
('2016-02-07','Sunny'),
('2016-02-08','Sunny'),
('2016-02-09','Snow'),
('2016-02-10','Snow') ),
changes AS (
SELECT date,
weather,
CASE WHEN lag(weather) OVER () = weather THEN 1 ELSE 0 END change
FROM v)
SELECT date
, weather
,(SELECT count(weather) -- number of times the weather didn't change
FROM changes v2
WHERE v2.date <= v1.date AND v2.weather = v1.weather
AND v2.date >= ( -- bounded between changes of weather
SELECT max(date)
FROM changes v3
WHERE change = 0
AND v3.weather = v1.weather
AND v3.date <= v1.date) --<-- here's the expensive part
) curve
FROM changes v1
Here is another approach based off of this answer.
First we add a change column that is 1 or 0 depending on whether the weather is different or not from the previous day.
Then we introduce a group_nr column by summing the change over an order by date. This produces a unique group number for each sequence of consecutive same-weather days since the sum is only incremented on the first day of each sequence.
Finally we do a row_number() over (partition by group_nr order by date) to produce the running count per group.
select date, weather, row_number() over (partition by group_nr order by date)
from (
select *, sum(change) over (order by date) as group_nr
from (
select *, (weather != lag(weather,1,'') over (order by date))::int as change
from tmp_weather
) t1
) t2;
sqlfiddle (uses equivalent WITH syntax)
You can accomplish this with a recursive CTE as follows:
WITH RECURSIVE CTE_ConsecutiveDays AS
(
SELECT
my_date,
weather,
1 AS consecutive_days
FROM My_Table T
WHERE
NOT EXISTS (SELECT * FROM My_Table T2 WHERE T2.my_date = T.my_date - INTERVAL '1 day' AND T2.weather = T.weather)
UNION ALL
SELECT
T.my_date,
T.weather,
CD.consecutive_days + 1
FROM
CTE_ConsecutiveDays CD
INNER JOIN My_Table T ON
T.my_date = CD.my_date + INTERVAL '1 day' AND
T.weather = CD.weather
)
SELECT *
FROM CTE_ConsecutiveDays
ORDER BY my_date;
Here's the SQL Fiddle to test: http://www.sqlfiddle.com/#!15/383e5/3

Tagging consecutive days

Supposedly I have data something like this:
ID,DATE
101,01jan2014
101,02jan2014
101,03jan2014
101,07jan2014
101,08jan2014
101,10jan2014
101,12jan2014
101,13jan2014
102,08jan2014
102,09jan2014
102,10jan2014
102,15jan2014
How could I efficiently code this in Greenplum SQL such that I can have a grouping of consecutive days similar to the one below:
ID,DATE,PERIOD
101,01jan2014,1
101,02jan2014,1
101,03jan2014,1
101,07jan2014,2
101,08jan2014,2
101,10jan2014,3
101,12jan2014,4
101,13jan2014,4
102,08jan2014,1
102,09jan2014,1
102,10jan2014,1
102,15jan2014,2
You can do this using row_number(). For a consecutive group, the difference between the date and the row_number() is a constant. Then, use dense_rank() to assign the period:
select id, date,
dense_rank() over (partition by id order by grp) as period
from (select t.*,
date - row_number() over (partition by id order by date) * 'interval 1 day'
from table t
) t

How can I select one row for each week in a date range that spans more than a year?

In my postgreSQL data base, I have a table with columns of dates and prices. ('transdate' and 'price')
I would like to form a query which selects one row for each week over a date range which spans more than one year.
From another question/answer here, I implemented this code which works for date ranges of less than a year:
;with cte as
(
select *,
row_number() over (partition by Extract (week from transdate) order by transdate desc) as rn
from "tablename" where transdate between '06-01-1999' and '06-01-1999'::timestamp + `'50 week'::interval
)
select transdate, price from cte where rn = 1 order by transdate;
However, when I extend the interval greater than 50 weeks, it still only selects a max of 12 months.
How can I re-write this code to select one date/price from every week in the range?
Your problem is that week numbers wrap around at year boundaries but you want to look at the week number and the year at the same time. Lucky for you, you can PARTITION BY several things at once:
row_number() over (
partition by extract(week from transdate),
extract(year from transdate)
order by transdate desc
) as rn

Last day of the month with a twist in SQLPLUS

I would appreciate a little expert help please.
in an SQL SELECT statement I am trying to get the last day with data per month for the last year.
Example, I am easily able to get the last day of each month and join that to my data table, but the problem is, if the last day of the month does not have data, then there is no returned data. What I need is for the SELECT to return the last day with data for the month.
This is probably easy to do, but to be honest, my brain fart is starting to hurt.
I've attached the select below that works for returning the data for only the last day of the month for the last 12 months.
Thanks in advance for your help!
SELECT fd.cust_id,fd.server_name,fd.instance_name,
TRUNC(fd.coll_date) AS coll_date,fd.column_name
FROM super_table fd,
(SELECT TRUNC(daterange,'MM')-1 first_of_month
FROM (
select TRUNC(sysdate-365,'MM') + level as DateRange
from dual
connect by level<=365)
GROUP BY TRUNC(daterange,'MM')) fom
WHERE fd.cust_id = :CUST_ID
AND fd.coll_date > SYSDATE-400
AND TRUNC(fd.coll_date) = fom.first_of_month
GROUP BY fd.cust_id,fd.server_name,fd.instance_name,
TRUNC(fd.coll_date),fd.column_name
ORDER BY fd.server_name,fd.instance_name,TRUNC(fd.coll_date)
You probably need to group your data so that each month's data is in the group, and then within the group select the maximum date present. The sub-query might be:
SELECT MAX(coll_date) AS last_day_of_month
FROM Super_Table AS fd
GROUP BY YEAR(coll_date) * 100 + MONTH(coll_date);
This presumes that the functions YEAR() and MONTH() exist to extract the year and month from a date as an integer value. Clearly, this doesn't constrain the range of dates - you can do that, too. If you don't have the functions in Oracle, then you do some sort of manipulation to get the equivalent result.
Using information from Rhose (thanks):
SELECT MAX(coll_date) AS last_day_of_month
FROM Super_Table AS fd
GROUP BY TO_CHAR(coll_date, 'YYYYMM');
This achieves the same net result, putting all dates from the same calendar month into a group and then determining the maximum value present within that group.
Here's another approach, if ANSI row_number() is supported:
with RevDayRanked(itemDate,rn) as (
select
cast(coll_date as date),
row_number() over (
partition by datediff(month,coll_date,'2000-01-01') -- rewrite datediff as needed for your platform
order by coll_date desc
)
from super_table
)
select itemDate
from RevDayRanked
where rn = 1;
Rows numbered 1 will be nondeterministically chosen among rows on the last active date of the month, so you don't need distinct. If you want information out of the table for all rows on these dates, use rank() over days instead of row_number() over coll_date values, so a value of 1 appears for any row on the last active date of the month, and select the additional columns you need:
with RevDayRanked(cust_id, server_name, coll_date, rk) as (
select
cust_id, server_name, coll_date,
rank() over (
partition by datediff(month,coll_date,'2000-01-01')
order by cast(coll_date as date) desc
)
from super_table
)
select cust_id, server_name, coll_date
from RevDayRanked
where rk = 1;
If row_number() and rank() aren't supported, another approach is this (for the second query above). Select all rows from your table for which there's no row in the table from a later day in the same month.
select
cust_id, server_name, coll_date
from super_table as ST1
where not exists (
select *
from super_table as ST2
where datediff(month,ST1.coll_date,ST2.coll_date) = 0
and cast(ST2.coll_date as date) > cast(ST1.coll_date as date)
)
If you have to do this kind of thing a lot, see if you can create an index over computed columns that hold cast(coll_date as date) and a month indicator like datediff(month,'2001-01-01',coll_date). That'll make more of the predicates SARGs.
Putting the above pieces together, would something like this work for you?
SELECT fd.cust_id,
fd.server_name,
fd.instance_name,
TRUNC(fd.coll_date) AS coll_date,
fd.column_name
FROM super_table fd,
WHERE fd.cust_id = :CUST_ID
AND TRUNC(fd.coll_date) IN (
SELECT MAX(TRUNC(coll_date))
FROM super_table
WHERE coll_date > SYSDATE - 400
AND cust_id = :CUST_ID
GROUP BY TO_CHAR(coll_date,'YYYYMM')
)
GROUP BY fd.cust_id,fd.server_name,fd.instance_name,TRUNC(fd.coll_date),fd.column_name
ORDER BY fd.server_name,fd.instance_name,TRUNC(fd.coll_date)