Update bigquery value based on partition by row number - sql

I have a table in which I have records on the wrong date. I want to update them to be the day before for "snapshot_date". I have written the query to select the values I want to update the date for, but I don't know how to write the update query to change it to the previous day.
See screenshot
Query to select problematic records
Select * FROM(
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY Period, User_Struct) rn
FROM `XXX.YYY.TABLE`
where Snapshot_Date = '2021-10-04'
order by Period, User_Struct, Num_Active_Users asc
) where rn = 1

Using DATE_SUB you may get the previous day i.e.
SELECT DATE_SUB(cast('2021-10-04' as DATE), interval '1' day)
will give 2021-10-03.
You may try the following using Big Query Update Statement Syntax
UPDATE
`XXX.YYY.TABLE` t0
SET
t0.Snapshot_Date = DATE_SUB(t2.Snapshot_Date, interval '1' day)
FROM (
SELECT * FROM(
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY Period, User_Struct) rn
FROM
`XXX.YYY.TABLE`
WHERE
Snapshot_Date = '2021-10-04'
ORDER BY -- recommend removing order by here and use recommendation below for row_number
Period, User_Struct, Num_Active_Users asc
) t1
WHERE rn = 1
) t2
WHERE
t0.Snapshot_Date = t2.Snapshot_Date AND -- include other columns to match/join subquery with main table on
You should also specify how your rows should be ordered when using ROW_NUMBER eg
ROW_NUMBER() OVER (PARTITION BY Period, User_Struct ORDER BY Num_Active_Users asc)
if this generates the same/desired results.
Let me know if this works for you.

Related

Getting 3 best Posts per Month with three different queries

I am having a hard time wrapping my head around the row_number function.
This is my SCHEMA :
I am trying to build a query that would output the top value for Post_impressions within a date range (I.E. a month) WHEN the RowNumber is set to 1, the second best value when it is set to 2 and so on.
Here is the query I came up with so far
SELECT Post_timestamp,
Post_impressions,
Post_tipo,
from
(SELECT Post_timestamp,
Post_impressions,
Post_tipo,
FORMAT_DATE("%Y-%m-%d",DATE_TRUNC(TIMESTAMP(Post_timestamp), DAY)) as TheDate,
row_number() OVER
(PARTITION BY FORMAT_DATE("%Y-%m-%d",DATE_TRUNC(TIMESTAMP(Post_timestamp), DAY)) ORDER BY Post_impressions DESC) AS RowNumber
from `***DATABASENAME***`
WHERE RowNumber = 1 AND TheDate BETWEEN "2021-07-01" AND "2021-07-31";
Thans for your help!
You're getting 31 rows because you're partitioning the subquery by day, and each partition has a RowNumber = 1. You could partition your query by month but I suspect that wouldn't address all your use cases, particularly when you want to look at a time period over multiple partitions.
Alternatively if your use case is limited to month over month, you can simply partition by the month.
SELECT Post_timestamp,
Post_impressions,
Post_tipo,
from
(SELECT Post_timestamp,
Post_impressions,
Post_tipo,
FORMAT_DATE("%Y-%m-%d",DATE_TRUNC(TIMESTAMP(Post_timestamp), day)) as TheDate,
row_number() OVER
(PARTITION BY FORMAT_DATE("%Y-%m-%d",DATE_TRUNC(TIMESTAMP(Post_timestamp), month)) ORDER BY Post_impressions DESC) AS RowNumber
from `***DATABASENAME***`
WHERE RowNumber = 1 AND TheDate BETWEEN "2021-07-01" AND "2021-07-31";

ORACLE SQL: Find last minimum and maximum consecutive period

I have the sample data set below which list the water meters not working for specific reason for a certain range period (jan 2016 to december 2018).
I would like to have a query that retrieves the last maximum and minimum consecutive period where the meter was not working within that range of period.
any help will be greatly appreciated.
You have two options:
select code, to_char(min_period, 'yyyymm') min_period, to_char(max_period, 'yyyymm') max_period
from (
select code, min(period) min_period, max(period) max_period,
max(min(period)) over (partition by code) max_min_period
from (
select code, period, sum(flag) over (partition by code order by period) grp
from (
select code, period,
case when add_months(period, -1)
= lag(period) over (partition by code order by period)
then 0 else 1 end flag
from (select mrdg_acc_code code, to_date(mrdg_per_period, 'yyyymm') period from t)))
group by code, grp)
where min_period = max_min_period
Explanation:
flag rows where period is not equal previous period plus one month,
create column grp which sums flags consecutively,
group data using code and grp additionaly finding maximal start of period,
show only rows where min_period = max_min_period
Second option is recursive CTE available in Oracle 11g and above:
with
data(period, code) as (
select to_date(mrdg_per_period, 'yyyymm'), mrdg_acc_code from t
where mrdg_per_period between 201601 and 201812),
cte (period, code) as (
select to_char(period, 'yyyymm'), code from data
where (period, code) in (select max(period), code from data group by code)
union all
select to_char(data.period, 'yyyymm'), cte.code
from cte
join data on data.code = cte.code
and data.period = add_months(to_date(cte.period, 'yyyymm'), -1))
select code, min(period) min_period, max(period) max_period
from cte group by code
Explanation:
subquery data filters only rows from 2016 - 2018 additionaly converting period to date format. We need this for function add_months to work.
cte is recursive. Anchor finds starting rows, these with maximum period for each code. After union all is recursive member, which looks for the row one month older than current. If it finds it then net row, if not then stop.
final select groups data. Notice that period which were not consecutive were rejected by cte.
Though recursive queries are slower than traditional ones, there can be scenarios where second solution is better.
Here is the dbfiddle demo for both queries. Good luck.
use aggregate function with group by
select max(mdrg_per_period) mdrg_per_period, mrdg_acc_code,max(mrdg_date_read),rea_Desc,min(mdrg_per_period) not_working_as_from
from tablename
group by mrdg_acc_code,rea_Desc
This is a bit tricky. This is a gap-and-islands problem. To get all continuous periods, it will help if you have an enumeration of months. So, convert the period to a number of months and then subtract a sequence generated using row_number(). The difference is constant for a group of adjacent months.
This looks like:
select acc_code, min(period), max(period)
from (select t.*,
row_number() over (partition by acc_code order by period_num) as seqnum
from (select t.*, floor(period / 100) * 12 + mod(period, 100) as period_num
from t
) t
where rea_desc = 'METER NOT WORKING'
) t
group by (period_num - seqnum);
Then, if you want the last one for each account, you can use a subquery:
select t.*
from (select acc_code, min(period), max(period),
row_number() over (partition by acc_code order by max(period desc) as seqnum
from (select t.*,
row_number() over (partition by acc_code order by period_num) as seqnum
from (select t.*, floor(period / 100) * 12 + mod(period, 100) as period_num
from t
) t
where rea_desc = 'METER NOT WORKING'
) t
group by (period_num - seqnum)
) t
where seqnum = 1;

Running count distinct

I am trying to see how the cumulative number of subscribers changed over time based on unique email addresses and date they were created. Below is an example of a table I am working with.
I am trying to turn it into the table below. Email 1#gmail.com was created twice and I would like to count it once. I cannot figure out how to generate the Running count distinct column.
Thanks for the help.
I would usually do this using row_number():
select date, count(*),
sum(count(*)) over (order by date),
sum(sum(case when seqnum = 1 then 1 else 0 end)) over (order by date)
from (select t.*,
row_number() over (partition by email order by date) as seqnum
from t
) t
group by date
order by date;
This is similar to the version using lag(). However, I get nervous using lag if the same email appears multiple times on the same date.
Getting the total count and cumulative count is straight forward. To get the cumulative distinct count, use lag to check if the email had a row with a previous date, and set the flag to 0 so it would be ignored during a running sum.
select distinct dt
,count(*) over(partition by dt) as day_total
,count(*) over(order by dt) as cumsum
,sum(flag) over(order by dt) as cumdist
from (select t.*
,case when lag(dt) over(partition by email order by dt) is not null then 0 else 1 end as flag
from tbl t
) t
DEMO HERE
Here is a solution that does not uses sum over, neither lag... And does produces the correct results.
Hence it could appear as simpler to read and to maintain.
select
t1.date_created,
(select count(*) from my_table where date_created = t1.date_created) emails_created,
(select count(*) from my_table where date_created <= t1.date_created) cumulative_sum,
(select count( distinct email) from my_table where date_created <= t1.date_created) running_count_distinct
from
(select distinct date_created from my_table) t1
order by 1

Postgres windowing (determine contiguous days)

Using Postgres 9.3, I'm trying to count the number of contiguous days of a certain weather type. If we assume we have a regular time series and weather report:
date|weather
"2016-02-01";"Sunny"
"2016-02-02";"Cloudy"
"2016-02-03";"Snow"
"2016-02-04";"Snow"
"2016-02-05";"Cloudy"
"2016-02-06";"Sunny"
"2016-02-07";"Sunny"
"2016-02-08";"Sunny"
"2016-02-09";"Snow"
"2016-02-10";"Snow"
I want something count the contiguous days of the same weather. The results should look something like this:
date|weather|contiguous_days
"2016-02-01";"Sunny";1
"2016-02-02";"Cloudy";1
"2016-02-03";"Snow";1
"2016-02-04";"Snow";2
"2016-02-05";"Cloudy";1
"2016-02-06";"Sunny";1
"2016-02-07";"Sunny";2
"2016-02-08";"Sunny";3
"2016-02-09";"Snow";1
"2016-02-10";"Snow";2
I've been banging my head on this for a while trying to use windowing functions. At first, it seems like it should be no-brainer, but then I found out its much harder than expected.
Here is what I've tried...
Select date, weather, Row_Number() Over (partition by weather order by date)
from t_weather
Would it be better just easier to compare the current row to the next? How would you do that while maintaining a count? Any thoughts, ideas, or even solutions would be helpful!
-Kip
You need to identify the contiguous where the weather is the same. You can do this by adding a grouping identifier. There is a simple method: subtract a sequence of increasing numbers from the dates and it is constant for contiguous dates.
One you have the grouping, the rest is row_number():
Select date, weather,
Row_Number() Over (partition by weather, grp order by date)
from (select w.*,
(date - row_number() over (partition by weather order by date) * interval '1 day') as grp
from t_weather w
) w;
The SQL Fiddle is here.
I'm not sure what the query engine is going to do when scanning multiple times across the same data set (kinda like calculating area under a curve), but this works...
WITH v(date, weather) AS (
VALUES
('2016-02-01'::date,'Sunny'::text),
('2016-02-02','Cloudy'),
('2016-02-03','Snow'),
('2016-02-04','Snow'),
('2016-02-05','Cloudy'),
('2016-02-06','Sunny'),
('2016-02-07','Sunny'),
('2016-02-08','Sunny'),
('2016-02-09','Snow'),
('2016-02-10','Snow') ),
changes AS (
SELECT date,
weather,
CASE WHEN lag(weather) OVER () = weather THEN 1 ELSE 0 END change
FROM v)
SELECT date
, weather
,(SELECT count(weather) -- number of times the weather didn't change
FROM changes v2
WHERE v2.date <= v1.date AND v2.weather = v1.weather
AND v2.date >= ( -- bounded between changes of weather
SELECT max(date)
FROM changes v3
WHERE change = 0
AND v3.weather = v1.weather
AND v3.date <= v1.date) --<-- here's the expensive part
) curve
FROM changes v1
Here is another approach based off of this answer.
First we add a change column that is 1 or 0 depending on whether the weather is different or not from the previous day.
Then we introduce a group_nr column by summing the change over an order by date. This produces a unique group number for each sequence of consecutive same-weather days since the sum is only incremented on the first day of each sequence.
Finally we do a row_number() over (partition by group_nr order by date) to produce the running count per group.
select date, weather, row_number() over (partition by group_nr order by date)
from (
select *, sum(change) over (order by date) as group_nr
from (
select *, (weather != lag(weather,1,'') over (order by date))::int as change
from tmp_weather
) t1
) t2;
sqlfiddle (uses equivalent WITH syntax)
You can accomplish this with a recursive CTE as follows:
WITH RECURSIVE CTE_ConsecutiveDays AS
(
SELECT
my_date,
weather,
1 AS consecutive_days
FROM My_Table T
WHERE
NOT EXISTS (SELECT * FROM My_Table T2 WHERE T2.my_date = T.my_date - INTERVAL '1 day' AND T2.weather = T.weather)
UNION ALL
SELECT
T.my_date,
T.weather,
CD.consecutive_days + 1
FROM
CTE_ConsecutiveDays CD
INNER JOIN My_Table T ON
T.my_date = CD.my_date + INTERVAL '1 day' AND
T.weather = CD.weather
)
SELECT *
FROM CTE_ConsecutiveDays
ORDER BY my_date;
Here's the SQL Fiddle to test: http://www.sqlfiddle.com/#!15/383e5/3

Last day of the month with a twist in SQLPLUS

I would appreciate a little expert help please.
in an SQL SELECT statement I am trying to get the last day with data per month for the last year.
Example, I am easily able to get the last day of each month and join that to my data table, but the problem is, if the last day of the month does not have data, then there is no returned data. What I need is for the SELECT to return the last day with data for the month.
This is probably easy to do, but to be honest, my brain fart is starting to hurt.
I've attached the select below that works for returning the data for only the last day of the month for the last 12 months.
Thanks in advance for your help!
SELECT fd.cust_id,fd.server_name,fd.instance_name,
TRUNC(fd.coll_date) AS coll_date,fd.column_name
FROM super_table fd,
(SELECT TRUNC(daterange,'MM')-1 first_of_month
FROM (
select TRUNC(sysdate-365,'MM') + level as DateRange
from dual
connect by level<=365)
GROUP BY TRUNC(daterange,'MM')) fom
WHERE fd.cust_id = :CUST_ID
AND fd.coll_date > SYSDATE-400
AND TRUNC(fd.coll_date) = fom.first_of_month
GROUP BY fd.cust_id,fd.server_name,fd.instance_name,
TRUNC(fd.coll_date),fd.column_name
ORDER BY fd.server_name,fd.instance_name,TRUNC(fd.coll_date)
You probably need to group your data so that each month's data is in the group, and then within the group select the maximum date present. The sub-query might be:
SELECT MAX(coll_date) AS last_day_of_month
FROM Super_Table AS fd
GROUP BY YEAR(coll_date) * 100 + MONTH(coll_date);
This presumes that the functions YEAR() and MONTH() exist to extract the year and month from a date as an integer value. Clearly, this doesn't constrain the range of dates - you can do that, too. If you don't have the functions in Oracle, then you do some sort of manipulation to get the equivalent result.
Using information from Rhose (thanks):
SELECT MAX(coll_date) AS last_day_of_month
FROM Super_Table AS fd
GROUP BY TO_CHAR(coll_date, 'YYYYMM');
This achieves the same net result, putting all dates from the same calendar month into a group and then determining the maximum value present within that group.
Here's another approach, if ANSI row_number() is supported:
with RevDayRanked(itemDate,rn) as (
select
cast(coll_date as date),
row_number() over (
partition by datediff(month,coll_date,'2000-01-01') -- rewrite datediff as needed for your platform
order by coll_date desc
)
from super_table
)
select itemDate
from RevDayRanked
where rn = 1;
Rows numbered 1 will be nondeterministically chosen among rows on the last active date of the month, so you don't need distinct. If you want information out of the table for all rows on these dates, use rank() over days instead of row_number() over coll_date values, so a value of 1 appears for any row on the last active date of the month, and select the additional columns you need:
with RevDayRanked(cust_id, server_name, coll_date, rk) as (
select
cust_id, server_name, coll_date,
rank() over (
partition by datediff(month,coll_date,'2000-01-01')
order by cast(coll_date as date) desc
)
from super_table
)
select cust_id, server_name, coll_date
from RevDayRanked
where rk = 1;
If row_number() and rank() aren't supported, another approach is this (for the second query above). Select all rows from your table for which there's no row in the table from a later day in the same month.
select
cust_id, server_name, coll_date
from super_table as ST1
where not exists (
select *
from super_table as ST2
where datediff(month,ST1.coll_date,ST2.coll_date) = 0
and cast(ST2.coll_date as date) > cast(ST1.coll_date as date)
)
If you have to do this kind of thing a lot, see if you can create an index over computed columns that hold cast(coll_date as date) and a month indicator like datediff(month,'2001-01-01',coll_date). That'll make more of the predicates SARGs.
Putting the above pieces together, would something like this work for you?
SELECT fd.cust_id,
fd.server_name,
fd.instance_name,
TRUNC(fd.coll_date) AS coll_date,
fd.column_name
FROM super_table fd,
WHERE fd.cust_id = :CUST_ID
AND TRUNC(fd.coll_date) IN (
SELECT MAX(TRUNC(coll_date))
FROM super_table
WHERE coll_date > SYSDATE - 400
AND cust_id = :CUST_ID
GROUP BY TO_CHAR(coll_date,'YYYYMM')
)
GROUP BY fd.cust_id,fd.server_name,fd.instance_name,TRUNC(fd.coll_date),fd.column_name
ORDER BY fd.server_name,fd.instance_name,TRUNC(fd.coll_date)