I wanted to get to the consecutive number of records a certain field value is stale based on the rates table.
From the below data records 3,4,5 have the same rate as 0.770827 ,so the days the rate is stale is 3 and the prior rate before stale is 0.770886.I would like to get help in writing a query to the no of records that have stale rate and also get to the prior rate of the same as well.In the below sample i am only showing CAD to USD ,but we need the same across different currencies .
Any assistance would be highly helpful.
Expected Output
When value changes mark row with 1, otherwise 0. Then sum this column (flg), you have now consecutive groups (grp). Use grp to aggregate, count, show min and max dates:
dbfiddle demo
select to_cur, from_cur, min(dt) dt_from, max(dt) dt_to, rate, count(1) cnt
from (
select dt, to_cur, from_cur, rate,
sum(flg) over (partition by to_cur, from_cur order by dt) grp
from (
select dt, to_cur, from_cur, rate,
case lag(rate) over (partition by to_cur, from_cur order by dt)
when rate then 0 else 1 end flg
from t))
group by grp, to_cur, from_cur, rate
order by from_cur, to_cur, min(dt)
If you want any specific group after group by add:
having count(1) >= 3
This is a gaps and islands problem.
You can use lag() to retrieve the previous rate for the same currencies tuple, and then do a window sum to define groups of consecutive records with the same rate. Then, you can aggregate the groups, and recover the previous rate using lag() again. The last step is to filter on groups that have at least 3 records.
select *
from (
select
from_cur,
to_cur,
rate,
max(date) max_date,
lag(rate) over(partition by from_cur, to_cur order by max(date)) lag_rate_grp,
count(*) cnt
from (
select
t.*,
sum(case when rate = lag_rate then 0 else 1 end) over(partition by from_date, to_date order by date) grp
from (
select
t.*,
lag(rate) over(partition by from_cur, to_cur order by date) lag_rate
from mytable t
) t
) t
group by from_cur, to_cur, rate, grp
) t
where cnt >= 3
order by from_cur, to_cur, max_date
Actually, using the difference between row numbers can save one level of nesting:
select *
from (
select
from_cur,
to_cur,
rate,
max(date) max_date,
lag(rate) over(partition by from_cur, to_cur order by max(date)) lag_rate_grp,
count(*) cnt
from (
select
t.*,
row_number() over(partition by from_cur, to_cur order by date) rn1,
row_number() over(partition by from_cur, to_cur, rate order by date) rn2
from mytable t
) t
group by from_cur, to_cur, rate, rn1 - rn2
) t
where cnt >= 3
order by from_cur, to_cur, max_date
If you want only the earliest record per currency tuple, then you can use row_number():
select *
from (
select
from_cur,
to_cur,
rate,
max(date) max_date,
lag(rate) over(partition by from_cur, to_cur order by max(date)) lag_rate_grp,
count(*) cnt,
row_number() over(partition by from_cur, to_cur, case when count(*) >= 3 then 0 else 1 end order by max(date)) rn
from (
select
t.*,
row_number() over(partition by from_cur, to_cur order by date) rn1,
row_number() over(partition by from_cur, to_cur, rate order by date) rn2
from mytable t
) t
group by from_cur, to_cur, rate, rn1 - rn2
) t
where cnt >= 3 and rn = 1
order by from_cur, to_cur
This is a gap-and-islands problem, but I would solve it just by subtracting a sequence from the date. And then aggregating:
select to_cur, from_cur, rate, min(date), max(date),
count(*) as days_stale
from (select r.*,
row_number() over (partition by to_cur, from_cur, rate order by date) as seqnum
from rates r
) r
group by (date - seqnum * interval '1' day)
Related
I have tried with row number with min and max logic. but its fail in case when a vehicle visited the same location second time.
This is a gaps-and-islands problem, where you want to group together adjacent rows that have the same vehicle and location.
Here is one approach that uses the difference between row numbers to identify the islands:
select vehicle_no, location, min(time) starttime, max(time) endtime,
max(time) - min(time) timediff
from (
select t.*,
row_number() over(partition by vehicle_no order by time) rn1,
row_number() over(partition by vehicle_no, location order by time) rn2
from mytable t
) t
group by vehicle_no, location, rn1 - rn2
This is a gaps-and-islands problem. In this case, you can use the difference of row_number()s to identify the "islands":
select vehicle_no, location, min(time), max(time),
max(time) - min(time) as at_location
from (select t.*,
row_number() over (partition by vehicle_no order by time) as seqnum,
row_number() over (partition by vehicle_no, location order by time) as seqnum_2
from t
) t
group by vehicle_no, location, (seqnum - seqnum_2)
I am trying to solve one query which I have already solved in SQL Server.
Write a SQL query to find continuous dates appear at least three times.
SQLfiddle
Table: orders
*------------*
| mdate |
*------------*
|'2012-05-01'|
|'2012-05-02'|
|'2012-05-03'|
|'2012-05-06'|
|'2012-05-07'|
|'2012-05-10'|
|'2012-05-11'|
*------------*
SQL Server:
select
mdate
from
(
select
mdate,
count(gap) over (partition by gap) as total
from
(
select
mdate,
dateadd(day, - row_number() over (order by mdate), mdate) as gap
from orders
) t
) tt
where total >= 3
Result:
*------------*
| mdate |
*------------*
|'2012-05-01'|
|'2012-05-02'|
|'2012-05-03'|
*------------*
I cannot use dateadd() function in PostgreSQL so how can I achieve same result in it?
In SQL Server, this would look like:
select mdate
from (select o.*,
count(*) over (partition by dateadd(day, - seqnum, mdate)) as cnt
from (select o.*,
row_number() over (order by mdate) as seqnum
from orders o
) o
) o
where cnt >= 3;
In Postgres:
select mdate
from (select o.*,
count(*) over (partition by mdate - seqnum * interval '1 day') as cnt
from (select o.*,
row_number() over (order by mdate) as seqnum
from orders o
) o
) o
where cnt >= 3;
The only difference is the date arithmetic.
You can make an interval out of your ROW_NUMBER computation by CONCAT with ' day' and typecasting; then you can subtract that from mdate. Your query remains the same other than changing
dateadd(day, - row_number() over (order by mdate), mdate) as gap
to
mdate - concat(row_number() over (order by mdate), ' day')::interval as gap
Demo on SQLFiddle
I would like to modify the query below so it only keeps the highest VISIT_flag value grouped by CUSTOMER_ID, TRANS_TO_DATE and then average VISIT_flag by CUSTOMER_ID.
I'm having challenges figuring out how to take the maximum DENSE_RANK() value and aggregate by taking the average.
(
SELECT
CUSTOMER_ID,
TRANS_TO_DATE ,
DENSE_RANK() OVER( PARTITION BY CUSTOMER_ID, TRANS_TO_DATE ORDER BY HOUR - RN) VISIT_flag
from (
SELECT
CUSTOMER_ID,
TRANS_TO_DATE,
TO_NUMBER(REGEXP_SUBSTR(HOUR,'\d+$')) HOUR,
ROW_NUMBER() OVER( PARTITION BY CUSTOMER_ID, TRANS_TO_DATE ORDER BY TO_NUMBER(REGEXP_SUBSTR(HOUR,'\d+$')) ) as RN
FROM mstr_clickstream
GROUP BY CUSTOMER_ID, TRANS_TO_DATE, REGEXP_SUBSTR(HOUR,'\d+$')
)
ORDER BY CUSTOMER_ID, TRANS_TO_DATE
Following your logic in order to get the last VISIT_flag, meaning the last "visit" occured within a day, you must order (within the DENSE_RANK) descending. Though descending is solving the problem of getting the last visit, you cannot calculate the average visits of customer because the VISIT_flag will always be 1. So to bypass this issue you must declare a second DENSE_RANK with the same partition by and ascending order by in order to quantify the visits of the day and calculate your average. So the derived query
SELECT customer_id,AVG(quanitify) FROM (
SELECT
customer_id,
trans_to_date ,
DENSE_RANK() OVER( PARTITION BY CUSTOMER_ID, TRANS_TO_DATE ORDER BY HOUR DESC, RN DESC, rownum) VISIT_flag,
DENSE_RANK() OVER( PARTITION BY CUSTOMER_ID, TRANS_TO_DATE ORDER BY HOUR ASC, RN ASC, rownum) quanitify FROM (
SELECT
CUSTOMER_ID,
TRANS_TO_DATE,
TO_NUMBER(REGEXP_SUBSTR(HOUR,'\d+$')) HOUR,
ROW_NUMBER() OVER( PARTITION BY CUSTOMER_ID, TRANS_TO_DATE ORDER BY TO_NUMBER(REGEXP_SUBSTR(HOUR,'\d+$')) ) as RN
FROM mstr_clickstream
GROUP BY CUSTOMER_ID, TRANS_TO_DATE, REGEXP_SUBSTR(HOUR,'\d+$') )) WHERE VISIT_flag = 1 GROUP BY customer_id
Now to be honest the above query can be implemented with easier way without using DENSE_RANK. The above query makes sense only if you remove GROUP BY customer_id from outer query and AVG calculation and you want to get information about the last visit.
In any case you may find below the easier way
SELECT CUSTOMER_ID,AVG(cnt) avg_visits FROM (
SELECT CUSTOMER_ID, TRANS_TO_DATE, count(*) cnt FROM (
SELECT
CUSTOMER_ID,
TRANS_TO_DATE,
TO_NUMBER(REGEXP_SUBSTR(HOUR,'\d+$')) HOUR,
ROW_NUMBER() OVER( PARTITION BY CUSTOMER_ID, TRANS_TO_DATE ORDER BY TO_NUMBER(REGEXP_SUBSTR(HOUR,'\d+$')) ) as RN
FROM mstr_clickstream
GROUP BY CUSTOMER_ID, TRANS_TO_DATE, REGEXP_SUBSTR(HOUR,'\d+$'))
GROUP BY CUSTOMER_ID, TRANS_TO_DATE) GROUP BY CUSTOMER_ID
P.S. i always include rownnum in dense_rank order by statement, in order to prevent exceptional cases (that always exists one to the database :D ) of having the same transaction_time. This will produce two records with the same dense_rank and might an issue to the application that uses the query data.
I would like to run the following query in BigQuery, ideally as efficiently as possible. The idea is that I have all of these rows corresponding to tests (taken daily) by millions of users and I want to determine, of the users who have been active for over a year, how much each user has improved.
"Improvement" in this case is the average of the first N subtracted from the last N.
In this example, N is 30. (I've also added in the where cnt >= 100 part because I don't want to consider users who took a test a long time ago and just came back to try once more.)
select user_id,
avg(score) filter (where seqnum_asc <= 30) as first_n_avg,
avg(score) filter (where seqnum_desc <= 30) as last_n_avg
from (select *,
row_number() over (partition by user_id order by created_at) as seqnum_asc,
row_number() over (partition by user_id order by created_at desc) as seqnum_desc,
count(*) over (partition by user_id) as cnt
from tests
) t
where cnt >= 100
group by user_id
having max(created_at) >= min(created_at) + interval '1 year';
Just use conditional aggregation and fix the date functions:
select user_id,
avg(case when seqnum_asc <= 30 then score end) as first_n_avg,
avg(case when seqnum_desc <= 30 then score end) as last_n_avg
from (select *,
row_number() over (partition by user_id order by created_at) as seqnum_asc,
row_number() over (partition by user_id order by created_at desc) as seqnum_desc,
count(*) over (partition by user_id) as cnt
from tests
) t
where cnt >= 100
group by user_id
having max(created_at) >= timestamp_add(min(created_at), interval 1 year);
The function in the having clause could be timetamp_add(), datetime_add(), or date_add(), depending on the type of created_at.
ref to this post: link, I used the answer provided by #Gordon Linoff:
select taxi, count(*)
from (select t.taxi, t.client, count(*) as num_times
from (select t.*,
row_number() over (partition by taxi order by time) as seqnum,
row_number() over (partition by taxi, client order by time) as seqnum_c
from t
) t
group by t.taxi, t.client, (seqnum - seqnum_c)
having count(*) >= 2
)
group by taxi;
and got my answer perfectly like this:
Tom 3 (AA count as 1, AAA count as 1 and BB count as 1, so total of 3 count)
Bob 1
But now I would like to add one more condition which is the time between two consecutive clients for same taxi should not be longer than 2hrs.
I know that I should probably use row_number() again and calculate the time difference with datediff. But I have no idea where to add and how to do.
So any suggestion?
This requires a bit more logic. In this case, I would use lag() to calculate the groups:
select taxi, count(*)
from (select t.taxi, t.client, count(*) as num_times
from (select t.*,
sum(case when prev_client = client and
prev_time > time - interval '2 hour'
then 1
else 0
end) over (partition by client order by time) as grp
from (select t.*,
lag(client) over (partition by taxi order by time) as prev_client,
lag(time) over (partition by taxi order by time) as prev_time
from t
) t
) t
group by t.taxi, t.client, grp
having count(*) >= 2
)
group by taxi;
Note: You don't specify the database, so this uses ISO/ANSI standard syntax for date/time comparisons. You can adjust this for your actual database.