How to return max date per month for user - sql

I have following table:
And I would like to have returned maximum threshold date per each month for every user, so my final result should look like that:
I wanted to use analytic function ROW_NUMBER and return maximum number of row but how to do it per month for each user? Is there any simpler way to do it in BigQuery?

You can partition the row_number by the user and the month, and then take the first one for each:
SELECT user_id, threshold_date, net_deposists_usd
FROM (SELECT user_id, threshold_date, net_deposists_usd,
ROW_NUMBER () OVER (PARTITION BY user_id, EXTRACT (MONTH FROM threshold_date)
ORDER BY net_deposists_usd DESC) AS rk
FROM mytable)
WHERE rk = 1

BigQuery now supports qualify, which does everything you want. For the month, just use date_trunc():
select t.*
from t
qualify row_number() over (partition by user_id, date_trunc(threshold_date, month)
order by threshold_date desc, net_deposits_usd desc
);
A simple alternative uses arrays and group by:
select array_agg(t order by threshold_date desc, net_deposits_usd desc limit 1)[ordinal(1)].*
from t
group by user_id, date_trunc(threshold_date, month) ;

Related

Get range between FIRST_VALUE and LAST_VALUE

timestamp
id
scope
2021-01-23 12:52:34.159999 UTC
1
enter_page
2021-01-23 12:53:02.342 UTC
1
view_product
2021-01-23 12:53:02.675 UTC
1
checkout
2021-01-23 12:53:04.342 UTC
1
search_page
2021-01-23 12:53:24.513 UTC
1
checkout
I am trying to get all the values between the FIRST_VALUE and LAST VALUE in the column 'scope' using WINDOWS/ANALYTICAL Functions
I already get the first_value() = enter_page
and the last_value() == checkout
by using windows functions in SQLite
FIRST_VALUE(scope) OVER ( PARTITION BY id ORDER BY julianday(timestamp) ASC) first_page
FIRST_VALUE(scope) OVER ( PARTITION BY id ORDER BY julianday(timestamp) DESC ) last_page
I am trying to capture all steps in between [excluding edges]: view_product, apartment_view, checkout[, N-field] to later add them into a string ( unique values -STR_AGGR() )
Once done that, i will later process trying to find if the customer at somepoint during the purchase_journey open the checkout multiple times
my result should like like
id
first_page
last_page
inbetween_pages
1
enter_page
checkout
view_product, checkout, search_page
p.s. I am trying to avoid using python to process this. I would like a 'clean' way of doing this with SQL-pure
Thanks a lot guys
You can do it with GROUP_CONCAT() window function which supports the ORDER BY clause so you will have the scopes in inbetween_pages in the correct order, instead of GROUP_CONCAT() aggregate function which does not support the ORDER BY clause and the results that it returns are not guaranteed to be in a specific order:
SELECT DISTINCT id, first_page, last_page,
GROUP_CONCAT(CASE WHEN timestamp NOT IN (min_timestamp, max_timestamp) THEN scope END)
OVER (PARTITION BY id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) inbetween_pages
FROM (
SELECT *,
FIRST_VALUE(scope) OVER (PARTITION BY id ORDER BY timestamp) first_page,
FIRST_VALUE(scope) OVER (PARTITION BY id ORDER BY timestamp DESC) last_page,
MIN(timestamp) OVER (PARTITION BY id) min_timestamp,
MAX(timestamp) OVER (PARTITION BY id) max_timestamp
FROM tablename
)
See the demo.
Results:
id
first_page
last_page
inbetween_pages
1
enter_page
checkout
view_product,checkout,search_page
Hmmm . . . I am thinking:
select id, group_concat(scope, ',')
from (select t.*,
row_number() over (partition by id order by timestamp) as seqnum_asc,
row_number() over (partition by id order by timestamp desc) as seqnum_desc
from t
order by id, timestamp
) t
where 1 not in (seqnum_asc, seqnum_desc)
group by id;
In SQLite, group_concat() doesn't accept an order by argument. My understanding is that it respects the ordering from the subquery, which is why the subquery has an order by.

Postgres DB query to get the count, and first and last ids by date in a single query

I have the following db structure.
table
-----
id (uuids)
date (TIMESTAMP)
I want to write a query in postgres (actually cockroachdb which uses the postgres engine, so postgres query should be fine).
The query should return a count of records between 2 dates , id of the record with latest date and id of the record with latest earliest date within that range.
So the query should return the following:
count, id(of the earliest record in the range), id (of the latest record in the range)
thanks.
You can use row_number() twice, then conditional aggregation:
select
no_records,
min(id) filter(where rn_asc = 1) first_id
max(id) filter(where rn_desc = 1) last_id
from (
select
id,
count(*) over() no_records
row_number() over(order by date asc) rn_asc,
row_number() over(order by date desc) rn_desc
from mytable
where date >= ? and date < ?
) t
where 1 in (rn_asc, rn_desc)
The question marks represents the (inclusive) start and (exclusive) end of the date interval.
Of course, if ids are always increasing, simple aggregation is sufficient:
select count(*), min(id) first_id, max(id) last_id
from mytable
where date >= ? and date < ?
Unfortunately, Postgres doesn't support first_value() as an aggregation function. One method is to use arrays:
select count(*),
(array_agg(id order by date asc))[1] as first_id,
(array_agg(id order by date desc))[1] as last_id
from t
where date >= ? and date <= ?

Running total in per year ordered by person based on latest date info

We try to calculate the running total in for each year ordered by person based on his latest date info. So i got an example for you how the data is ordered:
Expected result:
So for each downloaded date we want to running total in of all persons ordered by year (now the year is only 2018)
What do we have so far:
sum(Amount)
over(partition by [Year],[Person]
order by [Enddate)
where max(Downloaded)
Any idea how to fix this?
Just use window function
select *,
sum(Amount) over (partition by Year, Downloaded) RuningTotal
from table t
Try using a subquery with a moving downloaded date range.
SELECT
T.*,
RunningTotalByDate = (
SELECT
SUM(N.Amount)
FROM
YourTable AS N
WHERE
N.Downloaded <= T.Downloaded)
FROM
YourTable AS T
ORDER BY
T.Downloaded ASC,
T.Person ASC
Or with windowed SUM(). Do no include a PARTITION BY because it will reset the sum when the partitioned by column value changes.
SELECT
T.*,
RunningTotalByDate = SUM(T.Amount) OVER (ORDER BY T.Downloaded ASC)
FROM
YourTable AS T
ORDER BY
T.Downloaded ASC,
T.Person ASC

BigQuery RATIO_TO_REPORT for all data no partition

I want calculate ratio of specify field, I know in legacy sql I can use RATIO_TO_REPORT function ex:
SELECT
month,
RATIO_TO_REPORT(totalPoint) over (partition by month)
FROM (
SELECT
format_datetime('%Y-%m', ts) AS month,
SUM(point) AS totalPoint
FROM
`userPurchase`
GROUP BY
month
ORDER BY
month )
but I want get ratio that calculate by all data without partition, ex:(this code not work)
SELECT
month,
RATIO_TO_REPORT(totalPoint) over (partition by "all"),
# RATIO_TO_REPORT(totalPoint) over (partition by null)
FROM (
SELECT
format_datetime('%Y-%m', ts) AS month,
SUM(point) AS totalPoint
FROM
`userPurchase`
GROUP BY
month
ORDER BY
month )
It doesn't work, How I can do for same thing? thanks!
assuming the rest of the code is correct - just omit partition by part
RATIO_TO_REPORT(totalPoint) OVER ()

Tagging consecutive days

Supposedly I have data something like this:
ID,DATE
101,01jan2014
101,02jan2014
101,03jan2014
101,07jan2014
101,08jan2014
101,10jan2014
101,12jan2014
101,13jan2014
102,08jan2014
102,09jan2014
102,10jan2014
102,15jan2014
How could I efficiently code this in Greenplum SQL such that I can have a grouping of consecutive days similar to the one below:
ID,DATE,PERIOD
101,01jan2014,1
101,02jan2014,1
101,03jan2014,1
101,07jan2014,2
101,08jan2014,2
101,10jan2014,3
101,12jan2014,4
101,13jan2014,4
102,08jan2014,1
102,09jan2014,1
102,10jan2014,1
102,15jan2014,2
You can do this using row_number(). For a consecutive group, the difference between the date and the row_number() is a constant. Then, use dense_rank() to assign the period:
select id, date,
dense_rank() over (partition by id order by grp) as period
from (select t.*,
date - row_number() over (partition by id order by date) * 'interval 1 day'
from table t
) t