I'd like to, based on the first 6 columns, calculate the desired count. For each partition of user_id, session_id and orig_id, ordered by rank_agse ascending, I'd like to, starting in 1, add one every time the lag_agse column equals 'ACCESSED'. Please find it populated to illustrate what I would want in the table below.
It seems to me that you are looking for
select user_id, session_id, orig_id, type, lag_agse, rank_agse,
count(case when type = 'ACCESSED' then 1 end)
over (partition by user_id, session_id, orig_id
order by rank_agse) as desired_count
from your_table
order by user_id, session_id, orig_id, rank_agse desc
;
See my Comment under your question regarding ascending vs descending order by RANK_AGSE.
Note that count() does the same job as summing over 1 when type is 'ACCESSED' and 0 otherwise - and it does the same job in a simpler way.
Related
I have a table with id, order sequence and date, and I am trying to add two columns, one with a difference in date function, and another with a status function that is reliant on the value of the difference in date.
Table looks like this:
The issue I am having is that, when I try to find the difference between the dates of each unique id, so that if it's the first order sequence, it should be null, if it's any subsequent order sequence, let's say 3, it will be the 3rd date - 2nd date. Now this all works with the code I have:
case
when ord_seq = 1 then null
else ord_date - lag(ord_date) over (order by id)
end as date_diff,
However, this only works when the table is already ordered. If I jumble up the order that I input the table in, the values come out a little different. I figured it might be because "lag" function only takes the previous row's value, so if the previous row does not belong to the same id, and is not in chronological order, the dates won't subtract well.
My code looks like this at the moment:
select
id,
ord_seq,
ord_date,
case
when ord_seq = 1 then null
else ord_date - lag(ord_date) over (order by id)
end as date_diff,
case
when ord_seq = 1 then 'New'
when ord_date - lag(ord_date) over (order by id, ord_seq) between 1 and 200 then 'Retain'
when ord_date - lag(ord_date) over (order by id, ord_seq) > 200 then 'Reactivated'
end as status
from t1
order by id, ord_seq, ord_date
My db<>fiddle
Am I using the correct function here? How do I find the difference in date between one unique ID, regardless of the order of the table?
Any help would be much appreciated.
In case you want to see end table result (error is on id 'ddd', ord seq '2' and '3'):
Ordered Input:
Not Ordered Input:
When using this:
You miss the partition by in your window frame definition. Here it is, working regardless of any table order:
select *,
ord_date - lag(ord_date) over (partition by id order by ord_seq) as date_diff
from t1;
Please note however that database tables have no natural order that you can not rely upon and can not be considered ordered, no matter in what sequence the records have been inserted. You must specify explicitly an order by clause if you need a specific order.
I am trying fill out some nulls where I just need them to be the previous available value for a name (sorted by date).
So, from this table:
I need the query to output this:
Now, the idea is that for Jane, on the second and third there was no score, so it should be equal to the previous date on which an score was available, for Jane. And the same for Jon. I am trying coalesce and range, but range is not implemented yet in Redshift. I also looked into other questions and they don't fully apply to different categories. Any alternatives?
Thanks!
select day, name,
coalesce(score, (select score
from [your table] as t
where t.name = [your table].name and t.date < [your table].date
order by date desc limit 1)) as score
from [your table]
The query straightforwardly implements the logic you described:
if score is not null, coalesce will return its value without executing the subquery
if score is null, the subquery will return the last available score for that name before the given date
It's a "gaps and islands" problem and a query can be like this
SELECT
day,
name,
MAX(score) OVER (PARTITION BY name, group_id) AS score
FROM (
SELECT
*,
SUM(CASE WHEN score IS NULL THEN 0 ELSE 1 END) OVER (PARTITION BY name ORDER BY day) AS group_id
FROM data
) groups
ORDER BY name DESC, day
You can check a working demo here
I want to remove all customer hits that I see on my site after they have registered. However, not all customers will register on the same day so I cannot simply filter on a specific date. I have a registration indicator of 1 or 0 and then a hit timestamp, along with unique indicators for the specific customers. I have tried this:
rank() over (partition by customer_id, registration_ind order by hit_timestamp asc) rnk
However, this still partitions by customer and isn't working for what I want.
Any help please?
THanks
Is this what you want?
select t.*
from (select t.*,
min(case when registration_ind = 1 then hit_timestamp end) over (partition by customer_id) as registration_timestamp
from t
) t
where registration_timestamp is null or
hit_timestamp < registration_timestamp;
It returns all rows before the first registration timestamp.
Suppose I have the following record in BQ:
id name age timestamp
1 "tom" 20 2019-01-01
I then perform two "updates" on this record by using the streaming API to 'append' additional data -- https://cloud.google.com/bigquery/streaming-data-into-bigquery. This is mainly to get around the update quota that BQ enforces (and it is a high-write application we have).
I then append two edits to the table, one update that just modifies the name, and then one update that just modifies the age. Here are the three records after the updates:
id name age timestamp
1 "tom" 20 2019-01-01
1 "Tom" null 2019-02-01
1 null 21 2019-03-03
I then want to query this record to get the most "up-to-date" information. Here is how I have started:
SELECT id, **name**, **age**,max(timestamp)
FROM table
GROUP BY id
-- 1,"Tom",21,2019-03-03
How would I get the correct name and age here? Note that there could be thousands of updates to a record, so I don't want to have to write 1000 case statements, if at all possible.
For various other reasons, I usually won't have all row data at one time, I will only have the RowID + FieldName + FieldValue.
I suppose plan B here is to do a query to get the current data and then add my changes to insert the new row, but I'm hoping there's a way to do this in one go without having to do two queries.
Below is for BigQuery Standard SQL
#standardSQL
SELECT id,
ARRAY_AGG(name IGNORE NULLS ORDER BY ts DESC LIMIT 1)[OFFSET(0)] name,
ARRAY_AGG(age IGNORE NULLS ORDER BY ts DESC LIMIT 1)[OFFSET(0)] age,
MAX(ts) ts
FROM `project.dataset.table`
GROUP BY id
You can test, play with above using sample data from your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 id, "tom" name, 20 age, DATE '2019-01-01' ts UNION ALL
SELECT 1, "Tom", NULL, '2019-02-01' UNION ALL
SELECT 1, NULL, 21, '2019-03-03'
)
SELECT id,
ARRAY_AGG(name IGNORE NULLS ORDER BY ts DESC LIMIT 1)[OFFSET(0)] name,
ARRAY_AGG(age IGNORE NULLS ORDER BY ts DESC LIMIT 1)[OFFSET(0)] age,
MAX(ts) ts
FROM `project.dataset.table`
GROUP BY id
with result
Row id name age ts
1 1 Tom 21 2019-03-03
This is a classic case of application of analytic functions in Standard SQL.
Here is how you can achieve your results:
select id, name, age from (
select id, name, age, ts, rank() over (partition by id order by ts desc) rnk
from `yourdataset.yourtable`
)
where rnk = 1
This will sub-group your records based id and pick the one with most recent ts (indicating the record most recently added for a given id).
I've just find this problem in a query: (Sorry, I'll try to simplify the table definitions and shorten the details. If you need some more, please, tell me).
I've a table with (ID, date, status). ID is a FK, date is a date, and status is an int value between 1 and 5. Both date and status allow null values. Repeated values are also allowed.
What I need:
Extract one row for each ID, min(date) having an status of 1 or 2, min(date) having an status of 2 (only), max(status)...
I'm completely lost... I'm trying to use
SELECT
ID,
min(date) over (partition by(status)) ?? as min_12_date,
min(date) over (partition by(status)) ?? as min_2_date,
max(status) as max_status
FROM table
group by ID
So, the question is: Is this correct? How can I select only the status I want?
If I understand your question correctly, you don't need analytic functions for this. You only need conditional aggregation:
select id,
min(case when status in (1, 2) then date end) as min_12_date,
min(case when status = 2 then date end) as min_2_date,
max(status) as max_status
from tbl
group by id