Last click attribution/greatest n per user in SQL - sql

I would like to select the last campaign a user clicked in my dataset and return a table with the name of the last clicked campaign and date for each anonymous id.
This is what I have written
select anon,
source,
medium,
campaign,
max(ts) as ts
from attribution
group by 1,2,3,4
This code seems to return the last click date, but in cases where the user clicked on two campaigns it will return both campaigns with the latest date appended to the date column.
TS in this scenario refers to the timestamp

You could use row_number():
select *
from (
select
anon,
source,
medium,
campaign,
ts,
row_number() over(partition by anon order by ts desc) rn
from attribution
) where rn = 1
This assumes that anom is the column that hold the username - if that's not the case, then change it to the relevant column in the OVER(PARTITION BY ...) clause.

Related

SQL - timeline based queries

I have a table of events which has:
user_id
event_name
event_time
There are event names of types: meeting_started, meeting_ended, email_sent
I want to create a query that counts the number of times an email has been send during a meeting.
UPDATE: I'm using Google BigQuery.
Example query:
SELECT
event_name,
count(distinct user_id) users,
FROM
events_table WHERE
and event_name IN ('meeting_started', 'meeting_ended')
group by 1
How can I achieve that?
Thanks!
You can do this in BigQuery using last_value():
Presumably, an email is send during a meeting if the most recent "meeting" event is 'meeting_started'. So, you can solve this by getting the most recent meeting event for each event and then filtering:
select et.*
from (select et.*,
last_value(case when event_name in ('meeting_started', 'meeting_ended') then event_name end) ignore nulls) over
(partition by user_id order by event_time) as last_meeting_event
from events_table et
) et
where event_name = 'email_sent' and last_meeting_event = 'meeting_started'
This reads likes some kind of gaps-and-islands problem, where an island is a meeting, and you want emails that belong to islands.
How do we define an island? Assuming that meeting starts and ends properly interleave, we can just compare the count of starts and ends on a per-user basis. If there are more starts than ends, then a meeting is in progress. Using this logic, you can get all emails that were sent during a meeting like so:
select *
from (
select e.*,
countif(event_name = 'meeting_started') over(partition by user_id order by event_time) as cnt_started,
countif(event_name = 'meeting_ended' ) over(partition by user_id order by event_time) as cnt_ended
from events_table e
) e
where event_name = 'email_sent' and cnt_started > cnt_ended
It is unclear where you want to go from here. If you want the count of such emails, just use select count(*) instead of select * in the outer query.

Get latest status update for every user in the database

I have status_updates table which contains rows with each status update for each user,
id nickname status timestamp
-----------------------------------------------
14638 lovely_john offline 2020-07-14 08:37:18
14640 big_papa online 2020-07-14 08:57:10
When status changes, a new row is added.
How do I select the latest single row (in accordance to the timestamp) for each user and get them in one query? So, if I have 100 users, I will get 100 rows with the latest status change.
Thanks!
This is best handled by DISTINCT ON
select distinct on (nickname) *
from status_updates
order by nickname, timestamp desc;
Please use below query. You have to use ROW_NUMBER()
select id, nickname, status, timestamp
from
(select id, nickname, status, timestamp, row_number() over(partition by user_id order
by timestamp desc) as rnk) qry
where rnk = 1;
This will provide you the latest record of each user

How do I reset a sum() over () in a SQL Server query?

I have a derived table that looks like this example:
{select * from tb_data}
I want the results to have and additional summation column, the catch is I need the summation column to reset the working value if the info column value = 'reset'
{select *, (I assume some variation on sum(number) over (partition by id order by date desc)) as summation from tb_data}
and here's what the output should look like:
The actual derived table covers thousands of ids which is why it needs to be partitioned by the id and ordered by date desc and each has a different number of reset points.
What SQL query will get me the output I need?
You could first do a conditional window sum to define the groups: everytime a reset is found, a new group starts. Then you can simply do a window sum of numbers within the groups.
select
id,
date,
info,
number,
sum(number) over(partition by id, grp order by date) summation
from (
select
t.*,
sum(case when info = 'reset' then 1 else 0 end)
over(partition by id order by date) grp
from mytable t
) t

PostgreSQL backward intersection & join

I have a survey form of certain questions for a certain facility.
the facility can be monitored(data entry) more than once in a month.
now i need the latest data(values) against the questions
but if there is no latest data against any question i will traverse through prior records(previous dates) of the same month.
i can get the latest record but i don't know how to get previous record of the same month id there is no latest data.
i am using PostgreSQL 10.
Table Structure is
Desired output is
You can try to use ROW_NUMBER window function to make it.
SELECT to_char(date, 'MON') month,
facility,
idquestion,
value
FROM (
SELECT *,ROW_NUMBER() OVER(PARTITION BY facility,idquestion ORDER BY DATE DESC) rn
FROM T
) t1
where rn = 1
demo:db<>fiddle
SELECT DISTINCT
to_char(qdate, 'MON'),
facility,
idquestion,
first_value(value) OVER (PARTITION BY facility, idquestion ORDER BY qdate DESC) as value
FROM questions
ORDER BY facility, idquestion
Using window functions:
first_value(value) OVER ... gives you the first value of a window frame. The frame is a group of facility and idquestion. Within this group the rows are ordered by date DESC. So the very last value is first no matter which date it is
DISTINCT filtered the tied values (e.g. there are two values for facility == 1 and idquestion == 7)
Please notice:
"date" is a reserved word in Postgres. I strongly recommend to rename your column to avoid certain trouble. Furthermore in Postgres lower case is used and is recommended.

Max of a Date field into another field in Postgresql

I have a postgresql table wherein I have few fields such as id and date. I need to find the max date for that id and show the same into a new field for all the ids. SQLFiddle site was not responding so I have an example in the excel. Here is the screenshot of the data and the output for the table.
You could use the windowing variant of max:
SELECT id, date, MAX(date) OVER (PARTITION BY id)
FROM mytable
Something like this might work:
WITH maxdts AS (
SELECT id, max(dt) maxdt FROM table GROUP BY id
)
SELECT id, date, maxdt FROM table t, maxdts m WHERE t.id = m.id;
Keep in mind without more information that this could be a horribly inefficient query, but it will get you what you need.