How to calculate running sums with append-only rows - sql

I have a table where rows are never mutated but only inserted; they are immutable records. It has the following fields:
id: int
user_id: int
created: datetime
is_cool: boolean
likes_fruits: boolean
An object is tied to a user, and the "current" object for a given user is the one that has the latest created date. E.g. if I want to update is_cool for a user, I'd append a record with a new created timestamp and is_cool=true.
I want to calculate how many users are is_cool at the end of each day. I.e. I'd like the output table to have the columns:
day: some kind of date_trunc('day', created)
cool_users_count: number of users that have is_cool at the end of this day.
What SQL query can i write that does this? FWIW I'm using Presto (or Redshift if need to).
Note that there are other columns, e.g. likes_fruits, which means a record where is_cool is false does not mean is_cool was just changed to false - it could have been false for a while.
This is what procedural pseudo-code would look like to represent what I'd want to do in SQL:
// rows = ...
min_date = min([row.created for row in rows])
max_date = max([row.created for row in rows])
counts_by_day = {}
for date in range(min_date, max_date):
rows_up_until_date = [row for row in rows if row.created <= date]
latest_row_by_user = rows_up_until_date.reduce(
{},
(acc, row) => acc[row.user_id] = row,
)
counts_by_day[date] = latest_row_by_user.filter(row => row.is_cool).length

You can do this using jus a query .. try using a sum on boolend and group by
select date(created), sum(is_cool)
from my_table
group by date(created)
or if you need the number of users
select t.date_created, count(*) num_user
from (
select distinct date(created) date_created, user_id
from my_table
where is_cool = TRUE
) t
group by t.date_created
or if need the last value for is_cool
select date(max_date), sum(is_cool)
from (
select t.user_id, t.max_date, m.is_cool, m.user_id
from my_table m
inner join (
select max(date_created) max_date, user_id
from my_table
group by user_id, date(date_created)
) t on t.max_date = m.date_created
and t.user_id = m.user_id
where m.is_cool = TRUE
) t2
group by date(max_date)

A correlated subquery might be the simplest solution. The following gets the value of is_cool for each user on each date:
select u.user_id, d.date,
(select t.is_cool
from t
where t.user_id = u.user_id and
t.created < dateadd(day, 1, d.date)
order by t.created desc
limit 1
) as is_cool
from (select distinct date(created) as date
from t
) d cross join
(select distinct user_id
from t
) u ;
Then aggregate:
select date, sum(is_cool)
from (select u.user_id, d.date,
(select t.is_cool
from t
where t.user_id = u.user_id and
t.created < dateadd(day, 1, d.date)
order by t.created desc
limit 1
) as is_cool
from (select distinct date(created) as date
from t
) d cross join
(select distinct user_id
from t
) u
) ud
group by date;

Related

Show only latest login from inner join SQL statement

I'm relatively new to SQL and I have the following query to get a list of logins since Jan 1st. I'm trying to only display each user's last login.
SELECT usrlogs.serverlogintime AS Login_Date,
usrlogs.usrname AS User_Name,
usrlogs.usrid AS User_ID,
usrlogs.usrlogid AS Log_ID,
users.status AS Active
FROM usrlogs
INNER JOIN users
ON usrid = uid
WHERE DATE_FORMAT (ServerLoginTime,'%Y-%m-%d') >= '2022-01-01' and status="0"
User_Log_ID increases by 1 with each new login to the server. Is there a way to only display each user's highest Log ID?
you can subselebt the higest logdid from the user and select the userlogs with that id
SELECT u.serverlogintime AS Login_Date,
u.usrname AS User_Name,
u.usrid AS User_ID,
u.usrlogid AS Log_ID,
users.status AS Active
FROM usrlogs u
INNEr JOIN (SELECT MAX(usrlogid) as usrlogid,usrid FROM usrlogs GROUP BY usrid) u1 ON u1.usrid = u.usrid AND u1.usrlogid = u.usrlogid
INNER JOIN users
ON u.usrid = users.uid
WHERE DATE_FORMAT (u.ServerLoginTime,'%Y-%m-%d') >= '2022-01-01' and status="0"
You need to use Row_Number() like this:
SELECT * FROM (
SELECT usrlogs.serverlogintime AS Login_Date,
usrlogs.usrname AS User_Name,
usrlogs.usrid AS User_ID,
usrlogs.usrlogid AS Log_ID,
users.status AS Active,
Row_number() over (partition by usrlogs.usrid order by usrlogs.usrlogid desc ) rw
FROM usrlogs
INNER JOIN users
ON usrid = uid
WHERE DATE_FORMAT (ServerLoginTime,'%Y-%m-%d') >= '2022-01-01' and status="0"
) t where t.rw=1

SQLite Getting multiple results with LIMIT 1

I have the following problem.
Part of a task is to determine the visitor(s) with the most money spent between 2000 and 2020.
It just looks like this.
SELECT UserEMail FROM Visitor
JOIN Ticket ON Visitor.UserEMail = Ticket.VisitorUserEMail
where Ticket.Date> date('2000-01-01') AND Ticket.Date < date ('2020-12-31')
Group by Ticket.VisitorUserEMail
order by SUM(Price) DESC;
Is it possible to output more than one person if both have spent the same amount?
Use rank():
SELECT VisitorUserEMail
FROM (SELECT VisitorUserEMail, SUM(PRICE) as sum_price,
RANK() OVER (ORDER BY SUM(Price) DESC) as seqnum
FROM Ticket t
WHERE t.Date >= date('2000-01-01') AND Ticket.Date <= date('2021-01-01')
GROUP BY t.VisitorUserEMail
) t
WHERE seqnum = 1;
Note: You don't need the JOIN, assuming that ticket buyers are actually visitors. If that assumption is not true, then use the JOIN.
Use a CTE that returns all the total prices for each email and with NOT EXISTS select the rows with the top total price:
WITH cte AS (
SELECT VisitorUserEMail, SUM(Price) SumPrice
FROM Ticket
WHERE Date >= '2000-01-01' AND Date <= '2020-12-31'
GROUP BY VisitorUserEMail
)
SELECT c.VisitorUserEMail
FROM cte c
WHERE NOT EXISTS (
SELECT 1 FROM cte
WHERE SumPrice > c.SumPrice
)
or:
WITH cte AS (
SELECT VisitorUserEMail, SUM(Price) SumPrice
FROM Ticket
WHERE Date >= '2000-01-01' AND Date <= '2020-12-31'
GROUP BY VisitorUserEMail
)
SELECT VisitorUserEMail
FROM cte
WHERE SumPrice = (SELECT MAX(SumPrice) FROM cte)
Note that you don't need the function date() because the result of date('2000-01-01') is '2000-01-01'.
Also I think that the conditions in the WHERE clause should include the =, right?

Find increase in history records in specific range

I want to find records in date range 1/1/19-1/7/19 which increase amount
using table HISTORY:
DATE AMOUNT ID
(Date, number, varchar2(30))
I find IDs inside range correctly
assuming increase/decrease can happens only when having two records with same Id
with suspect as
(select id
from history
where t.createddate < to_date('2019-07-01', 'yyyy-mm-dd')
group by id
having count(1) > 1),
ids as
(select id
from history
join suspect
on history.id = suspect.id
where history.date > to_date('2019-01-01', 'yyyy-mm-dd')
and history.date < to_date('2019-07-01', 'yyyy-mm-dd'))
select count(distinct id)
from history a, history b
where a.id = b.id
and a.date < b.date
and a.amount < b.amount
The problem to find increase I need to find previous record which can be before time range
I can find last previous time before time range, but I failed to use it:
ids_prevtime as (
select history.*, max(t.date) over (partition by t.id) max_date
from history
join ids on history.userid = ids.id
where history.date < to_date('2019-01-01','yyyy-mm-dd' )
), ids_prev as (
select * from ids_prevtime where createdate=max_date
)
I see that you found solution, but maybe you could do it simpler, using lag():
select count(distinct id)
from (select id, date_, amount,
lag(amount) over (partition by id order by date_) prev_amt
from history)
where date_ between date '2019-01-01' and date '2019-07-01'
and amount > prev_amt;
dbfiddle
Add union of last history records before range with records inside range
ids_prev as
(select ID, DATE, AMOUNT
from id_before_rangetime
where createddate = max_date),
ids_in_range as
(select history.*
from history
join ids
on history.ID = ids.ID
where history.date > to_date('2019-01-01', 'yyyy-mm-dd')
and history.date < to_date('2019-07-01', 'yyyy-mm-dd')),
all_relevant as
(select * from ids_in_range union all select * from ids_prev)
and then count increases:
select count(distinct id)
from all_relevant a, all_relevant b
where a.id = b.id
and a.date < b.date
and a.amount < b.amount

What's the proper SQL query to find a 'status change' before given date?

I have a table of logged 'status changes'. I need to find the latest status change for a user, and if it was a) a certain 'type' of status change (s.new_status_id), and b) greater than 7 days old (s.change_date), then include it in the results. My current query sometimes returns the second-to-latest status change for a given user, which I don't want -- I only want to evaluate the last one.
How can I modify this query so that it will only include a record if it is the most recent status change for that user?
Query
SELECT DISTINCT ON (s.applicant_id) s.applicant_id, a.full_name, a.email_address, u.first_name, s.new_status_id, s.change_date, a.applied_class
FROM automated_responses_statuschangelogs s
INNER JOIN application_app a on (a.id = s.applicant_id)
INNER JOIN accounts_siuser u on (s.person_who_modified_id = u.id)
WHERE now() - s.change_date > interval '7' day
AND s.new_status_id IN
(SELECT current_status
FROM application_status
WHERE status_phase_id = 'In The Flow'
)
ORDER BY s.applicant_id, s.change_date DESC, s.new_status_id, s.person_who_modified_id;
You can use row_number() to filter one entry per applicant:
select *
from (
select row_number() over (partition by applicant_id
order by change_date desc) rn
, *
from automated_responses_statuschangelogs
) as lc
join application_app a
on a.id = lc.applicant_id
join accounts_siuser u
on lc.person_who_modified_id = u.id
join application_status stat
on lc.new_status_id = stat.current_status
where lc.rn = 1
and stat.status_phase_id = 'In The Flow'
and lc.change_date < now() - interval '7' day

How to determine if two records are 1 year apart (using a timestamp)

I need to analyze some weblogs and determine if a user has visited once, taken a year break, and visited again. I want to add a flag to every row (Y/N) with a VisitId that meets the above criteria.
How would I go about creating this sql?
Here are the fields I have, that I think need to be used (by analyzing the timestamp of the first page of each visit):
VisitID - each visit has a unique Id (ie. 12356, 12345, 16459)
UserID - each user has one Id (ie. steve = 1, ted = 2, mark = 12345, etc...)
TimeStamp - looks like this: 2010-01-01 00:32:30.000
select VisitID, UserID, TimeStamp from page_view_t where pageNum = 1;
thanks - any help would be greatly appreciated.
You could rank every user's rows, then join the ranked row set to itself to compare adjacent rows:
;
WITH ranked AS (
SELECT
*,
rnk = ROW_NUMBER() OVER (PARTITION BY UserID ORDER BY TimeStamp)
FROM page_view_t
),
flagged AS (
SELECT
*,
IsReturnVisit = CASE
WHEN EXISTS (
SELECT *
FROM ranked
WHERE UserID = r.UserID
AND rnk = r.rnk - 1
AND TimeStamp <= DATEADD(YEAR, -1, r.TimeStamp)
)
THEN 'Y'
ELSE 'N'
END
FROM ranked r
)
SELECT
VisitID,
UserID,
TimeStamp,
IsReturnVisit
FROM flagged
Note: the above flags only return visits.
UPDATE
To flag the first visits same as return visits, the flagged CTE could be modified as follows:
…
SELECT
*,
IsFirstOrReturnVisit = CASE
WHEN p.UserID IS NULL OR r.TimeStamp >= DATEADD(YEAR, 1, p.TimeStamp)
THEN 'Y'
ELSE 'N'
END
FROM ranked r
LEFT JOIN ranked p ON r.UserID = p.UserID AND r.rnk = p.rnk + 1
…
References that might be useful:
WITH common_table_expression (Transact-SQL)
Ranking Functions (Transact-SQL)
ROW_NUMBER (Transact-SQL)
The other guy was faster but since I took time to do it and it's a completely different approach I might as well post It :D.
SELECT pv2.VisitID,
pv2.UserID,
pv2.TimeStamp,
CASE WHEN pv1.VisitID IS NOT NULL
AND pv3.VisitID IS NULL
THEN 'YES' ELSE 'NO' END AS IsReturnVisit
FROM page_view_t pv2
LEFT JOIN page_view_t pv1 ON pv1.UserID = pv2.UserID
AND pv1.VisitID <> pv2.VisitID
AND (pv1.TimeStamp <= DATEADD(YEAR, -1, pv2.TimeStamp)
OR pv2.TimeStamp <= DATEADD(YEAR, -1, pv1.TimeStamp))
AND pv1.pageNum = 1
LEFT JOIN page_view_t pv3 ON pv1.UserID = pv3.UserID
AND (pv3.TimeStamp BETWEEN pv1.TimeStamp AND pv2.TimeStamp
OR pv3.TimeStamp BETWEEN pv2.TimeStamp AND pv1.TimeStamp)
AND pv3.pageNum = 1
WHERE pv2.pageNum = 1
Assuming page_view_t table stores UserID and TimeStamp details of each visit of the user, the following query will return users who have visited taking a break of at least an year (365 days) between two consecutive visits.
select t1.UserID
from page_view_t t1
where (
select datediff(day, max(t2.[TimeStamp]), t1.[TimeStamp])
from page_view_t t2
where t2.UserID = t1.UserID and t2.[TimeStamp] < t1.[TimeStamp]
group by t2.UserID
) >= 365