Get totals from difference between rows - sql

I have a table, with the following structure:
(
id SERIAL PRIMARY KEY,
user_id integer NOT NULL REFERENCES user(id) ON UPDATE CASCADE,
status text NOT NULL,
created_at timestamp with time zone NOT NULL,
updated_at timestamp with time zone NOT NULL
)
Example data:
"id","user_id","status","created_at","updated_at"
416,38,"ONLINE","2018-08-07 14:40:51.813+00","2018-08-07 14:40:51.813+00"
417,39,"ONLINE","2018-08-07 14:45:00.717+00","2018-08-07 14:45:00.717+00"
418,38,"OFFLINE","2018-08-07 15:43:22.678+00","2018-08-07 15:43:22.678+00"
419,38,"ONLINE","2018-08-07 16:21:30.725+00","2018-08-07 16:21:30.725+00"
420,38,"OFFLINE","2018-08-07 16:49:10.3+00","2018-08-07 16:49:10.3+00"
421,38,"ONLINE","2018-08-08 11:37:53.639+00","2018-08-08 11:37:53.639+00"
422,38,"OFFLINE","2018-08-08 12:29:08.234+00","2018-08-08 12:29:08.234+00"
423,39,"ONLINE","2018-08-14 15:22:00.539+00","2018-08-14 15:22:00.539+00"
424,39,"OFFLINE","2018-08-14 15:22:02.092+00","2018-08-14 15:22:02.092+00"
When a user on my application goes online, a new row is inserted with status ONLINE. When they go offline, a row with status OFFLINE is inserted. There are other entries created to record different events, but for this query only OFFLINE and ONLINE are important.
I want to produce a chart, showing the total number of users online over a time period (e.g 5 minutes), within a date range. If a user is online for any part of that period they should be counted.
Example:
datetime, count
2019-05-22T12:00:00+0000, 53
2019-05-22T12:05:00+0000, 47
2019-05-22T12:10:00+0000, 49
2019-05-22T12:15:00+0000, 55
2019-05-22T12:20:00+0000, 59
2019-05-22T12:25:00+0000, 56
I'm able to produce a similar chart for an individual user by fetching all status rows within the date range then processing manually, however this approach won't scale to all users.
I believe something like this could be accomplished with window functions, but I'm not really sure where to start

As your question is very vague nobody realy can help you to 100%. Well, you can achive what you want maybe with a combination of of "with" clauses and window functions. With the "with" clause you can easily break down big problems in small parts. Maybe following query (not looking at any performace) may help, you replace public.tbl_test with your table:
with temp_online as (
select
*
from public.tbl_test
where public.tbl_test.status ilike 'online'
order by created_at
),
temp_offline as (
select
*
from public.tbl_test
where public.tbl_test.status ilike 'offline'
order by created_at
),
temp_change as (
select
* ,
(
select temp_offline.created_at from temp_offline where temp_offline.created_at > temp_online.created_at and temp_offline.user_id = temp_online.user_id order by created_at asc limit 1
) as go_offline
from temp_online
),
temp_result as
(
select *,
go_offline - created_at as online_duration
from temp_change
),
temp_series as
(
SELECT (generate_series || ' minute')::interval + '2019-05-22 00:00:00'::timestamp as temp_date
FROM generate_series(0, 1440,5)
)
select
temp_series.temp_date,
(select count(*) from temp_result where temp_result.created_at <= temp_series.temp_date and temp_result.go_offline >= temp_series.temp_date) as count_users
from
temp_series

Related

SQL Efficient way to loop through sequential time-series data to identify unique users without double counting

I am working on a project where I have users that have 'signed up' during a period from Day1 to Day8. However, due to the circumstances of the issue a user can 'sign up' more than once. This results in the same users being able to sign up in Dayx and Dayz. Note: I am using the latest stable version of PostGreSQL for Windows
The goal is to only count the number of unique sign ups for each day while without double counting any users. This means that total sign ups in Day8 need to take into account signups in Days1-Day7 as well.
The solution I have at the moment works technically, but it is very clunky, takes forever to query and does not scale well. Ideally, the SQL query needs to scale for any time period between time x and time y without having to manually write a block of code for each individual time period.
As you can see from my code below it technically gives me the write answer, but is cumbersome, slow and does not scale. Looking for help finding an elegant, scalable solution that does not take 30 minutes to run.
Note: I could write this much more elegantly in Python but am not sure how well Python scales with large datasets stored in RDBMS (ex: Pull all raw data with SQL and then import the CSV into python where a python script will do the calculations instead of doing it in SQL)
TABLE DATA:
+-----------+--------------+-----------------------------------------------+
| cookie_id | time_created | URL |
+-----------+--------------+-----------------------------------------------+
| 3422erq | 2018-10-1 | https:data.join/4wr08w40rwj/utm_source.com |
| 3421ra | 2018-10-1 | https:data.join/convert/45824234/utm_code.com |
| 321af | 2018-10-2 | https:data.join/utm_source=34342.com |
+-----------+--------------+-----------------------------------------------+
SELECT COUNT(DISTINCT cookie_id), time_created FROM Data WHERE url LIKE ('%join%')
AND time_created IN (SELECT MIN(time_created) FROM Data)
GROUP BY time_created
--Code to get all unique users in Day1 (5,304 unique users)
SELECT COUNT(DISTINCT cookie_id), time_created FROM Data WHERE url LIKE ('%join%')
AND time_created IN (SELECT MIN(time_created +1) FROM Data)
AND cookie_id NOT IN (SELECT DISTINCT cookie_id FROM Data WHERE time_created = '2018-10-01')
GROUP BY time_created
--Code to get all unique users in Day2 (9,218 unique users)
SELECT COUNT(DISTINCT cookie_id), time_created FROM Data WHERE url LIKE ('%join%')
AND time_created IN (SELECT MIN(time_created +2) FROM Data)
AND cookie_id NOT IN (SELECT DISTINCT cookie_id FROM Data WHERE time_created BETWEEN '2018-10-01' AND '2018-10-02')
GROUP BY time_created
--Code to get all unique users in Day3 (8,745 unique users)
Expected & actual results are the same. However the code does not scale and is incredibly slow.
So given this table:
CREATE TABLE data
(
cookie_id text,
time_created date,
url text
)
(Yes, no indexes)
I generated 5.5 million rows with random 5 [0-9A-F] characters long cookie_ids on a random (2018-10-01::date + (10*random())::int) date, with every 100th row having the https:data.join/.... url while others were some garbage.
Your second query took around 8.5 minutes. This one, on the other hand, took around 0.2s:
with count_per_day as
(
select time_created, count(*) as unique_users from (
select cookie_id
, time_created
, row_number() over (partition by cookie_id order by time_created) occurrence
from data
where url like 'https:data.join%'
and time_created between '2018-10-01' and '2018-10-08'
) oc
where occurrence = 1
group by time_created
)
select time_created, unique_users, sum(unique_users) over (order by time_created) as running_sum
from count_per_day
Again, with no indexes. If you have orders of magnitude bigger counts, an index on (left(url, 15), time_created, cookie_id) and change of url condition to left(url, 15) = 'https:data.join' dropped it to below 50ms.

Analytics in sql

I have a table with the following structure:
use_id (int) - event (str) - time (timestamp) - value (int)
Event can take several values : install, login, buy, etc.
I need to get all user records before updating the application.
For example moment of release of my application - 1 January 2019, but users may be install new version on any day.
How can i get sum(value) by the first and second versions. ---------
I tried self-join table, but I think that this is not the best solution.
Help me, please.
Here is the definition of your table (as I understood it from your comments and description):
CREATE TABLE user_events (
user_id integer,
event varchar,
time timestamp without time zone,
value integer
);
Here is the query you asked for:
SELECT
COUNT(user_id),
SUM(value)
FROM (
SELECT
DISTINCT ON (user_id)
user_id,time,value
FROM user_events
WHERE event='install'
ORDER BY user_id, time DESC
) last_installations
WHERE
time BETWEEN date '2018-01-01' AND date '2019-01-01';
Some explanations:
inner query ( last_installations ) selects last install events for each user
outer query filters out only installations of first and second versions, and calculates SUM(value) (as you asked) and COUNT(user_id) (I added for clarity - how many users are using 1 and 2 versions now)
UPDATE
sum value for all events by version
SELECT
event,
CASE
WHEN time BETWEEN date '2018-01-01' AND timestamp '2018-05-30 23:59:59' THEN 1
WHEN time BETWEEN date '2018-06-01' AND timestamp '2018-12-31 23:59:59' THEN 2
WHEN time > date '2018-01-01' THEN 3
ELSE 0 -- unknown version
END AS version,
SUM(value)
FROM user_events
GROUP BY 1,2

Select a range of Time Stamp in SQL

Please help me in here.
SELECT TOP 200 [TimeStamp]
,[Id]
,[Serial]
,[Server]
,[Message]
,[Station]
,ISNULL([P1],'Active Directory') as 'Category'
,ISNULL([P2],'Item Bold') as 'ItemName'
FROM [data].[dbo].[Message]
WHERE TimeStamp >= '2017-11-13' AND TimeStamp <= '2017-12-30'
ORDER BY TimeStamp Desc
I am trying to get data in a specific range of "TimeStamp", I have a UI where the user can select two timestamp for them to select the range (see code). But my problem is, for specific TimeStamp, there are lot of identical data. For example the "2017-12-30" has 5 entries, but they have different in data in terms of "Category".
Now my question is, how would I know what the user really actually pick from the "TimeStamp" though they have identical items.
Extracts the date part of the date or datetime expression expr.
Use DATE(expr)
SELECT TOP 200 [TimeStamp]
,[Id]
,[Serial]
,[Server]
,[Message]
,[Station]
,ISNULL([P1],'Active Directory') as 'Category'
,ISNULL([P2],'Item Bold') as 'ItemName'
FROM [data].[dbo].[Message]
WHERE DATE(TimeStamp) >= '2017-11-13' AND DATE(TimeStamp) <= '2017-12-30'
ORDER BY TimeStamp Desc

How to use window functions to get meterics for today, last 7 days, last 30 days for each value of the date?

My problem seems simple on paper:
For a given date, give me active users for that given date, active users in given_Date()-7, active users in a given_Date()-30
i.e. sample data.
"timestamp" "user_public_id"
"23-Sep-15" "805a47023fa611e58ebb22000b680490"
"28-Sep-15" "d842b5bc5b1711e5a84322000b680490"
"01-Oct-15" "ac6b5f70b95911e0ac5312313d06dad5"
"21-Oct-15" "8c3e91e2749f11e296bb12313d086540"
"29-Nov-15" "b144298810ee11e4a3091231390eb251"
for 01-10 the count for today would be 1, last_7_days would be 3, last_30_days would be 3+n (where n would be the count of the user_ids that fall in dates that precede Oct 1st in a 30 day window)
I am on redshift amazon. Can somebody provide a sample sql to help me get started?
the outputshould look like this:
"timestamp" "users_today", "users_last_7_days", "users_30_days"
"01-Oct-15" 1 3 (3+n)
I know asking for help/incomplete solutions are frowned upon, but this is not getting any other attention so I thought I would do my bit.
I have been pulling my hair out trying to nut this one out, alas, I am a beginner and something is not clicking for me. Perhaps yourself or others will be able to drastically improve my answer, but I think I am on the right track.
SELECT replace(convert(varchar, [timestamp], 111), '/','-') AS [timestamp], -- to get date in same format as you require
(SELECT COUNT([TIMESTAMP]) FROM #SIMPLE WHERE ([TIMESTAMP]) = ([timestamp])) AS users_today,
(SELECT COUNT([TIMESTAMP]) FROM #SIMPLE WHERE [TIMESTAMP] BETWEEN DATEADD(DY,-7,[TIMESTAMP]) AND [TIMESTAMP]) AS users_last_7_days ,
(SELECT COUNT([TIMESTAMP]) FROM #SIMPLE WHERE [TIMESTAMP] BETWEEN DATEADD(DY,-30,[TIMESTAMP]) AND [timestamp]) AS users_last_30_days
FROM #SIMPLE
GROUP BY [timestamp]
Starting with this:
CREATE TABLE #SIMPLE (
[timestamp] datetime, user_public_id varchar(32)
)
INSERT INTO #SIMPLE
VALUES('23-Sep-15','805a47023fa611e58ebb22000b680490'),
('28-Sep-15','d842b5bc5b1711e5a84322000b680490'),
('01-Oct-15','ac6b5f70b95911e0ac5312313d06dad5'),
('21-Oct-15','8c3e91e2749f11e296bb12313d086540'),
('29-Nov-15','b144298810ee11e4a3091231390eb251')
The problem I am having is that each row contains the same counts, despite my grouping by [timestamp].
Step 1-- Create a table which has daily counts.
create temp table daily_mobile_Sessions as
select "timestamp" ,
count(user_public_id) over (partition by "timestamp" ) as "today"
from mobile_sessions
group by 1, mobile_sessions.user_public_id
order by 1 DESC
Step 2 -- From the table above. We create yet another table which can use the "today" field, and we apply the window function to Sum the counts.
select "timestamp", today,
sum(today) over (order by "timestamp" rows between 6 PRECEDING and CURRENT ROW) as "last_7days",
sum(today) over (order by "timestamp" rows between 29 PRECEDING and CURRENT ROW) as "last_30days"
from daily_mobile_Sessions group by "timestamp" , 2 order by 1 desc

Sqlite3: Need to Cartesian On date

I have a table which is a list of games that have been played in a sqlite3 database. The field "datetime" is the a datetime of when game ended. The field "duration" is the number of seconds the game lasted. I want to know what percent of the past 24 hours had at least 5 games running simutaniously. I figured out to tell how many games running at a given time are:
select count(*)
from games
where strftime('%s',datetime)+0 >= 1257173442 and
strftime('%s',datetime)-duration <= 1257173442
If I had a table that was simply a list of every second (or every 30 seconds or something) I could do an intentional cartisian product like this:
select count(*)
from (
select count(*) as concurrent, d.second
from games g, date d
where strftime('%s',datetime)+0 >= d.second and
strftime('%s',datetime)-duration <= d.second and
d.second >= strftime('%s','now') - 24*60*60 and
d.second <= strftime('%s','now')
group by d.second) x
where concurrent >=5
Is there a way to create this date table on the fly? Or that I can get a similar effect to this without having to actually create a new table that is simply a list of all the seconds this week?
Thanks
First, I can't think of a way to approach your problem by creating a table on the fly or without the aid of an extra table. Sorry.
My suggestion is for you to rely on a static Numbers table.
Create a fixed table with the format:
CREATE TABLE Numbers (
number INTEGER PRIMARY KEY
);
Populate it with the number of seconds in 24h (24*60*60 = 84600). I would use any scripting language to do that using the insert statement:
insert into numbers default values;
Now the Numbers table has the numbers 1 through 84600. Your query will them be modified to be:
select count(*)
from (
select count(*) as concurrent, strftime('%s','now') - 84601 + n.number second
from games g, numbers n
where strftime('%s',datetime)+0 >= strftime('%s','now') - 84601 + n.number and
strftime('%s',datetime)-duration <= strftime('%s','now') - 84601 + n.number
group by second) x
where concurrent >=5
Without a procedural language in the mix, that is the best you'll be able to do, I think.
Great question!
Here's a query that I think gives you what you want without using a separate table. Note this is untested (so probably contains errors) and I've assumed datetime is an int column with # of seconds to avoid a ton of strftime's.
select sum(concurrent_period) from (
select min(end_table.datetime - begin_table.begin_time) as concurrent_period
from (
select g1.datetime, g1.num_end, count(*) as concurrent
from (
select datetime, count(*) as num_end
from games group by datetime
) g1, games g2
where g2.datetime >= g1.datetime and
g2.datetime-g2.duration < g1.datetime and
g1.datetime >= strftime('%s','now') - 24*60*60 and
g1.datetime <= strftime('%s','now')+0
) end_table, (
select g3.begin_time, g1.num_begin, count(*) as concurrent
from (
select datetime-duration as begin_time,
count(*) as num_begin
from games group by datetime-duration
) g3, games g4
where g4.datetime >= g3.begin_time and
g4.datetime-g4.duration < g3.begin_time and
g3.begin_time >= strftime('%s','now') - 24*60*60 and
g3.begin_time >= strftime('%s','now')+0
) begin_table
where end_table.datetime > begin_table.begin_time
and begin_table.concurrent < 5
and begin_table.concurrent+begin_table.num_begin >= 5
and end_table.concurrent >= 5
and end_table.concurrent-end_table.num_end < 5
group by begin_table.begin_time
) aah
The basic idea is to make two tables: one with the # of concurrent games at the begin time of each game, and one with the # of concurrent games at the end time. Then join the tables together and only take rows at "critical points" where # of concurrent games crosses 5. For each critical begin time, take the critical end time that happened soonest and that hopefully gives all the periods where at least 5 games were running concurrently.
Hope that's not too convoluted to be helpful!
Kevin rather beat me to the punchline there (+1), but I'll post this variation as it differs at least a little.
The key ideas are
Map the data in to a stream of events with attributes time and 'polarity' (=start or end of game)
Keep a running total of how many games are open at the time of each event
(this is done by forming a self-join on the event stream)
Find the event times where the number of games (as Kevin says) transitions up to 5, or down to 4
A little trick: add up all the down-to-4 times and take away the up-to-5s - the order is not important
The result is the number of seconds spent with 5 or more games open
I don't have sqllite, so I've been testing with MySQL, and I've not bothered to limit the time window to preserve some sanity. Shouldn't be difficult to revise.
Also, and more importantly, I've not considered what to do if games are open at the beginning or end of the period!
Something tells me there's a big simplification to be had here, but I've not spotted it yet.
SELECT SUM( event_time )
FROM (
SELECT -ga.event_type * ga.event_time AS event_time,
SUM( ga.event_type * gb.event_type ) event_type
FROM
( SELECT UNIX_TIMESTAMP( g1.endtime - g1.duration ) AS event_time
, 1 event_type
FROM games g1
UNION
SELECT UNIX_TIMESTAMP( g1.endtime )
, -1
FROM games g1 ) AS ga,
( SELECT UNIX_TIMESTAMP( g1.endtime - g1.duration ) AS event_time
, 1 event_type
FROM games g1
UNION
SELECT UNIX_TIMESTAMP( g1.endtime )
, -1
FROM games g1 ) AS gb
WHERE
ga.event_time >= gb.event_time
GROUP BY ga.event_time
HAVING SUM( ga.event_type * gb.event_type ) IN ( -4, 5 )
) AS gr
Why don't you trim the date and keep only the time, if you filter your data for any given date every time is unique. In this way you'll only need a table with numbers from 1 to 86400 (or less if you take bigger intervals), you may create two columns, "from" and "to" to define the intervals.
I'm not familiar with SQLite functions but according to the manual you have to use the strftime function with this format: HH:MM:SS.