How do I produce a time interval query in SQLite? - sql

I have an events based table that I would like to produce a query, by minute for the number of events that were occuring.
For example, I have an event table like:
CREATE TABLE events (
session_id TEXT,
event TEXT,
time_stamp DATETIME
)
Which I have transformed into the following type of table:
CREATE TABLE sessions (
session_id TEXT,
start_ts DATETIME,
end_ts DATETIME,
duration INTEGER
);
Now I want to create a query that would group the sessions by a count of those that were active during a particular minute. Where I would essentially get back something like:
TIME_INTERVAL ACTIVE_SESSIONS
------------- ---------------
18:00 1
18:01 5
18:02 3
18:03 0
18:04 2

Ok, I think I got more what I wanted. It doesn't account for intervals that are empty, but it is good enough for what I need.
select strftime('%Y-%m-%dT%H:%M:00.000',start_ts) TIME_INTERVAL,
(select count(session_id)
from sessions s2
where strftime('%Y-%m-%dT%H:%M:00.000',s1.start_ts) between s2.start_ts and s2.end_ts) ACTIVE_SESSIONS
from sessions s1
group by strftime('%Y-%m-%dT%H:%M:00.000',start_ts);
This will generate a row per minute for the period that the data covers with a count for the number of sessions that were had started (start_ts) but hadn't finished (end_ts).

PostgreSQL allows the following query.
In contrast to your example, this returns an additional column for the day, and it omits the minutes where nothing happened (count=0).
select
day, hour, minute, count(*)
from
(values ( 0),( 1),( 2),( 3),( 4),( 5),( 6),( 7),( 8),( 9),
(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),
(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),
(30),(31),(32),(33),(34),(35),(36),(37),(38),(39),
(40),(41),(42),(43),(44),(45),(46),(47),(48),(49),
(50),(51),(52),(53),(54),(55),(56),(57),(58),(59))
as minutes (minute),
(values ( 0),( 1),( 2),( 3),( 4),( 5),( 6),( 7),( 8),( 9),
(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),
(20),(21),(22),(23))
as hours (hour),
(select distinct cast(start_ts as date) from sessions
union
select distinct cast(end_ts as date) from sessions)
as days (day),
sessions
where
(day,hour,minute)
between (cast(start_ts as date),extract(hour from start_ts),extract(minute from start_ts))
and (cast(end_ts as date), extract(hour from end_ts), extract(minute from end_ts))
group by
day, hour, minute
order by
day, hour, minute;

This isn't exactly your query, but I think it could help. Did you look into the SQLite R-Tree module? This would allow you to create a virtual index on the start/stop time:
CREATE VIRTUAL TABLE sessions_index USING rtree (id, start, end);
Then you could search via:
SELECT * FROM sessions_index WHERE end >= <first minute> AND start <= <last minute>;

Related

BQ: Select latest date from multiple columns

Good day, all. I wrote a question relating to this earlier, but now I have encountered another problem.
I have to calculate the timestamp difference between the install_time and contributer_time columns. HOWEVER, I have three contributor_time columns, and I need to select the latest time from those columns first then subtract it from install time.
Sample Data
users
install_time
contributor_time_1
contributor_time_2
contributor_time_3
1
8:00
7:45
7:50
7:55
2
10:00
9:15
9:45
9:30
3
11:00
10:30
null
null
For example, in the table above I would need to select contributor_time_3 and subtract it from install_time for user 1. For user 2, I would do the same, but with contributor_time_2.
Sample Results
users
install_time
time_diff_min
1
8:00
5
2
10:00
15
3
11:00
30
The problem I am facing is that 1) the contributor_time columns are in string format and 2) some of them have 'null' string values (which means that I cannot cast it into a timestamp.)
I created a query, but I am am facing an error stating that I cannot subtract a string from timestamp. So I added safe_cast, however the time_diff_min results are only showing when I have all three contributor_time columns as a timestamp. For example, in the sample table above, only the first two rows will pull.
The query I have so far is below:
SELECT
users,
install_time,
TIMESTAMP_DIFF(install_time, greatest(contributor_time_1, contributor_time_2, contributor_time_3), MINUTE) as ctct_min
FROM
(SELECT
users,
install_time,
safe_cast(contributor_time_1 as timestamp) as contributor_time_1,
safe_cast(contributor_time_2 as timestamp) as contributor_time_2,
safe_cast(contributor_time_3 as timestamp) as contributor_time_3,
FROM
(SELECT
users,
install_time,
case when contributor_time_1 = 'null' then '0' else contributor_time_1 end as contributor_time_1,
....
FROM datasource
Any help to point me in the right direction is appreciated! Thank you in advance!
Consider below
select users, install_time,
time_diff(
parse_time('%H:%M',install_time),
greatest(
parse_time('%H:%M',contributor_time_1),
parse_time('%H:%M',contributor_time_2),
parse_time('%H:%M',contributor_time_3)
),
minute) as time_diff_min
from `project.dataset.table`
if applied to sample data in your question - output is
Above can be refactored slightly into below
create temp function latest_time(arr any type) as ((
select parse_time('%H:%M',val) time
from unnest(arr) val
order by time desc
limit 1
));
select users, install_time,
time_diff(
parse_time('%H:%M',install_time),
latest_time([contributor_time_1, contributor_time_2, contributor_time_3]),
minute) as time_diff_min
from `project.dataset.table`
less verbose and no redundant parsing - with same result - so just matter of preferences
You can use greatest():
select t.*,
time_diff(install_time, greatest(contributor_time_1, contributor_time_2, contributor_time_3), minute) as diff_min
from t;
Note: this assumes that the values are never NULL, which seems reasonable based on your sample data.

Analytics in sql

I have a table with the following structure:
use_id (int) - event (str) - time (timestamp) - value (int)
Event can take several values : install, login, buy, etc.
I need to get all user records before updating the application.
For example moment of release of my application - 1 January 2019, but users may be install new version on any day.
How can i get sum(value) by the first and second versions. ---------
I tried self-join table, but I think that this is not the best solution.
Help me, please.
Here is the definition of your table (as I understood it from your comments and description):
CREATE TABLE user_events (
user_id integer,
event varchar,
time timestamp without time zone,
value integer
);
Here is the query you asked for:
SELECT
COUNT(user_id),
SUM(value)
FROM (
SELECT
DISTINCT ON (user_id)
user_id,time,value
FROM user_events
WHERE event='install'
ORDER BY user_id, time DESC
) last_installations
WHERE
time BETWEEN date '2018-01-01' AND date '2019-01-01';
Some explanations:
inner query ( last_installations ) selects last install events for each user
outer query filters out only installations of first and second versions, and calculates SUM(value) (as you asked) and COUNT(user_id) (I added for clarity - how many users are using 1 and 2 versions now)
UPDATE
sum value for all events by version
SELECT
event,
CASE
WHEN time BETWEEN date '2018-01-01' AND timestamp '2018-05-30 23:59:59' THEN 1
WHEN time BETWEEN date '2018-06-01' AND timestamp '2018-12-31 23:59:59' THEN 2
WHEN time > date '2018-01-01' THEN 3
ELSE 0 -- unknown version
END AS version,
SUM(value)
FROM user_events
GROUP BY 1,2

SQL question: count of occurrence greater than N in any given hour

I'm looking through login logs (in Netezza) and trying to find users who have greater than a certain number of logins in any 1 hour time period (any consecutive 60 minute period, as opposed to strictly a clock hour) since December 1st. I've viewed the following posts, but most seem to address searching within a specific time range, not ANY given time period. Thanks.
https://dba.stackexchange.com/questions/137660/counting-number-of-occurences-in-a-time-period
https://dba.stackexchange.com/questions/67881/calculating-the-maximum-seen-so-far-for-each-point-in-time
Count records per hour within a time span
You could use the analytic function lag to look back in a sorted sequence of time stamps to see whether the record that came 19 entries earlier is within an hour difference:
with cte as (
select user_id,
login_time,
lag(login_time, 19) over (partition by user_id order by login_time) as lag_time
from userlog
order by user_id,
login_time
)
select user_id,
min(login_time) as login_time
from cte
where extract(epoch from (login_time - lag_time)) < 3600
group by user_id
The output will show the matching users with the first occurrence when they logged a twentieth time within an hour.
I think you might do something like that (I'll use a login table, with user, datetime as single column for the sake of simplicity):
with connections as (
select ua.user
, ua.datetime
from user_logons ua
where ua.datetime >= timestamp'2018-12-01 00:00:00'
)
select ua.user
, ua.datetime
, (select count(*)
from connections ut
where ut.user = ua.user
and ut.datetime between ua.datetime and (ua.datetime + 1 hour)
) as consecutive_logons
from connections ua
It is up to you to complete with your columns (user, datetime)
It is up to you to find the dateadd facilities (ua.datetime + 1 hour won't work); this is more or less dependent on the DB implementation, for example it is DATE_ADD in mySQL (https://www.w3schools.com/SQl/func_mysql_date_add.asp)
Due to the subquery (select count(*) ...), the whole query will not be the fastest because it is a corelative subquery - it needs to be reevaluated for each row.
The with is simply to compute a subset of user_logons to minimize its cost. This might not be useful, however this will lessen the complexity of the query.
You might have better performance using a stored function or a language driven (eg: java, php, ...) function.

sql Query to find the maximium hour of particular event in table

I have a single table with fields (crime-id int , crime_time timestamp , crime string, city string )
There are only 9 unique crimes in table . I need to find the Time ie the hour in which a particular crime occured frequency in max times . Eg if Robbery cause most between 10- 11 it must show 10 or 11 ... the time may start from 00:00 nd ends in 23:59
viod answer is almost ok.
But you need a group by to count the robbery in time slot.
Also need put an alias for the subquery.
SELECT period, max(nb)
FROM (
SELECT extract(hour from crime_time) as period, count(*) as nb
FROM crimes
WHERE crime_string = 'Robbery'
GROUP BY extract(hour from crime_time)
) as subquery_alias
GROUP BY period
This should do, but I have not tested it (and you may have to find hive equivalents of the postgres function I use: extract (doc is available here: http://www.postgresql.org/docs/9.1/static/functions-datetime.html).
SELECT max(nb), period
FROM (
SELECT count(*) as nb, period
FROM (
SELECT crime_string, extract(hour from crime_time) as period
FROM crimes
WHERE crime_string = 'Robbery'
)
GROUP BY period
);

Group by data intervals

I have a single table which stores bandwidth usage on the network over a period of time. One column will contain the date time (primary key) and another column will record the bandwidth. Data is recorded every minute. We will have other columns recording other data at that moment in time.
If the user requests the data on 15 minute intervals (within a 24 hour period given start and end date), is it possible with a single query to get the data I require or would I have to write a stored procedure/cursor to do this? Users may then request 5 minute intervals data etc.
I will most likely be using Postgres but are there other NOSQL options which would be better?
Any ideas?
WITH t AS (
SELECT ts, (random()*100)::int AS bandwidth
FROM generate_series('2012-09-01', '2012-09-04', '1 minute'::interval) ts
)
SELECT date_trunc('hour', ts) AS hour_stump
,(extract(minute FROM ts)::int / 15) AS min15_slot
,count(*) AS rows_in_timeslice -- optional
,sum(bandwidth) AS sum_bandwidth
FROM t
WHERE ts >= '2012-09-02 00:00:00+02'::timestamptz -- user's time range
AND ts < '2012-09-03 00:00:00+02'::timestamptz -- careful with borders
GROUP BY 1, 2
ORDER BY 1, 2;
The CTE t provides data like your table might hold: one timestamp ts per minute with a bandwidth number. (You don't need that part, you work with your table instead.)
Here is a very similar solution for a very similar question - with detailed explanation how this particular aggregation works:
date_trunc 5 minute interval in PostgreSQL
Here is a similar solution for a similar question concerning running sums - with detailed explanation and links for the various functions used:
PostgreSQL: running count of rows for a query 'by minute'
Additional question in comment
WITH -- same as above ...
SELECT DISTINCT ON (1,2)
date_trunc('hour', ts) AS hour_stump
,(extract(minute FROM ts)::int / 15) AS min15_slot
,bandwidth AS bandwith_sample_at_min15
FROM t
WHERE ts >= '2012-09-02 00:00:00+02'::timestamptz
AND ts < '2012-09-03 00:00:00+02'::timestamptz
ORDER BY 1, 2, ts DESC;
Retrieves one un-aggregated sample per 15 minute interval - from the last available row in the window. This will be the 15th minute if the row is not missing. Crucial parts are DISTINCT ON and ORDER BY.
More information about the used technique here:
Select first row in each GROUP BY group?
select
date_trunc('hour', d) +
(((extract(minute from d)::integer / 5 * 5)::text) || ' minute')::interval
as "from",
date_trunc('hour', d) +
((((extract(minute from d)::integer / 5 + 1) * 5)::text) || ' minute')::interval
- '1 second'::interval
as "to",
sum(random() * 1000) as bandwidth
from
generate_series('2012-01-01', '2012-01-31', '1 minute'::interval) s(d)
group by 1, 2
order by 1, 2
;
That for 5 minutes ranges. For 15 minutes divide by 15.