I have a table which is a list of games that have been played in a sqlite3 database. The field "datetime" is the a datetime of when game ended. The field "duration" is the number of seconds the game lasted. I want to know what percent of the past 24 hours had at least 5 games running simutaniously. I figured out to tell how many games running at a given time are:
select count(*)
from games
where strftime('%s',datetime)+0 >= 1257173442 and
strftime('%s',datetime)-duration <= 1257173442
If I had a table that was simply a list of every second (or every 30 seconds or something) I could do an intentional cartisian product like this:
select count(*)
from (
select count(*) as concurrent, d.second
from games g, date d
where strftime('%s',datetime)+0 >= d.second and
strftime('%s',datetime)-duration <= d.second and
d.second >= strftime('%s','now') - 24*60*60 and
d.second <= strftime('%s','now')
group by d.second) x
where concurrent >=5
Is there a way to create this date table on the fly? Or that I can get a similar effect to this without having to actually create a new table that is simply a list of all the seconds this week?
Thanks
First, I can't think of a way to approach your problem by creating a table on the fly or without the aid of an extra table. Sorry.
My suggestion is for you to rely on a static Numbers table.
Create a fixed table with the format:
CREATE TABLE Numbers (
number INTEGER PRIMARY KEY
);
Populate it with the number of seconds in 24h (24*60*60 = 84600). I would use any scripting language to do that using the insert statement:
insert into numbers default values;
Now the Numbers table has the numbers 1 through 84600. Your query will them be modified to be:
select count(*)
from (
select count(*) as concurrent, strftime('%s','now') - 84601 + n.number second
from games g, numbers n
where strftime('%s',datetime)+0 >= strftime('%s','now') - 84601 + n.number and
strftime('%s',datetime)-duration <= strftime('%s','now') - 84601 + n.number
group by second) x
where concurrent >=5
Without a procedural language in the mix, that is the best you'll be able to do, I think.
Great question!
Here's a query that I think gives you what you want without using a separate table. Note this is untested (so probably contains errors) and I've assumed datetime is an int column with # of seconds to avoid a ton of strftime's.
select sum(concurrent_period) from (
select min(end_table.datetime - begin_table.begin_time) as concurrent_period
from (
select g1.datetime, g1.num_end, count(*) as concurrent
from (
select datetime, count(*) as num_end
from games group by datetime
) g1, games g2
where g2.datetime >= g1.datetime and
g2.datetime-g2.duration < g1.datetime and
g1.datetime >= strftime('%s','now') - 24*60*60 and
g1.datetime <= strftime('%s','now')+0
) end_table, (
select g3.begin_time, g1.num_begin, count(*) as concurrent
from (
select datetime-duration as begin_time,
count(*) as num_begin
from games group by datetime-duration
) g3, games g4
where g4.datetime >= g3.begin_time and
g4.datetime-g4.duration < g3.begin_time and
g3.begin_time >= strftime('%s','now') - 24*60*60 and
g3.begin_time >= strftime('%s','now')+0
) begin_table
where end_table.datetime > begin_table.begin_time
and begin_table.concurrent < 5
and begin_table.concurrent+begin_table.num_begin >= 5
and end_table.concurrent >= 5
and end_table.concurrent-end_table.num_end < 5
group by begin_table.begin_time
) aah
The basic idea is to make two tables: one with the # of concurrent games at the begin time of each game, and one with the # of concurrent games at the end time. Then join the tables together and only take rows at "critical points" where # of concurrent games crosses 5. For each critical begin time, take the critical end time that happened soonest and that hopefully gives all the periods where at least 5 games were running concurrently.
Hope that's not too convoluted to be helpful!
Kevin rather beat me to the punchline there (+1), but I'll post this variation as it differs at least a little.
The key ideas are
Map the data in to a stream of events with attributes time and 'polarity' (=start or end of game)
Keep a running total of how many games are open at the time of each event
(this is done by forming a self-join on the event stream)
Find the event times where the number of games (as Kevin says) transitions up to 5, or down to 4
A little trick: add up all the down-to-4 times and take away the up-to-5s - the order is not important
The result is the number of seconds spent with 5 or more games open
I don't have sqllite, so I've been testing with MySQL, and I've not bothered to limit the time window to preserve some sanity. Shouldn't be difficult to revise.
Also, and more importantly, I've not considered what to do if games are open at the beginning or end of the period!
Something tells me there's a big simplification to be had here, but I've not spotted it yet.
SELECT SUM( event_time )
FROM (
SELECT -ga.event_type * ga.event_time AS event_time,
SUM( ga.event_type * gb.event_type ) event_type
FROM
( SELECT UNIX_TIMESTAMP( g1.endtime - g1.duration ) AS event_time
, 1 event_type
FROM games g1
UNION
SELECT UNIX_TIMESTAMP( g1.endtime )
, -1
FROM games g1 ) AS ga,
( SELECT UNIX_TIMESTAMP( g1.endtime - g1.duration ) AS event_time
, 1 event_type
FROM games g1
UNION
SELECT UNIX_TIMESTAMP( g1.endtime )
, -1
FROM games g1 ) AS gb
WHERE
ga.event_time >= gb.event_time
GROUP BY ga.event_time
HAVING SUM( ga.event_type * gb.event_type ) IN ( -4, 5 )
) AS gr
Why don't you trim the date and keep only the time, if you filter your data for any given date every time is unique. In this way you'll only need a table with numbers from 1 to 86400 (or less if you take bigger intervals), you may create two columns, "from" and "to" to define the intervals.
I'm not familiar with SQLite functions but according to the manual you have to use the strftime function with this format: HH:MM:SS.
Related
Background: I am running TeslaMate/Grafana for monitoring my car status, one of the gauges plots the battery level fetched from database. My server is located remotely and running in a Dock from an old NAS, so both query performance and network overhead matters.
I found the koisk page frequently hangs and by investigation, it might caused by the query -- two of the plots returns 10~100k rows of results from database. I want to limit the number of rows returned by SQL queries, as the plots certainly don't have that much precision for drawing such detailed intervals.
I tried to follow this answer and use row_number() to pop only 100-th rows of results, but more complicated issues turned up, that is, the time intervals among rows are not consistent.
The car has 4 status, driving / online / asleep / offline.
If the car is at driving status, the time interval could be less than 200ms as the car pushes the status whenever it has new data.
If the car is at online status, the time interval could be several minutes as the system actively fetches the status from the car.
Even worse, if the system thinks the car is going to sleep and need to stop fetching status (to avoid preventing the car to sleep), the interval could be 40 minutes maximum depend on settings.
If the car is in asleep/offline status, no data is recorded at all.
This obviously makes skipping every n-th rows a bad idea, as for case 2-4 above, lots of data points might missing so that Grafana cannot plot correct graph representing the battery level at satisfactory precision.
I wonder if there's any possible to skip the rows by time interval from a datetime field rather than row_number() without much overhead from the query? i.e., fetch every row with minimal 1000ms from the previous row.
E.g., I have following data in the table, I want the rows returned are row 1, 4 and 5.
row date
[1] 1610000001000
[2] 1610000001100
[3] 1610000001200
[4] 1610000002000
[5] 1610000005000
The current (problematic) method I am using is as follows:
SELECT $__time(t.date), t.battery_level AS "SOC [%]"
FROM (
SELECT date, battery_level, row_number() OVER(ORDER BY date ASC) AS row
FROM (
SELECT battery_level, date
FROM positions
WHERE car_id = $car_id AND $__timeFilter(date)
UNION ALL
SELECT battery_level, date
FROM charges c
JOIN charging_processes p ON p.id = c.charging_process_id
WHERE $__timeFilter(date) AND p.car_id = $car_id) AS data
ORDER BY date ASC) as t
WHERE t.row % 100 = 0;
This method clearly gives problem that only returns alternate rows instead of what I wanted (given the last row reads t.row % 2 = 0)
PS: please ignore the table structures and UNION from the sample code, I haven't dig deep enough to the tables which could be other tweaks but irrelevant to this question anyway.
Thanks in advance!
You can use a recursive CTE:
WITH RECURSIVE rec(cur_row, cur_date) AS (
(
SELECT row, date
FROM t
ORDER BY date
LIMIT 1
)
UNION ALL
(
SELECT row, date
FROM t
JOIN rec
ON t.date >= cur_date + 1000
ORDER BY t.date
LIMIT 1
)
)
SELECT *
FROM rec;
cur_row
cur_date
1
1610000001000
4
1610000002000
5
1610000005000
View on DB Fiddle
Using a function instead would probably be faster:
CREATE OR REPLACE FUNCTION f() RETURNS SETOF t AS
$$
DECLARE
row t%ROWTYPE;
cur_date BIGINT;
BEGIN
FOR row IN
SELECT *
FROM t
ORDER BY date
LOOP
IF row.date >= cur_date + 1000 OR cur_date IS NULL
THEN
cur_date := row.date;
RETURN NEXT row;
END IF;
END LOOP;
END;
$$ LANGUAGE plpgsql;
SELECT *
FROM f();
row
date
1
1610000001000
4
1610000002000
5
1610000005000
I have one list of events. One event name is creating an account and another is creating an account with Facebook. I am trying to see what percentage of accounts created use Facebook.
The code below will give me an accurate count of the number of facebook accounts and total accounts, but when I try to divide the two numbers it just gives me the number 1.
I am very new to SQL, and have spent hours trying to figure out why it is doing that to no avail.
with
fb_act as (
select *
from raw_event
where name = 'onboard_fb_success'
and event_ts::date >= current_date - 30
),
total_act as (
select *
from raw_event
where name ='create_account'
and event_ts::date >= current_date - 30
)
select count(fb_act)/count(total_act), total_act.event_ts::date as day
from total_act, fb_act
group by day
order by day
I expect the output to be about ~.3, but the actual output is always exactly 1.
Conditional aggregation is a much simpler way to write the query. You appear to be using Postgres, so something like this:
select re.event_ts::date as day,
(sum( (name = 'onboard_fb_success' and event_ts::date >= current_date - 30):: int) /
sum( name = 'create_account' and event_ts::date >= current_date - 30)::int)
) as ratio
from raw_event re
group by re.event_ts::date
order by day;
I'm looking through login logs (in Netezza) and trying to find users who have greater than a certain number of logins in any 1 hour time period (any consecutive 60 minute period, as opposed to strictly a clock hour) since December 1st. I've viewed the following posts, but most seem to address searching within a specific time range, not ANY given time period. Thanks.
https://dba.stackexchange.com/questions/137660/counting-number-of-occurences-in-a-time-period
https://dba.stackexchange.com/questions/67881/calculating-the-maximum-seen-so-far-for-each-point-in-time
Count records per hour within a time span
You could use the analytic function lag to look back in a sorted sequence of time stamps to see whether the record that came 19 entries earlier is within an hour difference:
with cte as (
select user_id,
login_time,
lag(login_time, 19) over (partition by user_id order by login_time) as lag_time
from userlog
order by user_id,
login_time
)
select user_id,
min(login_time) as login_time
from cte
where extract(epoch from (login_time - lag_time)) < 3600
group by user_id
The output will show the matching users with the first occurrence when they logged a twentieth time within an hour.
I think you might do something like that (I'll use a login table, with user, datetime as single column for the sake of simplicity):
with connections as (
select ua.user
, ua.datetime
from user_logons ua
where ua.datetime >= timestamp'2018-12-01 00:00:00'
)
select ua.user
, ua.datetime
, (select count(*)
from connections ut
where ut.user = ua.user
and ut.datetime between ua.datetime and (ua.datetime + 1 hour)
) as consecutive_logons
from connections ua
It is up to you to complete with your columns (user, datetime)
It is up to you to find the dateadd facilities (ua.datetime + 1 hour won't work); this is more or less dependent on the DB implementation, for example it is DATE_ADD in mySQL (https://www.w3schools.com/SQl/func_mysql_date_add.asp)
Due to the subquery (select count(*) ...), the whole query will not be the fastest because it is a corelative subquery - it needs to be reevaluated for each row.
The with is simply to compute a subset of user_logons to minimize its cost. This might not be useful, however this will lessen the complexity of the query.
You might have better performance using a stored function or a language driven (eg: java, php, ...) function.
I'm at work and need to SUM specific value by hour. Lets say a shift is 10 hours, I want to be able to loop and do the following pseudo code in PLSQL:
For (int i=0; i<10; i++) {
SUM( DB.Sales) AS Hour#i
FROM
DB
WHERE DB.sale_Time BETWEEN
DB.Shift_Start_Time + i AND
DB.Shift_Start_Time + (i+1)
}
It should return a table with 11 columns, one for the time of the shift and each of the other columns sums the sales of an hour in the 10 hours shift, and each will be named after the hour it represents.
Basically I want to use it for other purposes, as my work has nothing to do with sales, where "i" can be as big as 1000 so I'm looking for a generic solution to my problem.
Again, I am using PLSQL.
Help will be very appreciated!
Thank you
You can do this using the analytic functions in Oracle. First, I aggregate the sales data by hour, since that seems to be the unit you want, and then sum up the hours using the analytic functions with a windowing clause:
select db.shift_start_time,
sum(db.sales) over (partition by null order by db.sale_time
range between current_row and <i> following)
from (select trunc(shift_start_time*24)/24 as thehour, sum(sales) as sales
from DB
group by trunc(shift_start_time*24)/24
) db
Note that this assumes that there are sales for every hour.
You can do this with a cursor, looping over each element. The key is to group by the hour you're interested in.
BEGIN
FOR x IN (
SELECT TRUNC( DB.sale_Time, 'HH' ) AS start_time,
SUM( db.sales ) INTO mysum
FROM db
GROUP BY TRUNC( DB.sale_Time, 'HH' )
) LOOP
/* Do loop stuff here */
END LOOP;
END;
I eventually gave up on my original demands and found a nice and simple way to loop through the hours using:
SELECT
db.shift_start_time,
T1.n AS Hour,
SUM(db.sales) AS Hourly_Sales
FROM
db,
(SELECT n FROM
(SELECT rownum n FROM DUAL CONNECT BY LEVEL <=10) WHERE n > 0) T1
WHERE
db.sale_time BETWEEN db.shift_start_time+(T1.n - 1)/24 AND db.shift_start_time + (T1.n)/24
GROUP BY
db.shift_start_time,
T1.n
ORDER BY
db.shift_start_time,
T1.n
It also solved my problem where hours are not rounded. If a shift starts at 9:45 it will work as expected and not round it to 9 or 10.
Only issue I have with this solution is that there is a separate row for each hour.. nevertheless , it's still the best solution yet.
I am trying to determine if it is possible, using only sql for postgres, to select a range of time ordered records at a given interval.
Lets say I have 60 records, one record for each minute in a given hour. I want to select records at 5 minute intervals for that hour. The resulting rows should be 12 records each one 5 minutes apart.
This is currently accomplished by selecting the full range of records and then looping thru the results and pulling out the records at the given interval. I am trying to see if I can do this purly in sql as our db is large and we may be dealing with tens of thousands of records.
Any thoughts?
Yes you can. Its really easy once you get the hang of it. I think its one of jewels of SQL and its especially easy in PostgreSQL because of its excellent temporal support. Often, complex functions can turn into very simple queries in SQL that can scale and be indexed properly.
This uses generate_series to draw up sample time stamps that are spaced 1 minute apart. The outer query then extracts the minute and uses modulo to find the values that are 5 minutes apart.
select
ts,
extract(minute from ts)::integer as minute
from
( -- generate some time stamps - one minute apart
select
current_time + (n || ' minute')::interval as ts
from generate_series(1, 30) as n
) as timestamps
-- extract the minute check if its on a 5 minute interval
where extract(minute from ts)::integer % 5 = 0
-- only pick this hour
and extract(hour from ts) = extract(hour from current_time)
;
ts | minute
--------------------+--------
19:40:53.508836-07 | 40
19:45:53.508836-07 | 45
19:50:53.508836-07 | 50
19:55:53.508836-07 | 55
Notice how you could add an computed index on the where clause (where the value of the expression would make up the index) could lead to major speed improvements. Maybe not very selective in this case, but good to be aware of.
I wrote a reservation system once in PostgreSQL (which had lots of temporal logic where date intervals could not overlap) and never had to resort to iterative methods.
http://www.amazon.com/SQL-Design-Patterns-Programming-Focus/dp/0977671542 is an excellent book that goes has lots of interval examples. Hard to find in book stores now but well worth it.
Extract the minutes, convert to int4, and see, if the remainder from dividing by 5 is 0:
select *
from TABLE
where int4 (date_part ('minute', COLUMN)) % 5 = 0;
If the intervals are not time based, and you just want every 5th row; or
If the times are regular and you always have one record per minute
The below gives you one record per every 5
select *
from
(
select *, row_number() over (order by timecolumn) as rown
from tbl
) X
where mod(rown, 5) = 1
If your time records are not regular, then you need to generate a time series (given in another answer) and left join that into your table, group by the time column (from the series) and pick the MAX time from your table that is less than the time column.
Pseudo
select thetimeinterval, max(timecolumn)
from ( < the time series subquery > ) X
left join tbl on tbl.timecolumn <= thetimeinterval
group by thetimeinterval
And further join it back to the table for the full record (assuming unique times)
select t.* from
tbl inner join
(
select thetimeinterval, max(timecolumn) timecolumn
from ( < the time series subquery > ) X
left join tbl on tbl.timecolumn <= thetimeinterval
group by thetimeinterval
) y on tbl.timecolumn = y.timecolumn
How about this:
select min(ts), extract(minute from ts)::integer / 5
as bucket group by bucket order by bucket;
This has the advantage of doing the right thing if you have two readings for the same minute, or your readings skip a minute. Instead of using min even better would be to use one of the the first() aggregate functions-- code for which you can find here:
http://wiki.postgresql.org/wiki/First_%28aggregate%29
This assumes that your five minute intervals are "on the fives", so to speak. That is, that you want 07:00, 07:05, 07:10, not 07:02, 07:07, 07:12. It also assumes you don't have two rows within the same minute, which might not be a safe assumption.
select your_timestamp
from your_table
where cast(extract(minute from your_timestamp) as integer) in (0,5);
If you might have two rows with timestamps within the same minute, like
2011-01-01 07:00:02
2011-01-01 07:00:59
then this version is safer.
select min(your_timestamp)
from your_table
group by (cast(extract(minute from your_timestamp) as integer) / 5)
Wrap either of those in a view, and you can join it to your base table.