Time averaging non-continuous data with PostgreSQL9.2 - sql

I have multiple datasets of 1 second time resolution real data. This data will often have gaps in the time-series where the instrument dropped data, or when the instrument was turned off, resulting in a patchy (albeit still very useful) dataset. The resulting data might look like the following
Timestamp [timestamp] : datastream1 [double precision] : datastream2 [double precision] : etc
2011-01-01 00:00:01 153.256 1255.325
2011-01-01 00:00:02 152.954 1254.288
2011-01-01 00:00:03 151.738 1248.951
2011-01-01 00:00:04 150.015 1249.185
2011-01-01 00:10:08 179.132 1328.115
2011-01-01 00:10:09 178.051 1323.125
2011-01-01 00:10:10 180.870 1336.983
2011-01-04 09:19:02 152.198 1462.814
2011-01-04 09:19:03 158.014 1458.122
2011-01-04 09:19:04 156.070 1464.174
Please note: this data are generally continuous but will have random gaps which must be dealt with.
I need to write code to take the average and stdev of a given time interval, "timeInt", that is able to deal with these gaps. For example, if I wanted a 10 min average of data, my required output would be:
Timestamp_10min : avg_data1 : med_data1 : count_data1
where avg_data1 would be the average of all the data points within a given 10 minute period, and count_data1 would be the number of points used in the calculation of that average (i.e. 600 if there was no missing data, 300 if every second point is missing, etc etc).
This code needs to work with any desired input interval (i.e. x minutes, y days, z weeks, months, years, etc).
Currently I am only able output minute averages using the following code.
CREATE OR REPLACE VIEW "DATATABLE_MIN" AS
SELECT MIN("DATATABLE"."Timestamp") AS "Timestamp_min",
avg("DATATABLE"."datastream1") AS "datastream1_avg_min",
stddev("DATATABLE"."datastream1") AS "datastream1_stdev_min",
count("DATATABLE"."datastream1") AS "datastream1_avg_min"
FROM "DATATABLE"
GROUP BY to_char("DATATABLE"."Timestamp",'YYYY-MM-DD HH24:MI'::text);
Thanks in advance for the help!

To group by 10 minutes, you can do this by using the "epoch":
SELECT MIN(dt."Timestamp") AS "Timestamp_min",
avg(dt."datastream1") AS "datastream1_avg_min",
stddev(dt."datastream1") AS "datastream1_stdev_min",
count(dt."datastream1") AS "datastream1_avg_min"
FROM "DATATABLE" dt
GROUP BY trunc(extract(epoch from dt."TimeStamp") / (60*10));
This is the number of seconds since a fixed time in the past. If you divide it by 600, you get the number of 10 minute intervals -- what you need for the aggregation.

Related

Select unique IDs and divide result into X minute intervals based on given timespan

I'm trying to knock some dust off my good old SQL queries, but I'm afraid I need a push in the right direction into taking those dusty skills and transform them into something useful when it comes to BigQuery statements.
I'm currently working with a single table schema looking like this:
In the query I would like to be able to supply the following in my where clause:
The date of which I would like the results to stem from.
A time range - in the above result example this range would be from 20:00 to 21:00. If 1. and 2. in this list should be merged together that's also fine.
The eventId I would like to find records for.
Optionally to be able to determine the interval frequency - should it be divided into each ie. 5, 10 or 15 minute intervals.
Also I would like to count the unique userIds for each interval. If one user is present during the entire session he/she should be taken into the count in every interval.
So think of it as the following:
How many unique users did we have every 5 minutes at X event, between 20:00 and 21:00 at Y day?
How should my query look if I want a result looking (something) like the following pseudo result:
time_interval number_of_unique_userIds
1 2022-03-16 20:00:00 10
2 2022-03-16 20:05:00 12
3 2022-03-16 20:10:00 15
4 2022-03-16 20:15:00 20
5 2022-03-16 20:20:00 30
6 ... etc.
If the time of the query is before the provided end time in the timespan, it should fill out the rest of the interval rows with 0 unique userIds.
In the following result we've executed mentioned query earlier than the provided end date - let's say that it's executed at 20:49:
time_interval number_of_unique_userIds
X 2022-03-16 20:50:00 0
X 2022-03-16 20:55:00 0
X 2022-03-16 21:00:00 0
Here's what I have so far, but it gives me several of the same interval records with what looks like each userId:
SELECT
TIMESTAMP_SECONDS(5*60 * DIV(UNIX_SECONDS(creationTime), 5*60)) time_interval,
COUNT(DISTINCT(userId)) number_of_unique_userIds
FROM `bigquery.table`
WHERE eventId = 'xyz'
AND creationTime > '2022-03-16 20:00:00' AND creationTime < '2022-03-16 21:00:00'
GROUP BY time_interval
ORDER BY time_interval DESC
This gives me somewhat what I expect - but I think the number_of_unique_userIds seems too low, so I'm a little worried that I'm not getting unique userIds for each interval. What I'm thinking is, that userIds that were counted into the first 5 minute interval is not counted in the next. So I'm not sure this query is sufficient for my needs. Also it's not filling the blanks with 0 number_of_unique_userIds.
I hope you can help me out here.
Thanks!

Trying to learn SQL Aggregations and Sub-Queries

I am trying to improve my query writing and need help with the following...
I have one table with multiple columns, including Operation_Code, Operation_Category, Downtime_In_Minutes, Downtime (as a percentage of the last 24 hours). Each line of my results set needs to SUM(Downtime_minutes) for each Operation_Code and SUM(Count of each occurrance of the Operation_Code). Stop will always be yesterday. Date functions and formatting return yesterdays date. This is not presented in the query below due to length of the code, but it works. So, each line in the results should look like:
StopDate
Operation_Code
Operation_Category
Count (# of occurrences of each Op_Code)
SUM (in minutes) of all downtime for each Operation_Code
% of Last 24 hours
Example Results:
StopDate Op_Code OP_Category Count Downtime (Minutes) % of Last 24
7/18/2021 X123 Grinder 10 720 50%
7/18/2021 A800 Cutter 12 360 25%
7/18/2021 O225 Polisher 5 60 4%
My query without attempting any aggregations is basically:
Select StopDate,
OpCode,
OpCat
From DTS
Where StopDate = yesterday
Basic question is hw do I SUM the count of occurrences and SUM the total time in minutes for each unique Operation_Code?
Thanks in advance!
Are you just looking for aggregation? Then you can use a window function to get the ratio, which I am guessing is based on the downtime:
Select StopDate, OpCode, OpCat, count(*) as cnt,
sum(Downtime_In_Minutes),
sum(Downtime_In_Minutes) * 1.0 / nullif(sum(Downtime_In_Minutes) over ()) as
From DTS
Where StopDate = yesterday;
I assume you know how to deal with "yesterday", because you say that you already have a query.
group by StopDate;

Exasol Extrackt from Timestamp Hours AND Minutes

I have a Exasol database with Login values of datatype TIMESTAMP like:
2015-10-01 13:00:34.0
2015-11-02 13:10:10.0
2015-10-06 13:20:03.0
2016-02-01 14:15:34.0
2016-04-03 14:08:10.0
2016-07-01 11:05:07.0
2016-09-03 10:08:12.0
2016-11-15 09:03:30.0
and many many more. I want to do a SQL (SQLite) query where I get like
Logins from 09:00:00 to 09:15:00 and logins from 09:15:00 to 09:30:00 and so on in separate tables (no matter what date it is). I already had success with selecting on 1 hour interval with:
...EXTRACT(HOUR FROM entryTime ) BETWEEN 8 and 8
that way i get entries of my database (no matter what date it is) within 1 hour, but i need smaller intervals, like every 09:00:00 - 09:15:00 minutes. Any ideas how to solve this in Exasol (SQLite)?
You can simply convert the time part of your timestamp to a string and do a between, something like:
WHERE to_char(entryTime, 'hhmi') BETWEEN '0900' AND '0915'
If you want to use extract and numeric values, I suggest this:
WHERE (EXTRACT(HOUR FROM entryTime) * 100) + EXTRACT(MINUTE FROM entryTime)
BETWEEN 900 and 915
I'm not in front of my computer now, but this (or something pretty similar) should work.
But I suspect that in both cases EXASOL will create an expression index for the first part of the WHERE clause. Because, I guess, you use EXASOL because you have a huge amount of data and you want fast performance, my suggestion is to have an additional column in your table where you store the time part of entryTime as a numeric value, that will create a proper index give you better performance.
I found a workaround. The solution is to interleave the SQL states. So in first step, you select the hours, and around that SELECT state, you wrap another, where you specify the minutes.
SELECT * FROM
(SELECT * FROM MY_SCHEMA.EXA_LOCAL WHERE EXTRACT(HOUR FROM TIMESTMP) BETWEEN 9 and 9)
where EXTRACT(MINUTE FROM TIMESTMP) BETWEEN 0 and 15;

Leaking Bucket with SQL

Is there a way to have an SQL query output the rows in a leaking bucket fashion?
Given a bunch of rows (could be a few could be a lot), each has a created_at column, the rows will be retrieved incrementally every X seconds ordered by their created_at, and starting from some fixed time (like beginning of the day)
So if these are the rows (created_at should be a datetime type, numbers here are to simplify the example)
**food, created_at**
apple, 1
orange, 2
banana, 3
meat, 4
brocolli, 5
tomato, 6
and X is 60 seconds, and the starting time is now, when the query is executed it will only return apple. after a minute it will return apple and orange. After 2 minutes it will return apple, orange, banana and so forth.
The idea behind all of this is to gradually release rows instead of everything at once, globally.
A more concrete example would be articles editted today in wikipedia, or pictures posted by your friends. You can consume everything at once, but I would rather do it incrementally for a fixed/user-specified time. If there is a better way to do this, I'd like to know.
Assuming we start at 2015-09-27 16:00:00, and the data in the table looks like this:
food | created_at
---------+--------------------
apple | 2015-09-27 16:00:00
orange | 2015-09-27 16:01:00
banana | 2015-09-27 16:02:00
meat | 2015-09-27 16:03:00
brocolli | 2015-09-27 16:04:00
tomato | 2015-09-27 16:05:00
Then running this
select *
from data
where created_at < timestamp '2015-09-27 16:00:00' + ((interval '1' minute) * X);
Gets you the first row with X=1 (first run), the first and second row with X=2 (second run) and so on.
If you increase the parameter X each time you call the query, this should do what you want.
Theoretically you can calculate X as current_timestamp - "*starting time"* (e.g. current_timestamp - timestamp '2015-09-27 16:00:00') and then use "truncate" the resulting interval to the unit you are interested in (using date_trunc().
Note that current_timestamp gives you the current time at the start of the transaction. So unless you are running in auto commit mode, you probably want to use clock_timestamp() instead.

How can I select one row of data per hour, from a table of time stamps?

Excuse me if this is confusing, as I am not very familiar with postgresql. I have a postgres database with a table full of "sites". Each site reports about once an hour, and when it reports, it makes an entry in this table, like so:
site | tstamp
-----+--------------------
6000 | 2013-05-09 11:53:04
6444 | 2013-05-09 12:58:00
6444 | 2013-05-09 13:01:08
6000 | 2013-05-09 13:01:32
6000 | 2013-05-09 14:05:06
6444 | 2013-05-09 14:06:25
6444 | 2013-05-09 14:59:58
6000 | 2013-05-09 19:00:07
As you can see, the time stamps are almost never on-the-nose, and sometimes there will be 2 or more within only a few minutes/seconds of each other. Furthermore, some sites won't report for hours at a time (on occasion). I want to only select one entry per site, per hour (as close to each hour as I can get). How can I go about doing this in an efficient way? I also will need to extend this to other time frames (like one entry per site per day -- as close to midnight as possible).
Thank you for any and all suggestions.
You could use DISTINCT ON:
select distinct on (date_trunc('hour', tstamp)) site, tstamp
from t
order by date_trunc('hour', tstamp), tstamp
Be careful with the ORDER BY if you care about which entry you get.
Alternatively, you could use the row_number window function to mark the rows of interest and then peel off the first result in each group from a derived table:
select site, tstamp
from (
select site, tstamp,
row_number() over (partition by date_trunc('hour', tstamp) order by tstamp) as r
from t
) as dt
where r = 1
Again, you'd adjust the ORDER BY to select the specific row of interest for each date.
You are looking for the closest value per hour. Some are before the hour and some are after. That makes this a hardish problem.
First, we need to identify the range of values that work for a particular hour. For this, I'll consider anything from 15 minutes before the hour to 45 minutes after as being for that hour. So, the period of consideration for 2:00 goes from 1:45 to 2:45 (arbitrary, but seems reasonable for your data). We can do this by shifting the time stamps by 15 minutes.
Second, we need to get the closest value to the hour. So, we prefer 1:57 to 2:05. We can do this by considering the first value in (57, 60 - 57, 5, 60 - 5).
We can put these rules into a SQL statement, using row_number():
select site, tstamp, usedTimestamp
from (select site, tstamp,
date_trunc('hour', tstamp + 'time 00:15') as usedTimestamp
row_number() over (partition by site, to_char(tstamp + time '00:15', 'YYYY-MM-DD-HH24'),
order by least(extract(minute from tstamp), 60 - extract(minute from tstamp))
) as seqnum
from t
) as dt
where seqnum = 1;
For the extensibility aspect of your question.
I also will need to extend this to other time frames (like one entry per site per day
From the distinct set of site ids, and using a (recursive) CTE, I would build a set comprised of one entry per site per hour (or other specified interval), within a specified StartDateTime, EndDateTime range.
SITE..THE DATE-TIME-HOUR
6000 12.1.2013 00:00:00
6000 12.1.2013 01:00:00
.
.
.
6000 12.1.2013 24:00:00
7000 12.1.2013 00:00:00
7000 12.1.2013 01:00:00
.
.
.
7000 12.1.2013 24:00:00
Then I would left join that CTE against your SITES log on site id and on the min absolute difference between the CTE point-in-time and the LOG's point-in-time.
That way you are assured of a row for each site per interval.
P.S. For a site that has not phoned home for a long time, its most recent phone-in timestamp will be repeated multiple times as the closest one available.