django aggregation to lower resolution using grouping by a date range - sql

horrible title, but let me explain: i've got this django model containing a timestamp (date) and the attribute to log - f.e. the number of users consuming some ressource - (value).
class Viewers(models.Model):
date = models.DateTimeField()
value = models.IntegerField()
for each 10seconds the table contains the number of users. something like this:
| date | value |
|------|-------|
| t1 | 15 |
| t2 | 18 |
| t3 | 27 |
| t4 | 25 |
| .. | .. |
| t30 | 38 |
| t31 | 36 |
| .. | .. |
now i want to generate different statistics from this data, each with another resolution. f.e. for a chart of the last day i don't need the 10second resolution, so i want 5 minute steps (that are build by averaging the values (and maybe also the date) of the rows from t1 to t29, t30 to t59, ...), so that i'll get:
| date | value |
|------|-------|
| t15 | 21 |
| t45 | 32 |
| .. | .. |
the attributes to keep variable are start & end timestamp and the resolution (like 5 minutes). is there a way using the django orm/queryset api and if not, how to reach this with custom sql?

I've been trying to solve this problem in the most 'django' way possible. I've settled for the following. It averages the values for 15minute time slots between start_date and end_date where the column name is'date':
readings = Reading.objects.filter(date__range=(start_date, end_date)) \
.extra(select={'date_slice': "FLOOR (EXTRACT (EPOCH FROM date) / '900' )"}) \
.values('date_slice') \
.annotate(value_avg=Avg('value'))
It returns a dictionary:
{'value_avg': 1116.4925373134329, 'date_slice': 1546512.0}
{'value_avg': 1001.2028985507246, 'date_slice': 1546513.0}
{'value_avg': 1180.6285714285714, 'date_slice': 1546514.0}
The core of the idea comes from this answer to the same question for PHP/SQL. The code passed to extra is for a Postgres DB.

from django.db.models import Avg
Viewers.objects.filter(date__range=(start_time, end_time)).aggregate(average=Avg('value'))
That will get you the average of all the values between start_time and end_time, returned as a dictionary in the form of { 'average': <the average> }.
start_time and end_time need to be Python datetime objects. So if you have a timestamp, or something, you'll need to convert it first. You can also use datetime.timedelta to calculate the end_time based on the start_time. For a five minute resolution, something like this:
from datetime import timedelta
end_time = start_time + timedelta(minutes=5)

have you looked at the range filter?
https://docs.djangoproject.com/en/dev/ref/models/querysets/#range
The example given in the doc's seems similar to your situation.

Slightly improving upon the answer by #Richard Corden, in Postgresql you can do
def for_interval(self, start=None, end=None, interval=60):
# (Check start and end values...)
return self
.filter(timestamp__range=(start, end)) \
.annotate(
unix_timestamp=Floor(Extract('timestamp', 'epoch') / interval) * interval,
time=Func(F('unix_timestamp'), function="TO_TIMESTAMP", output_field=models.DateTimeField()),
) \
.values('time') \
.annotate(value=Avg('value')) \
.order_by('time')
I would also recommend storing the floor of the interval rather than its midpoint.

After long trying i made it as SQL-statement:
SELECT FROM_UNIXTIME(AVG(UNIX_TIMESTAMP(date))), SUM(value)
FROM `my_table`
WHERE date BETWEEN SUBTIME(NOW( ), '0:30:00') AND NOW()
GROUP BY UNIX_TIMESTAMP(date) DIV 300
ORDER BY date DESC
with
start_time = SUBTIME(NOW( ), '0:30:00')
end_time = NOW()
period = 300 # in seconds
in the end - not really hard - and indeed independent from the time resolution of the samplings in the origin table.

Related

Howto limit timescaleDB queries to the used bucket size?

I have a postgres timescaleDB database with time series data.
The data in table flows was sampled roughly every 500ms.
I need to get the data for every 1 second.
I tried to do it with time_bucket() function.
This was my test query:
SELECT time_bucket('1 second', time) AS bucket, value AS val
FROM flows fl
WHERE
fl.time > '2021-08-31 06:14:00+00' AND
fl.time <= '2021-08-31 06:18:00+00' AND
fl.sensor_id = 2
ORDER BY fl.time ASC;
The returned data looks as follows:
|bucket |val |
| ---------------------- | ------------------- |
| 2021-08-31 06:14:00+00 | 9.75071040883207 |
| 2021-08-31 06:14:00+00 | 10.008532745208633 |
| 2021-08-31 06:14:01+00 | 9.953632354528265 |
| 2021-08-31 06:14:01+00 | 9.833033340905137 |
| 2021-08-31 06:14:02+00 | 9.77205680132453 |
| 2021-08-31 06:14:02+00 | 10.197350449765523 |
| ... | ... |
As you can see, there are two rows for each bucket of one second. Values are coming from the samples that were collected every 500ms.
How to make sure there is only one value per bucket?
(In my case: One value every second)
I also tried an aggregation function (avg) on value, but that did not change the result.
For time_bucket functions, in order to get the bucketing to work correctly, you will have to aggregate the value column in some way, and provide a group by statment. For example, something like this should correctly bucket the time,
SELECT time_bucket('1 second', time) AS bucket,
sum(value) AS val
FROM flows fl
WHERE
time_bucket('1 second', time) > '2021-08-31 06:14:00+00' AND
time_bucket('1 second', time) <= '2021-08-31 06:18:00+00' AND
fl.sensor_id = 2
GROUP BY bucket, sensor_id
ORDER BY bucket ASC;
Hopefully this works for you!
disclosure: I am a part of the Timescale team 😊

Storing Dates In SQL: Having Some Trouble

I have a bunch of data sitting in a Postgres database for a website I am building. Problem is, I don't really know what I should do with the date information. For example, I have a table called events that has a date column that stores date information as a string until I can figure out what to do with it.
The reason they are in string format is unfortunately for the topic of my website, their is not a good API, so I had to scrape data. Here's what some of the data looks like inside the column:
| Date |
|-------------------------------------|
| Friday 07.30.2021 at 08:30 AM ET |
| Wednesday 04.07.2021 at 10:00 PM ET |
| Saturday 03.27.2010 |
| Monday 01.11.2010 |
| Saturday 02.09.2019 at 05:00 PM ET |
| Wednesday 03.31.2010 |
It would have been nice to have every row have the time with it, but a lot of entries don't. I have no problem doing some string manipulation on the data to get them into a certain format where they can be turned into a date, I am just somewhat stumped on what I should do next.
What would you do in this situation if you were restricted to the data seen in the events table? Would you store it as UTC? How would you handle the dates without a time? Would you give up and just display everything as EST dates regardless of where the user lives (lol)?
It would be nice to use these dates to display correctly for anyone anywhere in the world, but it looks like I might be pigeonholed because of the dates that don't have a time associated with them.
Converting your mishmash free form looking date to a standard timestamp is not all that daunting as it seems. Your samples indicate you have a string with 5 separate pieces of information: day name, date (mm.dd.yyyy), literal (at),time of day, day part (AM,PM) and some code for timezone each separated by spaces. But for them to be useful the first step is splitting this into the individual parts. For that us a regular expression to ensure single spaces then use string_to_array to create an array of up to 6 elements. This gives:
+--------------------------+--------+----------------------------------+
| Field | Array | Action / |
| | index | Usage |
+--------------------------+--------+----------------------------------+
| day name | 1 | ignore |
| date | 2 | cast to date |
| literal 'at' | 3 | ignore |
| time of day | 4 | interval as hours since midnight |
| AM/PM | 5 | adjustment for time of day |
| some code for timezone | 6 | ??? |
+--------------------------+--------+----------------------------------+
Putting it all together we arrive at:
with test_data ( stg ) as
( values ('Friday 07.30.2021 at 08:30 AM ET' )
, ('Wednesday 04.07.2021 at 10:00 PM ET')
, ('Saturday 03.27.2010' )
, ('Monday 01.11.2010' )
, ('Saturday 02.09.2019 at 05:00 PM ET' )
, ('Wednesday 03.31.2010')
)
-- <<< Your query begins here >>>
, stg_array( strings) as
( select string_to_array(regexp_replace(stg, '( ){1,}',' ','g'), ' ' )
from test_data --<<< your actual table >>>
) -- select * from stg_array
, as_columns( dt, tod_interval, adj_interval, tz) as
( select strings[2]::date
, case when array_length(strings,1) >= 4
then strings[4]::interval
else '00:00':: interval
end
, case when array_length(strings,1) >= 5 then
case when strings[5]='PM'
then interval '12 hours'
else interval '0 hours'
end
else interval '0 hours'
end
, case when array_length(strings,1) >= 6
then strings[6]
else current_setting('TIMEZONE')
end
from stg_array
)
select dt + tod_interval + adj_interval dtts, tz
from as_columns;
This gives the corresponding timestamp for date, time, and AM/PM indicator (in the current timezone) for items without a timezone specifies. For those containing a timezone code, you will have to convert to a proper timezone name. Note ET is not a valid timezone name nor a valid abbreviation. Perhaps a lookup table. See example here; it also contains a regexp based solution. Also the example in run on db<>fiddle. Their server is in the UK, thus the timezone.

How to sum the minutes of each activity in Postgresql?

The column "activitie_time_enter" has the times.
The column "activitie_still" indicates the type of activity.
The column "activitie_walking" indicates the other type of activity.
Table example:
activitie_time_enter | activitie_still | activitie_walking
17:30:20 | Still |
17:31:32 | Still |
17:32:24 | | Walking
17:33:37 | | Walking
17:34:20 | Still |
17:35:37 | Still |
17:45:13 | Still |
17:50:23 | Still |
17:51:32 | | Walking
What I need is to sum up the total minutes for each activity separately.
Any suggestions or solution?
First calculate the duration for each activity (the with CTE) and then do conditional sum.
with t as
(
select
*, lead(activitie_time_enter) over (order by activitie_time_enter) - activitie_time_enter as duration
from _table
)
select
sum (duration) filter (where activitie_still = 'Still') as total_still,
sum (duration) filter (where activitie_walking = 'Walking') as total_walking
from t;
/** Result:
total_still|total_walking|
-----------+-------------+
00:19:16| 00:01:56|
*/
BTW do you really need two columns (activitie_still and activitie_walking)? Only one activity column with those values will do. This will allow more activities (Running, Sleeping, Working etc.) w/o having to change the table structure.

JOIN or analytic function to match different sensors on nearby timestamps within a large dataset?

I have a large dataset consisting of four sensors in a single stream, but for simplicity's sake let's reduce that to two sensors that transmit at approximate (but not exact) same times like this:
+---------+-------------+-------+
| Sensor | Time | Value |
+---------+-------------+-------+
| SensorA | 10:00:01.14 | 10 |
| SensorB | 10:00:01.06 | 8 |
| SensorA | 10:00:02.15 | 11 |
| SensorB | 10:00:02.07 | 9 |
| SensorA | 10:00:03.14 | 13 |
| SensorA | 10:00:04.09 | 12 |
| SensorB | 10:00:04.13 | 6 |
+---------+-------------+-------+
I am trying to find the difference between SensorA and SensorB when their readings are within a half-second of each other. Like this:
+-------------+-------+
| Trunc_Time | Diff |
+-------------+-------+
| 10:00:01 | 2 |
| 10:00:02 | 2 |
| 10:00:04 | 6 |
+-------------+-------+
I know I could write queries to put each sensor in its own table (say SensorA_table and SensorB_table), and then join those tables like this:
SELECT
TIMESTAMP_TRUNC(a.Time, SECOND) as truncated_sec,
a.Value - b.Value as sensor_diff
FROM SensorA_table AS a JOIN SensorB_Table AS b
ON b.Time BETWEEN TIMESTAMP_SUB(a.Time, INTERVAL 500 MILLISECOND) AND TIMESTAMP_ADD(a.Time, INTERVAL 500 MILLISECOND)
But that seems very expensive to make every row of the SensorA_table compare against every row of the SensorB_table, given that the sensor tables are each about 10 TB. Or does partitioning automatically take care of this and only look at one block of SensorB's table per row of SensorA's table?
Either way, I am wondering if there is a better way to do this than a full JOIN. Since the matching values are all coming from within a few rows of each other in the original table, it seems like an analytic function might be able to look at a smaller amount of data at a time, but because we can't guarantee alternating rows of A & B, there's no clear LAG or LEAD offset that would always return the correct row.
Is it a matter of writing an analytic functions to return a few LAG and LEAD rows for each row, then evaluate each of those rows with a CASE statement to see if it is the correct row, then calculating the value? Or is there a way of doing a join against an analytic function's window?
Thanks for any guidance here.
One method uses lag(). Something like this:
select st.time, st.value - st.prev_value
from (select st.*,
lag(sensor) over (order by time, sensor) as prev_sensor,
lag(time) over (order by time, sensor) as prev_time,
lag(value) over (order by time, sensor) as prev_value
from sensor_table st
) st
where ( st.sensor = 'A' <> prev_sensor = 'B' ) and
prev_time > timestamp_add(time, interval 1 second)

how between sql in 2 column

I, i search if I can reserve date, this is my table timestamps:
+------------+-----------+-----------------+----------------+-------------+--------------+----------------+--------------+------------+
| idReserver | demandeur | reserverAvecQui | reserverWhy | saisieQuant | reserverDate | reserverStartH | reserverEndH | reserverOk |
+------------+-----------+-----------------+----------------+-------------+--------------+----------------+--------------+------------+
| 1 | test.fr | anonyme | Je ne sais pas | 1524167863 | 1524175200 | 1524222000 | 1524240000 | NULL |
+------------+-----------+-----------------+----------------+-------------+--------------+----------------+--------------+------------+
I try this request SQL:
this is my request:
select * from calendar WHERE reserverStartH >=1524222000 AND reserverEndH <=1524240000;
is request find it, if find it, we can't reservert it because the human is busy at this timestamp.
But now i would like this request:
select * from calendar WHERE reserverStartH >=1524222000 AND reserverEndH <=1524222222;
there is not data, if not data I can't reserve it, but it's not possible because in 1524222000 and 1524222222 the human is busy at this time.
How i do for find data ?
the request find data equal we can't insert new timestamps the human is busy
the request not find data equal we can insert new timestamps, the human is free at the moment
You want to check if two intervals have an intersection.
For example if (a, b) intersects (c,d). Example: if (10,20) has an intersection with (15, 25). It does (15,20).
The logical condition to find if the intersection exists is: c <= b AND d >= a
Translating this to your SQL:
select * from calendar
WHERE 1524222000 <= reserverEndH AND 1524222222 >= reserverStartH;