I have table which contains power values (kW) for devices. Values are read from each device once a minute and inserted into table with timestamp. What I need to do is calculate power consumption (kWh) for given time span and return 10 most power consuming devices. Right now I query results for given time span and do calculation in backend looping all records. This works fine with small amount of devices and with short time span, but in real use case I could have thousands of devices and long time span.
So my question is how could I do this all in PostgreSQL 9.4.4 so that my query would return only 10 most power consuming (device_id, power_consumption) pairs?
Example table:
CREATE TABLE measurements (
id serial primary key,
device_id integer,
power real,
created_at timestamp
);
Simple data example:
| id | device_id | power | created_at |
|----|-----------|-------|--------------------------|
| 1 | 1 | 10 | August, 26 2015 08:23:25 |
| 2 | 1 | 13 | August, 26 2015 08:24:25 |
| 3 | 1 | 12 | August, 26 2015 08:25:25 |
| 4 | 2 | 103 | August, 26 2015 08:23:25 |
| 5 | 2 | 134 | August, 26 2015 08:24:25 |
| 6 | 2 | 2 | August, 26 2015 08:25:25 |
| 7 | 3 | 10 | August, 26 2015 08:23:25 |
| 8 | 3 | 13 | August, 26 2015 08:24:25 |
| 9 | 3 | 20 | August, 26 2015 08:25:25 |
Wanted results for query:
| id | device_id | power_consumption |
|----|-----------|-------------------|
| 1 | 1 | 24.0 |
| 2 | 2 | 186.5 |
| 3 | 3 | 28.0 |
Simplified example (created_at in hours) how I calculate kWh value:
data = [
[
{ 'id': 1, 'device_id': 1, 'power': 10.0, 'created_at': 0 },
{ 'id': 2, 'device_id': 1, 'power': 13.0, 'created_at': 1 },
{ 'id': 3, 'device_id': 1, 'power': 12.0, 'created_at': 2 }
],
[
{ 'id': 4, 'device_id': 2, 'power': 103.0, 'created_at': 0 },
{ 'id': 5, 'device_id': 2, 'power': 134.0, 'created_at': 1 },
{ 'id': 6, 'device_id': 2, 'power': 2.0, 'created_at': 2 }
],
[
{ 'id': 7, 'device_id': 3, 'power': 10.0, 'created_at': 0 },
{ 'id': 8, 'device_id': 3, 'power': 13.0, 'created_at': 1 },
{ 'id': 9, 'device_id': 3, 'power': 20.0, 'created_at': 2 }
]
]
# device_id: power_consumption
results = { 1: 0, 2: 0, 3: 0 }
for d in data:
for i in range(0, len(d)):
if i < len(d)-1:
# Area between two records gives us kWh
# X-axis is time(h)
# Y-axis is power(kW)
x1 = d[i]['created_at']
x2 = d[i+1]['created_at']
y1 = d[i]['power']
y2 = d[i+1]['power']
# Area between two records gives us kWh
# X-axis is time(h)
# Y-axis is power(kW)
x1 = d[i]['created_at']
x2 = d[i+1]['created_at']
y1 = d[i]['power']
y2 = d[i+1]['power']
results[d[i]['device_id']] += ((x2-x1)*(y2+y1))/2
print results
EDIT: Check this to see how I ended up solving this.
Some of the elements that you'll need in order to do this are:
Sum() aggregations, to calculate the total of a number of records
Lag()/Lead() functions, to calculate for a given record what the "previous" or "next" record's values were.
So where for a given row you can get the current created_at and power records, in SQL you'd probably use a Lead() windowing function to get the created_at and power records for the record for the same device id that has the next highest value for created_at.
Docs for Lead() are here: http://www.postgresql.org/docs/9.4/static/functions-window.html
When for each row you have calculated the power consumption by reference to the "next" record, you can use a Sum() to aggregate up all of the calculated powers for that one device.
When you have calculated the power per device, you can use ORDER BY and LIMIT to select the top n power-consuming devices.
Steps to follow, if you're not confident to plunge in and just write the final SQL -- after each step make sure you haveSQL you understand, and which returns just the data you need:
Start small, by selecting the data rows that you want.
Work out the Lead() function, defining the appropriate partition and order clauses to get the next row.
Add the calculation of power per row.
Define the Sum() function, and group by the device id.
Add the ORDER BY and LIMIT clauses.
If you have trouble with any one of these steps, they would each make a decent StackOverflow question.
If someone happens to wonder same thing here is how I solved this.
I followed instructions by David and made this:
SELECT
t.device_id,
sum(len_y*(extract(epoch from date_trunc('milliseconds', len_x)))/7200) AS total
FROM (
SELECT
m.id,
m.device_id,
m.power,
m.created_at,
m.power+lag(m.power) OVER (
PARTITION BY device_id
ORDER BY m.created_at
) AS len_y,
m.created_at-lag(m.created_at) OVER (
PARTITION BY device_id
ORDER BY m.created_at
) AS len_x
FROM
mes AS m
WHERE m.created_at BETWEEN '2015-08-26 13:39:57.834674'::timestamp
AND '2015-08-26 13:43:57.834674'::timestamp
) AS t
GROUP BY t.device_id
ORDER BY total
DESC LIMIT 10;
Related
After an AI file analysis across tens of thousands of audio files I end up with this kind of data structure in a Postgres DB:
id | name | tag_1 | tag_2 | tag_3 | tag_4 | tag_5
1 | first song | rock | pop | 80s | female singer | classic rock
2 | second song | pop | rock | jazz | electronic | new wave
3 | third song | rock | funk | rnb | 80s | rnb
Tag positions are really important: the more "to the left", the more prominent it is in the song. The number of tags is also finite (50 tags) and the AI always returns 5 of them for every song, no null values expected.
On the other hand, this is what I have to query:
{"rock" => 15, "pop" => 10, "soul" => 3}
The key is a Tag name and the value an arbitrary weight. Numbers of entries could be random from 1 to 50.
According to the example dataset, in this case it should return [1, 3, 2]
I'm also open for data restructuring if it could be easier to achieve using raw concatenated strings but... is it something doable using Postgres (tsvectors?) or do I really have to use something like Elasticsearch for this?
After a lot of trials and errors this is what I ended up with, using only Postgres:
Turn all data set to integers, so it moves on to something like this (I also added columns to match the real data set more closely) :
id | bpm | tag_1 | tag_2 | tag_3 | tag_4 | tag_5
1 | 114 | 1 | 2 | 3 | 4 | 5
2 | 102 | 2 | 1 | 6 | 7 | 8
3 | 110 | 1 | 9 | 10 | 3 | 12
Store requests in an array as strings (note that I sanitized those requests before with some kind of "request builder"):
requests = [
"bpm BETWEEN 110 AND 124
AND tag_1 = 1
AND tag_2 = 2
AND tag_3 = 3
AND tag_4 = 4
AND tag_5 = 5",
"bpm BETWEEN 110 AND 124
AND tag_1 = 1
AND tag_2 = 2
AND tag_3 = 3
AND tag_4 = 4
AND tag_5 IN (1, 3, 5)",
"bpm BETWEEN 110 AND 124
AND tag_1 = 1
AND tag_2 = 2
AND tag_3 = 3
AND tag_4 IN (1, 3, 5),
AND tag_5 IN (1, 3, 5)",
....
]
Simply loop in the requests array, from the most precise to the most approximate one:
# Ruby / ActiveRecord example
track_ids = []
requests.each do |request|
track_ids += Track.where([
"(#{request})
AND tracks.id NOT IN ?", track_ids
]).pluck(:id)
break if track_ids.length > 200
end
... and done! All my songs are ordered by similarity, the closest match at the top, and the more to the bottom, the more approximate they get. Since everything is about integers, it's pretty fast (fast enough on a 100K rows dataset), and the output looks like pure magic. Bonus point: it is still easily tweakable and maintainable by the whole team.
I do understand that's rough, so I'm open to any more efficient way to do the same thing, even if something else is needed in the stack (ES ?), but so far: it's a simple solution that just works.
I have a postgres table with columns:
id: text
availabilities: integer[]
A certain ID can has multiply availabilities (different days (not continuous) in a range for up to a few years). Each availability is a Unix timestamp (in seconds) for a certain day.
Hours, minutes, seconds, ms are set to 0, i.e. a timestamp represents the start of a day.
Question:
How can I find all IDs very fast, which contain at least one availability inbetween a certain from-to range (also timestamp)?
I can also store them differently in an array, e.g "days since epoch", if needed (to get 1 (day) steps instead of 86400 (second) steps).
However, if possible (and speed is roughly same), I want to use an array and on row per each entry.
Example:
Data (0 = day-1, 86400 = day-2, ...)
| id | availabilities |
| 1 | [0 , 86400, 172800, 259200 ]
| 2 | [ 86400, 259200 ]
| 3 | [ , 345600 ]
| 4 | [ , 172800, ]
| 5 | [0, ]
Now I want to get a list of IDs which contains at least 1 availability which:
is between 86400 AND 259200 --> ID 1, 2, 4
is between 172800 AND 172800 --> ID 1, 4
is between 259200 AND (max-int) --> ID 1,2,3
In PostgreSQL unnest function is the best function for converting array elements to rows and gets the best performance. You can use this function. Sample Query:
with mytable as (
select 1 as id, '{12,2500,6000,200}'::int[] as pint
union all
select 2 as id, '{0,200,3500,150}'::int[]
union all
select 4 as id, '{20,10,8500,1100,9000,25000}'::int[]
)
select id, unnest(pint) as pt from mytable;
-- Return
1 12
1 2500
1 6000
1 200
2 0
2 200
2 3500
2 150
4 20
4 10
4 8500
4 1100
4 9000
4 25000
WITH t(x,y) AS (
VALUES
(1,'[2]'),
(2,'[1, 2]'),
(3,'[2, 1]'),
(4,'[3, 2, 5]'),
(5,'[3, 2, 5, 2, 4]'),
(6,'[3, 2, 2, 0, 4]')
)
--- my wrong answer below
SELECT
REGEXP_EXTRACT(y, '(\d+,\s)?(2)(,\s\d+)?') AS _1st,
REGEXP_EXTRACT(y,'(.*?(2)){1}.*?(\d+,\s(2)(,\s\d+)?)',3) AS _2nd,
REGEXP_EXTRACT(y,'(.*?(2)){2}.*?(\d+,\s(2)(,\s\d+)?)',3) AS _3rd
FROM t
Expected ans:
| x | y | 1st | 2nd | nth |
| - | --------------- | ------- | ------- | ------- |
| 1 | [2] | 2 | | |
| 2 | [1, 2] | 1, 2 | | |
| 3 | [2, 1] | 2, 1 | | |
| 4 | [3, 2, 5] | 3, 2, 5 | | |
| 5 | [3, 2, 5, 2, 4] | 3, 2, 5 | 5, 2, 4 | |
| 6 | [3, 2, 2, 0, 4] | 3, 2, 2 | 2, 2, 0 | |
Need help on the Regex for REGEXP_EXTRACT function in Presto to get the nth occurrence of number '2' and include the figures before and after it (if any)
Additional info:
The figures in column y are not necessary single digit.
Orders of the numbers are important
1st, 2nd, 3rd refers to the nth occurrence of the number that I am seeking
Will be looking for a list of numbers, not just 2. Using 2 for illustration purpose.
Must it be a regular-expression?
If you see the text (VARCHAR) [1,2,3] as array-representation (JSON or internal data-type Array), you have more functions available to solve your task.
See related functions supported by Presto:
JSON functions
Array function
I would recommend to cast it as array of integers: CAST('[1,23,456]' AS ARRAY(INTEGER))
Finding the n-th occurrence
From Array functions, array_position(x, element, instance) → bigint to find the n-th occurrence:
If instance > 0, returns the position of the instance-th occurrence of the element in array x.
If instance < 0, returns the position of the instance-to-last occurrence of the element in array x.
If no matching element instance is found, 0 is returned.
Example:
SELECT CAST('[1,2,23,2,456]' AS ARRAY(INTEGER));
SELECT array_position(2, CAST('[1,2,23,2,456]' AS ARRAY(INTEGER)), 1); -- found in position 2
Now use the found position to build your slice (relatively from that).
Slicing and extracting sub-arrays
either parse it as JSON to a JSON-array. Then use a JSON-path to slice (extract a sub-array) as desired: Array slice operator in JSON-path: [start, stop, step]
or cast it as Array and then use slice(x, start, length) → array
Subsets array x starting from index start (or starting from the end if start is negative) with a length of length.
Examples:
SELECT json_extract(json_parse('[1,2,3]'), '$[-2, -1]'); -- the last two elements
SELECT slice(CAST('[1,23,456]' AS ARRAY(INTEGER)), -2, 2); -- [23, 456]
columns: datetime | clientid | amounts | *new_column_to_be_implemented* (rolling mean of values before but only for values that are the same in clientid)
`day 1` | 2 | 50 | (na)
`day 2` | 2 | 60 | 50
`day 3` | 1 | 45 | (na)
`day 4` | 2 | 45 | 110
`day 5` | 3 | 90 | (na)
`day 6` | 3 | 10 | 90
`day 7` | 2 | 10 | 105
so this gets the mean of the last 2 amounts of the same clientid for example.
I know it is possible to add a list and append/pop values to remember them, but is there a better way in pandas?
Please make sure to following the guidelines described in How to make good reproducible pandas examples when asking pandas related questions, it helps a lot for reproducibility.
The key element for the answer is the pairing of the groupby and rolling methods. groupby will group all the records with the same clientid and rolling will select the correct amount of records for the mean calculation.
import pandas as pd
import numpy as np
# setting up the dataframe
data = [
['day 1', 2, 50],
['day 2', 2, 60],
['day 3', 1, 45],
['day 4', 2, 45],
['day 5', 3, 90],
['day 6', 3, 10],
['day 7', 2, 10]
]
columns = ['date', 'clientid', 'amounts']
df = pd.DataFrame(data=data, columns=columns)
rolling_mean = df.groupby('clientid').rolling(2)['amounts'].mean()
rolling_mean.index = rolling_mean.index.get_level_values(1)
df['client_rolling_mean'] = rolling_mean
I'm trying to merge multiple pandas data frames into one. I have 1 main frame with the locations of measurements. The other data frames contain multiple measurements for one location. Like below:
df 1: Location ID | X | Y | Z
1 |1| 2 |3
2 |3| 2 |1
n
df 2: Location ID | Date | Measurement
1 |January 1 12:30 | 1
1 |January 16 12 :30 | 4
1 ...
df 2: Location ID | Date | Measurement
2 January 1 12:30 3
2 January 16 12 :30 9
2 ...
df n: Location ID | Date | Measurement
n January 1 12:30 4
n January 16 12 :30 6
n January 20 11:30 7 ...
I'm trying to create a data frame like this:
df_final: Location ID | X | Y | Z | january 1 12:00 | January 16 12 :30| January 20 11:30 etc.
1 1 2 3 1 4 NaN
2 3 2 1 3 9 NaN
n 2 5 7 4 6 7
The dates are already datetime objects and the Location ID is the index of both dataframes.
I tried to use the append, the merge and the concat functions both using two frames and converting the frame to a list by List = frame['measurements'] before adding it.
The problem is that either rows are added under the first data frame, while the measured values should be added in new columns on an existing row( the location ID resp.), or the dates end op to be new rows while new columns with location IDs are created.
I'm sorry my question lay-out is not so nice, but I'm new to this forum.
Found it myself.
I used frame. pivot to reshape df2-n and then used concat to ad it to the locations df.