Calculating values from the same table but a different period - sql

I have a readings table with the following definition.
Column | Type | Collation | Nullable | Default
------------+-----------------------------+-----------+----------+--------------------------------------
id | integer | | not null | nextval('readings_id_seq'::regclass)
created_at | timestamp without time zone | | not null |
type | character varying(50) | | not null |
device | character varying(100) | | not null |
value | numeric | | not null |
It has data such as:
id | created_at | type | device | value
----+---------------------+--------+--------+-------
1 | 2021-05-11 04:00:00 | weight | 1 | 100
2 | 2021-05-10 03:00:00 | weight | 2 | 120
3 | 2021-05-10 04:00:00 | weight | 1 | 120
4 | 2021-05-10 03:00:00 | weight | 1 | 124
5 | 2021-05-01 22:43:47 | weight | 1 | 130
6 | 2021-05-01 15:00:48 | weight | 1 | 140
7 | 2021-05-01 13:00:48 | weight | 2 | 160
Desired Output
Given a device and a type, I would like the max and min value from the past 7 days for each matched row (active row ignored). If there's nothing in the past 7 days, then it should be 0.
id | created_at | type | device | value | min | max
----+---------------------+--------+--------+-------+-----+-----
1 | 2021-05-11 03:09:47 | weight | 1 | 100 | 120 | 124
3 | 2021-05-10 04:00:00 | weight | 1 | 120 | 124 | 124
4 | 2021-05-10 03:00:00 | weight | 1 | 124 | 0 | 0
5 | 2021-05-01 22:43:47 | weight | 1 | 130 | 140 | 140
6 | 2021-05-01 15:00:48 | weight | 1 | 140 | 0 | 0
I have created a db-fiddle.

You can use lateral left join for your requirement like below:
select
t1.id,
t1.created_at,
t1.type,
t1.device,
t1.value,
min(coalesce(t2.value,0)),
max(coalesce(t2.value,0))
from
readings t1
left join lateral
( select *
from readings
where id!=t1.id and created_at between t1.created_at- interval '7 day' and t1.created_at and device=t1.device and t1.type=type
) t2 on true
where t1.device='1' -- Change the device
and t1.type='weight' -- Change the type
group by 1,2,3,4,5
order by 1
DEMO

Considering the comments, here is your PSQL:
select readings.id, readings.type, readings.device, readings.created_at, readings.value,
min(COALESCE(m_readings.value,0)) min, max(COALESCE(m_readings.value,0)) max
from readings LEFT JOIN readings m_readings
ON m_readings.type =readings.type
AND m_readings.device =readings.device
AND m_readings.id > readings.id
AND date( m_readings.created_at) between (date(readings.created_at)-7) and date(readings.created_at)
group by readings.id, readings.type, readings.device, readings.created_at, readings.value
order by readings.id;
Explanations: We make a LEFT junction between each records of readings and the others records of readings which are of the same type and device but not the same id, keeping only the last 7 days records. Then for each type/devicewe are grouping to get max and min valueon this 7 days.

You should be using window functions for this!
select r.*,
max(value) over (partition by device, type
order by created_at
range between interval '7 day' preceding and interval '1 second' preceding
),
min(value) over (partition by device, type
order by created_at
range between interval '7 day' preceding and interval '1 second' preceding
)
from readings r;
The above returns NULL values when there are no values -- and that makes more sense to me than 0. But if you really want 0, just use COALESCE():
select r.*,
coalesce(max(value) over (partition by device, type
order by created_at
range between interval '7 day' preceding and interval '1 second' preceding
), 0),
coalesce(min(value) over (partition by device, type
order by created_at
range between interval '7 day' preceding and interval '1 second' preceding
), 0)
from readings r;
In addition to being more concise, this is easier to read and should have better performance than other methods.

Related

How do I use AVG() with GROUP BY in time_bucket_gapfill() in TimeScaleDB, PostgreSQL?

I'm using TimescaleDB in my PostgreSQL and I have the following two Tables:
windows_log
| windows_log_id | timestamp | computer_id | log_count |
------------------------------------------------------------------
| 1 | 2021-01-01 00:01:02 | 382 | 30 |
| 2 | 2021-01-02 14:59:55 | 382 | 20 |
| 3 | 2021-01-02 19:08:24 | 382 | 20 |
| 4 | 2021-01-03 13:05:36 | 382 | 10 |
| 5 | 2021-01-03 22:21:14 | 382 | 40 |
windows_reliability_score
| computer_id (FK) | timestamp | reliability_score |
--------------------------------------------------------------
| 382 | 2021-01-01 22:21:14 | 6 |
| 382 | 2021-01-01 22:21:14 | 6 |
| 382 | 2021-01-01 22:21:14 | 6 |
| 382 | 2021-01-02 22:21:14 | 1 |
| 382 | 2021-01-02 22:21:14 | 3 |
| 382 | 2021-01-03 22:21:14 | 7 |
| 382 | 2021-01-03 22:21:14 | 8 |
| 382 | 2021-01-03 22:21:14 | 9 |
Note: In both tables is indexed on the timestamp column (hypertable)
So I'm trying to get the average reliability_score for each time bucket, but it just gives me the average for everything, instead of the average per specific bucket...
This is my query:
SELECT time_bucket_gapfill(CAST(1 * INTERVAL '1 day' AS INTERVAL), wl.timestamp) AS timestamp,
COALESCE(SUM(log_count), 0) AS log_count,
AVG(reliability_score) AS reliability_score
FROM windows_log wl
JOIN reliability_score USING (computer_id)
WHERE wl.time >= '2021-01-01 00:00:00.0' AND wl.time < '2021-01-04 00:00:00.0'
GROUP BY timestamp
ORDER BY timestamp asc
This is the result I'm looking for:
| timestamp | log_count | reliability_score |
-------------------------------------------------------
| 2021-01-01 00:00:00 | 30 | 6 |
| 2021-01-02 00:00:00 | 20 | 2 |
| 2021-01-03 00:00:00 | 20 | 8 |
But this is what I get:
| timestamp | log_count | reliability_score |
-------------------------------------------------------
| 2021-01-01 00:00:00 | 30 | 5.75 |
| 2021-01-02 00:00:00 | 20 | 5.75 |
| 2021-01-03 00:00:00 | 20 | 5.75 |
Given what we can glean from your example, there's no simple way to do a join between these two tables, with the given functions, and achieve the results you want. The schema, as presented, just makes that difficult.
If this is really what your data/schema look like, then one solution is to use multiple CTE's to get the two values from each distinct table and then join based on bucket and computer.
WITH wrs AS (
SELECT time_bucket_gapfill('1 day', timestamp) AS bucket,
computer_id,
AVG(reliability_score) AS reliability_score
FROM windows_reliability_score
WHERE timestamp >= '2021-01-01 00:00:00.0' AND timestamp < '2021-01-04 00:00:00.0'
GROUP BY 1, 2
),
wl AS (
SELECT time_bucket_gapfill('1 day', wl.timestamp) bucket, wl.computer_id,
sum(log_count) total_logs
FROM windows_log wl
WHERE timestamp >= '2021-01-01 00:00:00.0' AND timestamp < '2021-01-04 00:00:00.0'
GROUP BY 1, 2
)
SELECT wrs.bucket, wrs.computer_id, reliability_score, total_logs
FROM wrs LEFT JOIN wl ON wrs.bucket = wl.bucket AND wrs.computer_id = wl.computer_id;
The filtering would have to be applied internally to each query because pushdown on the outer query likely wouldn't happen and so then you would scan the entire hypertable before the date filter is applied (not what you want I assume).
I tried to quickly re-create your sample schema, so I apologize if I got names wrong somewhere.
The main issue is that the join codition is on column computer_id, where both tables have the same values 382. Thus each column from table windows_log will be joined with each column from table reliability_score (Cartesian product of all rows). Also the grouping is done on column timestamp, which is ambigous, and is likely to be resolved to timestamp from windows_log. This leads to the result that average will use all values of reliability_score for each value of the timestamp from windows_log and explains the undesired result.
The resolution of the gropuing ambiguity, which resolved in favor the inner column, i.e., the table column, is explained in GROUP BY description in SELECT documentation:
In case of ambiguity, a GROUP BY name will be interpreted as an input-column name rather than an output column name.
To avoid having groups, which includes all rows matching on computer id, windows_log_id can be used for grouping. This will allow to bring log_count to the query result. And if it is desire to keep the output name timestamp, GROUP BY should use the reference to the position. For example:
SELECT time_bucket_gapfill('1 day'::INTERVAL, rs.timestamp) AS timestamp,
AVG(reliability_score) AS reliability_score,
log_count
FROM windows_log wl
JOIN reliability_score rs USING (computer_id)
WHERE rs.timestamp >= '2021-01-01 00:00:00.0' AND rs.timestamp < '2021-01-04 00:00:00.0'
GROUP BY 1, windows_log_id, log_count
ORDER BY timestamp asc
For ORDER BY it is not an issue, since the output name is used. From the same doc:
If an ORDER BY expression is a simple name that matches both an output column name and an input column name, ORDER BY will interpret it as the output column name.

30 day rolling count of distinct IDs

So after looking at what seems to be a common question being asked and not being able to get any solution to work for me, I decided I should ask for myself.
I have a data set with two columns: session_start_time, uid
I am trying to generate a rolling 30 day tally of unique sessions
It is simple enough to query for the number of unique uids per day:
SELECT
COUNT(DISTINCT(uid))
FROM segment_clean.users_sessions
WHERE session_start_time >= CURRENT_DATE - interval '30 days'
it is also relatively simple to calculate the daily unique uids over a date range.
SELECT
DATE_TRUNC('day',session_start_time) AS "date"
,COUNT(DISTINCT uid) AS "count"
FROM segment_clean.users_sessions
WHERE session_start_time >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY date(session_start_time)
I then I tried several ways to do a rolling 30 day unique count over a time interval
SELECT
DATE(session_start_time) AS "running30day"
,COUNT(distinct(
case when date(session_start_time) >= running30day - interval '30 days'
AND date(session_start_time) <= running30day
then uid
end)
) AS "unique_30day"
FROM segment_clean.users_sessions
WHERE session_start_time >= CURRENT_DATE - interval '3 months'
GROUP BY date(session_start_time)
Order BY running30day desc
I really thought this would work but when looking into the results, it appears I'm getting the same results as I was when doing the daily unique rather than the unique over 30days.
I am writing this query from Metabase using the SQL query editor. the underlying tables are in redshift.
If you read this far, thank you, your time has value and I appreciate the fact that you have spent some of it to read my question.
EDIT:
As rightfully requested, I added an example of the data set I'm working with and the desired outcome.
+-----+-------------------------------+
| UID | SESSION_START_TIME |
+-----+-------------------------------+
| | |
| 10 | 2020-01-13T01:46:07.000-05:00 |
| | |
| 5 | 2020-01-13T01:46:07.000-05:00 |
| | |
| 3 | 2020-01-18T02:49:23.000-05:00 |
| | |
| 9 | 2020-03-06T18:18:28.000-05:00 |
| | |
| 2 | 2020-03-06T18:18:28.000-05:00 |
| | |
| 8 | 2020-03-31T23:13:33.000-04:00 |
| | |
| 3 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 2 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 9 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 3 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 8 | 2020-09-15T16:40:29.000-04:00 |
| | |
| 3 | 2020-09-21T20:49:09.000-04:00 |
| | |
| 1 | 2020-11-05T21:31:48.000-05:00 |
| | |
| 6 | 2020-11-05T21:31:48.000-05:00 |
| | |
| 8 | 2020-12-12T04:42:00.000-05:00 |
| | |
| 8 | 2020-12-12T04:42:00.000-05:00 |
| | |
| 5 | 2020-12-12T04:42:00.000-05:00 |
+-----+-------------------------------+
bellow is what the result I would like looks like:
+------------+---------------------+
| DATE | UNIQUE 30 DAY COUNT |
+------------+---------------------+
| | |
| 2020-01-13 | 3 |
| | |
| 2020-01-18 | 1 |
| | |
| 2020-03-06 | 3 |
| | |
| 2020-03-31 | 1 |
| | |
| 2020-08-28 | 4 |
| | |
| 2020-09-15 | 2 |
| | |
| 2020-09-21 | 1 |
| | |
| 2020-11-05 | 2 |
| | |
| 2020-12-12 | 2 |
+------------+---------------------+
Thank you
You can approach this by keeping a counter of when users are counted and then uncounted -- 30 (or perhaps 31) days later. Then, determine the "islands" of being counted, and aggregate. This involves:
Unpivoting the data to have an "enters count" and "leaves" count for each session.
Accumulate the count so on each day for each user you know whether they are counted or not.
This defines "islands" of counting. Determine where the islands start and stop -- getting rid of all the detritus in-between.
Now you can simply do a cumulative sum on each date to determine the 30 day session.
In SQL, this looks like:
with t as (
select uid, date_trunc('day', session_start_time) as s_day, 1 as inc
from users_sessions
union all
select uid, date_trunc('day', session_start_time) + interval '31 day' as s_day, -1
from users_sessions
),
tt as ( -- increment the ins and outs to determine whether a uid is in or out on a given day
select uid, s_day, sum(inc) as day_inc,
sum(sum(inc)) over (partition by uid order by s_day rows between unbounded preceding and current row) as running_inc
from t
group by uid, s_day
),
ttt as ( -- find the beginning and end of the islands
select tt.uid, tt.s_day,
(case when running_inc > 0 then 1 else -1 end) as in_island
from (select tt.*,
lag(running_inc) over (partition by uid order by s_day) as prev_running_inc,
lead(running_inc) over (partition by uid order by s_day) as next_running_inc
from tt
) tt
where running_inc > 0 and (prev_running_inc = 0 or prev_running_inc is null) or
running_inc = 0 and (next_running_inc > 0 or next_running_inc is null)
)
select s_day,
sum(sum(in_island)) over (order by s_day rows between unbounded preceding and current row) as active_30
from ttt
group by s_day;
Here is a db<>fiddle.
I'm pretty sure the easier way to do this is to use a join. This creates a list of all the distinct users who had a session on each day and a list of all distinct dates in the data. Then it one-to-many joins the user list to the date list and counts the distinct users, the key here is the expanded join criteria that matches a range of dates to a single date via a system of inequalities.
with users as
(select
distinct uid,
date_trunc('day',session_start_time) AS dt
from <table>
where session_start_time >= '2021-05-01'),
dates as
(select
distinct date_trunc('day',session_start_time) AS dt
from <table>
where session_start_time >= '2021-05-01')
select
count(distinct uid),
dates.dt
from users
join
dates
on users.dt >= dates.dt - 29
and users.dt <= dates.dt
group by dates.dt
order by dt desc
;

SQL interpolating missing values for a specific date range - with some conditions

There are some similar questions on the site, but I believe mine warrants a new post because there are specific conditions that need to be incorporated.
I have a table with monthly intervals, structured like this:
+----+--------+--------------+--------------+
| ID | amount | interval_beg | interval_end |
+----+--------+--------------+--------------+
| 1 | 10 | 12/17/2017 | 1/17/2018 |
| 1 | 10 | 1/18/2018 | 2/18/2018 |
| 1 | 10 | 2/19/2018 | 3/19/2018 |
| 1 | 10 | 3/20/2018 | 4/20/2018 |
| 1 | 10 | 4/21/2018 | 5/21/2018 |
+----+--------+--------------+--------------+
I've found that sometimes there is a month of data missing around the end/beginning of the year where I know it should exist, like this:
+----+--------+--------------+--------------+
| ID | amount | interval_beg | interval_end |
+----+--------+--------------+--------------+
| 2 | 10 | 10/14/2018 | 11/14/2018 |
| 2 | 10 | 11/15/2018 | 12/15/2018 |
| 2 | 10 | 1/17/2019 | 2/17/2019 |
| 2 | 10 | 2/18/2019 | 3/18/2019 |
| 2 | 10 | 3/19/2019 | 4/19/2019 |
+----+--------+--------------+--------------+
What I need is a statement that will:
Identify where this year-end period is missing (but not find missing
months that aren't at the beginning/end of the year).
Create this interval by using the length of an existing interval for
that ID (maybe using the mean interval length for the ID to do it?). I could create the interval from the "gap" between the previous and next interval, except that won't work if I'm missing an interval at the beginning or end of the ID's record (i.e. if the record starts at say 1/16/2015, I need the amount for 12/15/2014-1/15/2015
Interpolate an 'amount' for this interval using the mean daily
'amount' from the closest existing interval.
The end result for the sample above should look like:
+----+--------+--------------+--------------+
| ID | amount | interval_beg | interval_end |
+----+--------+--------------+--------------+
| 2 | 10 | 10/14/2018 | 11/14/2018 |
| 2 | 10 | 11/15/2018 | 12/15/2018 |
| 2 | 10 | 12/16/2018 | 1/16/2018 |
| 2 | 10 | 1/17/2019 | 2/17/2019 |
| 2 | 10 | 2/18/2019 | 3/18/2019 |
+----+--------+--------------+--------------+
A 'nice to have' would be a flag indicating that this value is interpolated.
Is there a way to do this efficiently in SQL? I have written a solution in SAS, but have a need to move it to SQL, and my SAS solution is very inefficient (optimization isn't a goal, so any statement that does what I need is fantastic).
EDIT: I've made an SQLFiddle with my example table here:
http://sqlfiddle.com/#!18/8b16d
You can use a sequence of CTEs to build up the data for the missing periods. In this query, the first CTE (EOYS) generates all the end-of-year dates (YYYY-12-31) relevant to the table; the second (INTERVALS) the average interval length for each ID and the third (MISSING) attempts to find start (from t2) and end (from t3) dates of adjoining intervals for any missing (indicated by t1.ID IS NULL) end-of-year interval. The output of this CTE is then used in an INSERT ... SELECT query to add missing interval records to the table, generating missing dates by adding/subtracting the interval length to the end/start date of the adjacent interval as necessary.
First though we add the interp column to indicate if a row was interpolated:
ALTER TABLE Table1 ADD interp TINYINT NOT NULL DEFAULT 0;
This sets interp to 0 for all existing rows. Then we can do the INSERT, setting interp for all those rows to 1:
WITH EOYS AS (
SELECT DISTINCT DATEFROMPARTS(DATEPART(YEAR, interval_beg), 12, 31) AS eoy
FROM Table1
),
INTERVALS AS (
SELECT ID, AVG(DATEDIFF(DAY, interval_beg, interval_end)) AS interval_len
FROM Table1
GROUP BY ID
),
MISSING AS (
SELECT e.eoy,
ids.ID,
i.interval_len,
COALESCE(t2.amount, t3.amount) AS amount,
DATEADD(DAY, 1, t2.interval_end) AS interval_beg,
DATEADD(DAY, -1, t3.interval_beg) AS interval_end
FROM EOYS e
CROSS JOIN (SELECT DISTINCT ID FROM Table1) ids
JOIN INTERVALS i ON i.ID = ids.ID
LEFT JOIN Table1 t1 ON ids.ID = t1.ID
AND e.eoy BETWEEN t1.interval_beg AND t1.interval_end
LEFT JOIN Table1 t2 ON ids.ID = t2.ID
AND DATEADD(MONTH, -1, e.eoy) BETWEEN t2.interval_beg AND t2.interval_end
LEFT JOIN Table1 t3 ON ids.ID = t3.ID
AND DATEADD(MONTH, 1, e.eoy) BETWEEN t3.interval_beg AND t3.interval_end
WHERE t1.ID IS NULL
)
INSERT INTO Table1 (ID, amount, interval_beg, interval_end, interp)
SELECT ID,
amount,
COALESCE(interval_beg, DATEADD(DAY, -interval_len, interval_end)) AS interval_beg,
COALESCE(interval_end, DATEADD(DAY, interval_len, interval_beg)) AS interval_end,
1 AS interp
FROM MISSING
This adds the following rows to the table:
ID amount interval_beg interval_end interp
2 10 2017-12-05 2018-01-04 1
2 10 2018-12-16 2019-01-16 1
2 10 2019-12-28 2020-01-27 1
Demo on SQLFiddle

Pad row with default if values not found PostgresSQL

I wanted to return the last 7 days of user_activity, but for those empty days I want to add 0 as value
Say I have this table
actions | id | date
------------------------
67 | 123 | 2019-07-7
90 | 123 | 2019-07-9
100 | 123 | 2019-07-10
50 | 123 | 2019-07-13
30 | 123 | 2019-07-15
and this should be the expected output , for the last 7 days
actions | id | date
------------------------
90 | 123 | 2019-07-9
100 | 123 | 2019-07-10
0 | 123 | 2019-07-11 <--- padded
0 | 123 | 2019-07-12 <--- padded
50 | 123 | 2019-07-13
0 | 123 | 2019-07-14 <--- padded
30 | 123 | 2019-07-15
Here is my query so far, I can only get the last 7 days
but not sure if it's positive to add to default values
SELECT *
FROM user_activity
WHERE action_day > CURRENT_DATE - INTERVAL '7 days'
ORDER BY uid, action_day
You may left join your table with generate_series. First you need to have a way to use the rows for distinct ids. That set can then be correctly joined with the main table.
WITH days
AS (SELECT id,dt
FROM (
SELECT DISTINCT id FROM user_activity
) AS ids
CROSS JOIN generate_series(
CURRENT_DATE - interval '7 days',
CURRENT_DATE, interval '1 day') AS dt
)
SELECT
coalesce(u.actions,0)
,d.id
,d.dt
FROM days d LEFT JOIN user_activity u ON u.id = d.id AND u.action_day = d.dt
DEMO

How to calculate the percentage between the last value and the value from X time ago?

Today I tried to play a bit with currencies and gave PostgreSQL a chance to help me a bit.
I have a table in a PostgreSQL database, which has three fields:
CREATE TABLE IF NOT EXISTS binance (
date TIMESTAMP,
symbol VARCHAR(20),
price REAL
)
This table is updated from 10 to 10 seconds with ~250rows. The symbols are always the same between intervals. E.g data:
+----------------------------+--------+-------+
| date | symbol | price |
+----------------------------+--------+-------+
| 2018-01-18 00:00:00.000000 | x | 12 |
| 2018-01-18 00:00:00.000120 | y | 15 |
| 2018-01-18 00:00:00.000200 | z | 19 |
| 2018-01-18 00:00:10.080000 | x | 14 |
| 2018-01-18 00:00:10.123000 | y | 16 |
| 2018-01-18 00:00:10.130000 | z | 20 |
+----------------------------+--------+-------+
Now, what I'd like to do is get for each symbol how much did it grow (percentage) in the last 5 minutes.
Let's take a symbol as an example (ETHBTC). Data for this symbol in the last 5 minutes looks like this:
+----------------------------+--------+----------+
| date | symbol | price |
+----------------------------+--------+----------+
| 2018-01-19 22:59:10.000000 | ETHBTC | 0.09082 |
| 2018-01-19 22:58:59.000000 | ETHBTC | 0.0907 |
| 2018-01-19 22:58:47.000000 | ETHBTC | 0.090693 |
| 2018-01-19 22:58:35.000000 | ETHBTC | 0.090697 |
| 2018-01-19 22:58:24.000000 | ETHBTC | 0.090712 |
| 2018-01-19 22:58:11.000000 | ETHBTC | 0.090682 |
| 2018-01-19 22:57:59.000000 | ETHBTC | 0.090774 |
| 2018-01-19 22:57:48.000000 | ETHBTC | 0.090672 |
| 2018-01-19 22:57:35.000000 | ETHBTC | 0.09075 |
| 2018-01-19 22:57:24.000000 | ETHBTC | 0.090727 |
| 2018-01-19 22:57:12.000000 | ETHBTC | 0.090705 |
| 2018-01-19 22:57:00.000000 | ETHBTC | 0.090707 |
| 2018-01-19 22:56:49.000000 | ETHBTC | 0.090646 |
| 2018-01-19 22:56:37.000000 | ETHBTC | 0.090645 |
| 2018-01-19 22:56:25.000000 | ETHBTC | 0.090636 |
| 2018-01-19 22:56:13.000000 | ETHBTC | 0.090696 |
| 2018-01-19 22:56:00.000000 | ETHBTC | 0.090698 |
| 2018-01-19 22:55:48.000000 | ETHBTC | 0.090693 |
| 2018-01-19 22:55:37.000000 | ETHBTC | 0.090698 |
| 2018-01-19 22:55:25.000000 | ETHBTC | 0.090601 |
| 2018-01-19 22:55:13.000000 | ETHBTC | 0.090644 |
| 2018-01-19 22:55:01.000000 | ETHBTC | 0.0906 |
| 2018-01-19 22:54:49.000000 | ETHBTC | 0.0906 |
| 2018-01-19 22:54:37.000000 | ETHBTC | 0.09062 |
| 2018-01-19 22:54:25.000000 | ETHBTC | 0.090693 |
+----------------------------+--------+----------+
To select this data I'm using the following query:
SELECT *
FROM binance
WHERE date >= NOW() AT TIME ZONE 'EET' - INTERVAL '5 minutes'
AND symbol = 'ETHBTC'
ORDER BY date DESC;
What I'd like to do is to find out for every symbol:
the percentage between the last value and the value from 10s ago
the percentage between the last value and the value from 1min ago
the percentage between the last value and the value from 5mins ago
Now, I'm kind of stuck about how'd a query like this should look like. More, IDK if this is important or not but the queries are run from within Python so I might not have the possibility to take advantage of the full PostgreSQL functionality.
Demo
Rextester online demo: http://rextester.com/QNVGU31219
SQL
Below is the SQL for comparing the latest price with the price 1 minute ago:
WITH cte AS
(SELECT price,
ABS(EXTRACT(EPOCH FROM (
SELECT date - (SELECT MAX(date) - INTERVAL '1 minute' FROM binance))))
AS secs_from_prev_timestamp
FROM binance
WHERE symbol = 'ETHBTC')
SELECT price /
(SELECT price FROM binance
WHERE symbol = 'ETHBTC' AND date = (SELECT MAX(date) FROM binance))
* 100.0 AS percentage_difference
FROM cte
WHERE secs_from_prev_timestamp = (SELECT MIN(secs_from_prev_timestamp) FROM cte);
The above can be simply changed to compare with the price from a different interval ago, e.g. by changing to INTERVAL '5 minutes' instead of INTERVAL '1 minute', or to give results for a different symbol by changing the two references to 'ETHBTC' to a different symbol.
Explanation
The tricky bit is getting the previous price. This is done by using a common table expression (CTE), which lists all the prices and the number of seconds away from the desired timestamp. The absolute value function is used (ABS) so the nearest one will be found, regardless of whether it is more or less than the target timestamp.
Results
In the one example above, the query gives a result of 99.848...%. This is formulated from 0.090682 / 0.09082 * 100.0, where 0.09082 is the latest price and 0.090682 is the price one minute ago.
The above was based on an assumption of what was meant by "percentage difference" but there are alternative percentages that could be calculated - e.g. 0.09082 is 0.152% higher than 0.090682. (Please reply in the comments if my interpretation of percentage difference wasn't what you are after and I'll update the answer accordingly.)
UPDATE - "do it all" query
After reading your comments to Dan's answer that you would like to get all these results using a single query, I've posted one below that should do what is required. Rextester demo here: http://rextester.com/QDUN45907
WITH cte2 AS
(WITH cte1 AS
(SELECT symbol,
price,
ABS(EXTRACT(EPOCH FROM (
SELECT date - (SELECT MAX(date) - INTERVAL '10 seconds' FROM binance))))
AS secs_from_latest_minus_10,
ABS(EXTRACT(EPOCH FROM (
SELECT date - (SELECT MAX(date) - INTERVAL '1 minute' FROM binance))))
AS secs_from_latest_minus_60,
ABS(EXTRACT(EPOCH FROM (
SELECT date - (SELECT MAX(date) - INTERVAL '5 minutes' FROM binance))))
AS secs_from_latest_minus_300
FROM binance)
SELECT symbol,
(SELECT price AS latest_price
FROM binance b2
WHERE b2.symbol = b.symbol AND date = (SELECT MAX(date) FROM binance)),
(SELECT price AS price_latest_minus_10
FROM cte1
WHERE cte1.symbol = b.symbol AND secs_from_latest_minus_10 =
(SELECT MIN(secs_from_latest_minus_10) FROM cte1)),
(SELECT price AS price_latest_minus_60
FROM cte1
WHERE cte1.symbol = b.symbol AND secs_from_latest_minus_60 =
(SELECT MIN(secs_from_latest_minus_60) FROM cte1)),
(SELECT price AS price_latest_minus_500
FROM cte1
WHERE cte1.symbol = b.symbol AND secs_from_latest_minus_60 =
(SELECT MIN(secs_from_latest_minus_60) FROM cte1))
FROM binance b
GROUP BY symbol)
SELECT symbol,
price_latest_minus_10 / latest_price * 100.0 AS percentage_diff_10_secs_ago,
price_latest_minus_60 / latest_price * 100.0 AS percentage_diff_1_minute_ago,
price_latest_minus_500 / latest_price * 100.0 AS percentage_diff_5_minutes_ago
FROM cte2;
To get a Relative Percentage for three different times in a row you have to join each case for every time, in this case 10s / 1min / 5 mins.
Here is the query, NOTE that the JOIN is ON id. You need a primary key or a unique value for this JOIN to work properly:
-- Overall SELECT, '*' includes 5min
SELECT a.*,b."1min",c."10sec"
FROM
-- First we select the group with most rows, that are <=5min
(SELECT *,
-- Formula for the percentage
100*price/last_value(price)
OVER (PARTITION BY symbol
ORDER BY date DESC rows between unbounded preceding and
unbounded following) AS "5min"
FROM test
WHERE date >= NOW() AT TIME ZONE 'EET' - INTERVAL '5 minutes'
ORDER BY symbol,date DESC)a
LEFT JOIN
-- Join with 1 minute query
(SELECT *,
-- Formula for the percentage
100*price/last_value(price)
OVER (PARTITION BY symbol
ORDER BY date DESC rows between unbounded preceding and
unbounded following) AS "1min"
FROM test
WHERE date >= NOW() AT TIME ZONE 'EET' - INTERVAL '1 minutes'
ORDER BY symbol,date DESC)b
-- join with id (primary or unique)
ON a.id = b.id
-- Join with 30 seconds query
LEFT JOIN
(SELECT *,
-- Formula for the percentage
100*price/last_value(price)
OVER (PARTITION BY symbol
ORDER BY date DESC rows between unbounded preceding and
unbounded following) AS "10sec"
FROM test
WHERE date >= NOW() AT TIME ZONE 'EET' - INTERVAL '30 seconds'
ORDER BY symbol,date DESC)c
-- join with id (primary or unique)
ON a.id=c.id
In this query you can alter the formula for the percentage and the time, according to your needs. If you like the percentage to be relative to another value like a master price, it will have to be included in each query and added to the formula instead of last_value(price) OVER.... Keep in mind that the actual formula gets the percentage relative to the oldest row in the query.
Percent Rank:
This query gives the percentage from 0 to 1 of the rows in a query, being 0 the first row, and 1 the last one.
For example:
date |symbol |price | percentage
-----------+--------+------+-------------
2017-01-05 | 1 | 0.5 | 1
2017-01-04 | 1 | 1.5 | 0.5
2017-01-03 | 1 | 1 | 0
2017-01-05 | 2 | 1 | 1
2017-01-04 | 2 | 3 | 0.5
2017-01-03 | 2 | 2 | 0
This is the query:
SELECT *,
-- this makes a column with the percentage per row
percent_rank() OVER (PARTITION BY symbol ORDER BY date) AS percent
FROM binance
WHERE date >= NOW() AT TIME ZONE 'EET' - INTERVAL '5 minutes'
ORDER BY symbol,date DESC;
Relative Percentage:
This query shows the percentage regarding the oldest value of the price of the data set.
For example:
date | symbol |price | percentage
-----------+--------+------------
2017-01-05 | 1 | 0.5 | 50
2017-01-04 | 1 | 1.5 | 150
2017-01-03 | 1 | 1 | 100
2017-01-05 | 2 | 1 | 50
2017-01-04 | 2 | 3 | 150
2017-01-03 | 2 | 2 | 100
The query is:
SELECT *,
-- Formula to get the percentage taking the price from the oldest date:
100*price/last_value(price) OVER (PARTITION BY symbol ORDER BY date DESC rows between unbounded preceding and unbounded following) AS percentage
FROM binance
WHERE date >= NOW() AT TIME ZONE 'EET' - INTERVAL '5 minutes'
ORDER BY symbol,date DESC;