PostgreSQL: running count of rows for a query 'by minute' - sql

I need to query for each minute the total count of rows up to that minute.
The best I could achieve so far doesn't do the trick. It returns count per minute, not the total count up to each minute:
SELECT COUNT(id) AS count
, EXTRACT(hour from "when") AS hour
, EXTRACT(minute from "when") AS minute
FROM mytable
GROUP BY hour, minute

Return only minutes with activity
Shortest
SELECT DISTINCT
date_trunc('minute', "when") AS minute
, count(*) OVER (ORDER BY date_trunc('minute', "when")) AS running_ct
FROM mytable
ORDER BY 1;
Use date_trunc(), it returns exactly what you need.
Don't include id in the query, since you want to GROUP BY minute slices.
count() is typically used as plain aggregate function. Appending an OVER clause makes it a window function. Omit PARTITION BY in the window definition - you want a running count over all rows. By default, that counts from the first row to the last peer of the current row as defined by ORDER BY. The manual:
The default framing option is RANGE UNBOUNDED PRECEDING, which is the
same as RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. With ORDER BY,
this sets the frame to be all rows from the partition start up
through the current row's last ORDER BY peer.
And that happens to be exactly what you need.
Use count(*) rather than count(id). It better fits your question ("count of rows"). It is generally slightly faster than count(id). And, while we might assume that id is NOT NULL, it has not been specified in the question, so count(id) is wrong, strictly speaking, because NULL values are not counted with count(id).
You can't GROUP BY minute slices at the same query level. Aggregate functions are applied before window functions, the window function count(*) would only see 1 row per minute this way.
You can, however, SELECT DISTINCT, because DISTINCT is applied after window functions.
ORDER BY 1 is just shorthand for ORDER BY date_trunc('minute', "when") here.
1 is a positional reference reference to the 1st expression in the SELECT list.
Use to_char() if you need to format the result. Like:
SELECT DISTINCT
to_char(date_trunc('minute', "when"), 'DD.MM.YYYY HH24:MI') AS minute
, count(*) OVER (ORDER BY date_trunc('minute', "when")) AS running_ct
FROM mytable
ORDER BY date_trunc('minute', "when");
Fastest
SELECT minute, sum(minute_ct) OVER (ORDER BY minute) AS running_ct
FROM (
SELECT date_trunc('minute', "when") AS minute
, count(*) AS minute_ct
FROM tbl
GROUP BY 1
) sub
ORDER BY 1;
Much like the above, but:
I use a subquery to aggregate and count rows per minute. This way we get 1 row per minute without DISTINCT in the outer SELECT.
Use sum() as window aggregate function now to add up the counts from the subquery.
I found this to be substantially faster with many rows per minute.
Include minutes without activity
Shortest
#GabiMe asked in a comment how to get eone row for every minute in the time frame, including those where no event occured (no row in base table):
SELECT DISTINCT
minute, count(c.minute) OVER (ORDER BY minute) AS running_ct
FROM (
SELECT generate_series(date_trunc('minute', min("when"))
, max("when")
, interval '1 min')
FROM tbl
) m(minute)
LEFT JOIN (SELECT date_trunc('minute', "when") FROM tbl) c(minute) USING (minute)
ORDER BY 1;
Generate a row for every minute in the time frame between the first and the last event with generate_series() - here directly based on aggregated values from the subquery.
LEFT JOIN to all timestamps truncated to the minute and count. NULL values (where no row exists) do not add to the running count.
Fastest
With CTE:
WITH cte AS (
SELECT date_trunc('minute', "when") AS minute, count(*) AS minute_ct
FROM tbl
GROUP BY 1
)
SELECT m.minute
, COALESCE(sum(cte.minute_ct) OVER (ORDER BY m.minute), 0) AS running_ct
FROM (
SELECT generate_series(min(minute), max(minute), interval '1 min')
FROM cte
) m(minute)
LEFT JOIN cte USING (minute)
ORDER BY 1;
Again, aggregate and count rows per minute in the first step, it omits the need for later DISTINCT.
Different from count(), sum() can return NULL. Default to 0 with COALESCE.
With many rows and an index on "when" this version with a subquery was fastest among a couple of variants I tested with Postgres 9.1 - 9.4:
SELECT m.minute
, COALESCE(sum(c.minute_ct) OVER (ORDER BY m.minute), 0) AS running_ct
FROM (
SELECT generate_series(date_trunc('minute', min("when"))
, max("when")
, interval '1 min')
FROM tbl
) m(minute)
LEFT JOIN (
SELECT date_trunc('minute', "when") AS minute
, count(*) AS minute_ct
FROM tbl
GROUP BY 1
) c USING (minute)
ORDER BY 1;

Related

How do I select a data every second with PostgreSQL?

I've got a SQL query that selects every data between two dates and now I would like to add the time scale factor so that instead of returning all the data it returns one data every second, minute or hour.
Do you know how I can achieve it ?
My query :
"SELECT received_on, $1 FROM $2 WHERE $3 <= received_on AND received_on <= $4", [data_selected, table_name, date_1, date_2]
The table input:
As you can see there are several data the same second, I would like to select only one per second
If you want to select data every second, you may use ROW_NUMBER() function partitioned by 'received_on' as the following:
WITH DateGroups AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY received_on ORDER BY adc_v) AS rn
FROM table_name
)
SELECT received_on, adc_v, adc_i, acc_axe_x, acc_axe_y, acc_axe_z
FROM DateGroups
WHERE rn=1
ORDER BY received_on
If you want to select data every minute or hour, you may use the extract function to get the number of seconds in 'received_on' and divide it by 60 to get the minutes or divide it by 3600 to get the hours.
epoch: For date and timestamp values, the number of seconds since 1970-01-01 00:00:00-00 (can be negative); for interval values, the total number of seconds in the interval
Group by minutes:
WITH DateGroups AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY floor(extract(epoch from (received_on)) / 60) ORDER BY adc_v) AS rn
FROM table_name
)
SELECT received_on, adc_v, adc_i, acc_axe_x, acc_axe_y, acc_axe_z
FROM DateGroups
WHERE rn=1
ORDER BY received_on
Group by hours:
WITH DateGroups AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY floor(extract(epoch from (received_on)) / (60*60)) ORDER BY adc_v) AS rn
FROM table_name
)
SELECT received_on, adc_v, adc_i, acc_axe_x, acc_axe_y, acc_axe_z
FROM DateGroups
WHERE rn=1
ORDER BY received_on
See a demo.
When there are several rows per second, and you only want one result row per second, you can decide to pick one of the rows for each second. This can be a randomly chosen row or you pick the row with the greatest or least value in a column as shown in Ahmed's answer.
It would be more typical, though, to aggregate your data per second. The columns show figures and you are interested in those figures. Your sample data shows two times the value 2509 and three times the value 2510 for the adc_v column at 2022-07-29, 15:52. Consider what you would like to see. Maybe you don't want this value go below some boundary, so you show the minimum value MIN(adc_v) to see how low it went in the second. Or you want to see the value that occured most often in the second MODE(adc_v). Or you'd like to see the average value AVG(adc_v). Make this decision for every value, so as to get the informarion most vital to you.
select
received_on,
min(adc_v),
avg(adc_i),
...
from mytable
group by received_on
order by received_on;
If you want this for another interval, say an hour instead of the month, truncate your received_on column accordingly. E.g.:
select
date_trunc('hour', received_on) as received_hour,
min(adc_v),
avg(adc_i),
...
from mytable
group by date_trunc('hour', received_on)
order by date_trunc('hour', received_on);

How to reference fields from table created in sub-query's of large JOIN

I am writing a large query with many JOINs (shortened it in example here) and I am trying to reference values form other sub-queries but can't figure out how.
This is my example query:
DROP TABLE IF EXISTS breakdown;
CREATE TEMP TABLE breakdown AS
SELECT * FROM
(
SELECT COUNT(DISTINCT s_id) AS before, date_trunc('day', time) AS day FROM table_a
WHERE date_trunc('sec',earliest) < date_trunc('sec',time) GROUP BY day
)
JOIN
(
SELECT ROUND(before * 100.0 / total, 1) AS Percent_1, day
FROM breakdown
GROUP BY day
) USING (day)
JOIN
(
SELECT COUNT(DISTINCT s_id) AS equal, date_trunc('day', time) AS day FROM table_a
WHERE date_trunc('sec',earliest) = date_trunc('sec',time) GROUP BY day
) USING (day)
JOIN
(
SELECT COUNT(DISTINCT s_id) AS after, date_trunc('day', time) AS day FROM table_a
WHERE date_trunc('sec',earliest) > date_trunc('sec',time) GROUP BY day
) USING (day)
JOIN
(
SELECT COUNT(DISTINCT s_id) AS total, date_trunc('day', earliest) AS day
FROM first
GROUP BY 2
) USING (day)
ORDER BY day;
SELECT * FROM breakdown ORDER BY day;
The last query gives me the total and for each of the previous subqueries I want to get the percentages as well.
I found the code for getting the percentage (second JOIN) but I don't know how to reference the values from the other tables.
E.g. for getting the percentage from the first query I want to use the COUNT of the first query which I renamed before and then divide that by the COUNT of the last query which I renamed total (If there is an easier solution to do this i.e. get the percentage for each of the sub-queries please let me know), But I cant seem to find how to reference them. I tried adding AS x to the end of each subquery and calling by that (x.total) as well as trying to reference via the parent table (breakdown.total) but neither worked.
How can I do this without changing my table too much as it is a long table with a lot of sub-queries.
This is what my table looks like I would like to add percentage for each column
Using redshift BTW.
Thanks
I'm a little confused by all that is going on as you drop table breakdown and then in the second subquery of the create table you reference breakdown. I suspect that there are some issues in the provided sample of SQL. Please update if there are issues.
For a number of these subqueries it looks like you are using a subquery where a case statement will do. In Redshift you don't want to scan the same table over and over if you can prevent it. For example if we look at the the 3rd and 4th subqueries you can replace these with one query. Also in these cases I like to use the DECODE() statement rather than CASE since it is more readable in these simple cases.
(
SELECT COUNT(DISTINCT s_id) AS equal, date_trunc('day', time) AS day
FROM table_a
WHERE date_trunc('sec',earliest) = date_trunc('sec',time)
GROUP BY day
) USING (day)
JOIN
(
SELECT COUNT(DISTINCT s_id) AS after, date_trunc('day', time) AS day
FROM table_a
WHERE date_trunc('sec',earliest) > date_trunc('sec',time)
GROUP BY day
)
Becomes:
(
SELECT COUNT(DISTINCT DECODE(date_trunc('sec',earliest) = date_trunc('sec',time), true, s_id, NULL)) AS equal,
COUNT(DISTINCT DECODE(date_trunc('sec',earliest) > date_trunc('sec',time), true, s_id, NULL)) AS after,
date_trunc('day', time) AS day
FROM table_a
GROUP BY day
)
Read each table once (if at all possible) and calculate the desired results. then you will have all your values in one layer of query and can reference these new values. This will be faster (especially on Redshift).
=============================
Expanding based on comment made by poster.
It appears that using DECODE() and referencing derived columns in a single query can produce what you want. I don't have your data so I cannot test this but here is what I'd want to move to:
SELECT
COUNT(DISTINCT DECODE(date_trunc('sec',earliest) < date_trunc('sec',time), true, s_id)) AS before,
ROUND(before * 100.0 / total, 1) AS Percent_1,
COUNT(DISTINCT DECODE(date_trunc('sec',earliest) = date_trunc('sec',time), true, s_id)) AS equal,
COUNT(DISTINCT DECODE(date_trunc('sec',earliest) > date_trunc('sec',time), true, s_id)) AS after,
COUNT(DISTINCT s_id) AS total
FROM table_a
GROUP BY date_trunc('day', time);
This should be a complete replacement for the SELECT currently inside your CREATE TEMP TABLE. However, I don't have sample data so this is untested.

Time difference between two rows for specified ID

I'm trying to find the time difference in seconds between two rows that have the same ID.
Here's a simple table.
The table is ordered by myid and timestamp. I'm trying to get the total second between two rows that have the same myid.
Here's what I have come up with. The only problem with this query is that it calculates the time difference for all records but not for the same ID.
SELECT DATEDIFF(second, pTimeStamp, TimeStamp), q.*
FROM (
SELECT *,
LAG(TimeStamp) OVER (ORDER BY TimeStamp) pTimeStamp
FROM data
) q
WHERE pTimeStamp IS NOT NULL
This is the output.
I only want the output highlighted in yellow.
Any suggestions?
SQLFIDDLE
The fix is simply a matter of narrowing the window, with PARTITION BY, to rows with the same ID:
SELECT DATEDIFF(second, pTimeStamp, TimeStamp), q.*
FROM (
SELECT *,
LAG(TimeStamp) OVER (PARTITION BY ID ORDER BY TimeStamp) pTimeStamp
FROM data
) q
WHERE pTimeStamp IS NOT NULL

SQL Statement Only latest entry of the day

seems it is too long ago that I needed create own SQL Statements. I have a table (GAS_COUNTER) with timestamps (TS) and values (VALUE).
There are hundreds of entries per day, but I only need the latest of the day. I tried different ways but never get what I need.
Edit
Thanks for the fast replies, but some do not meet my needs (I need the latest value of each day in the table) and some don't work. My best own statement was:
select distinct (COUNT),
from
(select
extract (DAY_OF_YEAR from TS) as COUNT,
extract (YEAR from TS) as YEAR,
extract (MONTH from TS) as MONTH,
extract (DAY from TS) as DAY,
VALUE as VALUE
from GAS_COUNTER
order by COUNT)
but the value is missing. If I put it in the first select all rows return. (logical correct as every line is distinct)
Here an example of the Table content:
TS VALUE
2015-07-25 08:47:12.663 0.0
2015-07-25 22:50:52.155 2.269999999552965
2015-08-10 11:18:07.667 52.81999999284744
2015-08-10 20:29:20.875 53.27999997138977
2015-08-11 10:27:21.49 54.439999997615814
2nd Edit and solution
select TS, VALUE from GAS_COUNTER
where TS in (
select max(TS) from GAS_COUNTER group by extract(DAY_OF_YEAR from TS)
)
This one would give you the very last record:
select top 1 * from GAS_COUNTER order by TS desc
Here is one that would give you last records for every day:
select VALUE from GAS_COUNTER
where TS in (
select max(TS) from GAS_COUNTER group by to_date(TS,'yyyy-mm-dd')
)
Depending on the database you are using you might need to replace/adjust to_date(TS,'yyyy-mm-dd') function. Basically it should extract date-only part from the timestamp.
Select the max value for the timestamp.
select MAX(TS), value -- or whatever other columns you want from the record
from GAS_COUNTER
group by value
Something like this would window the data and give you the last value on the day - but what happens if you get two TS the same? Which one do you want?
select *
from ( select distinct cast( TS as date ) as dt
from GAS_COUNTER ) as gc1 -- distinct days
cross apply (
select top 1 VALUE -- last value on the date.
from GAS_COUNTER as gc2
where gc2.TS < dateadd( day, 1, gc1.dt )
and gc2.TS >= gc1.dt
order by gc2.TS desc
) as x

Aggregates for today and the previous day depending on data

Having trouble putting together a query to pull the aggregate values of a give timestamp and the timestamp before it. Given the following schema:
name TEXT,
ts TIMESTAMP,
X NUMERIC,
Y NUMERIC
where there are gaps in the ts column due to gaps in data, I'm trying to construct a query to produce
name,
date_trunc('day' q1.ts),
avg(q1.X),
sum(q2.Y),
date_trunc('day', q2.ts),
avg(q2.X),
sum(q2.Y)
The first half is straightforward:
SELECT q1.name, date_trunc('day', q1.ts), avg(q1.X), sum(q1.Y)
FROM data as q1
GROUP BY 1, 2
ORDER BY 1, 2;
But not sure how to generate the relation to find the "day" before for each row. I'm trying to work an inner join like this:
SELECT q1.name, q1.day, q1.avg, q1.sum, q2.day, q2.avg, q2.sum
FROM (
SELECT name, date_trunc('day', ts) AS day, avg(X) AS avg, sum(Y) as sum
FROM data
GROUP BY 1,2
ORDER BY 1,2
) q1 INNER JOIN (
SELECT name, date_trunc('day', ts) AS day, avg(X) AS avg, sum(Y) as sum
FROM data
GROUP BY 1,2
ORDER BY 1,2
) q2 ON (
q1.name = q2.name
AND q2.day = q1.day - interval '1 day'
);
The problem with this is, it doesn't cover the cases when the next "day" is more than 1 day before the current day.
The special difficulty here is that you need to number days after aggregating rows. You can do this in a single query level with the window function row_number(), since window functions are applied after aggregation by GROUP BY.
Also, use a CTE to avoid executing the same subquery multiple times:
WITH q AS (
SELECT name, ts::date AS day
,avg(x) AS avg_x, sum(y) AS sum_y
,row_number() OVER (PARTITION BY name ORDER BY ts::date) AS rn
FROM data
GROUP BY 1,2
)
SELECT q1.name, q1.day, q1.avg_x, q1.sum_y
,q2.day AS day2, q2.avg_x AS avg_x2, q2.sum_y AS sum_y2
FROM q q1
LEFT JOIN q q2 ON q1.name = q2.name
AND q1.rn = q2.rn + 1
ORDER BY 1,2;
Using the simpler cast to date (ts::date) instead of date_trunc('day', ts) to get "days".
LEFT [OUTER] JOIN (as opposed to [INNER] JOIN) is instrumental to preserve the corner case of the first row, where there is no previous day.
And ORDER BY should be applied to the outer query.
The question isn't crystal clear, but it sounds like you're actually trying to fill gaps while keeping track of leading/lagging rows.
To fill the gaps, look into generate_series() and left join it with your table:
select d
from generate_series(timestamp '2013-12-01', timestamp '2013-12-31', interval '1 day') d;
http://www.postgresql.org/docs/current/static/functions-srf.html
For previous and next row values, look into lead() and lag() window functions:
select date_trunc('day', ts) as curr_row_day,
lag(date_trunc('day', ts)) over w as prev_row_day
from data
window w as (order by ts)
http://www.postgresql.org/docs/current/static/tutorial-window.html