Populating empty bins in a histogram generated using SQL - sql

In Redshift I can create a histogram – in this case it's binning a column named metric into 100ms buckets
select floor(metric / 100) * 100 as bin, count(*) as impressions
from tablename
where epoch > date_part(epoch, dateadd(day, -1, sysdate))
and metric is not null
group by bin
order by bin
There's a danger that some of the bins might be empty and won't appear in the result set, so I want to use generate_series to create the empty bins e.g.
select *, 0 as impressions from generate_series(0, maxMetricValue, 100) as bin
and union the two sets of results together to produce the 'full' histogram
select bin, sum(impressions)
from
(
select floor(metric/100)*100 as bin, count(*) as impressions
from tablename
where epoch > date_part(epoch, dateadd(day, -1, sysdate))
and metric is not null
group by bin
order by bin
)
union
(
select *, 0 as impressions from generate_series(0, maxMetricValue, 100) as bin
)
oroup by bin
order by bin
The challenge is that calculating the maxMetricValue requires a subquery i.e. select max(metric)… etc and I'd like to avoid that
Is there a way I can calculate the max value from the histogram query and use that instead?
Edit:
Something like this seems along the right lines but Redshift doesn't like it
with histogram as (
select cast(floor(metric/100)*100 as integer) as bin, count(*) as impressions
from table name
and epoch > date_part(epoch, dateadd(day, -1, sysdate))
and metric is not null
group by bin
order by bin)
select bin, sum(impressions)
from (
select * from histogram
union
select *, 0 as impressions from generate_series(0, (select max(bin) from histogram), 100) as bin
)
group by bin
order by bin
I get this error, but there are no INFO messages visible: ERROR: Specified types or functions (one per INFO message) not supported on Redshift tables.
If I remove the cast I get: ERROR: function generate_series(integer, double precision, integer) does not exist Hint: No function matches the given name and argument types. You may need to add explicit type casts.
If I try using cast or convert in the parameter for generate_series I get the first error again!
Edit 2:
Presume the above query is failing because Redshift is trying to execute generate_series on a compute node rather than a leader but not sure

First off generate_series is a leader-only function and will throw an error when used in combo with user data. A recursive CTE is the way to do this but since this isn't what you want I won't get into it.
You could create a numbers table and calculate the min, max and count from the other data you know. You could then outer join on some condition that will never match.
However i expect you will be much better off the union all you already have.

Related

How to compute window function for each nth row in Presto?

I am working with a table that contains timeseries data, with a row for each minute for each user.
I want to compute some aggregate functions on a rolling window of N calendar days.
This is achieved via
SELECT
SOME_AGGREGATE_FUN(col) OVER (
PARTITION BY user_id
ORDER BY timestamp
ROWS BETWEEN (60 * 24 * N) PRECEDING AND CURRENT ROW
) as my_col
FROM my_table
However, I am only interested in the result of this at a daily scale.
i.e. I want the window to be computed only at 00:00:00, but I want the window itself to contain all the minute-by-minute data to be passed into my aggregate function.
Right now I am doing this:
WITH agg_results AS (
SELECT
SOME_AGGREGATE_FUN(col) OVER (
PARTITION BY user_id
ORDER BY timestamp_col
ROWS BETWEEN (60 * 24 * N) PRECEDING AND CURRENT ROW
)
FROM my_table
)
SELECT * FROM agg_results
WHERE
timestamp_col = DATE_TRUNC('day', "timestamp_col")
This works in theory, but it does 60 * 24 more computations that necessary, resulting in the query being super slow.
Essentially, I am trying to find a way to make the right window bound skip rows based on a condition. Or, if it is simpler to implement, for every nth row (as I have a constant number of rows for each day).
I don't think that's possible with window functions. You could switch to a subquery instead, assuming that your aggregate function works as a regular aggregate function too (that is, without an OVER() clause):
select
timestamp_col,
(
select some_aggregate_fun(t1.col)
from my_table t1
where
t1.user_id = t.user_id
and t1.timestamp_col >= t.timestamp_col - interval '1' day
and t1.timestamp_col <= t.timestamp_col
)
from my_table t
where timestamp_col = date_trunc('day', timestamp_col)
I am unsure that this would perform better than your original query though; you might need to assess that against your actual dataset.
You can change interval '1' day to the actual interval you want to use.

AWS Redshift column "view_table_B.cost" must appear in the GROUP BY clause or be used in an aggregate function

I have 2 queries in AWS Redshift, the queries target different table with similar schema. But my issue is that one of the query is working meanwhile the other is failed.
First Query
SELECT view_table_A.accountId, view_table_A.date, SUM(view_table_A.cost) as Cost
FROM view_table_A
GROUP BY accountId, date
HAVING Cost >= '20'
Second Query
SELECT view_table_B.projectname, view_table_B.usagedate, sum(view_table_B.cost) as Cost
FROM view_table_B
GROUP BY projectname, usagedate
HAVING Cost >= '20'
My problem is that the first query is working well meanwhile second query will return error as below
Amazon Invalid operation: column "view_table_B .cost" must appear in the GROUP BY clause or be used in an aggregate function;
Update-1
I try to remove ' from the query but still get same result. And I attach the screenshot of query I tried to execute in redshift.
Redshift identifiers are case insensitive, therefore cost and Cost collide in your query.
I was able to reproduce the problem with:
with src(cost, dat) as (
select 1, current_date
union all
select 2, current_date
)
SELECT
dat,
sum(s.cost) as Cost
FROM src s
GROUP BY dat
HAVING Cost = 3
;
it's giving me
[2020-06-04 11:22:44] [42803][500310] Amazon Invalid operation: column "s.cost" must appear in the GROUP BY clause or be used in an aggregate function;
If you renamed the column to something distinct, that would fix the query:
with src(cost, dat) as (
select 1, current_date
union all
select 2, current_date
)
SELECT
dat,
sum(s.cost) as sum_cost
FROM src s
GROUP BY dat
HAVING sum_cost = 3
;
I was also surprised to see that quoting identifiers with " does not solve the problem - as I initially expected.

REGR_SLOPE in Teradata SQL Query Returning 0 Slope

I am a relative newbie with Teradata SQL and have run into this strange (I think strange) situation. I am trying to run a regression (REGR_SLOPE) on sensor data. I am gathering sensor readings for a single day, each day is 80 observations which is confirmed by the COUNT in the outer SELECT. My query is:
SELECT
d.meter_id,
REGR_SLOPE(d.reading_measure, d.x_axis) AS slope,
COUNT(d.x_axis) AS xcount,
COUNT(d.reading_measure) AS read_count
FROM
(
SELECT
meter_id,
reading_measure,
row_number() OVER (ORDER BY Reading_Dttm) AS x_axis
FROM data_mart.v_meter_reading
WHERE Reading_Start_Dt = '2017-12-12'
AND Meter_Id IN (11932101, 11419827, 11385229, 11643466)
AND Channel_Num = 5
) d
GROUP BY 1
When I use the "IN" clause in the subquery to specify Meter_Id, I get slope values, but when I take it out (to run over all meters) all the slopes are 0 (zero). I would simply like to run a line through a day's worth of observations (80).
I'm using Teradata v15.0.
What am I missing / doing wrong?
I would bet a Pepperoni Pizza that it's the x_axis value.
Instead try ROW_NUMBER() OVER (PARTITION BY meter_id ORDER BY reading_dttm)
This will ensure that the x_axis starts again from 1 for each meter, and each reading will always be 1 away from the previous reading on the x_axis.
This makes me thing you should probably just use reading_dttm as the x_axis value, rather than fabricating one with ROW_NUMBER(). That way readings with a 5 hour gap between them have a different slope to readings with a 10 day gap between them. You may need to convert the reading_dttm's data-type, with a function like TO_UNIXTIME(reading_dttm), or something similar.
I'll message you my address for the Pizza Delivery. (Joking.)
Additional to #MatBailie's answer.
You probably know that should you order by the timestamp instead of the ROW_NUMBER, but you couldn't do it because Teradata doesn't allow timestamps in this place (strange).
There's no built-in TO_UNIXTIME function in Teradata, but you can use this instead:
REPLACE FUNCTION TimeStamp_to_UnixTime (ts TIMESTAMP(6))
RETURNS decimal(18,6)
LANGUAGE SQL
CONTAINS SQL
DETERMINISTIC
SQL SECURITY DEFINER
COLLATION INVOKER
INLINE TYPE 1
RETURN
(Cast(ts AS DATE) - DATE '1970-01-01') * 86400
+ (Extract(HOUR From ts) * 3600)
+ (Extract(MINUTE From ts) * 60)
+ (Extract(SECOND From ts));
If you're not allowed to create UDFs simply cut&paste the calculation.

Aggregating (x,y) coordinate point clouds in PostgreSQL

I have a a PostgreSQL database table with the following simplified structure:
Device Id varchar
Pos_X (int)
Pos_Y (int)
Basically this table contains a lot of two dimensional waypoint data for devices. Now I want to design a query which reduces the number of coordinates in the output. It should aggregate nearby coordinates (for a certain x,y threshold)
An example:
row 1: DEVICE1;603;1205
row 2: DEVICE1;604;1204
If the threshold is 5, these two rows should be aggregated since the variance is smaller than 5.
Any idea how to do this in PostgreSQL or SQL in general?
Use the often overlooked built-in function width_bucket() in combination with your aggregation:
If your coordinates run from, say, 0 to 2000 and you want to consolidate everything within squares of 5 to single points, I would lay out a grid of 10 (5*2) like this:
SELECT device_id
, width_bucket(pos_x, 0, 2000, 2000/10) * 10 AS pos_x
, width_bucket(pos_y, 0, 2000, 2000/10) * 10 AS pos_y
, count(*) AS ct -- or any other aggregate
FROM tbl
GROUP BY 1,2,3
ORDER BY 1,2,3;
To minimize the error you could GROUP BY the grid as demonstrated, but save actual average coordinates:
SELECT device_id
, avg(pos_x)::int AS pos_x -- save actual averages to minimize error
, avg(pos_y)::int AS pos_y -- cast if you need to
, count(*) AS ct -- or any other aggregate
FROM tbl
GROUP BY
device_id
, width_bucket(pos_x, 0, 2000, 2000/10) * 10 -- aggregate by grid
, width_bucket(pos_y, 0, 2000, 2000/10) * 10
ORDER BY 1,2,3;
sqlfiddle demonstrating both alongside.
Well, this particular case could be simpler:
...
GROUP BY
device_id
, (pos_x / 10) * 10 -- truncates last digit of an integer
, (pos_y / 10) * 10
...
But that's just because the demo grid size of 10 conveniently matches the decimal system. Try the same with a grid size of 17 or something ...
Expand to timestamps
You can expand this approach to cover date and timestamp values by converting them to unix epoch (number of seconds since '1970-1-1') with extract().
SELECT extract(epoch FROM '2012-10-01 21:06:38+02'::timestamptz);
When you are done, convert the result back to timestamp with time zone:
SELECT timestamptz 'epoch' + 1349118398 * interval '1s';
Or simply to_timestamp():
SELECT to_timestamp(1349118398);
select [some aggregates] group by (pos_x/5, pos_y/5);
Where instead of 5 you can have any number depending how much aggregation you need/

SQL Average Inter-arrival Time, Time Between Dates

I have a table with sequential timestamps:
2011-03-17 10:31:19
2011-03-17 10:45:49
2011-03-17 10:47:49
...
I need to find the average time difference between each of these(there could be dozens) in seconds or whatever is easiest, I can work with it from there. So for example the above inter-arrival time for only the first two times would be 870 (14m 30s). For all three times it would be: (870 + 120)/2 = 445 (7m 25s).
A note, I am using postgreSQL 8.1.22 .
EDIT: The table I mention above is from a different query that is literally just a one-column list of timestamps
Not sure I understood your question completely, but this might be what you are looking for:
SELECT avg(difference)
FROM (
SELECT timestamp_col - lag(timestamp_col) over (order by timestamp_col) as difference
FROM your_table
) t
The inner query calculates the distance between each row and the preceding row. The result is an interval for each row in the table.
The outer query simply does an average over all differences.
i think u want to find avg(timestamptz).
my solution is avg(current - min value). but since result is interval, so add it to min value again.
SELECT avg(target_col - (select min(target_col) from your_table))
+ (select min(target_col) from your_table)
FROM your_table
If you cannot upgrade to a version of PG that supports window functions, you
may compute your table's sequential steps "the slow way."
Assuming your table is "tbl" and your timestamp column is "ts":
SELECT AVG(t1 - t0)
FROM (
-- All this silliness would be moot if we could use
-- `` lead(ts) over (order by ts) ''
SELECT tbl.ts AS t0,
next.ts AS t1
FROM tbl
CROSS JOIN
tbl next
WHERE next.ts = (
SELECT MIN(ts)
FROM tbl subquery
WHERE subquery.ts > tbl.ts
)
) derived;
But don't do that. Its performance will be terrible. Please do what
a_horse_with_no_name suggests, and use window functions.