Average interval between timestamps in an array - sql

In a PostgreSQL 9.x database, I have a column which is an array of type timestamp. Each array has between 1..n timestamps.
I'm trying to extract the average interval between all elements in each array.
I understand using a window function on the source table might be the ideal way to tackle this but in this case I am trying to do it as an operation on the array.
I've looked at several other questions that are trying to calculate the moving average of another column etc or the avg (median date of a list of timestamps).
For example the average interval I'm looking for on an array with 3 elements like this:
'{"2012-10-09 17:04:05.710887"
,"2013-10-18 22:30:08.973749"
,"2014-10-22 22:18:18.885973"}'::timestamp[]
Would be:
-368d
Wondering if I need to unpack the array through a function?

One way of many possible: unnest, join, avg in a lateral subquery:
SELECT *
FROM tbl t
LEFT JOIN LATERAL (
SELECT avg(a2.ts - a1.ts) AS avg_intv
FROM unnest(t.arr) WITH ORDINALITY a1(ts, ord)
JOIN unnest(t.arr) WITH ORDINALITY a2(ts, ord) ON (a2.ord = a1.ord + 1)
) avg ON true;
db<>fiddle here
The [INNER] JOIN in the subquery produces exactly the set of combinations relevant for intervals between elements.
I get 371 days 14:37:06.587543, not '-368d', btw.
Related, with more explanation:
PostgreSQL unnest() with element number
You can also only unnest once and use the window functions lead() or lag(), but you were trying to avoid window functions. And you need to make sure of the original order of elements in any case ...
(There is no array function you could use directly to get what you need - in case you were hoping for that.)
Alternative with CTE
Might be appealing to still unnest only once (even while avoiding window functions):
SELECT *
FROM tbl t
LEFT JOIN LATERAL (
WITH a AS (SELECT * FROM unnest(t.arr) WITH ORDINALITY a1(ts, ord))
SELECT avg(a2.ts - a1.ts) AS avg_intv
FROM a a1
JOIN a a2 ON (a2.ord = a1.ord +1)
) avg ON true;
But I expect the added CTE overhead to cost more than unnesting twice. Mostly just demonstrating a WITH clause in a subquery.

Related

Athena/Presto | Can't match ID row on self join

I'm trying to get the bi-grams on a string column.
I've followed the approach here but Athena/Presto is giving me errors at the final steps.
Source code so far
with word_list as (
SELECT
transaction_id,
words,
n,
regexp_extract_all(f70_remittance_info, '([a-zA-Z]+)') as f70,
f70_remittance_info
FROM exploration_transaction
cross join unnest(regexp_extract_all(f70_remittance_info, '([a-zA-Z]+)')) with ordinality AS t (words, n)
where cardinality((regexp_extract_all(f70_remittance_info, '([a-zA-Z]+)'))) > 1
and f70_remittance_info is not null
limit 50 )
select wl1.f70, wl1.n, wl1.words, wl2.f70, wl2.n, wl2.words
from word_list wl1
join word_list wl2
on wl1.transaction_id = wl2.transaction_id
The specific issue I'm having is on the very last line, when I try to self join the transaction ids - it always returns zero rows. It does work if I join only by wl1.n = wl2.n-1 (the position on the array) which is useless if I can't constrain it to a same id.
Athena doesn't support the ngrams function by presto, so I'm left with this approach.
Any clues why this isn't working?
Thanks!
This is speculation. But I note that your CTE is using limit with no order by. That means that an arbitrary set of rows is being returned.
Although some databases materialize CTEs, many do not. They run the code independently each time it is referenced. My guess is that the code is run independently and the arbitrary set of 50 rows has no transaction ids in common.
One solution would be to add order by transacdtion_id in the subquery.

Filtering arrays in hive

I have a hive table in that I am having a column called paid_value in an array format for each record.
Now I want to filter the array such that the value must be between 1000 and 10000 for each record.
I don't know how to do it.
I know array_contains(Array<T>, value) function but this doesn't solve my problem since it accepts only one value as a checking criteria but I want like 'between 1000 and 10000'.
You can use LATERAL VIEW EXPLODE to explode the array and then do the filter subsequently. But if your array size is huge, your process will be slow.
Other option definitely needs a UDF to do the filter.
The other workaround I can think of is doing with a brickhouse UDF is:
-- this will give you an array of numbers between start(st) and end(ed)
select collect_set(pe.i+1) as range_array
from
(SELECT 1000 as st, 1100 as ed) t
LATERAL VIEW posexplode(split(space(ed-st),' ')) pe AS i,x;
Then I use the brickhouse udf bhouse_intersect_array
select count(1)
from range_array cross join <source_tablename>
where size(bhouse_intersect_array(source_array, range_array)) > 0

Postgresql - Map array aggregates into a single array in a particular order

I have a PostgreSQL table containing a column of 1 dimensional array data. I wish to perform an aggregate query on this column, obtaining min/max/mean for each element of the array as well as the group count, returning the result as a 1 dimensional array. The array lengths in the table may vary, but I can be certain that in any grouping I perform, all arrays will be of the same length.
In a simple form, say my arrays are of length 2 and have readings for x and y, I want to return the result as
{Min(x), Max(x), Mean(x), Min(y), Max(y), Mean(y), Count()}
I am able to get a result in the form {Min(x), Min(y), Max(x), Max(y), Mean(x), Mean(y) Count()} but I can't get from there to my desired result.
Here's an example showing where I am so far (this time with arrays of length 3, but without the mean aggregation as there isnt one for arrays built in to pgSql):
(SQLFiddle here)
CREATE TABLE my_test(some_key numeric, event_data bigint[]);
INSERT INTO my_test(some_key, event_data) VALUES
(1, {11,12,13}),
(1, {5,6,7}),
(1, {-11,-12,-13});
SELECT MIN(event_data) || MAX(event_data) || COUNT(event_data) FROM my_test GROUP BY some_key;
The above gives me
{11,12,13,-11,-12,-13,3}
However, I don't know how to transform a result like the above into what I want, which is:
{11,-11,12,-12,13,-13,3}
What function should I use to transform the above?
Note that the aggregation functions above don't exactly match with those I am using to get min, max - I'm using the aggs_for_vecs extension to give me min, max and mean.
I would recommend using array operations and aggregation:
select x.some_key,
array_agg(u.val order by x.n, u.nn)
from (select t.some_key, ed.n, min(val) as minval, max(val) as maxval
from my_test t cross join lateral
unnest(t.event_data) with ordinality as ed(val, n)
group by t.some_key, ed.n
) x cross join lateral
unnest(array[x.minval, x.maxval]) with ordinality u(val, nn)
group by x.some_key;
Personally, I would prefer an array with three elements and the min/max as a record:
select x.some_key, array_agg((x.minval, x.maxval) order by x.n)
from (select t.some_key, ed.n, min(val) as minval, max(val) as maxval
from my_test t cross join lateral
unnest(t.event_data) with ordinality as ed(val, n)
group by t.some_key, ed.n
) x
group by x.some_key;
Here is a db<>fiddle.

Redshift LISTAGG frame clause

I am trying to aggregate strings, but limited to only the preceding rows, not the whole partition. Does anyone know how to do this in Redshift?
What I am trying to achieve is the appended_event_namespace column below.
This is what I've tried so far.
LISTAGG(event_namespace, '/')
WITHIN GROUP (ORDER BY tstamp_true)
OVER (PARTITION BY acct_id) AS appended_event_namespace
This results in the full ApplicationLaunch/CategoryBrowse/NotificationCenter/UserProfile aggregation on every single row instead of what is in the desired screenshot.
The difficulty is in getting it to only append up to the current row since there doesn't seem to be a frame-clause for Redshift's LISTAGG(). Thanks for any ideas that may help.
You can hack this together with another query. Start with your appended_event_namespace as the result of your original LISTAGG
SELECT event_namespace,
SUBSTRING(appended_event_namespace,
1,
POSITION(event_namespace,appended_event_namespace) + LEN(event_namespace) - 1
) as appended_event_namespace_cum
FROM your_table;
Basically, you take your aggregated, ordered string, and then take the first N characters where N is ([where it appears in the aggregated string ]+[its length]), which will cut out everything after that item. This gives you a cumulative namespace.
LISTAGG with frame clause is not supported in RS yet. If you have some columns that you can use for partitioning and ordering you can make a self join (not so performant but would accomplish what you want):
SELECT
t1.id
,t2.tstamp_true
,t1.event_namespace
,LISTAGG(t2.event_namespace,'/') WITHIN GROUP (ORDER BY t2.tstamp_true)
FROM your_table t1
JOIN your_table t2
ON t1.id=t2.id
AND t1.tstamp_true>=t2.tstamp_true
GROUP BY 1,2,3
Alternatively, if you want to avoid self join you can build a JSON with the following structure using LISTAGG:
[{tstamp_true_1,event_namespace_1},{tstamp_true_N,event_namespace_N},...]
and write a Python UDF that takes such JSON for the given group of rows and tstamp_true of the given row and returns the path (the function would need to filter the tstamp_true_N values earlier than the second parameter and concatenate filtered event_namespace_N values for the output)

Pairwise array sum aggregate function?

I have a table with arrays as one column, and I want to sum the array elements together:
> create table regres(a int[] not null);
> insert into regres values ('{1,2,3}'), ('{9, 12, 13}');
> select * from regres;
a
-----------
{1,2,3}
{9,12,13}
I want the result to be:
{10, 14, 16}
that is: {1 + 9, 2 + 12, 3 + 13}.
Does such a function already exist somewhere? The intagg extension looked like a good candidate, but such a function does not already exist.
The arrays are expected to be between 24 and 31 elements in length, all elements are NOT NULL, and the arrays themselves will also always be NOT NULL. All elements are basic int. There will be more than two rows per aggregate. All arrays will have the same number of elements, in a query. Different queries will have different number of elements.
My implementation target is: PostgreSQL 9.1.13
General solutions for any number of arrays with any number of elements. Individual elements or the the whole array can be NULL, too:
Simpler in 9.4+ using WITH ORDINALITY
SELECT ARRAY (
SELECT sum(elem)
FROM tbl t
, unnest(t.arr) WITH ORDINALITY x(elem, rn)
GROUP BY rn
ORDER BY rn
);
See:
PostgreSQL unnest() with element number
Postgres 9.3+
This makes use of an implicit LATERAL JOIN
SELECT ARRAY (
SELECT sum(arr[rn])
FROM tbl t
, generate_subscripts(t.arr, 1) AS rn
GROUP BY rn
ORDER BY rn
);
See:
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
Postgres 9.1
SELECT ARRAY (
SELECT sum(arr[rn])
FROM (
SELECT arr, generate_subscripts(arr, 1) AS rn
FROM tbl t
) sub
GROUP BY rn
ORDER BY rn
);
The same works in later versions, but set-returning functions in the SELECT list are not standard SQL and were frowned upon by some. Should be OK since Postgres 10, though. See:
What is the expected behaviour for multiple set-returning functions in SELECT clause?
db<>fiddle here
Old sqlfiddle
Related:
Is there something like a zip() function in PostgreSQL that combines two arrays?
If you need better performances and can install Postgres extensions, the agg_for_vecs C extension provides a vec_to_sum function that should meet your need. It also offers various aggregate functions like min, max, avg, and var_samp that operate on arrays instead of scalars.
I know the original question and answer are pretty old, but for others who find this... The most elegant and flexible solution I've found is to create a custom aggregate function. Erwin's answer presents some great simple solutions if you only need the single resulting array, but doesn't translate to a solution that could include other table columns and aggregations, in a GROUP BY for example.
With a custom array_add function and array_sum aggregate function:
CREATE OR REPLACE FUNCTION array_add(_a numeric[], _b numeric[])
RETURNS numeric[]
AS
$$
BEGIN
RETURN ARRAY(
SELECT coalesce(a, 0) + coalesce(b, 0)
FROM unnest(_a, _b) WITH ORDINALITY AS x(a, b, n)
ORDER BY n
);
END
$$ LANGUAGE plpgsql;
CREATE AGGREGATE array_sum(numeric[])
(
sfunc = array_add,
stype = numeric[],
initcond = '{}'
);
Then (using the names from your example):
SELECT array_sum(a) a_sums
FROM regres;
Returns your array of sums, and it can just as well be used anywhere other aggregate functions could be used, so if your table also had a column name you wanted to group by, and another array of numbers, column b:
SELECT name, array_sum(a) a_sums, array_sum(b) b_sums
FROM regres
GROUP BY name;
You won't get quite the performance you'd get out of the built-in sum function and just selecting sum(a[1]), sum(a[2]), sum(a[3]), you'd have to implement the array_add function as a compiled C function to get that. But in cases where you don't have the ability to add custom C functions (like a managed cloud database, e.g. AWS RDS), or you're not aggregating huge numbers of rows, the difference probably won't be noticed.