Aggregate single array of distinct elements from array column, excluding NULL - sql

I'm trying to roll up the distinct non-null values of timestamps stored in a PostgreSQL 9.6 database column.
So given a table containing the following:
date_array
------------------------
{2019-10-21 00:00:00.0}
{2019-08-06 00:00:00.0,2019-08-05 00:00:00.0}
{2019-08-05 00:00:00.0}
(null)
{2019-08-01 00:00:00.0,2019-08-06 00:00:00.0,null}
The desired result would be:
{2019-10-21 00:00:00.0, 2019-08-06 00:00:00.0, 2019-08-05 00:00:00.0, 2019-08-01 00:00:00.0}
The arrays can be different sizes so most solutions I've tried end up running into a Code 0:
SQL State: 2202E
ERROR: cannot accumulate arrays of different dimensionality.
Some other caveats:
The arrays can be null, the arrays can contain a null. They happen to be timestamps of just dates (eg without time or timezone). But in trying to simplify the problem, I've had no luck in changing the sample data to strings (e.g {foo, bar, (null)}, {foo,baz}) - just to focus on the problem and eliminate any issues I miss/don't understand about timestamps w/o timezone.
This following SQL is the closest I've come (it resolves all but the different dimensionality issues):
SELECT
ARRAY_REMOVE ( ARRAY ( SELECT DISTINCT UNNEST ( ARRAY_AGG ( CASE WHEN ARRAY_NDIMS(example.date_array) > 0 AND example.date_array IS NOT NULL THEN example.date_array ELSE '{null}' END ) ) ), NULL) as actualDates
FROM example;
I created the following DB fiddle with sample data that illustrates the problem if the above is lacking: https://www.db-fiddle.com/f/8m469XTDmnt4iRkc5Si1eS/0
Additionally, I've perused stackoverflow on the issue (as well as PostgreSQL documentation) and there are similar questions with answers, but I've found none that are articulating the same problem I'm having.

Use unnest() in FROM clause (in a lateral join):
select array_agg(distinct elem order by elem desc) as result
from example
cross join unnest(date_array) as elem
where elem is not null
Test it in DB Fiddle.
A general note. An alternative solution using an array constructor is more efficient, especially in cases as simple as described. Personally, I prefer to use aggregate functions because this query structure is more general and flexible, easy to extend to handle more complex problems (e.g. having to aggregate more than one column, grouping by another column, etc). In these non-trivial cases, the performance differences tend to decrease, but the code using aggregates remains cleaner and more readable. It's an extremely important factor when you have to maintain really large and complex projects.
See also In Postgres select, return a column subquery as an array?

Plain array_agg() does this with arrays:
Concatenates all the input arrays into an array of one higher
dimension. (The inputs must all have the same dimensionality, and
cannot be empty or null.)
Not what you need. See:
Is there something like a zip() function in PostgreSQL that combines two arrays?
You need something like this: unnest(), process and sort elements an feed the resulting set to an ARRAY constructor:
SELECT ARRAY(
SELECT DISTINCT elem::date
FROM (SELECT unnest(date_array) FROM example) AS e(elem)
WHERE elem IS NOT NULL
ORDER BY elem DESC
);
db<>fiddle here
To be clear: we could use array_agg() (taking non-array input, different from your incorrect use) instead of the final ARRAY constructor. But the latter is faster (and simpler, too, IMO).
They happen to be timestamps of just dates (eg without time or timezone)
So cast to date and trim the noise.
Should be the fastest way:
A correlated subquery is a bit faster than a LATERAL one (and does the simple job).
An ARRAY constructor is a bit faster than the aggregate function array_agg() (and does the simple job).
Most importantly, sorting and applying DISTINCT in a subquery is typically faster than inline ORDER BY and DISTINCT in an aggregate function (and does the simple job).
See:
Unnest arrays of different dimensions
How to select 1d array from 2d array?
Why is array_agg() slower than the non-aggregate ARRAY() constructor?
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
Performance comparison:
db<>fiddle here

Related

Greater than any element of array

I am facing this issue:
Is there a way in PostgreSQL to put aggregated timestamp data into an array (for example using array_agg function) and then perform any match on some condition?
I am doing something similar with LIKE on aggregated strings (using string_agg(column,';')). But how to perform something similar on timestamps?
So if result would be '{10.10.2021,20.12.2021,1.1.1996}' as timestamp_array and I would like to filter rows that have at least one array element that after some input?
For example, ... WHERE 31.12.2021 > timestamp_array ... would not match the row above cause there is no array element after 31.12.2021.
But If I query ... WHERE 31.12.1996 > timestamp_array ..., the row above would be matched (cause at least one element of the array is in given interval).
First, you would use standard date formats. Then you can use:
where '2021-12-31' > any (timestamp_array)
Here is a db<>fiddle to illustrate the idea.
I would like to filter rows that has at least one array element that is after some input?
You can use the ANY construct as has been advised.
WHERE '1996-12-31'::timestamp < ANY ('{2021-10-10, 2021-12-20, 1996-01-01}'::timestamp[])
Has to be <, not >, obviously.
Your "timestamps" look a lot like dates - timestamp input accepts that, too.
But always use the recommended ISO 8601 format (as demonstrated), else your input depends on setting of the current session.
See:
IN vs ANY operator in PostgreSQL
But chances are, there is much more efficient way. You speak of "aggregated timestamp data". Typically it's much more efficient to check before aggregating. Not least because that can use indexes, as opposed to your approach. Typically, EXISTS does the job. Something like:
SELECT ...
FROM tbl t
WHERE EXISTS (SELECT FROM tbl t1 WHERE t1.id = t.id AND t1.timestamp_column > '1996-12-31'
GROUP BY ...
Start a new question with details of your query to get a fitting solution.

Filtering arrays in hive

I have a hive table in that I am having a column called paid_value in an array format for each record.
Now I want to filter the array such that the value must be between 1000 and 10000 for each record.
I don't know how to do it.
I know array_contains(Array<T>, value) function but this doesn't solve my problem since it accepts only one value as a checking criteria but I want like 'between 1000 and 10000'.
You can use LATERAL VIEW EXPLODE to explode the array and then do the filter subsequently. But if your array size is huge, your process will be slow.
Other option definitely needs a UDF to do the filter.
The other workaround I can think of is doing with a brickhouse UDF is:
-- this will give you an array of numbers between start(st) and end(ed)
select collect_set(pe.i+1) as range_array
from
(SELECT 1000 as st, 1100 as ed) t
LATERAL VIEW posexplode(split(space(ed-st),' ')) pe AS i,x;
Then I use the brickhouse udf bhouse_intersect_array
select count(1)
from range_array cross join <source_tablename>
where size(bhouse_intersect_array(source_array, range_array)) > 0

Starting from a column type, how to find supported aggregations in Postgres?

I'm trying to figure out from a column type, which aggregates the data type supports. There's a lot of variety amongst types, just a sample below (some of these support more aggregates, of course):
uuid count()
text count(), min(), max()
integer count(), min, max(),avg(),sum()
I've been thrashing around in the system catalogs and views, but haven't found what I'm after. (See "thrashing around.") I've poked at pg_type, pg_aggregate, pg_operator, and a few more.
Is there a straightforward way to start from a column type and gather all supported aggregates?
For background, I'm writing a client-side cross-tab code generator, and the UX is better when the tool automatically prevents you from selecting an aggregation that's not supported. I've hacked in some hard-coded rules for now, but would like to improve the system.
We're on Postgres 11.4.
A plain list of available aggregate functions can be based on pg_proc like this:
SELECT oid::regprocedure::text AS agg_func_plus_args
FROM pg_proc
WHERE prokind = 'a'
ORDER BY 1;
Or with separate function name and arguments:
SELECT proname AS agg_func, pg_get_function_identity_arguments(oid) AS args
FROM pg_proc
WHERE prokind = 'a'
ORDER BY 1, 2;
pg_proc.prokind replaces proisagg in Postgres 11. In Postgres 10 or older use:
...
WHERE proisagg
...
Related:
How to drop all of my functions in PostgreSQL?
How to get function parameter lists (so I can drop a function)
To get a list of available functions for every data type (your question), start with:
SELECT type_id::regtype::text, array_agg(proname) AS agg_functions
FROM (
SELECT proname, unnest(proargtypes::regtype[])::text AS type_id
FROM pg_proc
WHERE proisagg
ORDER BY 2, 1
) sub
GROUP BY type_id;
db<>fiddle here
Just a start. Some of the arguments are just "direct" (non-aggregated) (That's also why some functions are listed multiple times - due to those additional non-aggregate columns, example string_agg). And there are special cases for "ordered-set" and "hypothetical-set" aggregates. See the columns aggkind and aggnumdirectargs of the additional system catalog pg_aggregate. (You may want to exclude the exotic special cases for starters ...)
And many types have an implicit cast to one of the types listed by the query. Prominent example string_agg() works with varchar, too, but it's only listed for text above. You can extend the query with information from pg_cast to get the full picture.
Plus, some aggregates work for pseudo types "any", anyarray etc. You'll want to factor those in for every applicable data type.
The complication of multiple aliases for the same data type names can be eliminated easily, though: cast to regtype to get canonical names. Or use pg_typeof() which returns standard names. Related:
Type conversion. What do I do with a PostgreSQL OID value in libpq in C?
PostgreSQL syntax error in parameterized query on "date $1"
How do I translate PostgreSQL OID using python
Man, that is just stunning Thank you. The heat death of the universe will arrive before I could have figured that out. I had to tweak one line for PG 11 compatibility...says the guy who did not say what version he was on. I've reworked the query to get close to what I'm after and included a bit of output for the archives.
with aggregates as (
SELECT pro.proname aggregate_name,
CASE
WHEN array_agg(typ.typname ORDER BY proarg.position) = '{NULL}'::name[] THEN
'{}'::name[]
ELSE
array_agg(typ.typname ORDER BY proarg.position)
END aggregate_types
FROM pg_proc pro
CROSS JOIN LATERAL unnest(pro.proargtypes) WITH ORDINALITY proarg (oid,
position)
LEFT JOIN pg_type typ
ON typ.oid = proarg.oid
WHERE pro. prokind = 'a' -- I needed this for PG 11, I didn't say what version I was using.
GROUP BY pro.oid,
pro.proname
ORDER BY pro.proname),
-- The *super helpful* code above is _way_ past my skill level with Postgres. So, thrashing around a bit to get close to what I'm after.
-- First up, a CTE to sort everything by aggregation and then combine the types.
aggregate_summary as (
select aggregate_name,
array_agg(aggregate_types) as types_array
from aggregates
group by 1
order by 1)
-- Finally, the previous CTE is used to get the details and a count of the types.
select aggregate_name,
cardinality(types_array) as types_count, -- Couldn't get array_length to work here. ¯\_(ツ)_/¯
types_array
from aggregate_summary
limit 5;
And a bit of output:
aggregate_name types_count types_array
array_agg 2 {{anynonarray},{anyarray}}
avg 7 {{int8},{int4},{int2},{numeric},{float4},{float8},{interval}}
bit_and 4 {{int2},{int4},{int8},{bit}}
bit_or 4 {{int2},{int4},{int8},{bit}}
bool_and 1 {{bool}}
Still on my wish list are
Figuring out how to execute arrays (we aren't using array fields now, and only have a few places that we ever might. At that point, I don't expect we'll try and support pivots on arrays. tab tool
Getting all of the aliases for the various types. it seems like (?) int8, etc. can come through from pg_attribute in multiple ways. For example, timestamptz can come back from "timestamp with time zone".
These results are going to be consumed by client-side code and processed, so I don't need to get Postgres to figure everything out in one query, just enough for me to get the job done.
In any case, thanks very, very much.
There's the pg_proc catalog table, that lists all functions. The column proisagg marks aggregation functions and the column proargtypes holds an array of the OIDs of the argument types.
So for example to get a list of all aggregation functions with the names of their arguments' type you could use:
SELECT pro.proname aggregationfunctionname,
CASE
WHEN array_agg(typ.typname ORDER BY proarg.position) = '{NULL}'::name[] THEN
'{}'::name[]
ELSE
array_agg(typ.typname ORDER BY proarg.position)
END aggregationfunctionargumenttypes
FROM pg_proc pro
CROSS JOIN LATERAL unnest(pro.proargtypes) WITH ORDINALITY proarg (oid,
position)
LEFT JOIN pg_type typ
ON typ.oid = proarg.oid
WHERE pro.proisagg
GROUP BY pro.oid,
pro.proname
ORDER BY pro.proname;
Of course you may need to extend that, e.g. joining and respecting the schemas (pg_namespace) and check for compatible types in pg_type (have a look at the typcategory column for that), etc..
Edit:
I overlooked, that proisagg was removed in version 11 (I'm still mostly on a 9.6) as the other answers mentioned. So for the sake of completeness: As of version 11 replace WHERE pro.proisagg with WHERE pro.prokind = 'a'.
I've been playing around with the suggestions a bit, and want to post one adaptation based on one of Erwin's scripts:
select type_id::regtype::text as type_name,
array_agg(proname) as aggregate_names
from (
select proname,
unnest(proargtypes::regtype[])::text AS type_id
from pg_proc
where prokind = 'a'
order by 2, 1
) subquery
where type_id in ('"any"', 'bigint', 'boolean','citext','date','double precision','integer','interval','numeric','smallint',
'text','time with time zone','time without time zone','timestamp with time zone','timestamp without time zone')
group by type_id;
That brings back details on the types specified in the where clause. Not only is this useful for my current work, it's useful to my understanding generally. I've run into cases where I've had to recast something, like an integer to a double, to get it to work with an aggregate. So far, this has been pretty much trial and error. If you run the query above (or one like it), it's easier to see from the output where you need recasting between similar seeming types.

Efficient way to select one from each category - Rails

I'm developing a simple app to return a random selection of exercises, one for each bodypart.
bodypart is an indexed enum column on an Exercise model. DB is PostgreSQL.
The below achieves the result I want, but feels horribly inefficient (hitting the db once for every bodypart):
BODYPARTS = %w(legs core chest back shoulders).freeze
#exercises = BODYPARTS.map do |bp|
Exercise.public_send(bp).sample
end.shuffle
So, this gives a random exercise for each bodypart, and mixes up the order at the end.
I could also store all exercises in memory and select from them; however, I imagine this would scale horribly (there are only a dozen or so seed records at present).
#exercises = Exercise.all
BODYPARTS.map do |bp|
#exercises.select { |e| e[:bodypart] == bp }.sample
end.shuffle
Benchmarking these shows the select approach as the more effective on a small scale:
Queries: 0.072902 0.020728 0.093630 ( 0.088008)
Select: 0.000962 0.000225 0.001187 ( 0.001113)
MrYoshiji's answer: 0.000072 0.000008 0.000080 ( 0.000072)
My question is whether there's an efficient way to achieve this output, and, if so, what that approach might look like. Ideally, I'd like to keep this to a single db query.
Happy to compose this using ActiveRecord or directly in SQL. Any thoughts greatly appreciated.
From my comment, you should be able to do (thanks PostgreSQL's DISTINCT ON):
Exercise.select('distinct on (bodypart) *')
.order('bodypart, random()')
Postgres' DISTINCT ON is very handy and performance is typically great, too - for many distinct bodyparts with few rows each. But for only few distinct values of bodypart with many rows each (big table - and your use case) there are far superior query techniques.
This will be massively faster in such a case:
SELECT e.*
FROM unnest(enum_range(null::bodypart)) b(bodypart)
CROSS JOIN LATERAL (
SELECT *
FROM exercises
WHERE bodypart = b.bodypart
-- ORDER BY ??? -- for a deterministic pick
LIMIT 1 -- arbitrary pick!
) e;
Assuming that bodypart is the name of the enum as well as the table column.
enum_range is an enum support function that (quoting the manual):
Returns all values of the input enum type in an ordered array
I unnest it and run a LATERAL subquery for each value, which is very fast when supported with the right index. Detailed explanation for the query technique and the needed index (focus on chapter "2a. LATERAL join"):
Optimize GROUP BY query to retrieve latest record per user
For just an arbitrary row for each bodypart, a simple index on exercises(bodypart) does the job. But you can have a deterministic pick like "the latest entry" with the right multicolumn index and a matching ORDER BY clause and almost the same performance.
Related:
Is it a bad practice to query pg_type for enums on a regular basis?
Select first row in each GROUP BY group?

Postgresql Writing max() Window function with multiple partition expressions?

I am trying to get the max value of column A ("original_list_price") over windows defined by 2 columns (namely - a unique identifier, called "address_token", and a date field, called "list_date"). I.e. I would like to know the max "original_list_price" of rows with both the same address_token AND list_date.
E.g.:
SELECT
address_token, list_date, original_list_price,
max(original_list_price) OVER (PARTITION BY address_token, list_date) as max_list_price
FROM table1
The query already takes >10 minutes when I use just 1 expression in the PARTITION (e.g. using address_token only, nothing after that). Sometimes the query times out. (I use Mode Analytics and get this error: An I/O error occurred while sending to the backend) So my questions are:
1) Will the Window function with multiple PARTITION BY expressions work?
2) Any other way to achieve my desired result?
3) Any way to make Windows functions, especially the Partition part run faster? e.g. use certain data types over others, try to avoid long alphanumeric string identifiers?
Thank you!
The complexity of the window functions partitioning clause should not have a big impact on performance. Do realize that your query is returning all the rows in the table, so there might be a very large result set.
Window functions should be able to take advantage of indexes. For this query:
SELECT address_token, list_date, original_list_price,
max(original_list_price) OVER (PARTITION BY address_token, list_date) as max_list_price
FROM table1;
You want an index on table1(address_token, list_date, original_list_price).
You could try writing the query as:
select t1.*,
(select max(t2.original_list_price)
from table1 t2
where t2.address_token = t1.address_token and t2.list_date = t1.list_date
) as max_list_price
from table1 t1;
This should return results more quickly, because it doesn't have to calculate the window function value first (for all rows) before returning values.