Presto filter an array during aggregation - sql

I would like to filter an aggregated array depending on all values associated with an id. The values are strings and can be of three type all-x:y, x:y and empty (here x and y are arbitrary substrings of values).
I have a few conditions:
If an id has x:y then the result should contain x:y.
If an id always has all-x:y then the resulting aggregation should have all-x:y
If an id sometimes has all-x:y then the resulting aggregation should have x:y
For example with the following
WITH
my_table(id, my_values) AS (
VALUES
(1, ['all-a','all-b']),
(2, ['all-c','b']),
(3, ['a','b','c']),
(1, ['all-a']),
(2, []),
(3, ['all-c']),
),
The result should be:
(1, ['all-a','b']),
(2, ['c','b']),
(3, ['a','b','c']),
I have worked multiple hours on this but it seems like it's not feasible.
I came up with the following but it feels like it cannot work because I can check the presence of all-x in all arrays which would go in <<IN ALL>>:
SELECT
id,
SET_UNION(
CASE
WHEN SPLIT_PART(my_table.values,'-',1) = 'all' THEN
CASE
WHEN <<my_table.values IN ALL>> THEN my_table.values
ELSE REPLACE(my_table.values,'all-')
END
ELSE my_table.values
END
) AS values
FROM my_table
GROUP BY 1
I would need to check that all arrays values for the specific id contains all-x and that's where I'm struggling to find a solution.
I was trying to co
After a few hours of searching how to do so I am starting to believe that it is not feasible.
Any help is appreciated. Thank you for reading.

This should do what you want:
WITH my_table(id, my_values) AS (
VALUES
(1, array['all-a','all-b']),
(2, array['all-c','b']),
(3, array['a','b','c']),
(1, array['all-a']),
(2, array[]),
(3, array['all-c'])
),
with_group_counts AS (
SELECT *, count(*) OVER (PARTITION BY id) group_count -- to see if the number of all-X occurrences match the number of rows for a given id
FROM my_table
),
normalized AS (
SELECT
id,
if(
count(*) OVER (PARTITION BY id, value) = group_count AND starts_with(value, 'all-'), -- if its an all-X value and every original row for the given id contains it ...
value,
if(starts_with(value, 'all-'), substr(value, 5), value)) AS extracted
FROM with_group_counts CROSS JOIN UNNEST(with_group_counts.my_values) t(value)
)
SELECT id, array_agg(DISTINCT extracted)
FROM normalized
GROUP BY id
The trick is to compute the number of total rows for each id in the original table via the count(*) OVER (PARTITION BY id) expression in the with_group_counts subquery. We can then use that value to determine whether a given value should be treated as an all-x or the x should be extracted. That's handled by the following expression:
if(
count(*) OVER (PARTITION BY id, value) = group_count AND starts_with(value, 'all-'),
value,
if(starts_with(value, 'all-'), substr(value, 5), value))
For more information about window functions in Presto, check out the documentation. You can find the documentation for UNNEST here.

Related

Postgres - select non-blank non-null values from multiple ordered rows

There are lots of data coming from multiple sources that I need to group based on priority, but the data quality from those sources is different - they may be missing some data.
The task is to group that data into a separate table, in as complete as possible way.
For example:
create table grouped_data (
id serial primary key,
type text,
a text,
b text,
c int
);
create table raw_data (
id serial primary key,
type text,
a text,
b text,
c int,
priority int
);
insert into raw_data
(type, a, b, c, priority)
values
('one', null, '', 123, 1),
('one', 'foo', '', 456, 2),
('one', 'bar', 'baz', 789, 3),
('two', null, 'two-b', 11, 3),
('two', '', '', 33, 2),
('two', null, 'two-bbb', 22, 1);
Now I need to group records by type, order by priority, take the first non-null and non-empty value, and put it into grouped_data.
In this case, value of a for group one would be foo because the row that holds that value have a higher priority than the one with bar. And c should be 123, as it has the highest prio.
Same for group two, for each column we take the data that is non-null, non-empty, and has the highest priority, or fallback to null if no actual data present.
In the end, grouped_data is expected to have the following content:
('one', 'foo', 'baz', 123),
('two', null, 'two-bbb', 22)
I've tried grouping, sub-selects, MERGE, cross joins... Alas, my knowledge of PostgreSQL is not good enough to get it working.
One thing I'd like to avoid, too - is going through columns one-by-one, since in the real world there are few dozens of columns to work with...
A link to a fiddle I've been using to mess around with this: http://sqlfiddle.com/#!17/76699/1
UPD:
Thank you all!
Oleksii Tambovtsev's solution is the fastest one. On a set of data closely resembling a real-world case (2m records, ~30 fields) it takes only 20 seconds to produce the exact same set of data, which was previously generated programmatically and took over 20 minutes.
eshirvana's solution does the same in 95s, Steve Kass' in 125s, and Stefanov.sm - 308s (which is still helluvalotfaster than programatically!)
Thank you all :)
You should try this:
SELECT
type,
(array_agg(a ORDER BY priority ASC) FILTER (WHERE a IS NOT NULL AND a != ''))[1] as a,
(array_agg(b ORDER BY priority ASC) FILTER (WHERE b IS NOT NULL AND b != ''))[1] as b,
(array_agg(c ORDER BY priority ASC) FILTER (WHERE c IS NOT NULL))[1] as c
FROM raw_data GROUP BY type ORDER BY type;
you can use window function first_value:
select distinct
type
, first_value(a) over (partition by type order by nullif(a,'') is null, priority) as a
, first_value(b) over (partition by type order by nullif(b,'') is null, priority) as b
, first_value(c) over (partition by type order by priority) as c
from raw_data
select distinct on (type) type,
first_value(a) over (partition by type order by (nullif(a, '') is null), priority) a,
first_value(b) over (partition by type order by (nullif(b, '') is null), priority) b,
first_value(c) over (partition by type order by (c is null), priority) c
from raw_data;
This should also work.
WITH types(type) AS (
SELECT DISTINCT
type
FROM raw_data
)
SELECT
type,
(SELECT a FROM raw_data WHERE a > '' AND raw_data.type = types.type ORDER BY priority LIMIT 1) AS a,
(SELECT b FROM raw_data WHERE b > '' AND raw_data.type = types.type ORDER BY priority LIMIT 1) AS b,
(SELECT c FROM raw_data WHERE c IS NOT NULL AND raw_data.type = types.type ORDER BY priority LIMIT 1) AS c
FROM types
ORDER BY type;

Query to get count of distinct items in groupings

I have a table that stores created grouping for items from another table like this:
table1
table2
So giving the above, I want to write a query that returns the count of items from table1 that a grouping has been created for.
It may sound like doing the below but that is actually not what I'm looking for because the groups have to be manually created for them to appear in table 2 so you may have an item from table1 that does't exist in table 2 because the grouping hasn't been created (i.e id: 555).
SELECT count(id)
FROM table1
WHERE group IS NOT NULL
The above will return 4 but I need something that looks at table2 and returns 3 which is count of items from table1 whose group exists in the category column of table2.
My real table for this can be pretty large up to 100k+ rows so I don't think it is efficient to check if group string from table1 it exists in table2 one by one as that would probably take forever to run - or is that the only viable solution?
PS: tried to use table markdown but I must have screwed up somehow
PPS categories column is not of json type, its just string
Not sure that this will be faster but you can prepare an existing categories aggregate. Something like that (also you can try set_union instead of array_agg with flatten and array_distinct):
SELECT array_distinct(flatten(array_agg(CAST(JSON_EXTRACT(categories, '$.x') as ARRAY(VARCHAR)))))
FROM table2
And check that group is in the result.
Assuming that table2 would not contain any groups in the array that are not there in table1, you can try the following:
WITH table1(id, "group", qty) AS (
SELECT *
FROM (VALUES (111, 'cups', 1),
(222, 'plates', 2),
(333, 'spoons', 5),
(444, null, 2),
(555, 'knives', 2))
),
table2(group_id, categories, count_inventory) as (
SELECT *
FROM (VALUES ('A1', CAST(MAP(ARRAY['x'], ARRAY[ARRAY['cups', 'plates']]) AS JSON), 3),
('B1', CAST(MAP(ARRAY['x'], ARRAY[ARRAY['cups']]) AS JSON), 1),
('C1', CAST(MAP(ARRAY['x'], ARRAY[ARRAY['cups', 'spoons']]) AS JSON), 6),
('C4', CAST(MAP(ARRAY['x'], ARRAY[ARRAY['spoons']]) AS JSON), 5)
))
SELECT reduce(
array_agg(CAST(json_extract(categories, '$.x') AS ARRAY(VARCHAR))),
array[],
(s, x) -> array_union(s, x),
x -> cardinality(x)
)
FROM table2 WHERE categories is not null;

Comparing a value of a row with the value of the previous row

I have a table in SQL Server that stores geology samples, and there is a rule that must be adhered to.
The rule is simple, a "DUP_2" sample must always come after a "DUP_1" sample (sometimes they are loaded inverted)
CREATE TABLE samples (
id INT
,name VARCHAR(5)
);
INSERT INTO samples VALUES (1, 'ASSAY');
INSERT INTO samples VALUES (2, 'DUP_1');
INSERT INTO samples VALUES (3, 'DUP_2');
INSERT INTO samples VALUES (4, 'ASSAY');
INSERT INTO samples VALUES (5, 'DUP_2');
INSERT INTO samples VALUES (6, 'DUP_1');
INSERT INTO samples VALUES (7, 'ASSAY');
id
name
1
ASSAY
2
DUP_1
3
DUP_2
4
ASSAY
5
DUP_2
6
DUP_1
7
ASSAY
In this example I would like to show all rows where name equal to 'DUP_2' and predecessor row (using ID) name is different from 'DUP_1'.
In this case, it would be row 5 only.
I would appreciate very much if you help me.
You can use the LAG() window function or you can use LEAD() - they are identical except for the way in which they are ordered. That is - LAG(name) OVER ( ORDER BY id ) is the same as LEAD(name) OVER ( ORDER BY id DESC ). (You can read more about these functions here.)
WITH s1 ( id, name, prior_name ) AS (
SELECT id, name, LAG(name) OVER ( ORDER BY id ) AS prior_name
FROM samples
)
SELECT id, name
FROM s1
WHERE name = 'DUP_2'
AND COALESCE(prior_name, 'DUMMY') != 'DUP_1';
The reason for the COALESCE() at the end with the DUMMY value is that the first value won't have a LAG(); it will be NULL; and we want to return the DUP_2 record in this case since it doesn't follow a DUP_1 record.
You can use lag():
select s.*
from (select s.*,
lag(name) over (order by id) as prev_name
from samples s
) s
where name = 'DUP_2' and (prev_name <> 'DUP_1' or prev_name is null)

Run mode() function of each value in INT ARRAY

I have a table that holds an INT ARRAY data type representing some features (this is done instead of having a separate boolean column for each feature). The column is called feature_ids. If a record has a specific feature, the ID of the feature will be present in the feature_ids column. The mapping of the feature_ids are for context understanding as follows:
1: Fast
2: Expensive
3: Colorfull
4: Deadly
So in other words, I would also have had 4 columns called is_fast, is_expensive, is_colorfull and is_deadly - but I don't since my real application have +100 features, and they change quite a bit.
Now back to the question: I wanna do an aggregate mode() on the records in the table returning what are the most "frequent" features to have (e.g. if it's more common to be "fast" than not etc.). I want it returned in the same format as the original feature_ids column, but where the ID of a feature is ONLY in represented, if it's more common to be there than not, within each group:
CREATE TABLE test (
id INT,
feature_ids integer[] DEFAULT '{}'::integer[],
age INT,
type character varying(255)
);
INSERT INTO test (id, age, feature_ids, type) VALUES (1, 10, '{1,2}', 'movie');
INSERT INTO test (id, age, feature_ids, type) VALUES (2, 2, '{1}', 'movie');
INSERT INTO test (id, age, feature_ids, type) VALUES (3, 9, '{1,2,4}', 'movie');
INSERT INTO test (id, age, feature_ids, type) VALUES (4, 11, '{1,2,3}', 'wine');
INSERT INTO test (id, age, feature_ids, type) VALUES (5, 12, '{1,2,4}', 'hat');
INSERT INTO test (id, age, feature_ids, type) VALUES (6, 12, '{1,2,3}', 'hat');
INSERT INTO test (id, age, feature_ids, type) VALUES (7, 8, '{1,4}', 'hat');
I wanna do a query something like this:
SELECT
type, avg(age) as avg_age, mode() within group (order by feature_ids) as most_frequent_features
from test group by "type"
The result I expect is:
type avg_age most_frequent_features
hat 10.6 [1,2,4]
movie 7.0 [1,2]
wine 11.0 [1,2,3]
I have made an example here: https://www.db-fiddle.com/f/rTP4w7264vDC5rqjef6Nai/1
I find this quite tricky. The following is a rather brute-force approach -- calculating the "mode" explicitly and then bringing in the other aggregates:
select tf.type, t.avg_age,
array_agg(feature_id) as features
from (select t.type, feature_id, count(*) as cnt,
dense_rank() over (partition by t.type order by count(*) desc) as seqnum
from test t cross join
unnest(feature_ids) feature_id
group by t.type, feature_id
) tf join
(select t.type, avg(age) as avg_age
from test t
group by t.type
) t
on tf.type = t.type
where seqnum <= 2
group by tf.type, t.avg_age;
Here is a db<>fiddle.

Custom Rank Calculation by a percentage range

I have this requirement of calculating a custom rank. I need to calculate Annualized % Return for all the 6 companies. I get rank from their return %. Let's consider this as input data. Now I need to calculate Custom Ranking where if another company's return is within a percentage point of Company A, I need to assign all these companies same rank (as shown in the chart below). I have 6 companies which is going to be fixed.
So, in a nutshell my requirement is to find which companies are within a percentage point return of company A. Then convert their rank to a string and concatenate their rank; keep the rest of the ranks the same and assign it to a new variable.
Attached Image is for illustration only.
The trick is to find the dense_rank() based on the absolute value of Difference from A. For difference less than 1.0%, it is treat as 0.
-- Sample Table
declare #company table
(
Company char,
AnnualReturns decimal(5,1)
)
-- Sample Data
insert into #company
values ('A', 5.5), ('B', 7.7), ('C', -1.3), ('D', 6.3), ('E', 5.4), ('F', 9.0)
-- The query
; with cte as
(
select *,
[Difference from A] = AnnualReturns - 5.5,
ActualRank = row_number() over (order by AnnualReturns desc),
dr = dense_rank() over (order by case when abs(AnnualReturns - 5.5) <= 1.0
then 0
else abs(AnnualReturns - 5.5)
end)
from #company
)
select Company, AnnualReturns, [Difference from A], ActualRank,
stuff(RequiredRank, 1, 1, '') as RequiredRank
from cte c
cross apply -- concatenate the rank
(
select '/' + convert(varchar(10), ActualRank)
from cte x
where x.dr = c.dr
order by ActualRank
for xml path('')
) rr (RequiredRank)
order by Company