Run mode() function of each value in INT ARRAY - sql

I have a table that holds an INT ARRAY data type representing some features (this is done instead of having a separate boolean column for each feature). The column is called feature_ids. If a record has a specific feature, the ID of the feature will be present in the feature_ids column. The mapping of the feature_ids are for context understanding as follows:
1: Fast
2: Expensive
3: Colorfull
4: Deadly
So in other words, I would also have had 4 columns called is_fast, is_expensive, is_colorfull and is_deadly - but I don't since my real application have +100 features, and they change quite a bit.
Now back to the question: I wanna do an aggregate mode() on the records in the table returning what are the most "frequent" features to have (e.g. if it's more common to be "fast" than not etc.). I want it returned in the same format as the original feature_ids column, but where the ID of a feature is ONLY in represented, if it's more common to be there than not, within each group:
CREATE TABLE test (
id INT,
feature_ids integer[] DEFAULT '{}'::integer[],
age INT,
type character varying(255)
);
INSERT INTO test (id, age, feature_ids, type) VALUES (1, 10, '{1,2}', 'movie');
INSERT INTO test (id, age, feature_ids, type) VALUES (2, 2, '{1}', 'movie');
INSERT INTO test (id, age, feature_ids, type) VALUES (3, 9, '{1,2,4}', 'movie');
INSERT INTO test (id, age, feature_ids, type) VALUES (4, 11, '{1,2,3}', 'wine');
INSERT INTO test (id, age, feature_ids, type) VALUES (5, 12, '{1,2,4}', 'hat');
INSERT INTO test (id, age, feature_ids, type) VALUES (6, 12, '{1,2,3}', 'hat');
INSERT INTO test (id, age, feature_ids, type) VALUES (7, 8, '{1,4}', 'hat');
I wanna do a query something like this:
SELECT
type, avg(age) as avg_age, mode() within group (order by feature_ids) as most_frequent_features
from test group by "type"
The result I expect is:
type avg_age most_frequent_features
hat 10.6 [1,2,4]
movie 7.0 [1,2]
wine 11.0 [1,2,3]
I have made an example here: https://www.db-fiddle.com/f/rTP4w7264vDC5rqjef6Nai/1

I find this quite tricky. The following is a rather brute-force approach -- calculating the "mode" explicitly and then bringing in the other aggregates:
select tf.type, t.avg_age,
array_agg(feature_id) as features
from (select t.type, feature_id, count(*) as cnt,
dense_rank() over (partition by t.type order by count(*) desc) as seqnum
from test t cross join
unnest(feature_ids) feature_id
group by t.type, feature_id
) tf join
(select t.type, avg(age) as avg_age
from test t
group by t.type
) t
on tf.type = t.type
where seqnum <= 2
group by tf.type, t.avg_age;
Here is a db<>fiddle.

Related

Performing multiple inserts into a primary table and then into secondary tables using Postgres CTE or PLPGSQL

I have a 2 tables with the following structures:
Table_1: id name age
Table_2: id table_1_id city state
I need to perform bulk insert into Table 1:
INSERT INTO Table_1(name, age)
VALUES ('A', 12), ('B', 13), ('C', 14)
RETURNING id
The ids returned from the first insert are to be used in the below query:
INSERT INTO Table_2(table_1_id, city, state)
VALUES (first_row_id, 'Austin', 'Texas'),
(second_row_id, 'Dallas', 'Texas'),
(third_row_id, 'Houston', 'Texas')
How can I achieve the same using CTE or any other efficient way with least amount of code in Postgres?
I have to admit, I've never considered a construct like this, but for what it's worth I believe this will do what you seek in a single statement, unless I've missed something:
with ids as (
INSERT INTO Table_1(name, age)
VALUES ('A', 12), ('B', 13), ('C', 14)
RETURNING id
),
rowz as (
select id, row_number() over (order by id) as rn
from ids
)
insert into table_2 (table_1_id, city, state)
select
r.id, v.city, v.state
from
rowz r
join (values (1, 'Austin', 'Texas'), (2, 'Houston', 'Texas'), (3, 'Dallas', 'Texas')) v (id, city, state) on
r.rn = v.id
Just my $0.02, but I think it would be easier to follow and more scalable if instead you wrapped this in some code and did it that way (pick your favorite programming language or use PLPGSQL).
The flaw I see with this is ideally you want to provide this:
A 12 Austin Texas
B 13 Houston Texas
And have the script do the rest. Here you have to mess with the individual values, and unless it's always 3 you can't really take advantage of parameters (at least not easily).

Comparing a value of a row with the value of the previous row

I have a table in SQL Server that stores geology samples, and there is a rule that must be adhered to.
The rule is simple, a "DUP_2" sample must always come after a "DUP_1" sample (sometimes they are loaded inverted)
CREATE TABLE samples (
id INT
,name VARCHAR(5)
);
INSERT INTO samples VALUES (1, 'ASSAY');
INSERT INTO samples VALUES (2, 'DUP_1');
INSERT INTO samples VALUES (3, 'DUP_2');
INSERT INTO samples VALUES (4, 'ASSAY');
INSERT INTO samples VALUES (5, 'DUP_2');
INSERT INTO samples VALUES (6, 'DUP_1');
INSERT INTO samples VALUES (7, 'ASSAY');
id
name
1
ASSAY
2
DUP_1
3
DUP_2
4
ASSAY
5
DUP_2
6
DUP_1
7
ASSAY
In this example I would like to show all rows where name equal to 'DUP_2' and predecessor row (using ID) name is different from 'DUP_1'.
In this case, it would be row 5 only.
I would appreciate very much if you help me.
You can use the LAG() window function or you can use LEAD() - they are identical except for the way in which they are ordered. That is - LAG(name) OVER ( ORDER BY id ) is the same as LEAD(name) OVER ( ORDER BY id DESC ). (You can read more about these functions here.)
WITH s1 ( id, name, prior_name ) AS (
SELECT id, name, LAG(name) OVER ( ORDER BY id ) AS prior_name
FROM samples
)
SELECT id, name
FROM s1
WHERE name = 'DUP_2'
AND COALESCE(prior_name, 'DUMMY') != 'DUP_1';
The reason for the COALESCE() at the end with the DUMMY value is that the first value won't have a LAG(); it will be NULL; and we want to return the DUP_2 record in this case since it doesn't follow a DUP_1 record.
You can use lag():
select s.*
from (select s.*,
lag(name) over (order by id) as prev_name
from samples s
) s
where name = 'DUP_2' and (prev_name <> 'DUP_1' or prev_name is null)

Presto filter an array during aggregation

I would like to filter an aggregated array depending on all values associated with an id. The values are strings and can be of three type all-x:y, x:y and empty (here x and y are arbitrary substrings of values).
I have a few conditions:
If an id has x:y then the result should contain x:y.
If an id always has all-x:y then the resulting aggregation should have all-x:y
If an id sometimes has all-x:y then the resulting aggregation should have x:y
For example with the following
WITH
my_table(id, my_values) AS (
VALUES
(1, ['all-a','all-b']),
(2, ['all-c','b']),
(3, ['a','b','c']),
(1, ['all-a']),
(2, []),
(3, ['all-c']),
),
The result should be:
(1, ['all-a','b']),
(2, ['c','b']),
(3, ['a','b','c']),
I have worked multiple hours on this but it seems like it's not feasible.
I came up with the following but it feels like it cannot work because I can check the presence of all-x in all arrays which would go in <<IN ALL>>:
SELECT
id,
SET_UNION(
CASE
WHEN SPLIT_PART(my_table.values,'-',1) = 'all' THEN
CASE
WHEN <<my_table.values IN ALL>> THEN my_table.values
ELSE REPLACE(my_table.values,'all-')
END
ELSE my_table.values
END
) AS values
FROM my_table
GROUP BY 1
I would need to check that all arrays values for the specific id contains all-x and that's where I'm struggling to find a solution.
I was trying to co
After a few hours of searching how to do so I am starting to believe that it is not feasible.
Any help is appreciated. Thank you for reading.
This should do what you want:
WITH my_table(id, my_values) AS (
VALUES
(1, array['all-a','all-b']),
(2, array['all-c','b']),
(3, array['a','b','c']),
(1, array['all-a']),
(2, array[]),
(3, array['all-c'])
),
with_group_counts AS (
SELECT *, count(*) OVER (PARTITION BY id) group_count -- to see if the number of all-X occurrences match the number of rows for a given id
FROM my_table
),
normalized AS (
SELECT
id,
if(
count(*) OVER (PARTITION BY id, value) = group_count AND starts_with(value, 'all-'), -- if its an all-X value and every original row for the given id contains it ...
value,
if(starts_with(value, 'all-'), substr(value, 5), value)) AS extracted
FROM with_group_counts CROSS JOIN UNNEST(with_group_counts.my_values) t(value)
)
SELECT id, array_agg(DISTINCT extracted)
FROM normalized
GROUP BY id
The trick is to compute the number of total rows for each id in the original table via the count(*) OVER (PARTITION BY id) expression in the with_group_counts subquery. We can then use that value to determine whether a given value should be treated as an all-x or the x should be extracted. That's handled by the following expression:
if(
count(*) OVER (PARTITION BY id, value) = group_count AND starts_with(value, 'all-'),
value,
if(starts_with(value, 'all-'), substr(value, 5), value))
For more information about window functions in Presto, check out the documentation. You can find the documentation for UNNEST here.

oracle correlated subquery using distinct listagg

I have an interesting query I'm trying to figure out. I have a view which is getting a column added to it. This column is pivoted data coming from other tables, to form into a single row. Now, I need to wipe out duplicate entries in this pivoted data. Listagg is great for getting the data to a single row, but I need to make it unique. While I know how to make it unique, I'm tripping up on the fact that correlated sub-queries only go 1 level deep. So... not really sure how to get a distinct list of values. I can get it to work if I don't do the distinct just fine. Anyone out there able to work some SQL magic?
Sample data:
drop table test;
drop table test_widget;
create table test (id number, description Varchar2(20));
create table test_widget (widget_id number, test_fk number, widget_type varchar2(20));
insert into test values(1, 'cog');
insert into test values(2, 'wheel');
insert into test values(3, 'spring');
insert into test_widget values(1, 1, 'A');
insert into test_widget values(2, 1, 'A');
insert into test_widget values(3, 1, 'B');
insert into test_widget values(4, 1, 'A');
insert into test_widget values(5, 2, 'C');
insert into test_widget values(6, 2, 'C');
insert into test_widget values(7, 2, 'B');
insert into test_widget values(8, 3, 'A');
insert into test_widget values(9, 3, 'C');
insert into test_widget values(10, 3, 'B');
insert into test_widget values(11, 3, 'B');
insert into test_widget values(12, 3, 'A');
commit;
Here is an example of the query that works, but shows duplicate data:
SELECT A.ID
, A.DESCRIPTION
, (SELECT LISTAGG (WIDGET_TYPE, ', ') WITHIN GROUP (ORDER BY WIDGET_TYPE)
FROM TEST_WIDGET
WHERE TEST_FK = A.ID) widget_types
FROM TEST A
Here is an example of what does NOT work due to the depth of where I try to reference the ID:
SELECT A.ID
, A.DESCRIPTION
, (SELECT LISTAGG (WIDGET_TYPE, ', ') WITHIN GROUP (ORDER BY WIDGET_TYPE)
FROM (SELECT DISTINCT WIDGET_TYPE
FROM TEST_WIDGET
WHERE TEST_FK = A.ID))
WIDGET_TYPES
FROM TEST A
Here is what I want displayed:
1 cog A, B
2 wheel B, C
3 spring A, B, C
If anyone knows off the top of their head, that would fantastic! Otherwise, I can post up some sample create statements to help you with dummy data to figure out the query.
You can apply the distinct in a subquery, which also has the join - avoiding the level issue:
SELECT ID
, DESCRIPTION
, LISTAGG (WIDGET_TYPE, ', ')
WITHIN GROUP (ORDER BY WIDGET_TYPE) AS widget_types
FROM (
SELECT DISTINCT A.ID, A.DESCRIPTION, B.WIDGET_TYPE
FROM TEST A
JOIN TEST_WIDGET B
ON B.TEST_FK = A.ID
)
GROUP BY ID, DESCRIPTION
ORDER BY ID;
ID DESCRIPTION WIDGET_TYPES
---------- -------------------- --------------------
1 cog A, B
2 wheel B, C
3 spring A, B, C
I was in a unique situation using the Pentaho reports writer and some inconsistent data. The Pentaho writer uses Oracle to query data, but has limitations. The data pieces were unique but not classified in a consistent manner, so I created a nested listagg inside of a left join to present the data the way I wanted to:
left join
(
select staff_id, listagg(thisThing, ' --- '||chr(10) ) within group (order by this) as SCHED_1 from
(
SELECT
staff_id, RPT_STAFF_SHIFTS.ORGANIZATION||': '||listagg(
RPT_STAFF_SHIFTS.DAYS_OF_WEEK
, ',' ) within group (order by BEGIN_DATE desc)
as thisThing
FROM "RPT_STAFF_SHIFTS" where "RPT_STAFF_SHIFTS"."END_DATE" is null
group by staff_id, organization)
group by staff_id
) schedule_1 on schedule_1.staff_id = "RPT_STAFF"."STAFF_ID"
where "RPT_STAFF"."STAFF_ID" ='555555'
This is a different approach than using the nested query, but it some situations it might work better by taking into account the level issue when developing the query and taking an extra step to fully concatenate the results.

SQL Filter by attributes (Design)

I have a database of products which I'd like to filter by arbitrary categories. Let's say for the sake of an example that I run a garage. I have a section of products which are cars.
Each car should have a collection of attributes, all cars have the same number and type of attributes; for instance colour:red, doors:2, make:ford ; with those same attributes set to various values on all the cars.
Gut feeling tells me that it would be best to add "colour", "doors" and "make" columns to the product table.
HOWEVER: Not ALL the products in the table are cars. Perhaps I would like to list tyres on the page of cars. Obviously, "colour" and "doors" won't apply to tires. Even so, if a user selects colour=red as a filter, I would still like the tires to be shown as they lack the colour attribute.
Mulling it over (and I'm really not a database guy so I apologise if this approach is horrible) I considered having a single "attributes" column which I could fill with an arbitrary number of arbitrarily named attributes, then use SQLs string functions to do the filtering. I guess you could even use a bit field here if you planned carefully. This seems hackish to me though, I'd be interested to know how some of the larger sites such as Amazon do this.
What are the issues with these approaches, can anyone recommend any alternatives or shed any light on the subject for me?
Thanks in advance
You should read about database normalization. It is generally not a good idea to use concatenated strings as values in a single column. I made a very small sqlfiddle for you to start playing around. This does not really solve all your problems, but it may lead you in the right direction.
Schema:
CREATE TABLE product (id int, name varchar(200), info varchar(200));
INSERT INTO product (id, name, info) VALUES (100, "Porsche", "cool");
...
INSERT INTO product (id, name, info) VALUES (103, "Tires", "you need them!");
CREATE TABLE attr (id int, product_id int, a_name varchar(200), a_value varchar(200));
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (1, 100, "color", "black");
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (2, 100, "doors", "2");
...
A Query:
SELECT * FROM product INNER JOIN attr ON attr.product_id=product.id
WHERE attr.a_name="doors" AND attr.a_value = "2"
Anyone reading this in the future, I managed to get the results I wanted thanks to luksch taking the time to help me out!!! Thanks!!!
Using this layout:
CREATE TABLE product (id int, name varchar(200));
INSERT INTO product (id, name) VALUES (100, "Red Porsche");
INSERT INTO product (id, name) VALUES (101, "Red Ferrari V8");
INSERT INTO product (id, name) VALUES (102, "Red Ferrari V12");
INSERT INTO product (id, name) VALUES (103, "Blue Porsche");
INSERT INTO product (id, name) VALUES (104, "Blue Ferrari V8");
INSERT INTO product (id, name) VALUES (105, "Blue Ferrari V12");
INSERT INTO product (id, name) VALUES (106, "Snow Tires");
INSERT INTO product (id, name) VALUES (107, "Fluffy Dice");
CREATE TABLE attr (id int, product_id int, a_name varchar(200), a_value varchar(200));
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (1, 100, "colour", "red");
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (1, 101, "colour", "red");
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (1, 101, "cylinders", "8");
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (1, 102, "colour", "red");
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (1, 102, "cylinders", "12");
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (1, 103, "colour", "blue");
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (1, 104, "colour", "blue");
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (1, 104, "cylinders", "8");
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (1, 105, "colour", "blue");
INSERT INTO attr (id, product_id, a_name, a_value) VALUES (1, 105, "cylinders", "12");
I achieved the result I wanted; which was two things:
Firstly I wanted to be able to select products by attribute, say by colour and cylinders, but also show any products which have neither the colour nor cylinders attribute, which I achieved with this query:
SELECT DISTINCT product.id, name, a_value
FROM product
LEFT JOIN attr
ON product_id=product.id
WHERE
(
(a_name="colour" AND a_value="blue")
OR
(a_name IS NULL)
)
AND product.id IN
(
SELECT product.id
FROM product
LEFT JOIN attr
ON product_id=product.id
WHERE
(a_name="cylinders" AND a_value="12")
OR
(a_name IS NULL)
)
This lists all the blue cars with 12 cylinders, and also lists the tires and fluffy dice since they have neither a colour or cylinder count. This can easily be adapted to filter on one attribute, or you can add more AND / IN clauses to add more filters
And I also wanted to be able to list all relevant attributes (I use WHERE 1 in this example, but in practise this would be WHERE idfolders=? to list all attribute relevant to a specific folder)
SELECT DISTINCT a_value, a_name
FROM product
INNER JOIN attr
ON product_id=product.id
WHERE 1