Select concatenated columns based on criteria list in other table - sql

I have a table1
line
a
b
c
d
e
f
g
h
1
18
2
2
22
0
2
1
2
2
20
2
2
2
0
0
0
2
3
10
2
2
222
0
2
1
2
4
12
2
2
3
0
0
0
0
5
15
2
2
3
0
0
0
0
And a table2
 line
criteria
1
 a,b
2
 b,c,f,h
3
 a,b,e,g,h
4
 c,e
I am using this code to see/select the unique results of concated/joined columns, like concat(c,',',d), concat(b,',',d,',',g) and so on from table1 and is working perfectly:
SELECT DISTINCT(CONCAT(c,',',d))
FROM table1
But, instead of writing manually like concat(c,',',d), I want to refer to table2.criteria to get columns references to be concated/joined from table1 so that i can see the entire unique results against each concated criteria
Tried this, but getting an error:
SELECT DISTINCT(SELECT criteria FROM table2)
FROM table1
ERROR: more than one row returned by a subquery used as an expression
SQL state: 21000
The expected unique result is something like this;
| criteria | result |
| ------------ | ---------- |
| a,b | 15,2 |
| a,b | 10,2 |
| a,b | 20,2 |
| a,b | 12,2 |
| a,b | 18,2 |
| b,c,f,h | 2,2,2,2 |
| b,c,f,h | 2,2,0,2 |
| b,c,f,h | 2,2,0,0 |
| a,b,e,g,h | 20,2,0,0,2 |
| a,b,e,g,h | 12,2,0,0,0 |
| a,b,e,g,h | 15,2,0,0,0 |
| a,b,e,g,h | 10,2,0,1,2 |
| a,b,e,g,h | 18,2,0,1,2 |
| c,e | 2,0 |

SQL does not allow to parameterize identifiers. There are various ways to work around this restriction.
It's unclear from the question, but according to comments you want to concatenate the given pattern for every row in table1.
1. Dynamic SQL
Create a helper function (once!) that concatenates and executes statements dynamically.
Basics:
Define table and column names as arguments in a plpgsql function?
CREATE OR REPLACE FUNCTION f_concat_cols(_cols text)
RETURNS TABLE (result text)
LANGUAGE plpgsql AS
$func$
BEGIN
RETURN QUERY EXECUTE format(
$q$SELECT concat_ws(',', %s) FROM table1 ORDER BY line$q$, _cols);
END
$func$;
It's a set-returning function (a.k.a. "table function"), to return one result row for every row in table1 for each given pattern.
Warning: Converting user input to code like this is a prime opportunity for SQL injection. You must make sure that table1.criteria can only hold valid strings!
To get the full result matrix (with distinct results per row in table2), the query is simple now:
SELECT DISTINCT line AS t2_line, criteria, t1.*
FROM table2, f_concat_cols(criteria) t1
ORDER BY t2_line;
2. Workaround with conversion to JSON
SELECT DISTINCT t2.line AS t2_line, t2.criteria, c.*
FROM table2 t2
CROSS JOIN (SELECT line, to_json(t) AS js FROM table1 t) t1
CROSS JOIN LATERAL (
SELECT string_agg(t1.js->>sub, ',') AS result
FROM unnest(string_to_array(t2.criteria, ',')) sub
) c
ORDER BY t2_line;
After converting rows from t1 to a JSON record, we can access keys (converted from column names) directly.
I unnest the pattern, access each single key, and aggregate the result in LATERAL subquery. See:
What is the difference between a LATERAL JOIN and a subquery in PostgreSQL?
You could encapsulate the logic in a function like in 1., but that's optional in this case.
3. Workaround with conversion to Postgres arrays
SELECT DISTINCT t2.line AS t2_line, t2.criteria, c.*
FROM table2 t2
CROSS JOIN (SELECT line, ARRAY [a,b,c,d,e,f,g,h] AS arr FROM table1 t) t1
CROSS JOIN LATERAL (
SELECT string_agg(t1.arr[idx]::text, ',') AS result
FROM unnest(string_to_array(translate(t2.criteria, 'abcdefgh', '12345678'), ',')::int[]) idx
) c
ORDER BY t2_line;
Similar to the "trick" with JSON, we can avoid dynamic SQL by converting columns to a plain Postgres array. Then project column names to integer array indices. I use translate() for the simple case, but that only works for single letters! Use replace() or regexp_replace() or some other method for longer names.
The rest is like the above.
fiddle - showing all.

Related

Snowflake returns 'invalid query block' error when using `=ANY()` subquery operator

I'm trying to filter a table with a list of strings as a parameter, but as I want to make the parameter optional (in Python sql user case) I can't use IN operator.
With postgresql I was able to build the query like this:
SELECT *
FROM table1
WHERE (id = ANY(ARRAY[%(param_id)s]::INT[]) OR %(param_id)s IS NULL)
;
Then in Python one could choose to pass a list of param_id or just None, which will return all results from table1. E.g.
pandas.read_sql(query, con=con, params={param_id: [id_list or None]})
However I couldn't do the same with snowflake because even the following query fails:
SELECT *
FROM table1
WHERE id = ANY(param_id)
;
Does Snowflake not have ANY operator? Because it is in their doc.
If the parameter is a single string literal 1,2,3 then it first needs to be parsed to multiple rows SPLIT_TO_TABLE
SELECT *
FROM table1
WHERE id IN (SELECT s.value
FROM TABLE (SPLIT_TO_TABLE(%(param_id)s, ',')) AS s);
Agree with #Yuya. This is not very clear in documentation. As per doc -
"IN is shorthand for = ANY, and is subject to the same restrictions as ANY subqueries."
However, it does not work this way - IN works with a IN list where as ANY only works with subquery.
Example -
select * from values (1,2),(2,3),(4,5);
+---------+---------+
| COLUMN1 | COLUMN2 |
|---------+---------|
| 1 | 2 |
| 2 | 3 |
| 4 | 5 |
+---------+---------+
IN works fine with list of literals -
select * from values (1,2),(2,3),(4,5) where column1 in (1,2);
+---------+---------+
| COLUMN1 | COLUMN2 |
|---------+---------|
| 1 | 2 |
| 2 | 3 |
+---------+---------+
Below gives error (though as per doc IN and = ANY are same) -
select * from values (1,2),(2,3),(4,5) where column1 = ANY (1,2);
002076 (42601): SQL compilation error:
Invalid query block: (.
Using subquery ANY runs fine -
select * from values (1,2),(2,3),(4,5) where column1 = ANY (select column1 from values (1),(2));
+---------+---------+
| COLUMN1 | COLUMN2 |
|---------+---------|
| 1 | 2 |
| 2 | 3 |
+---------+---------+
Would it not make more sense for both snowflake and postgresql to have two functions/store procedures that have one/two parameters.
Then the one with the “default” just dose not asked this fake question (is in/any some none) and is simpler. Albeit it you question is interesting.

Counting the total number of rows with SELECT DISTINCT ON without using a subquery

I have performing some queries using PostgreSQL SELECT DISTINCT ON syntax. I would like to have the query return the total number of rows alongside with every result row.
Assume I have a table my_table like the following:
CREATE TABLE my_table(
id int,
my_field text,
id_reference bigint
);
I then have a couple of values:
id | my_field | id_reference
----+----------+--------------
1 | a | 1
1 | b | 2
2 | a | 3
2 | c | 4
3 | x | 5
Basically my_table contains some versioned data. The id_reference is a reference to a global version of the database. Every change to the database will increase the global version number and changes will always add new rows to the tables (instead of updating/deleting values) and they will insert the new version number.
My goal is to perform a query that will only retrieve the latest values in the table, alongside with the total number of rows.
For example, in the above case I would like to retrieve the following output:
| total | id | my_field | id_reference |
+-------+----+----------+--------------+
| 3 | 1 | b | 2 |
+-------+----+----------+--------------+
| 3 | 2 | c | 4 |
+-------+----+----------+--------------+
| 3 | 3 | x | 5 |
+-------+----+----------+--------------+
My attemp is the following:
select distinct on (id)
count(*) over () as total,
*
from my_table
order by id, id_reference desc
This returns almost the correct output, except that total is the number of rows in my_table instead of being the number of rows of the resulting query:
total | id | my_field | id_reference
-------+----+----------+--------------
5 | 1 | b | 2
5 | 2 | c | 4
5 | 3 | x | 5
(3 rows)
As you can see it has 5 instead of the expected 3.
I can fix this by using a subquery and count as an aggregate function:
with my_values as (
select distinct on (id)
*
from my_table
order by id, id_reference desc
)
select count(*) over (), * from my_values
Which produces my expected output.
My question: is there a way to avoid using this subquery and have something similar to count(*) over () return the result I want?
You are looking at my_table 3 ways:
to find the latest id_reference for each id
to find my_field for the latest id_reference for each id
to count the distinct number of ids in the table
I therefore prefer this solution:
select
c.id_count as total,
a.id,
a.my_field,
b.max_id_reference
from
my_table a
join
(
select
id,
max(id_reference) as max_id_reference
from
my_table
group by
id
) b
on
a.id = b.id and
a.id_reference = b.max_id_reference
join
(
select
count(distinct id) as id_count
from
my_table
) c
on true;
This is a bit longer (especially the long thin way I write SQL) but it makes it clear what is happening. If you come back to it in a few months time (somebody usually does) then it will take less time to understand what is going on.
The "on true" at the end is a deliberate cartesian product because there can only ever be exactly one result from the subquery "c" and you do want a cartesian product with that.
There is nothing necessarily wrong with subqueries.

Handling multiple return values in subquery

I have the following data:
cte
=================
gp_id | m_ids
------|----------
1 | {123}
2 | {432,222}
3 | {123,222}
And a function with a signature like this (which in fact returns not a table but a couple of ids):
FUNCTION foo(m_ids integer[])
RETURNS TABLE (
first_id integer,
second_id integer
)
Now, I've got to iterate over each row and perform some calculations with that function, so I would get something like this:
gp_id | first_id | second_id
------|----------|-----------
1 | 25 | 25
2 | 13 | 24
3 | 25 | 11
To achieve that I tried the following code:
SELECT gp_id,
(
SELECT *
FROM foo(
(
SELECT m_ids
FROM cte c2
WHERE c2.gp_id = c1.gp_id)) limit 1)
FROM cte c1
The problem is in the SELECT * statement. If I use SELECT first_id, everything works well (except for that I have to run two consecutive queries, which I'd like to avoid, obviously), but in the former case I'm getting the error
subquery must return only one column
which is somewhat expected.
So how can I correctly iterate over the table in one single query?
Use the function in a lateral join:
select gp_id, first_id, second_id
from cte,
lateral foo(m_ids);

Join tables with unknown number of rows without repeating column that is joined by

Here's my quandary:
I need to join all columns of two tables based on a primary key, but I don't want to repeat the primary key in the results.
The second table has the primary key and then unknown number and names of columns.
So essentially I want
SELECT * (except for b.PK) FROM
TableA a
JOIN TableB b ON a.PK = b.PK
The obvious solution would be to select all columns explicitly from table a except for a.PK, but let's say that I don't know the number or names of columns in table a either (except I know it has the PK).
So to sum:
How do I join two tables by their PKs, where I don't know the rest of their columns explicitly, and without repeating the PK in the results?
EDIT: (Using T-SQL with SQL Server)
Something like SELECT * except column foo FROM ... doesn't exist. But you can use a natural join, which eliminates redundant columns. You haven't mentioned your RDBMS, so here's an explanation from the MySQL manual. A natural join is standard SQL though.
The columns of a NATURAL join or a USING join may be different from
previously. Specifically, redundant output columns no longer appear,
and the order of columns for SELECT * expansion may be different from
before.
Consider this set of statements:
CREATE TABLE t1 (i INT, j INT);
CREATE TABLE t2 (k INT, j INT);
INSERT INTO t1 VALUES(1,1);
INSERT INTO t2 VALUES(1,1);
SELECT * FROM t1 NATURAL JOIN t2;
SELECT * FROM t1 JOIN t2 USING (j);
Previously, the statements produced this output:
+------+------+------+------+
| i | j | k | j |
+------+------+------+------+
| 1 | 1 | 1 | 1 |
+------+------+------+------+
+------+------+------+------+
| i | j | k | j |
+------+------+------+------+
| 1 | 1 | 1 | 1 |
+------+------+------+------+
In the first SELECT statement, column j appears in both tables and
thus becomes a join column, so, according to standard SQL, it should
appear only once in the output, not twice. Similarly, in the second
SELECT statement, column j is named in the USING clause and should
appear only once in the output, not twice. But in both cases, the
redundant column is not eliminated. Also, the order of the columns is
not correct according to standard SQL.
Now the statements produce this output:
+------+------+------+
| j | i | k |
+------+------+------+
| 1 | 1 | 1 |
+------+------+------+
+------+------+------+
| j | i | k |
+------+------+------+
| 1 | 1 | 1 |
+------+------+------+
The redundant column is eliminated and the column order is correct
according to standard SQL

Deleting similar columns in SQL

In PostgreSQL 8.3, let's say I have a table called widgets with the following:
id | type | count
--------------------
1 | A | 21
2 | A | 29
3 | C | 4
4 | B | 1
5 | C | 4
6 | C | 3
7 | B | 14
I want to remove duplicates based upon the type column, leaving only those with the highest count column value in the table. The final data would look like this:
id | type | count
--------------------
2 | A | 29
3 | C | 4 /* `id` for this record might be '5' depending on your query */
7 | B | 14
I feel like I'm close, but I can't seem to wrap my head around a query that works to get rid of the duplicate columns.
count is a sql reserve word so it'll have to be escaped somehow. I can't remember the syntax for doing that in Postgres off the top of my head so I just surrounded it with square braces (change it if that isn't correct). In any case, the following should theoretically work (but I didn't actually test it):
delete from widgets where id not in (
select max(w2.id) from widgets as w2 inner join
(select max(w1.[count]) as [count], type from widgets as w1 group by w1.type) as sq
on sq.[count]=w2.[count] and sq.type=w2.type group by w2.[count]
);
There is a slightly simpler answer than Asaph's, with EXISTS SQL operator :
DELETE FROM widgets AS a
WHERE EXISTS
(SELECT * FROM widgets AS b
WHERE (a.type = b.type AND b.count > a.count)
OR (b.id > a.id AND a.type = b.type AND b.count = a.count))
EXISTS operator returns TRUE if the following SQL statement returns at least one record.
According to your requirements, seems to me that this should work:
DELETE
FROM widgets
WHERE type NOT IN
(
SELECT type, MAX(count)
FROM widgets
GROUP BY type
)