Translating an Excel concept into SQL - sql

Let's say I have the following range in Excel named MyRange:
This isn't a table by any means, it's more a collection of Variant values entered into cells. Excel makes it easy to sum these values doing =SUM(B3:D6) which gives 25. Let's not go into the details of type checking or anything like that and just figure that sum will easily skip values that don't make sense.
If we were translating this concept into SQL, what would be the most natural way to do this? The few approaches that came to mind are (ignore type errors for now):
MyRange returns an array of values:
-- myRangeAsList = [1,1,1,2, ...]
SELECT SUM(elem) FROM UNNEST(myRangeAsList) AS r (elem);
MyRange returns a table-valued function of a single column (basically the opposite of a list):
-- myRangeAsCol = (SELECT 1 UNION ALL SELECT 1 UNION ALL ...
SELECT SUM(elem) FROM myRangeAsCol as r (elem);
Or, perhaps more 'correctly', return a 3-columned table such as:
-- myRangeAsTable = (SELECT 1,1,1 UNION ALL SELECT 2,'other',2 UNION ALL ...
SELECT SUM(a+b+c) FROM SELECT a FROM myRangeAsTable (a,b,c)
Unfortunately, I think this makes things the most difficult to work with, as we now have to combine an unknown number of columns.
Perhaps returning a single column is the easiest of the above to work with, but even that takes a very simple concept -- SUM(myRange) and converts into something that is anything but that: SELECT SUM(elem) FROM myRangeAsCol as r (elem).
Perhaps this could also just be rewritten to a function for convenience, for example:

Just possible direction to think
create temp function extract_values (input string)
returns array<string> language js as """
return Object.values(JSON.parse(input));
""";
with myrangeastable as (
select '1' a, '1' b, '1' c union all
select '2', 'other', '2' union all
select 'true', '3', '3' union all
select '4', '4', '4'
)
select sum(safe_cast(value as float64)) range_sum
from myrangeastable t,
unnest(extract_values(to_json_string(t))) value
with output
Note: no columns explicitly used so should work for any sized range w/o any changes in code
Depends on specific use case, I think above can be wrapped into something more friendly for someone who knows excel to do

I'll try to pose, atomic, pure SQL principles that start with obvious items and goes to the more complicated ones. The intention is, all items can be used in any RDBS:
SQL is basically designed to query tabular data which has relations. (Hence the name is Structured Query Language).
The range in excel is a table for SQL. (Yes you can have some other types in different DBs, but keep it simple so you can use the concept in different types of DBs.)
Now we accept a range in the excel is a table in a database. Then the next step is how to map columns and rows of an excel range to a DB table. It is straight forward. An excel range column is a column in DB. And a row is a row. So why is this a separate item? Because the main difference between the two is usually in DBs, adding new column is usually a pain, the DB tables are almost exclusively designed for new rows not for new columns. (But, of course there are methods to add new columns, and even there exists column based DBs, but these are out of the scope of this answer.)
Items 2 and 3 in Excel and in a DB:
/*
Item 2: Table
the range in the excel is modeled as the below test_table
Item 3: Columns
id keeps the excel row number
b, c, d are the corresponding b, c, d columns of the excel
*/
create table test_table
(
id integer,
b varchar(20),
c varchar(20),
d varchar(20)
);
-- Item 3: Adding the rows in the DB
insert into test_table values (3 /* same as excel row number */ , '1', '1', '1');
insert into test_table values (4 /* same as excel row number */ , '2', 'other', '2');
insert into test_table values (5 /* same as excel row number */ , 'TRUE', '3', '3');
insert into test_table values (6 /* same as excel row number */ , '4', '4', '4');
Now we have similar structure. Then the first thing we want to do is to have equal number of rows between excel range and db table. At DB side this is called filtering and your tool is the where condition. where condition goes through all rows (or indexes for the sake of speed but this is beyond this answer's scope), and filters out which does not satisfy the test boolean logic in the condition. (So for example where 1 = 1 is brings all rows because the condition is always true for all rows.
The next thing to do is to sum the related columns. For this purpose you have two options. To use sum(column_a + column_b) (row by row summation) or sum(a) + sum(b) (column by column summation). If we assume all the data are not null, then both gives the same output.
Items 4 and 5 in Excel and in a DB:
select sum(b + c + d) -- Item 5, first option: We sum row by row
from test_table
where id between 3 and 6; -- Item 4: We simple get all rows, because for all rows above the id are between 3 and 6, if we had another row with 7, it would be filtered out
+----------------+
| sum(b + c + d) |
+----------------+
| 25 |
+----------------+
select sum(b) + sum(c) + sum(d) -- Item 5, second option: We sum column by column
from test_table
where id between 3 and 6; -- Item 4: We simple get all rows, because for all rows above the id are between 3 and 6, if we had another row with 7, it would be filtered out
+--------------------------+
| sum(b) + sum(c) + sum(d) |
+--------------------------+
| 25 |
+--------------------------+
At this point it is better to go one step further. In the excel you have got the "pivot table" structure. The corresponding structure at SQL is the powerful group by mechanics. The group by basically groups a table according to its condition and each group behaves like a sub-table. For example if you say group by column_a for a table, the values are grouped according to the values of the table.
SQL is so powerful that you can even filter the sub groups using having clauses, which acts same as where but works over the columns in group by or the functions over those columns.
Items 6 and 7 in Excel and in a DB:
-- Item 6: We can have group by clause to simulate a pivot table
insert into test_table values (7 /* same as excel row */ , '4', '2', '2');
select b, sum(d), min(d), max(d), avg(d)
from test_table
where id between 3 and 7
group by b;
+------+--------+--------+--------+--------+
| b | sum(d) | min(d) | max(d) | avg(d) |
+------+--------+--------+--------+--------+
| 1 | 1 | 1 | 1 | 1 |
| 2 | 2 | 2 | 2 | 2 |
| TRUE | 3 | 3 | 3 | 3 |
| 4 | 6 | 2 | 4 | 3 |
+------+--------+--------+--------+--------+
Beyond this point following are the details which are not directly related with the questions purpose:
SQL has the ability for table joins (the relations). They can be thought like the VLOOKUP functionality in the Excel.
The RDBSs have the indexing mechanisms to fetch the rows as quick as possible. (Where the RDBMSs start to go beyond the purpose of excel).
The RDBSs keep huge amount of data (where excel the max rows are limited).
Both RDBSMs and Excel can be used by most of frameworks as persistent data layer. But of course Excel is not the one you pick because its reason of existence is more on the presentation layer.
The excel file and the SQL used in this answer can be found in this github repo: https://github.com/MehmetKaplan/stackoverflow-72135212/
PS: I used SQL for more than 2 decades and then reduced using it and started to use Excel much frequently because of job changes. Each time I use Excel I still think of the DBs and "relational algebra" which is the mathematical foundation of the RDBMSs.

So in Snowflake:
Strings as input:
if you have your data in a "order" table represented by this CTE:
and the data was strings of comma separated values:
WITH data(raw) as (
select * from values
('null,null,null,null,null,null'),
('null,null,null,null,null,null'),
('null,1,1,1,null,null'),
('null,2, other,2,null,null'),
('null,true,3,3,null,null'),
('null,4,4,4,null,null')
)
this SQL will select the sub part, try parse it and sum the valid values:
select sum(nvl(try_to_double(r.value::text), try_to_number(r.value::text))) as sum_total
from data as d
,table(split_to_table(d.raw,',')) r
where r.index between 2 and 4 /* the column B,C,D filter */
and r.seq between 3 and 6 /* the row 3-6 filter */
;
giving:
SUM_TOTAL
25
Arrays as input:
if you already have arrays.. here I am smash those strings into STRTOK_TO_ARRAY in the CTE to make me some arrays:
WITH data(_array) as (
select STRTOK_TO_ARRAY(column1, ',') from values
('null,null,null,null,null,null'),
('null,null,null,null,null,null'),
('null,1,1,1,null,null'),
('null,2, other,2,null,null'),
('null,true,3,3,null,null'),
('null,4,4,4,null,null')
)
thus again with almost the same SQL, but not the array indexes are 0 based, and I have used FLATTEN:
select sum(nvl(try_to_double(r.value::text), try_to_number(r.value::text))) as sum_total
from data as d
,table(flatten(input=>d._array)) r
where r.index between 1 and 3 /* the column B,C,D filter */
and r.seq between 3 and 6 /* the row 3-6 filter */
;
gives:
SUM_TOTAL
25
With JSON driven data:
This time using semi-structured data, we can include the filter ranges with the data.. and some extra "out of bounds values just to show we are not just converting it all.
WITH data as (
select parse_json('{ "col_from":2,
"col_to":4,
"row_from":3,
"row_to":6,
"data":[[101,102,null,104,null,null],
[null,null,null,null,null,null],
[null,1,1,1,null,null],
[null,2, "other",2,null,null],
[null,true,3,3,null,null],
[null,4,4,4,null,null]
]}') as json
)
select
sum(try_to_double(c.value::text)) as sum_total
from data as d
,table(flatten(input=>d.json:data)) r
,table(flatten(input=>r.value)) c
where r.index+1 between d.json:row_from::number and d.json:row_to::number
and c.index+1 between d.json:col_from::number and d.json:col_to::number
;

Here is another solution using Snowflake scripting (Snowsight format) . This code can easily be wrapped as a stored procedure.
declare
table_name := 'xl_concept'; -- input
column_list := 'a,b,c'; -- input
total resultset; -- result output
pos int := 0; -- position for delimiter
sql := ''; -- sql to be generated
col := ''; -- individual column names
begin
sql := 'select sum('; -- initialize sql
loop -- repeat until column list is empty
col := replace(split_part(:column_list, ',', 1), ',', ''); -- get the column name
pos := position(',' in :column_list); -- find the delimiter
sql := sql || 'coalesce(try_to_number('|| col ||'),0)'; -- add to the sql
if (pos > 0) then -- more columns in the column list
sql := sql || ' + ';
column_list := right(:column_list, len(:column_list) - :pos); -- update column list
else -- last entry in the columns list
break;
end if;
end loop;
sql := sql || ') total from ' || table_name||';'; -- finalize the sql
total := (execute immediate :sql); -- run the sql and store total value
return table(total); -- return total value
end;
only these two variables need to be set table_name and column_list
generates the following sql to sum up the values
select sum(coalesce(try_to_number(a),0) + coalesce(try_to_number(b),0) + coalesce(try_to_number(c),0)) from xl_concept
prep steps
create or replace temp table xl_concept (a varchar,b varchar,c varchar)
;
insert into xl_concept
with cte as (
select '1' a, '1' b, '1' c union all
select '2', 'other', '2' union all
select 'true', '3', '3' union all
select '4', '4', '4'
)
select * from cte
;
result for the run with no change
TOTAL
25
result after changing column list to column_list := 'a,c';
TOTAL
17
Also, this can be enhanced setting columns_list to * and reading the column names from information_schema.columns to include all the columns from the table.

In PostgreSQL regular expression can be used to filter non numeric values before sum
select sum(e::Numeric) from (
select e
from unnest((Array[['1','2w','1.2e+4'],['-1','2.232','zz']])) as t(e)
where e ~ '^[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?$'
) a
expression for validating numeric value was taken from post Return Just the Numeric Values from a PostgreSQL Database Column
More secure option is to define function as in PostgreSQL alternative to SQL Servers try_cast function
Function (simplified for this example):
create function try_cast_numeric(p_in text)
returns Numeric
as
$$
begin
begin
return $1::Numeric;
exception
when others then
return 0;
end;
end;
$$
language plpgsql;
Select
select
sum(try_cast_numeric(e))
from
unnest((Array[['1','2w','1.2e+4'],['-1','2.232','zz']])) as t(e)

Most modern RDBMS support lateral joins and table value constructors. You can use them together to convert arbitrary columns to rows (3 columns per row become 3 rows with 1 column) then sum. In SQL server you would:
create table t (
id int not null primary key identity,
a int,
b int,
c int
);
insert into t(a, b, c) values
( 1, 1, 1),
( 2, null, 2),
(null, 3, 3),
( 4, 4, 4);
select sum(value)
from t
cross apply (values
(a),
(b),
(c)
) as x(value);
Below is the implementation of this concept in some popular RDBMS:
SQL Server
PostgreSQL
MySQL
Generic solution, ANSI SQL
Unpivot solution, Oracle

Using regular expression to extract all number values from a row could be another option, I guess.
DECLARE rectangular_table ARRAY<STRUCT<A STRING, B STRING, C STRING>> DEFAULT [
('1', '1', '1'), ('2', 'other', '2'), ('TRUE', '3', '3'), ('4', '4', '4')
];
SELECT SUM(SAFE_CAST(v AS FLOAT64)) AS `sum`
FROM UNNEST(rectangular_table) t,
UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r':"?([-0-9.]*)"?[,}]')) v
output:
+------+------+
| Row | sum |
+------+------+
| 1 | 25.0 |
+------+------+

You could use a CTE with a SELECT FROM VALUES
with xlary as
(
select val from (values
('1')
,('1')
,('1')
,('2')
,('OTHER')
,('2')
,('TRUE')
,('3')
,('3')
,('4')
,('4')
,('4')
) as tbl (val)
)
select sum(try_cast(val as number)) from xlary;

Related

Is there a function in PostgreSQL that counts string match across columns (row-wise)

I want to overwrite a number based on a few conditions.
Intended overwrite:
If a string (in the example I use is just a letter) occurs across 3 columns at least 2 times and the numerical column is more than some number, overwrite the numerical value OR
If another string occurs across 3 columns at least 2 times and the numerical column is more than some other number, overwrite the numerical value, else leave the numerical value unchanged.
The approach I thought of first, works but only if the table has one row. Could this be extended somehow so it could work on more rows? And if my approach is wrong, would you please direct me to the right one?
Please, see the SQL Fiddle
Any help is highly appreciated!
if letter a repeats at least 2 times among section_1,section_2,section_3 and number >= 3 then overwrite number with 3 or if letter b repeats at least 2 times among section_1,section_2,section_3 and number >= 8 write 8, else leave number unchanged
CREATE TABLE sections (
id int,
section_1 text,
section_2 text,
section_3 text,
number int
);
INSERT INTO sections VALUES
( 1, 'a', 'a', 'c', 5),
( 2, 'b', 'b', 'c', 9),
( 3, 'b', 'b', 'c', 4);
expected result:
id number
1 3
2 8
3 4
Are you looking for a case expression?
select (case when (section_1 = 'a')::int + (section_2 = 'a')::int + (section_3 = 'a')::int >= 2 and
other_col > threshold
then 'special'
end)
You can have additional when conditions. And include this in an update if you really wand to change the value.
A typical solution uses a lateral join to unpivot:
select s.*, x.number as new_number
from sections s
cross join lateral (
select count(*) number
from (values (s.section_1), (s.section_2), (s.section_3)) x(section)
where section = 'a'
) x;
This is a bit more scalable than repeating conditional expression, since you just need to enumerate the columns in the values() row constructor of the subquery.

One-statement Insert+delete in PostgreSQL

Suppose I have a PostgreSQL table t that looks like
id | name | y
----+------+---
0 | 'a' | 0
1 | 'b' | 0
2 | 'c' | 0
3 | 'd' | 1
4 | 'e' | 2
5 | 'f' | 2
With id being the primary key and with a UNIQUE constraint on (name, y).
Suppose I want to update this table in such a way that the part of the data set with y = 0 becomes (without knowing what is already there)
id | name | y
----+------+---
0 | 'a' | 0
1 | 'x' | 0
2 | 'y' | 0
I could use
DELETE FROM t WHERE y = 0 AND name NOT IN ('a', 'x', 'y');
INSERT INTO t (name, y) VALUES ('a', 0), ('x', 0), ('y', 0)
ON CONFLICT (name) DO NOTHING;
I feel like there must be a one-statement way to do this (like what upsert does for the task "update the existing entries and insert missing ones", but then for "insert the missing entries and delete the entries that should not be there"). Is there? I heard rumours that oracle has something called MERGE... I'm not sure what it does exactly.
This can be done with a single statement. But I doubt whether that classifies as "simpler".
Additionally: your expected output doesn't make sense.
Your insert statement does not provide a value for the primary key column (id), so apparently, the id column is a generated (identity/serial) column.
But in that case, news rows can't have the same IDs as the ones before because when the new rows were inserted, new IDs were generated.
Given the above change to your expected output, the following does what you want:
with data (name, y) as (
values ('a', 0), ('x', 0), ('y', 0)
), changed as (
insert into t (name, y)
select *
from data
on conflict (name,y) do nothing
)
delete from t
where (name, y) not in (select name, y from data);
That is one statement, but certainly not "simpler". The only advantage I can see is that you do not have to specify the list of values twice.
Online example: https://rextester.com/KKB30299
Unless there's a tremendous number of rows to be updated, do it as three update statements.
update t set name = 'a' where id = 0;
update t set name = 'x' where id = 1;
update t set name = 'y' where id = 2;
This is simple. It's easily done in a loop with a SQL builder. There's no race conditions as there are with deleting and inserting. And it preserves the ids and other columns of those rows.
To demonstrate with some psuedo-Ruby code.
new_names = ['a', 'x', 'y']
# In a transaction
db.transaction {
# Query the matching IDs in the same order as their new names
ids_to_update = db.select("
select id from t where y = 0 order by id
")
# Iterate through the IDs and new names together
ids_to_update.zip(new_names).each { |id,name|
# Update the row with its new name
db.execute("
update t set name = ? where id = ?
", name, id)
}
}
Fooling around some, here's how I did it in "one" statement, or at least one thing sent to the server, while preserving the IDs and no race conditions.
do $$
declare change text[];
declare changes text[][];
begin
select array_agg(array[id::text,name])
into changes
from unnest(
(select array_agg(id order by id) from t where y = 0),
array['a','y','z']
) with ordinality as a(id, name);
foreach change slice 1 in array changes
loop
update t set name = change[2] where id = change[1]::int;
end loop;
end$$;
The goal is to produce an array of arrays matching the id to its new name. That can be iterated over to do the updates.
unnest(
(select array_agg(id order by id) from t where y = 0),
array['a','y','z']
) with ordinality as a(id, name);
That bit produces rows with the IDs and their new names side by side.
select array_agg(array[id::text,name])
into changes
from unnest(...) with ordinality as a(id, name);
Then those rows of IDs and names are turned into an array of arrays like: {{1,a},{2,y},{3,z}}. (There's probably a more direct way to do that)
foreach change slice 1 in array changes
loop
update t set name = change[2] where id = change[1]::int;
end loop;
Finally we loop over the array and use it to perform each update.
You can turn this into a proper function and pass in the y value to match and the array of names to change them to. You should verify that the length of the ids and names match.
This might be faster, depends on how many rows you're updating, but it isn't simpler, and it took some time to puzzle out.

How to get index of an array value in PostgreSQL?

I have a table called pins like this:
id (int) | pin_codes (jsonb)
--------------------------------
1 | [4000, 5000, 6000]
2 | [8500, 8400, 8600]
3 | [2700, 2300, 2980]
Now, I want the row with pin_code 8600 and with its array index. The output must be like this:
pin_codes | index
------------------------------
[8500, 8500, 8600] | 2
If I want the row with pin_code 2700, the output :
pin_codes | index
------------------------------
[2700, 2300, 2980] | 0
What I've tried so far:
SELECT pin_codes FROM pins WHERE pin_codes #> '[8600]'
It only returns the row with wanted value. I don't know how to get the index on the value in the pin_codes array!
Any help would be great appreciated.
P.S:
I'm using PostgreSQL 10
If you were storing the array as a real array not as a json, you could use array_position() to find the (first) index of a given element:
select array_position(array['one', 'two', 'three'], 'two')
returns 2
With some text mangling you can cast the JSON array into a text array:
select array_position(translate(pin_codes::text,'[]','{}')::text[], '8600')
from the_table;
The also allows you to use the "operator"
select *
from pins
where '8600' = any(translate(pin_codes::text,'[]','{}')::text[])
The contains #> operator expects arrays on both sides of the operator. You could use it to search for two pin codes at a time:
select *
from pins
where translate(pin_codes::text,'[]','{}')::text[] #> array['8600','8400']
Or use the overlaps operator && to find rows with any of multiple elements:
select *
from pins
where translate(pin_codes::text,'[]','{}')::text[] && array['8600','2700']
would return
id | pin_codes
---+-------------------
2 | [8500, 8400, 8600]
3 | [2700, 2300, 2980]
If you do that a lot, it would be more efficient to store the pin_codes as text[] rather then JSON - then you can also index that column to do searches more efficiently.
Use the function jsonb_array_elements_text() using with ordinality.
with my_table(id, pin_codes) as (
values
(1, '[4000, 5000, 6000]'::jsonb),
(2, '[8500, 8400, 8600]'),
(3, '[2700, 2300, 2980]')
)
select id, pin_codes, ordinality- 1 as index
from my_table, jsonb_array_elements_text(pin_codes) with ordinality
where value::int = 8600;
id | pin_codes | index
----+--------------------+-------
2 | [8500, 8400, 8600] | 2
(1 row)
As has been pointed out previously the array_position function is only available in Postgres 9.5 and greater.
Here is custom function that achieves the same, derived from nathansgreen at github.
-- The array_position function was added in Postgres 9.5.
-- For older versions, you can get the same behavior with this function.
create function array_position(arr ANYARRAY, elem ANYELEMENT, pos INTEGER default 1) returns INTEGER
language sql
as $BODY$
select row_number::INTEGER
from (
select unnest, row_number() over ()
from ( select unnest(arr) ) t0
) t1
where row_number >= greatest(1, pos)
and (case when elem is null then unnest is null else unnest = elem end)
limit 1;
$BODY$;
So in this specific case, after creating the function the following worked for me.
SELECT
pin_codes,
array_position(pin_codes, 8600) AS index
FROM pins
WHERE array_position(pin_codes, 8600) IS NOT NULL;
Worth bearing in mind that it will only return the index of the first occurrence of 8600, you can use the pos argument to index which ever occurrence that you like.
In short, normalize your data structure, or don't do this in SQL. If you want this index of the sub-data element given your current data structure, then do this in your application code (take result, cast to list/array, get index).
Try to unnest the string and assign numbers as follows:
with dat as
(
select 1 id, '8700, 5600, 2300' pins
union all
select 2 id, '2300, 1700, 1000' pins
)
select dat.*, t.rn as index
from
(
select id, t.pins, row_number() over (partition by id) rn
from
(
select id, trim(unnest(string_to_array(pins, ','))) pins from dat
) t
) t
join dat on dat.id = t.id and t.pins = '2300'
If you insist on storing Arrays, I'd defer to klins answer.
As the alternative answer and extension to my comment...don't store SQL data in arrays. 'Normalize' your data in advance and SQL will handle it significantly better. Klin's answer is good, but may suffer for performance as it's outside of what SQL does best.
I'd break the Array prior to storing it. If the number of pincodes is known, then simply having the table pin_id,pin1,pin2,pin3, pinetc... is functional.
If the number of pins is unknown, a first table as pin that stored the pin_id and any info columns related to that pin ID, and then a second table as pin_id, pin_seq,pin_value is also functional (though you may need to pivot this later on to make sense of the data). In this case, select pin_seq where pin_value = 260 would work.

DB2: fill a dummy field with values in for loop while a select

I want to fill a dummy field with values in a for loop during a select:
Somethinhg like (table account e.g. has a field "login")
select login,(for i= 1 to 3 {list=list.login.i.","}) as list from account
The result should be
login | list
aaa | aaa1,aaa2,aaa3
bbb | bbb1,bbb2,bbb3
ccc | ccc1,ccc2,ccc3
Can someone please help me if that is possible !!!!
Many Thanks !
If this is an one-off task and the size of your loop is fixed, you can make up a table of integers and do a cartesian product with your table containing the column login:
SELECT ACC.LOGIN || NUMBRS.NUM FROM
ACCOUNT ACC, TABLE (
SELECT '1' AS NUM FROM SYSIBM.SYSDUMMY1 UNION
SELECT '2' AS NUM FROM SYSIBM.SYSDUMMY1 UNION
SELECT '3' AS NUM FROM SYSIBM.SYSDUMMY1
) NUMBRS
which will give you strings like 'aaa1', 'aaa2', 'aaa3' one string per row. Then, you can aggregate these strings with LISTAGG.
If the size is not fixed, you can always make up a temporary table and fill it up with appropriate data and use it instead of the NUMBRS table above.

Dynamic pivot for thousands of columns

I'm using pgAdmin III / PostgreSQL 9.4 to store and work with my data. Sample of my current data:
x | y
--+--
0 | 1
1 | 1
2 | 1
5 | 2
5 | 2
2 | 2
4 | 3
6 | 3
2 | 3
How I'd like it to be formatted:
1, 2, 3 -- column names are unique y values
0, 5, 4 -- the first respective x values
1, 5, 6 -- the second respective x values
2, 2, 2 -- etc.
It would need to be dynamic because I have millions of rows and thousands of unique values for y.
Is using a dynamic pivot approach correct for this? I have not been able to successfully implement this:
DECLARE #columns VARCHAR(8000)
SELECT #columns = COALESCE(#columns + ',[' + cast(y as varchar) + ']',
'[' + cast(y as varchar)+ ']')
FROM tableName
GROUP BY y
DECLARE #query VARCHAR(8000)
SET #query = '
SELECT x
FROM tableName
PIVOT
(
MAX(x)
FOR [y]
IN (' + #columns + ')
)
AS p'
EXECUTE(#query)
It is stopping on the first line and giving the error:
syntax error at or near "#"
All dynamic pivot examples I've seen use this, so I'm not sure what I've done wrong. Any help is appreciated. Thank you for your time.
**Note: It is important for the x values to be stored in the correct order, as sequence matters. I can add another column to indicate sequential order if necessary.
The term "first row" assumes a natural order of rows, which does not exist in database tables. So, yes, you need to add another column to indicate sequential order like you suspected. I am assuming a column tbl_id for the purpose. Using the ctid would be a measure of last resort. See:
Deterministic sort order for window functions
The code you present looks like MS SQL Server code; invalid syntax for Postgres.
For millions of rows and thousands of unique values for Y it wouldn't even make sense to try and return individual columns. Postgres has generous limits, but not nearly generous enough for that. According to the source code or the manual, the absolute maximum number of columns is 1600.
So we don't even get to discuss the restrictive characteristics of SQL, which demands to know columns and data types at execution time, not dynamically adjusted during execution. You would need two separate calls, like we discussed in great detail under this related question.
Dynamic alternative to pivot with CASE and GROUP BY
Another answer by Clodoaldo under the same question returns arrays. That can actually be completely dynamic. And that's what I suggest here, too. The query is actually rather simple:
WITH cte AS (
SELECT *, row_number() OVER (PARTITION BY y ORDER BY tbl_id) AS rn
FROM tbl
ORDER BY y, tbl_id
)
SELECT text 'y' AS col, array_agg (y) AS values
FROM cte
WHERE rn = 1
UNION ALL
( -- parentheses required
SELECT text 'x' || rn, array_agg (x)
FROM cte
GROUP BY rn
ORDER BY rn
);
Result:
col | values
----+--------
y | {1,2,3}
x1 | {0,5,4}
x2 | {1,5,6}
x3 | {2,2,2}
db<>fiddle here
Old sqlfiddle
Explanation
The CTE computes a row_number rn for each row (each x) per group of y. We are going to use it twice, hence the CTE.
The 1st SELECT in the outer query generates the array of y values.
The 2nd SELECT in the outer query generates all arrays of x values in order. Arrays can have different length.
Why the parentheses for UNION ALL? See:
Sum results of a few queries and then find top 5 in SQL