Group Concat in Redshift - sql

I have a table like this:
| Col1 | Col2 |
|:-----------|------------:|
| 1 | a;b; |
| 1 | b;c; |
| 2 | c;d; |
| 2 | d;e; |
I want the result to be some thing like this.
| Col1 | Col2 |
|:-----------|------------:|
| 1 | a;b;c;|
| 2 | c;d;e;|
Is there some way to write a set function which adds unique values in a column into an array and then displays them. I am using the Redshift Database which mostly uses postgresql with the following difference:
Unsupported PostgreSQL Functions

Have a look at Redshift's listagg() function which is similar to MySQL's group_concat. You would need to split the items first and then use listagg() to give you a list of values. Do take note, though, that, as the documentation states:
LISTAGG does not support DISTINCT expressions
(Edit: As of 11th October 2018, DISTINCT is now supported. See the docs.)
So will have to take care of that yourself. Assuming you have the following table set up:
create table _test (col1 int, col2 varchar(10));
insert into _test values (1, 'a;b;'), (1, 'b;c;'), (2, 'c;d;'), (2, 'd;e;');
Fixed number of items in Col2
Perform as many split_part() operations as there are items in Col2:
select
col1
, listagg(col2, ';') within group (order by col2)
from (
select col1, split_part(col2, ';', 1) as col2 from _test
union select col1, split_part(col2, ';', 2) as col2 from _test
)
group by col1
;
Varying number of items in Col2
You would need a helper here. If there are more rows in the table than items in Col2, a workaround with row_number() could work (but is expensive for large tables):
with _helper as (
select
(row_number() over())::int as part_number
from
_test
),
_values as (
select distinct
col1
, split_part(col2, ';', part_number) as col2
from
_test, _helper
where
length(split_part(col2, ';', part_number)) > 0
)
select
col1
, listagg(col2, ';') within group (order by col2) as col2
from
_values
group by
col1
;

Related

SQL - Create a formatted ouput with placeholder rows

For reasons of our IT department, I am stuck doing this entirely within an SQL query.
Simplified, I have this as an input table:
And I need to create this:
And I am just not sure where to start with this. In my normal C# way of thinking its easy. Column1 is ordered, if the value in Col1 is new, then add a new row to the output and put the contents in column1 in the output. Then, whilst the contents of the input Column1 is unchanged, keep adding the contents of column2 to new rows.
In SQL... nope, I just cannot see the right way to start!
This is a presentation issue that can be easily done in the application or presentation layer. In SQL this can be clunky. The goal of a database is not to render a UI but to store and retrieve data fast and also efficiently, in order to serve as many clients as possible with the same hardware and software resources constraints.
The query that could do this can look like:
with
y as (
select col1, row_number() over(order by col1) as r1
from (select distinct col1 as col1 from t) x
),
z as (
select
t.col1, y.r1, t.col2,
row_number() over(partition by t.col1 order by t.col2) as r2
from t
join y on y.col1 = x.col1
)
select col1, col2
from (
select col1, null as col2, r1, 0 from y
union all
select null, col2, r1, r2 from z
) w
order by r1, r2
As you see, it looks clunky and bloated.
You need a header row for each group which will consist of col1 and null and all the rows of the table with null as col1.
You can do it with UNION ALL and conditional sorting:
select
case when t.col2 is null then t.col1 end col1,
t.col2
from (
select col1, col2 from tablename
union all
select distinct col1, null from tablename
) t
order by
t.col1,
case when t.col2 is null then 1 else 2 end,
t.col2
See the demo (for MySql but it is standard SQL).
Results:
| col1 | col2 |
| ---- | ----- |
| SetA | |
| | BH101 |
| | BH102 |
| | BH103 |
| SetB | |
| | BH201 |
| | BH202 |
| | BH203 |
I agree, formatting should be done outside of SQL, but if you have no choice, here is some SQL Server code that will generate your output
select *
from (
select top 100
case
when col2 is null then ' '+col1
else '' end as firstCol,
IsNull(col2,'') as Col2
from dbo.test t1
group by col1,col2 with rollup
order by col1,col2
) x
where x.firstcol is not null

How can I array aggregate per column where distinct values are less than a given number in Google BigQuery?

I have a dataset table like this in Google Big Query:
| col1 | col2 | col3 | col4 | col5 | col6 |
-------------------------------------------
| a1 | b1 | c1 | d1 | e2 | f1 |
| a2 | b2 | c2 | d1 | e2 | f2 |
| a1 | b3 | c3 | d1 | e3 | f2 |
| a2 | b1 | c4 | d1 | e4 | f2 |
| a1 | b2 | c5 | d1 | e5 | f2 |
Let's say the given threshold number is 4, in that case, I want to transform this into one of the tables given below:
| col1 | col2 | col4 | col5 | col6 |
---------------------------------------------------------------------
| [a1,a2] | [b1,b2,b] | [d1] |[e2,e3,e4,e5]| [f1,f2] |
Or like this:
| col | values |
------------------------
| col1 | [a1,a2] |
| col2 | [b1,b2,b] |
| col4 | [d1] |
| col5 | [e2,e3,e4,e5] |
| col6 | [f1,f2] |
Please note col3 was removed because it contained more than 4 (threshold) distinct values. I explored lot of documents here but was not able to figure out the required query. Can somebody help or point in the right direction ?
Edit: I have one solution in mind, where I do something like this:
select * from (select 'col1', array_aggregate(distinct col1) as values union all
select 'col2', array_aggregate(distinct col2) as values union all
select 'col3', array_aggregate(distinct col3) as values union all
select 'col4', array_aggregate(distinct col4) as values union all
select 'col5', array_aggregate(distinct col5) as values) X where array_length(values) > 4;
This will give me the second result but requires complex query construction assuming I don't know the number and names of the columns up front. Also, this might cross 100MB per row limit for BigQuery table as I will be having more than a billion rows in the table. Please also suggest if there is a better way to do this.
How about:
WITH arrays AS (
SELECT * FROM UNNEST((
SELECT [
STRUCT("col_repo_name" AS col, ARRAY_AGG(DISTINCT repo.name IGNORE NULLS LIMIT 1001) AS values)
, ('col_actor_login', ARRAY_AGG(DISTINCT actor.login IGNORE NULLS LIMIT 1001))
, ('col_type', ARRAY_AGG(DISTINCT type IGNORE NULLS LIMIT 1001))
, ('col_org_login', ARRAY_AGG(DISTINCT org.login IGNORE NULLS LIMIT 1001))
]
FROM `githubarchive.year.2017`
))
)
SELECT *
FROM arrays
WHERE ARRAY_LENGTH(values)<=1000
This query processed 20.6GB in 11.9s (half billion rows). It only returned one row, because every other row had more than 1000 unique values (my threshold).
That's traditional SQL -- but see here an even simpler query, that produces similar results:
SELECT col, ARRAY_AGG(DISTINCT value IGNORE NULLS LIMIT 1001) values
FROM (
SELECT REGEXP_EXTRACT(x, r'"([^\"]*)"') col , REGEXP_EXTRACT(x, r'":"([^\"]*)"') value
FROM (
SELECT SPLIT(TO_JSON_STRING(STRUCT(repo.name, actor.login, type, org.login)), ',') x
FROM `githubarchive.year.2017`
), UNNEST(x) x
)
GROUP BY col
HAVING ARRAY_LENGTH(values)<=1000
# 17.0 sec elapsed, 20.6 GB processed
Caveat: This will only run if there are no special values in the columns, like quotes or commas. If you have those, it won't be as straightforward (but still possible).
Below is for BigQuery Standard SQL
#standardSQL
SELECT col, STRING_AGG(DISTINCT value) `values`
FROM (
SELECT
TRIM(z[OFFSET(0)], '"') col,
TRIM(z[OFFSET(1)], '"') value
FROM `project.dataset.table` t,
UNNEST(SPLIT(TRIM(to_JSON_STRING(t), '{}'))) kv,
UNNEST([STRUCT(SPLIT(kv, ':') AS z)])
)
GROUP BY col
HAVING COUNT(DISTINCT value) < 5
You can test, play with above using sample data from your question - result will be
Row col values
1 col1 a1,a2
2 col2 b1,b2,b3
3 col4 d1
4 col5 e2,e3,e4,e5
5 col6 f1,f2
#FelipeHoffa I was able to use your idea with a little modification in the query for my use-case.
SELECT * FROM UNNEST((
SELECT [
STRUCT("col_repo_name" AS col, ARRAY_AGG(DISTINCT repo.name IGNORE NULLS LIMIT 1001) AS values)
, ('col_actor_login', ARRAY_AGG(DISTINCT actor.login IGNORE NULLS LIMIT 1001))
, ('col_type', ARRAY_AGG(DISTINCT type IGNORE NULLS LIMIT 1001))
, ('col_org_login', ARRAY_AGG(DISTINCT org.login IGNORE NULLS LIMIT 1001))
]
FROM `githubarchive.year.2017`
))
This UNNEST on an array of structs will not work as it is because the underlying columns will have different data-types and BigQuery will not be able to put the arrays under a single column (with error like this: Array elements of types {STRUCT>, STRUCT>} do not have a common supertype). I modified it to something like this to serve my use-case.
SELECT * FROM UNNEST((
SELECT [
STRUCT("col_repo_name" AS col, to_json_string(ARRAY_AGG(DISTINCT repo.name IGNORE NULLS LIMIT 1001)) AS values)
, ('col_actor_login', to_json_string(ARRAY_AGG(DISTINCT actor.login IGNORE NULLS LIMIT 1001)))
, ('col_type', to_json_string(ARRAY_AGG(DISTINCT type IGNORE NULLS LIMIT 1001)))
, ('col_org_login', to_json_string(ARRAY_AGG(DISTINCT org.login IGNORE NULLS LIMIT 1001)))
]
FROM `githubarchive.year.2017`
))
And this worked well !

Selecting multiple values into single row - SQL server

I need to merge a table with ID and various bit flags like this
-----------------
a1 | x | | x |
-----------------
a1 | | x | |
-----------------
a1 | | | |
-----------------
b2 | x | | |
-----------------
b2 | | | |
-----------------
c3 | x | x | x |
into such form
-----------------
a1 | x | x | x |
-----------------
b2 | x | | |
-----------------
c3 | x | x | x |
The problem is that data are join by kind of option ID each option has an unique ID which is joined with a1, b2. When I try to SELECT it by using DISTINCT I receive results from table number 1. I can make it by subqueries in SELECT but it is really weak solution due to performance reasons.
Do you have any idea how select and combine all these flags into single row?
use aggregation
select col1 ,max(col2),max(col3),max(col4)
form table_name group by col1
For the given result set it is eligible to use MIN and GROUP BY:
SELECT
tbl.Col
, MIN(tbl.Col1) Col1
, MIN(tbl.Col2) Col2
, MIN(tbl.Col3) Col3
FROM #table tbl
GROUP BY tbl.Col
However, if you have empty rows, then use MAX(). Otherwise MIN() returns empty rows:
SELECT
tbl.Col
, MAX(tbl.Col1) Col1
, MAX(tbl.Col2) Col2
, MAX(tbl.Col3) Col3
FROM #table tbl
GROUP BY tbl.Col
For example:
DECLARE #table TABLE
(
Col VARCHAR(50),
Col1 VARCHAR(50),
Col2 VARCHAR(50),
Col3 VARCHAR(50)
)
INSERT INTO #table
(
Col,
Col1,
Col2,
Col3
)
VALUES
( 'a1', -- Col - varchar(50)
'x', -- Col1 - varchar(50)
Null, -- Col2 - varchar(50)
'x' -- Col3 - varchar(50)
)
, ('a1', NULL, 'x', null)
, ('a1', NULL, 'x', null)
, ('b2', 'x', null, null)
, ('b2', null, null, null)
, ('c3', 'x', 'x', 'x')
SELECT
tbl.Col
, MIN(tbl.Col1) Col1
, MIN(tbl.Col2) Col2
, MIN(tbl.Col3) Col3
FROM #table tbl
GROUP BY tbl.Col
OUTPUT:
Col Col1 Col2 Col3
a1 x x x
b2 x NULL NULL
c3 x x x
You want aggregation :
select col1, max(col2), max(col2), max(col3)
from table t
group by col1;
This assuming blank value as null.
The general solution for such a situation is to simply aggregate and either use MIN or MAX on the columns.
SQL Server's data type BIT, however, is quirky. It's a little like a BOOLEAN, but not a real boolean. It is a little like a very limited numeric data type, but it isn't really a numeric type either. And there simply exist no aggregation functions for this data type. In standard SQL you'd have ANY and EVERY for the BOOLEAN type. In PostgreSQL you have BIT_OR and BIT_AND for BIT and BOOL_OR and BOOL_AND for BOOLEAN. SQL Server has nothing.
So convert your columns to a numeric type before using MIN (which would be a bitwise AND) or MAX (which would be a bitwise OR) on it. E.g.
select
id,
max(bit1 + 0) as bit1agg,
max(bit2 + 0) as bit2agg,
max(bit3 + 0) as bit3agg
from mytable
group by id
order by id;
You can also use CAST or CONVERT instead of course.

Import Unpivot results to new Table and match on Key

I currently have a few unpivot queries that yeilds about 2000 rows each. I need to take the results of those queires, and put in a new table to match on a key.
Query Example:
Select DeviceSlot
FROM tbl1
unpivot(
DeviceSlot
For col in(
col1,
col2,
col3,
)
)AS Unpivot
Now I need to match the results from the query, and insert it into a new table with about 20,000 rows.
Pseudo-Code for this:
Insert Into tbl2(DeviceSlot)
Select DeviceSlot
FROM tbl1
unpivot(
DeviceSlot
For col in(
col1,
col2,
col3
)
)AS Unpivot2
Where tbl1.key = tbl2.key
I've been pretty confused on how to do this, and I apologize if it is not clear.
I also have another unpivot query doing the same thing for different columns.
Not sure what you are asking for. While unpivoting to "normalize" data typically the wanted "key" is derived during the unpivot, for example, below the id column of the original table is repeated in the un-pivoted data to represent a foreign key for some new table.
SQL Fiddle
MS SQL Server 2014 Schema Setup:
CREATE TABLE Table1
([id] int, [col1] varchar(2), [col2] varchar(2), [col3] varchar(2))
;
INSERT INTO Table1
([id], [col1], [col2], [col3])
VALUES
(1, 'a', 'b', 'c'),
(2, 'aa', 'bb', 'cc')
;
Query 1:
select id as table1_fk, colheading, colvalue
from (
select * from table1
) t
unpivot (
colvalue for colheading in (col1, col2, col3)
) u
Results:
| table1_fk | colheading | colvalue |
|-----------|------------|----------|
| 1 | col1 | a |
| 1 | col2 | b |
| 1 | col3 | c |
| 2 | col1 | aa |
| 2 | col2 | bb |
| 2 | col3 | cc |

Rotate one column and Select it comma seperated [duplicate]

This question already has answers here:
Inner Join Two Table, aggregating varchar fields
(2 answers)
Closed 8 years ago.
I have a table like this:
PK | COL1 | COL2
----------------
1 | A | a
2 | B | b
3 | C | c
4 | A | d
5 | A | e
6 | B | f
7 | C | g
8 | C | h
and I want to do an Select that I get the following result
COL1 | COL2
---------------
A | a,d,e
B | b,f
C | c,g,h
As I understand SQL at the moment I don't know how to do this without "programing" something extra e.q. with PL/SQL
But i search for an DBMS independent solution as good as it can be DBMS independent.
This is an Oracle (11.2) solution:
select col1,
listagg(col2, ',') within group (order by col1) as col2
from the_table
group by col1;
As you need this for other DBMS as well, this would be the Postgres solution:
select col1,
string_agg(col2, ',' order by col1) as col2
from the_table
group by col1;
For MySQL this would be:
select col1,
group_concat(col2 ORDER BY col1 SEPARATOR ',') as col2
from the_table
group by col1;
For a SQL Server solution, see Vijaykumar's answer.
try this !!
SELECT [col1],
SUBSTRING(d.col2,1, LEN(d.col2) - 1) col2
FROM
(
SELECT DISTINCT [col1]
FROM table1
) a
CROSS APPLY
(
SELECT [col2] + ', '
FROM table1 AS b
WHERE a.[col1] = b.[col1]
FOR XML PATH('')
) d (col2)