Split values over multiple rows in RedShift - sql

The question of how to split a field (e.g. a CSV string) into multiple rows has already been answered:
Split values over multiple rows.
However, this question refers to MSSQL, and the answers use various features for which there are no RedShift equivalents.
For the sake of completeness, here's an example of what I'd like to do:
Current data:
| Key | Data |
+-----+----------+
| 1 | 18,20,22 |
| 2 | 17,19 |
Required data:
| Key | Data |
+-----+----------+
| 1 | 18 |
| 1 | 20 |
| 1 | 22 |
| 2 | 17 |
| 2 | 19 |
Now, I can suggest a walkaround for the case of small, bounded number of elements in the CSV field: use split_part and union over all possible array locations, like so:
SELECT Key, split_part(Data, ',', 1)
FROM mytable
WHERE split_part(Data, ',', 1) != ""
UNION
SELECT Key, split_part(Data, ',', 2)
FROM mytable
WHERE split_part(Data, ',', 2) != ""
-- etc. etc.
However, this is obviously very inefficient, and would not work for longer lists.
Any better ideas on how to do this?
EDIT:
There's also a somewhat similar question regarding multiplying rows: splitting rows in Redshift. However I don't see how this approach can be applied here.
EDIT 2:
A possible duplicate: Redshift. Convert comma delimited values into rows. But nothing new - the answer by #Masashi Miyazaki is similar to my suggestion above, and suffers from the same issues.

Here is the Redshift answer, it will work with up to 10 thousand values per row.
Set up test data
create table test_data (key varchar(50),data varchar(max));
insert into test_data
values
(1,'18,20,22'),
(2,'17,19')
;
code
with ten_numbers as (select 1 as num union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9 union select 0)
, generted_numbers AS
(
SELECT (1000 * t1.num) + (100 * t2.num) + (10 * t3.num) + t4.num AS gen_num
FROM ten_numbers AS t1
JOIN ten_numbers AS t2 ON 1 = 1
JOIN ten_numbers AS t3 ON 1 = 1
JOIN ten_numbers AS t4 ON 1 = 1
)
, splitter AS
(
SELECT *
FROM generted_numbers
WHERE gen_num BETWEEN 1 AND (SELECT max(REGEXP_COUNT(data, '\\,') + 1)
FROM test_data)
)
, expanded_input AS
(
SELECT
key,
split_part(data, ',', s.gen_num) AS data
FROM test_data AS td
JOIN splitter AS s ON 1 = 1
WHERE split_part(data, ',', s.gen_num) <> ''
)
SELECT * FROM expanded_input
order by key,data;

You welcome to take RDS PostgreSql instance and create dblink to RedShift. Then you can manipulate on result set like on normal PostgreSQL DB and even put result back into RedShift through same dblink.

Related

How to convert JSONB array of pair values to rows and columns?

Given that I have a jsonb column with an array of pair values:
[1001, 1, 1002, 2, 1003, 3]
I want to turn each pair into a row, with each pair values as columns:
| a | b |
|------|---|
| 1001 | 1 |
| 1002 | 2 |
| 1003 | 3 |
Is something like that even possible in an efficient way?
I found a few inefficient (slow) ways, like using LEAD(), or joining the same table with the value from next row, but queries take ~ 10 minutes.
DDL:
CREATE TABLE products (
id int not null,
data jsonb not null
);
INSERT INTO products VALUES (1, '[1001, 1, 10002, 2, 1003, 3]')
DB Fiddle: https://www.db-fiddle.com/f/2QnNKmBqxF2FB9XJdJ55SZ/0
Thanks!
This is not an elegant approach from a declarative standpoint, but can you please see whether this performs better for you?
with indexes as (
select id, generate_series(1, jsonb_array_length(data) / 2) - 1 as idx
from products
)
select p.id, p.data->>(2 * i.idx) as a, p.data->>(2 * i.idx + 1) as b
from indexes i
join products p on p.id = i.id;
This query
SELECT j.data
FROM products
CROSS JOIN jsonb_array_elements(data) j(data)
should run faster if you just need to unpivot all elements within the query as in the demo.
Demo
or even remove the columns coming from products table :
SELECT jsonb_array_elements(data)
FROM products
OR
If you need to return like this
| a | b |
|------|---|
| 1001 | 1 |
| 1002 | 2 |
| 1003 | 3 |
as unpivoting two columns, then use :
SELECT MAX(CASE WHEN mod(rn,2) = 1 THEN data->>(rn-1)::int END) AS a,
MAX(CASE WHEN mod(rn,2) = 0 THEN data->>(rn-1)::int END) AS b
FROM
(
SELECT p.data, row_number() over () as rn
FROM products p
CROSS JOIN jsonb_array_elements(data) j(data)) q
GROUP BY ceil(rn/2::float)
ORDER BY ceil(rn/2::float)
Demo

How can I find a non duplicate value in a column, disregarding special characters?

I have an attribute field of labels relevant to my work. I am looking for duplicates within this field; the issue is, the inputs are inconsistant. For example:
Group | Label |
---------------
1 | H7 |
1 | H-7 |
2 | C9 |
2 | C 9 |
3 | D5 |
3 | M 9 |
The result I am looking for is just:
3 | D5 |
3 | M 9 |
as these are truly different from each other. I am using the following query currently:
SELECT *
FROM TABLE t3
WHERE t3.group IN (
SELECT t1.group
FROM TABLE t1, TABLE t2
WHERE t1.group = t2.group
AND (t1.label <> t2.label)
How can I get the query to disregard special characters?
If the "special" character can be anything other than alphanumeric chars, then you can use regexp_replace:
select max(t.group), max(t.label)
from your_table t
group by regexp_replace(t.label, '[^[:alnum:]]', '')
having count(*) = 1;
If there are only a limited number of special characters possible in the values, then perhaps a non-"regexp" solution would work - using replace.
Also, avoid using keywords such as "group" as identifiers.
Try:
select regexp_replace(label,'[^[:alnum:]]',''), count(1) cnt
from some_table
group by regexp_replace(label,'[^[:alnum:]]','')
having count(1) > 1
This will show the duplicate labels (based on alphanumerics only)
You can use regexp_replace():
SELECT t.*
FROM TABLE t
WHERE NOT EXISTS (SELECT 1
FROM TABLE tt
WHERE tt.group = t.group AND tt.rowid <> t.rowid AND
regexp_replace(tt.label, '[^a-zA-Z0-9]', '') = regexp_replace(t.label, '[^a-zA-Z0-9]', '')
);
This should return all the original rows that are singletons. If you want all rows for a group where all are singletons:
SELECT t.*
FROM TABLE t
WHERE t.group IN (SELECT tt.group
FROM (SELECT tt.group, regexp_replace(tt.label, '[^a-zA-Z0-9]', '') as label_clean, COUNT(*) as cnt
FROM TABLE tt
GROUP BY tt.group, regexp_replace(tt.label, '[^a-zA-Z0-9]', '')
) tt
GROUP BY tt.group
HAVING MAX(cnt) = 1
);

Undo a LISTAGG in redshift

I have a table that probably resulted from a listagg, similar to this:
# select * from s;
s
-----------
a,c,b,d,a
b,e,c,d,f
(2 rows)
How can I change it into this set of rows:
a
c
b
d
a
b
e
c
d
f
In redshift, you can join against a table of numbers, and use that as the split index:
--with recursive Numbers as (
-- select 1 as i
-- union all
-- select i + 1 as i from Numbers where i <= 5
--)
with Numbers(i) as (
select 1 union
select 2 union
select 3 union
select 4 union
select 5
)
select split_part(s,',', i) from Numbers, s ORDER by s,i;
EDIT: redshift doesn't seem to support recursive subqueries, only postgres. :(
SQL Fiddle
Oracle 11g R2 Schema Setup:
create table s(
col varchar2(20) );
insert into s values('a,c,b,d,a');
insert into s values('b,e,c,d,f');
Query 1:
SELECT REGEXP_SUBSTR(t1.col, '([^,])+', 1, t2.COLUMN_VALUE )
FROM s t1 CROSS JOIN
TABLE
(
CAST
(
MULTISET
(
SELECT LEVEL
FROM DUAL
CONNECT BY LEVEL <= REGEXP_COUNT(t1.col, '([^,])+')
)
AS SYS.odciNumberList
)
) t2
Results:
| REGEXP_SUBSTR(T1.COL,'([^,])+',1,T2.COLUMN_VALUE) |
|---------------------------------------------------|
| a |
| c |
| b |
| d |
| a |
| b |
| e |
| c |
| d |
| f |
As this is tagged to Redshift and no answer so far has a complete overview from undoing a LISTAGG in Redshift properly, here is the code that solves all its use cases:
CREATE TEMPORARY TABLE s (
s varchar(255)
);
INSERT INTO s VALUES('a,c,b,d,a');
INSERT INTO s VALUES('b,e,c,d,f');
SELECT
TRIM(split_part(s.s,',',R::smallint)) AS s
FROM s
LEFT JOIN (
SELECT
ROW_NUMBER() OVER (PARTITION BY 1) AS R
FROM any_large_table
LIMIT 1000
) extend_number
ON (SELECT MAX(regexp_count(s.s,',')+1) FROM s) >= extend_number.R
AND NULLIF(TRIM(split_part(s.s,',',extend_number.R::smallint)),'') IS NOT NULL;
DROP TABLE s;
Where “any_large_table” is any table you have in redshift already that has enough records for your purposes depending on the number of elements the list of each record will contain (i.e. in the above case, I ensure it is up to one-thousand records). Unfortunately, generate_series function does not work properly in Redshift as far as I know and that is the only way.
Another point of advise is check if you can get the values before they already list_agg whenever possible. As you can see from the above code, it looks quite complex, and you save a lot of maintenance time on your code if you keep things simple (that is, whenever the opportunity is available).

SQL - Compare Table1.items (Ntext) to Table2.item (Varchar)

I'm working on SQL Server 2012.
I would like to split the different items from the Table1 to compare with a specific column of the Table2.
Table1 have a row like that :
| id | items |
| 1 | aaa;ery;sha;cbre;dezrzyg; |
| 2 | aaa;ery;sha;cbre;dezrzyg; | // Could be the same items than another row
| 3 | dg;e3ry;sd6ha;cb8re;48dz; |
| 4 | e5zeza;48;dz;46az;12BREd; |
| ... | ... |
| 10 | aaa | // Currently match because the request compare the whole cell
items is a string (ntext in the db) and the string never contain spaces.
Table2 have a row like that :
| id | item |
| 1 | aaa | // match
| 2 | AAA | // match
| 3 | aaa52 | // doesn't match
| 4 | 2aaa2 | // doesn't match
| ... | ... |
item also is a string (nvarchar in the db) and the string never contain spaces.
Here is my current SQL request :
SELECT * FROM Table1 t1
INNER JOIN Table2 t2 ON t1.items = t2.item
How could I solve my problem ?
Should I split a string then compare each Table1.items to Table2.item ?
Is there something in SQL to resolve it easily ?
Is there something in SQL to resolve it easily ?
No but you can creatively use like. Indexes can not help you with performance when you do something like this.
select *
from Table1 as T1
inner join Table2 as T2
on ';'+cast(T1.items as nvarchar(max))+';' like '%;'+T2.item+';%'
SQL Fiddle
The failsafe solution is to split the content of items column into table-like form and then join it to table2.
Say we have these tables:
create table #t1 (id int, items varchar(100));
go
insert #t1 values
( 1, 'aaa;ery;sha;cbre;dezrzyg;'),
( 2, 'aaa;ery;sha;cbre;dezrzyg;'),
( 3, 'dg;e3ry;sd6ha;cb8re;48dz;'),
( 4, 'e5zeza;48;dz;46az;12BREd;'),
(10, 'aaa');
go
create table #t2 (id int, item varchar(100));
go
insert #t2 values
(1, 'aaa'),
(2, 'AAA'),
(3, 'aaa52'),
(4, '2aaa2')
go
We'll use the following approach to split the items:
select substring(items, n, charindex(';', items + ';', n) - n)
from numbers, #t1
where substring(';' + items, n, 1) = ';'
and n < len(items) + 1
This requires a numbers table, see here how to create it.
Here's the whole query:
select distinct #t1.id, #t1.items, case when #t2.id is null then 'doesn''t match' else 'match' end
from #t1
cross apply (
select substring(items, n, charindex(';', items + ';', n) - n)
from numbers
where substring(';' + items, n, 1) = ';'
and n < len(items) + 1
) x (col)
left join #t2 on x.col = #t2.item
--where #t2.id is not null

SQL - Find missing int values in mostly ordered sequential series

I manage a message based system in which a sequence of unique integer ids will be entirely represented at the end of the day, though they will not necessarily arrive in order.
I am looking for help in finding missing ids in this series using SQL. If my column values are something like the below, how can I find which ids I am missing in this sequence, in this case 6?
The sequence will begin and end at an arbitrary point each day, so min and max would differ upon each run. Coming from a Perl background I through some regex in there.
ids
1
2
3
5
4
7
9
8
10
Help would be much appreciated.
Edit: We run oracle
Edit2: Thanks all. I'll be running through your solutions next week in the office.
Edit3: I settled for the time being on something like the below, with ORIG_ID being the original id column and MY_TABLE being the source table. In looking closer at my data, there are a variety of cases beyond just number data in a string. In some cases there is a prefix or suffix of non-numeric characters. In others, there are dashes or spaces intermixed into the numeric id. Beyond this, ids periodically appear multiple times, so I included distinct.
I would appreciate any further input, specifically in regard to the best route of stripping out non-numeric characters.
SELECT
CASE
WHEN NUMERIC_ID + 1 = NEXT_ID - 1
THEN TO_CHAR( NUMERIC_ID + 1 )
ELSE TO_CHAR( NUMERIC_ID + 1 ) || '-' || TO_CHAR( NEXT_ID - 1 )
END
MISSING_SEQUENCES
FROM
(
SELECT
NUMERIC_ID,
LEAD (NUMERIC_ID, 1, NULL)
OVER
(
ORDER BY
NUMERIC_ID
ASC
)
AS NEXT_ID
FROM
(
SELECT
DISTINCT TO_NUMBER( REGEXP_REPLACE(ORIG_ID,'[^[:digit:]]','') )
AS NUMERIC_ID
FROM MY_TABLE
)
) WHERE NEXT_ID != NUMERIC_ID + 1
I've been there.
FOR ORACLE:
I found this extremely useful query on the net a while ago and noted down, however I don't remember the site now, you may search for "GAP ANALYSIS" on Google.
SELECT CASE
WHEN ids + 1 = lead_no - 1 THEN TO_CHAR (ids +1)
ELSE TO_CHAR (ids + 1) || '-' || TO_CHAR (lead_no - 1)
END
Missing_track_no
FROM (SELECT ids,
LEAD (ids, 1, NULL)
OVER (ORDER BY ids ASC)
lead_no
FROM YOURTABLE
)
WHERE lead_no != ids + 1
Here, the result is:
MISSING _TRACK_NO
-----------------
6
If there were multiple gaps,say 2,6,7,9 then it would be:
MISSING _TRACK_NO
-----------------
2
6-7
9
This is sometimes called an exclusion join. That is, try to do a join and return only rows where there is no match.
SELECT t1.value-1
FROM ThisTable AS t1
LEFT OUTER JOIN ThisTable AS t2
ON t1.id = t2.value+1
WHERE t2.value IS NULL
Note this will always report at least one row, which will be the MIN value.
Also, if there are gaps of two or more numbers, it will only report one missing value.
You didn't state your DBMS, so I'm assuming PostgreSQL:
select aid as missing_id
from generate_series( (select min(id) from message), (select max(id) from message)) as aid
left join message m on m.id = aid
where m.id is null;
This will report any missing value in a sequence between the minimum and maximum id in your table (including gaps that are bigger than one)
psql (9.1.1)
Type "help" for help.
postgres=> select * from message;
id
----
1
2
3
4
5
7
8
9
11
14
(10 rows)
postgres=> select aid as missing_id
postgres-> from generate_series( (select min(id) from message), (select max(id) from message)) as aid
postgres-> left join message m on m.id = aid
postgres-> where m.id is null;
missing_id
------------
6
10
12
13
(4 rows)
postgres=>
I applied it in mysql, it worked ..
mysql> select * from sequence;
+--------+
| number |
+--------+
| 1 |
| 2 |
| 4 |
| 6 |
| 7 |
| 8 |
+--------+
6 rows in set (0.00 sec)
mysql> SELECT t1.number - 1 FROM sequence AS t1 LEFT OUTER JOIN sequence AS t2 O
N t1.number = t2.number +1 WHERE t2.number IS NULL;
+---------------+
| t1.number - 1 |
+---------------+
| 0 |
| 3 |
| 5 |
+---------------+
3 rows in set (0.00 sec)
SET search_path='tmp';
DROP table tmp.table_name CASCADE;
CREATE table tmp.table_name ( num INTEGER NOT NULL PRIMARY KEY);
-- make some data
INSERT INTO tmp.table_name(num) SELECT generate_series(1,20);
-- create some gaps
DELETE FROM tmp.table_name WHERE random() < 0.3 ;
SELECT * FROM table_name;
-- EXPLAIN ANALYZE
WITH zbot AS (
SELECT 1+tn.num AS num
FROM table_name tn
WHERE NOT EXISTS (
SELECT * FROM table_name nx
WHERE nx.num = tn.num+1
)
)
, ztop AS (
SELECT -1+tn.num AS num
FROM table_name tn
WHERE NOT EXISTS (
SELECT * FROM table_name nx
WHERE nx.num = tn.num-1
)
)
SELECT zbot.num AS bot
,ztop.num AS top
FROM zbot, ztop
WHERE zbot.num <= ztop.num
AND NOT EXISTS ( SELECT *
FROM table_name nx
WHERE nx.num >= zbot.num
AND nx.num <= ztop.num
)
ORDER BY bot,top
;
Result:
CREATE TABLE
INSERT 0 20
DELETE 9
num
-----
1
2
6
7
10
11
13
14
15
18
19
(11 rows)
bot | top
-----+-----
3 | 5
8 | 9
12 | 12
16 | 17
(4 rows)
Note: a recursive CTE is also possible (and probably shorter).
UPDATE: here comes the recursive CTE ...:
WITH RECURSIVE tree AS (
SELECT 1+num AS num
FROM table_name t0
UNION
SELECT 1+num FROM tree tt
WHERE EXISTS ( SELECT *
FROM table_name xt
WHERE xt.num > tt.num
)
)
SELECT * FROM tree
WHERE NOT EXISTS (
SELECT *
FROM table_name nx
WHERE nx.num = tree.num
)
ORDER BY num
;
Results: (same data)
num
-----
3
4
5
8
9
12
16
17
20
(9 rows)
select student_key, next_student_key
from (
select student_key, lead(student_key) over (order by student_key) next_fed_cls_prgrm_key
from student_table
)
where student_key <> next_student_key-1;