SQL - Find missing int values in mostly ordered sequential series - sql

I manage a message based system in which a sequence of unique integer ids will be entirely represented at the end of the day, though they will not necessarily arrive in order.
I am looking for help in finding missing ids in this series using SQL. If my column values are something like the below, how can I find which ids I am missing in this sequence, in this case 6?
The sequence will begin and end at an arbitrary point each day, so min and max would differ upon each run. Coming from a Perl background I through some regex in there.
ids
1
2
3
5
4
7
9
8
10
Help would be much appreciated.
Edit: We run oracle
Edit2: Thanks all. I'll be running through your solutions next week in the office.
Edit3: I settled for the time being on something like the below, with ORIG_ID being the original id column and MY_TABLE being the source table. In looking closer at my data, there are a variety of cases beyond just number data in a string. In some cases there is a prefix or suffix of non-numeric characters. In others, there are dashes or spaces intermixed into the numeric id. Beyond this, ids periodically appear multiple times, so I included distinct.
I would appreciate any further input, specifically in regard to the best route of stripping out non-numeric characters.
SELECT
CASE
WHEN NUMERIC_ID + 1 = NEXT_ID - 1
THEN TO_CHAR( NUMERIC_ID + 1 )
ELSE TO_CHAR( NUMERIC_ID + 1 ) || '-' || TO_CHAR( NEXT_ID - 1 )
END
MISSING_SEQUENCES
FROM
(
SELECT
NUMERIC_ID,
LEAD (NUMERIC_ID, 1, NULL)
OVER
(
ORDER BY
NUMERIC_ID
ASC
)
AS NEXT_ID
FROM
(
SELECT
DISTINCT TO_NUMBER( REGEXP_REPLACE(ORIG_ID,'[^[:digit:]]','') )
AS NUMERIC_ID
FROM MY_TABLE
)
) WHERE NEXT_ID != NUMERIC_ID + 1

I've been there.
FOR ORACLE:
I found this extremely useful query on the net a while ago and noted down, however I don't remember the site now, you may search for "GAP ANALYSIS" on Google.
SELECT CASE
WHEN ids + 1 = lead_no - 1 THEN TO_CHAR (ids +1)
ELSE TO_CHAR (ids + 1) || '-' || TO_CHAR (lead_no - 1)
END
Missing_track_no
FROM (SELECT ids,
LEAD (ids, 1, NULL)
OVER (ORDER BY ids ASC)
lead_no
FROM YOURTABLE
)
WHERE lead_no != ids + 1
Here, the result is:
MISSING _TRACK_NO
-----------------
6
If there were multiple gaps,say 2,6,7,9 then it would be:
MISSING _TRACK_NO
-----------------
2
6-7
9

This is sometimes called an exclusion join. That is, try to do a join and return only rows where there is no match.
SELECT t1.value-1
FROM ThisTable AS t1
LEFT OUTER JOIN ThisTable AS t2
ON t1.id = t2.value+1
WHERE t2.value IS NULL
Note this will always report at least one row, which will be the MIN value.
Also, if there are gaps of two or more numbers, it will only report one missing value.

You didn't state your DBMS, so I'm assuming PostgreSQL:
select aid as missing_id
from generate_series( (select min(id) from message), (select max(id) from message)) as aid
left join message m on m.id = aid
where m.id is null;
This will report any missing value in a sequence between the minimum and maximum id in your table (including gaps that are bigger than one)
psql (9.1.1)
Type "help" for help.
postgres=> select * from message;
id
----
1
2
3
4
5
7
8
9
11
14
(10 rows)
postgres=> select aid as missing_id
postgres-> from generate_series( (select min(id) from message), (select max(id) from message)) as aid
postgres-> left join message m on m.id = aid
postgres-> where m.id is null;
missing_id
------------
6
10
12
13
(4 rows)
postgres=>

I applied it in mysql, it worked ..
mysql> select * from sequence;
+--------+
| number |
+--------+
| 1 |
| 2 |
| 4 |
| 6 |
| 7 |
| 8 |
+--------+
6 rows in set (0.00 sec)
mysql> SELECT t1.number - 1 FROM sequence AS t1 LEFT OUTER JOIN sequence AS t2 O
N t1.number = t2.number +1 WHERE t2.number IS NULL;
+---------------+
| t1.number - 1 |
+---------------+
| 0 |
| 3 |
| 5 |
+---------------+
3 rows in set (0.00 sec)

SET search_path='tmp';
DROP table tmp.table_name CASCADE;
CREATE table tmp.table_name ( num INTEGER NOT NULL PRIMARY KEY);
-- make some data
INSERT INTO tmp.table_name(num) SELECT generate_series(1,20);
-- create some gaps
DELETE FROM tmp.table_name WHERE random() < 0.3 ;
SELECT * FROM table_name;
-- EXPLAIN ANALYZE
WITH zbot AS (
SELECT 1+tn.num AS num
FROM table_name tn
WHERE NOT EXISTS (
SELECT * FROM table_name nx
WHERE nx.num = tn.num+1
)
)
, ztop AS (
SELECT -1+tn.num AS num
FROM table_name tn
WHERE NOT EXISTS (
SELECT * FROM table_name nx
WHERE nx.num = tn.num-1
)
)
SELECT zbot.num AS bot
,ztop.num AS top
FROM zbot, ztop
WHERE zbot.num <= ztop.num
AND NOT EXISTS ( SELECT *
FROM table_name nx
WHERE nx.num >= zbot.num
AND nx.num <= ztop.num
)
ORDER BY bot,top
;
Result:
CREATE TABLE
INSERT 0 20
DELETE 9
num
-----
1
2
6
7
10
11
13
14
15
18
19
(11 rows)
bot | top
-----+-----
3 | 5
8 | 9
12 | 12
16 | 17
(4 rows)
Note: a recursive CTE is also possible (and probably shorter).
UPDATE: here comes the recursive CTE ...:
WITH RECURSIVE tree AS (
SELECT 1+num AS num
FROM table_name t0
UNION
SELECT 1+num FROM tree tt
WHERE EXISTS ( SELECT *
FROM table_name xt
WHERE xt.num > tt.num
)
)
SELECT * FROM tree
WHERE NOT EXISTS (
SELECT *
FROM table_name nx
WHERE nx.num = tree.num
)
ORDER BY num
;
Results: (same data)
num
-----
3
4
5
8
9
12
16
17
20
(9 rows)

select student_key, next_student_key
from (
select student_key, lead(student_key) over (order by student_key) next_fed_cls_prgrm_key
from student_table
)
where student_key <> next_student_key-1;

Related

How to find duplicate sets of values in column SQL

I have a database table in SQL Server like this:
+----+--------+
| ID | Number |
+----+--------+
| 1 | 4 |
| 2 | 2 |
| 3 | 6 |
| 4 | 5 |
| 5 | 3 |
| 6 | 2 |
| 7 | 6 |
| 8 | 4 |
| 9 | 5 |
| 10 | 1 |
| 11 | 6 |
| 12 | 4 |
| 13 | 2 |
| 14 | 6 |
+----+--------+
I want to get all values ​​of rows that are the same with last row or last 2 rows or last 3 rows or .... in column Number, and when finding those values, will go on to get the values ​​that appear next and count the number its appearance.
Result output like this:
If the same with the last row:
We see that the number next to 6 in column Number is 4 and 5.
Times appear in column Number of pair 6,4 is 2 and pair 6,5 is 1.
+---------------------+-------------------------+--------------+
| "Condition to find" | "Next Number in column" | Times appear |
+---------------------+-------------------------+--------------+
| 6 | 5 | 1 |
| 6 | 4 | 2 |
+---------------------+-------------------------+--------------+
If the same with the last two rows:
+---------------------+-------------------------+--------------+
| "Condition to find" | "Next Number in column" | Times appear |
+---------------------+-------------------------+--------------+
| 2,6 | 5 | 1 |
| 2,6 | 4 | 1 |
+---------------------+-------------------------+--------------+
If the same with the last 3 rows:
+---------------------+-------------------------+--------------+
| "Condition to find" | "Next Number in column" | Times appear |
+---------------------+-------------------------+--------------+
| 4,2,6 | 5 | 1 |
+---------------------+-------------------------+--------------+
And if the last 4,5,6...rows, find until Times appear returns 0
+---------------------+-------------------------+--------------+
| "Condition to find" | "Next Number in column" | Times appear |
+---------------------+-------------------------+--------------+
| 6,4,2,6 | | |
+---------------------+-------------------------+--------------+
Any idea how to get this. Thank so much!
Here's an answer which uses the 'Lead' function - which (once ordered) takes a value from a certain number of rows ahead.
It converts your table with 1 number, to also include the next 3 numbers on each row.
Then you can join on those columns to get numbers etc.
CREATE TABLE #Src (Id int PRIMARY KEY, Num int)
INSERT INTO #Src (Id, Num) VALUES
( 1, 4),
( 2, 2),
( 3, 6),
( 4, 5),
( 5, 3),
( 6, 2),
( 7, 6),
( 8, 4),
( 9, 5),
(10, 1),
(11, 6),
(12, 4),
(13, 2),
(14, 6)
CREATE TABLE #SrcWithNext (Id int PRIMARY KEY, Num int, Next1 int, Next2 int, Next3 int)
-- First step - use LEAD to get the next 1, 2, 3 values
INSERT INTO #SrcWithNext (Id, Num, Next1, Next2, Next3)
SELECT ID, Num,
LEAD(Num, 1, NULL) OVER (ORDER BY Id) AS Next1,
LEAD(Num, 2, NULL) OVER (ORDER BY Id) AS Next2,
LEAD(Num, 3, NULL) OVER (ORDER BY Id) AS Next3
FROM #Src
SELECT * FROM #SrcWithNext
/* Find number with each combination */
-- 2 chars
SELECT A.Num, A.Next1, COUNT(*) AS Num_Instances
FROM (SELECT DISTINCT Num, Next1 FROM #SrcWithNext) AS A
INNER JOIN #SrcWithNext AS B ON A.Num = B.Num AND A.Next1 = B.Next1
WHERE A.Num <= B.Num
GROUP BY A.Num, A.Next1
ORDER BY A.Num, A.Next1
-- 3 chars
SELECT A.Num, A.Next1, A.Next2, COUNT(*) AS Num_Instances
FROM (SELECT DISTINCT Num, Next1, Next2 FROM #SrcWithNext) AS A
INNER JOIN #SrcWithNext AS B
ON A.Num = B.Num
AND A.Next1 = B.Next1
AND A.Next2 = B.Next2
WHERE A.Num <= B.Num
GROUP BY A.Num, A.Next1, A.Next2
ORDER BY A.Num, A.Next1, A.Next2
-- 4 chars
SELECT A.Num, A.Next1, A.Next2, A.Next3, COUNT(*) AS Num_Instances
FROM (SELECT DISTINCT Num, Next1, Next2, Next3 FROM #SrcWithNext) AS A
INNER JOIN #SrcWithNext AS B
ON A.Num = B.Num
AND A.Next1 = B.Next1
AND A.Next2 = B.Next2
AND A.Next3 = B.Next3
WHERE A.Num <= B.Num
GROUP BY A.Num, A.Next1, A.Next2, A.Next3
ORDER BY A.Num, A.Next1, A.Next2, A.Next3
Here's a db<>fiddle to check.
Notes
The A.Num <= B.Num means it finds all matches to itself, and then only counts others once
This answer finds all combinations. To filter, it currently would need to filter as separate columns e.g., instead of 2,6, you'd filter on Num = 2 AND Next1 = 6. Feel free to then do various text/string concatenation functions to create references for your preferred search/filter approach.
Hmmm . . . I am thinking that you want to create the "pattern to find" as a string. Unfortunately, string_agg() is not a windowing function, but you can use apply:
select t.*, p.*
from t cross apply
(select string_agg(number, ',') within group (order by id) as pattern
from (select top (3) t2.*
from t t2
where t2.id <= t.id
order by t2.id desc
) t2
) p;
You would change the "3" to whatever number of rows that you want.
Then you can use this to identify the rows where the patterns are matched and aggregate:
with tp as (
select t.*, p.*
from t cross apply
(select string_agg(number, ',') within group (order by id) as pattern
from (select top (3) t2.*
from t t2
where t2.id <= t.id
order by t2.id desc
) t2
) p
)
select pattern_to_find, next_number, count(*)
from (select tp.*,
first_value(pattern) over (order by id desc) as pattern_to_find,
lead(number) over (order by id) as next_number
from tp
) tp
where pattern = pattern_to_find
group by pattern_to_find, next_number;
Here is a db<>fiddle.
If you are using an older version of SQL Server -- one that doesn't support string_agg() -- you can calculate the pattern using lag():
with tp as (
select t.*,
concat(lag(number, 2) over (order by id), ',',
lag(number, 1) over (order by id), ',',
number
) as pattern
from t
)
Actually, if you have a large amount of data, it would be interesting to know which is faster -- the apply version or the lag() version. I suspect that lag() might be faster.
EDIT:
In unsupported versions of SQL Server, you can get the pattern using:
select t.*, p.*
from t cross apply
(select (select cast(number as varchar(255)) + ','
from (select top (3) t2.*
from t t2
where t2.id <= t.id
order by t2.id desc
) t2
order by t2.id desc
for xml path ('')
) as pattern
) p
You can use similar logic for lead().
I tried to solve this problem by converting "Number" column to a string.
Here is my code using a function with input of "number of last selected rows":
(Be careful that the name of the main table is "test" )
create function duplicate(#nlast int)
returns #temp table (RowNumbers varchar(20), Number varchar(1))
as
begin
declare #num varchar(20)
set #num=''
declare #count int=1
while #count <= (select count(id) from test)
begin
set #num = #num + cast((select Number from test where #count=ID) as varchar(20))
set #count=#count+1
end
declare #lastnum varchar(20)
set #lastnum= (select RIGHT(#num,#nlast))
declare #count2 int=1
while #count2 <= len(#num)-#nlast
begin
if (SUBSTRING(#num,#count2,#nlast) = #lastnum)
begin
insert into #temp
select #lastnum ,SUBSTRING(#num,#count2+#nlast,1)
end
set #count2=#count2+1
end
return
end
go
select RowNumbers AS "Condition to find", Number AS "Next Number in column" , COUNT(Number) AS "Times appear" from dbo.duplicate(2)
group by Number, RowNumbers

How can I find a non duplicate value in a column, disregarding special characters?

I have an attribute field of labels relevant to my work. I am looking for duplicates within this field; the issue is, the inputs are inconsistant. For example:
Group | Label |
---------------
1 | H7 |
1 | H-7 |
2 | C9 |
2 | C 9 |
3 | D5 |
3 | M 9 |
The result I am looking for is just:
3 | D5 |
3 | M 9 |
as these are truly different from each other. I am using the following query currently:
SELECT *
FROM TABLE t3
WHERE t3.group IN (
SELECT t1.group
FROM TABLE t1, TABLE t2
WHERE t1.group = t2.group
AND (t1.label <> t2.label)
How can I get the query to disregard special characters?
If the "special" character can be anything other than alphanumeric chars, then you can use regexp_replace:
select max(t.group), max(t.label)
from your_table t
group by regexp_replace(t.label, '[^[:alnum:]]', '')
having count(*) = 1;
If there are only a limited number of special characters possible in the values, then perhaps a non-"regexp" solution would work - using replace.
Also, avoid using keywords such as "group" as identifiers.
Try:
select regexp_replace(label,'[^[:alnum:]]',''), count(1) cnt
from some_table
group by regexp_replace(label,'[^[:alnum:]]','')
having count(1) > 1
This will show the duplicate labels (based on alphanumerics only)
You can use regexp_replace():
SELECT t.*
FROM TABLE t
WHERE NOT EXISTS (SELECT 1
FROM TABLE tt
WHERE tt.group = t.group AND tt.rowid <> t.rowid AND
regexp_replace(tt.label, '[^a-zA-Z0-9]', '') = regexp_replace(t.label, '[^a-zA-Z0-9]', '')
);
This should return all the original rows that are singletons. If you want all rows for a group where all are singletons:
SELECT t.*
FROM TABLE t
WHERE t.group IN (SELECT tt.group
FROM (SELECT tt.group, regexp_replace(tt.label, '[^a-zA-Z0-9]', '') as label_clean, COUNT(*) as cnt
FROM TABLE tt
GROUP BY tt.group, regexp_replace(tt.label, '[^a-zA-Z0-9]', '')
) tt
GROUP BY tt.group
HAVING MAX(cnt) = 1
);

Running SUM in T-SQL

Sorry for bad topic but I wasn't sure what to call it..
I have a table looking like this:
+-----++-----+
| Id ||Count|
+-----++-----+
| 1 || 1 |
+-----++-----+
| 2 || 5 |
+-----++-----+
| 3 || 8 |
+-----++-----+
| 4 || 3 |
+-----++-----+
| 5 || 6 |
+-----++-----+
| 6 || 8 |
+-----++-----+
| 7 || 3 |
+-----++-----+
| 8 || 1 |
+-----++-----+
I'm trying to make a select from this table where every time the SUM of row1 + row2 + row3 (etc) reaches 10, then it's a "HIT", and the count starts over again.
Requested output:
+-----++-----++-----+
| Id ||Count|| HIT |
+-----++-----++-----+
| 1 || 1 || N | Count = 1
+-----++-----++-----+
| 2 || 5 || N | Count = 6
+-----++-----++-----+
| 3 || 8 || Y | Count = 14 (over 10)
+-----++-----++-----+
| 4 || 3 || N | Count = 3
+-----++-----++-----+
| 5 || 6 || N | Count = 9
+-----++-----++-----+
| 6 || 8 || Y | Count = 17 (over 10..)
+-----++-----++-----+
| 7 || 3 || N | Count = 3
+-----++-----++-----+
| 8 || 1 || N | Count = 4
+-----++-----++-----+
How do I do this, and with best performance? I have no idea..
You can't do this using window/analytic functions, because the breakpoints are not known in advance. Sometimes, it is possible to calculate the breakpoints. However, in this case, the breakpoints depend on a non-linear function of the previous values (I can't think of a better word than "non-linear" right now). That is, sometimes adding "1" to an earlier value has zero effect on the calculation for the current row. Sometimes it has a big effect. The implication is that the calculation has to start at the beginning and iterate through the data.
A minor modification to the problem would be solvable using such functions. If the problem were, instead, to carry over the excess amount for each group (instead of restarting the sum), the problem would be solvable using cumulative sums (and some other trickery).
Recursive queries (which others have provided) or a sequential operation is the best way to approach this problem. Unfortunately, it doesn't have a set-based method for solving it.
You could use Recursive Queries
Please note the following query assuming the id value are all in sequence, otherwise, please use ROW_NUMBER() to create a new id
WITH cte AS (
SELECT id, [Count], [Count] AS total_count
FROM Table1
WHERE id = 1
UNION ALL
SELECT t2.id,t2.[Count], CASE WHEN t1.total_count >= 10 THEN t2.[Count] ELSE t1.total_count + t2.[Count] END
FROM Table1 t2
INNER JOIN cte t1
ON t2.id = t1.id + 1
)
SELECT *
FROM cte
ORDER BY id
SQL Fiddle
I'm really hoping someone can show us if it's possible to do this using straight-forward window functions. That's the real challenge.
In the meantime, here is how I would do it using recursion. This handles gaps in the sequence, and handles the edge case of the first row already being >= 10.
I also added the maxrecursion hint to remove the default recursion limit. But I honestly don't know how well it will run with larger amounts of data.
with NumberedRows as (
select Id, Cnt,
row_number() over(order by id) as rn
from CountTable
), RecursiveCTE as (
select Id, Cnt, rn,
case when Cnt >= 10 then 0 else Cnt end as CumulativeSum,
case when Cnt >= 10 then 'Y' else 'N' end as hit
from NumberedRows
where rn = 1
union all
select n.Id, n.Cnt, n.rn,
case when (n.Cnt + r.CumulativeSum) >= 10 then 0 else n.Cnt + r.CumulativeSum end as CumulativeSum,
case when (n.Cnt + r.CumulativeSum) >= 10 then 'Y' else 'N' end as hit
from RecursiveCTE r
join NumberedRows n
on n.rn = r.rn + 1
)
select Id, Cnt, hit
from RecursiveCTE
order by Id
option (maxrecursion 0)
SQLFiddle Demo
How about this using Running Totals:
DECLARE #Data TABLE(
Id INT
,SubTotal INT
)
INSERT INTO #Data
VALUES(1, 5)
INSERT INTO #Data
VALUES(2, 3)
INSERT INTO #Data
VALUES(3, 4)
INSERT INTO #Data
VALUES(4, 4)
INSERT INTO #Data
VALUES(5, 7)
DECLARE #RunningTotal INT = 0
DECLARE #HitCount INT = 0
SELECT
#RunningTotal = CASE WHEN #RunningTotal < 10 THEN #RunningTotal + SubTotal ELSE SubTotal END
,#HitCount = #HitCount + CASE WHEN #RunningTotal >= 10 THEN 1 ELSE 0 END
FROM #Data ORDER BY Id
SELECT #HitCount -- Outputs 2
Having re-read the question I see this does not meet the required output - I'll leave the answer as it may be of some use to someone looking for an example of a running total solution to this type of problem that doesn't need each row tagged with a Y or an N.

Undo a LISTAGG in redshift

I have a table that probably resulted from a listagg, similar to this:
# select * from s;
s
-----------
a,c,b,d,a
b,e,c,d,f
(2 rows)
How can I change it into this set of rows:
a
c
b
d
a
b
e
c
d
f
In redshift, you can join against a table of numbers, and use that as the split index:
--with recursive Numbers as (
-- select 1 as i
-- union all
-- select i + 1 as i from Numbers where i <= 5
--)
with Numbers(i) as (
select 1 union
select 2 union
select 3 union
select 4 union
select 5
)
select split_part(s,',', i) from Numbers, s ORDER by s,i;
EDIT: redshift doesn't seem to support recursive subqueries, only postgres. :(
SQL Fiddle
Oracle 11g R2 Schema Setup:
create table s(
col varchar2(20) );
insert into s values('a,c,b,d,a');
insert into s values('b,e,c,d,f');
Query 1:
SELECT REGEXP_SUBSTR(t1.col, '([^,])+', 1, t2.COLUMN_VALUE )
FROM s t1 CROSS JOIN
TABLE
(
CAST
(
MULTISET
(
SELECT LEVEL
FROM DUAL
CONNECT BY LEVEL <= REGEXP_COUNT(t1.col, '([^,])+')
)
AS SYS.odciNumberList
)
) t2
Results:
| REGEXP_SUBSTR(T1.COL,'([^,])+',1,T2.COLUMN_VALUE) |
|---------------------------------------------------|
| a |
| c |
| b |
| d |
| a |
| b |
| e |
| c |
| d |
| f |
As this is tagged to Redshift and no answer so far has a complete overview from undoing a LISTAGG in Redshift properly, here is the code that solves all its use cases:
CREATE TEMPORARY TABLE s (
s varchar(255)
);
INSERT INTO s VALUES('a,c,b,d,a');
INSERT INTO s VALUES('b,e,c,d,f');
SELECT
TRIM(split_part(s.s,',',R::smallint)) AS s
FROM s
LEFT JOIN (
SELECT
ROW_NUMBER() OVER (PARTITION BY 1) AS R
FROM any_large_table
LIMIT 1000
) extend_number
ON (SELECT MAX(regexp_count(s.s,',')+1) FROM s) >= extend_number.R
AND NULLIF(TRIM(split_part(s.s,',',extend_number.R::smallint)),'') IS NOT NULL;
DROP TABLE s;
Where “any_large_table” is any table you have in redshift already that has enough records for your purposes depending on the number of elements the list of each record will contain (i.e. in the above case, I ensure it is up to one-thousand records). Unfortunately, generate_series function does not work properly in Redshift as far as I know and that is the only way.
Another point of advise is check if you can get the values before they already list_agg whenever possible. As you can see from the above code, it looks quite complex, and you save a lot of maintenance time on your code if you keep things simple (that is, whenever the opportunity is available).

Split values over multiple rows in RedShift

The question of how to split a field (e.g. a CSV string) into multiple rows has already been answered:
Split values over multiple rows.
However, this question refers to MSSQL, and the answers use various features for which there are no RedShift equivalents.
For the sake of completeness, here's an example of what I'd like to do:
Current data:
| Key | Data |
+-----+----------+
| 1 | 18,20,22 |
| 2 | 17,19 |
Required data:
| Key | Data |
+-----+----------+
| 1 | 18 |
| 1 | 20 |
| 1 | 22 |
| 2 | 17 |
| 2 | 19 |
Now, I can suggest a walkaround for the case of small, bounded number of elements in the CSV field: use split_part and union over all possible array locations, like so:
SELECT Key, split_part(Data, ',', 1)
FROM mytable
WHERE split_part(Data, ',', 1) != ""
UNION
SELECT Key, split_part(Data, ',', 2)
FROM mytable
WHERE split_part(Data, ',', 2) != ""
-- etc. etc.
However, this is obviously very inefficient, and would not work for longer lists.
Any better ideas on how to do this?
EDIT:
There's also a somewhat similar question regarding multiplying rows: splitting rows in Redshift. However I don't see how this approach can be applied here.
EDIT 2:
A possible duplicate: Redshift. Convert comma delimited values into rows. But nothing new - the answer by #Masashi Miyazaki is similar to my suggestion above, and suffers from the same issues.
Here is the Redshift answer, it will work with up to 10 thousand values per row.
Set up test data
create table test_data (key varchar(50),data varchar(max));
insert into test_data
values
(1,'18,20,22'),
(2,'17,19')
;
code
with ten_numbers as (select 1 as num union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9 union select 0)
, generted_numbers AS
(
SELECT (1000 * t1.num) + (100 * t2.num) + (10 * t3.num) + t4.num AS gen_num
FROM ten_numbers AS t1
JOIN ten_numbers AS t2 ON 1 = 1
JOIN ten_numbers AS t3 ON 1 = 1
JOIN ten_numbers AS t4 ON 1 = 1
)
, splitter AS
(
SELECT *
FROM generted_numbers
WHERE gen_num BETWEEN 1 AND (SELECT max(REGEXP_COUNT(data, '\\,') + 1)
FROM test_data)
)
, expanded_input AS
(
SELECT
key,
split_part(data, ',', s.gen_num) AS data
FROM test_data AS td
JOIN splitter AS s ON 1 = 1
WHERE split_part(data, ',', s.gen_num) <> ''
)
SELECT * FROM expanded_input
order by key,data;
You welcome to take RDS PostgreSql instance and create dblink to RedShift. Then you can manipulate on result set like on normal PostgreSQL DB and even put result back into RedShift through same dblink.