How split comma separated string into multiple rows in AWS redshift? - sql

Im trying to separate string values to multiple rows grouped by its id column.
Most of the answers i saw include some of this functions however they are not supported by aws redshift https://docs.aws.amazon.com/redshift/latest/dg/c_unsupported-postgresql-functions.html
Assume i have a table like this
id
order_id
1
10001,10005,10006
2
11000,12005
And i would like to have a result like this
id
order_id
1
10001
1
10005
1
10006
2
11000
2
12005

A few concepts to have in mind. First is the recursive CTE which can be used to create number values for each position in the order_id string. Second is json functions which can split the string into parts based on commas.
A full test case with expanded input data:
create table test (id int, order_id varchar(256));
insert into test values
(1, '10001,10005,10006'),
(2, '11000,12005'),
(3, '10001,10005,10006,21000,22005'),
(4, '21000,22005,10001,10005,10006,11000,12005,10001,10005,10006,21000,22005')
;
with recursive numbers(n) as (
select 0 as n
union all
select n + 1
from numbers n
where n < (select max(length(order_id) - length(replace(order_id, ',',''))) from test)
),
input as (
select id, order_id,
length(order_id) - length(replace(order_id, ',','')) no_of_elements --counts the number of commas in the string
from test
)
select id, json_extract_array_element_text('['||order_id||']', n.n) as order_id
from input t
join numbers n
on n.n <= t.no_of_elements
order by id, order_id
;

Related

SQL recursively creating matching groups based on reference table

Imagine you had a data source like:
Id
Val
Data_Date
1
A
2022-01-01
2
B
2022-01-05
3
C
2022-01-09
4
D
2022-01-31
5
E
2022-02-01
With a reference table matching values in this way:
Target_Val
Matching_Val
Valid_Start
Valid_End
B
A
2022-01-04
2022-01-06
C
B
2022-01-09
2022-01-09
D
A
2022-01-31
2022-01-31
Imagine you want to create a table grouping values together where there is a match in the reference table within X days, say 4.
And you want to apply this matching recursively.
Output would be something like this:
Group_Id
Id
1
1
1
2
1
3
2
4
3
5
The logic here would be that C matches to B in the appropriate date range, and B matches to A in the appropriate date range, therefore they are all one group.
But although D matches to A, it is too far apart (greater than 4 days). And E doesn't match to anything.
There could be any depth (A > B > C > D ...)
Is there an appropriate algorithm in SQL to accomplish this? The values of the group IDs are unimportant and just meant to group data points together.
Here's my attempt. You do indeed need a recursive CTE, but you need to join the source table to groups table and then join back to the source table to ensure that the child fits within the parent's 4 day window. E.g. in the case of D and A, as you mention, they match, but they aren't close enough to be counted.
Then I added a calc to work out which rows were valid hierarchies and used that for the recursive join, because we can exclude anything not part of a hierachy.
After that we need to order the records by their depth so we know which parent record is first, e.g. in the case of A > B > C.
Then DENSE_RANK over the results to get your final groups. This will need some testing with deeper levels of recursion though, but this should point you in the right direction:
CREATE TABLE SourceData
(
Id INTEGER,
Val CHAR(1),
Data_Date DATE
);
CREATE TABLE Groups
(
Target_Val CHAR(1),
Matching_Val CHAR(1),
Valid_Start DATE,
Valid_End DATE
);
INSERT INTO SourceData (Id, Val, Data_Date) VALUES (1,'A','2022-01-01');
INSERT INTO SourceData (Id, Val, Data_Date) VALUES (2,'B','2022-01-05');
INSERT INTO SourceData (Id, Val, Data_Date) VALUES (3,'C','2022-01-09');
INSERT INTO SourceData (Id, Val, Data_Date) VALUES (4,'D','2022-01-31');
INSERT INTO SourceData (Id, Val, Data_Date) VALUES (5,'E','2022-02-01');
INSERT INTO Groups (Target_Val, Matching_Val, Valid_Start, Valid_End ) VALUES ('B','A','2022-01-04','2022-01-06');
INSERT INTO Groups (Target_Val, Matching_Val, Valid_Start, Valid_End ) VALUES ('C','B','2022-01-09','2022-01-09');
INSERT INTO Groups (Target_Val, Matching_Val, Valid_Start, Valid_End ) VALUES ('D','A','2022-01-31','2022-01-31');
WITH sourceCTE AS
(
SELECT sd.Id, sd.Val, sd.Data_Date, g.Valid_Start, g.Valid_End, IIF(s.Val IS NULL, sd.Val, g.Matching_Val) [ParentVal], CAST(NULL AS DATE) [start], CAST(NULL AS DATE) [end], 1 [Depth],
IIF(s.Val IS NULL, 0, 1) IsHeirarchy
FROM SourceData sd
LEFT JOIN Groups g ON g.Target_Val = sd.Val AND sd.Data_Date BETWEEN g.Valid_Start AND g.Valid_End
LEFT JOIN SourceData s ON s.Val = g.Matching_Val AND ABS(DATEDIFF(DAY, s.Data_Date, sd.Data_Date)) < 5
UNION ALL
SELECT s.Id, s.Val, s.Data_Date, g.Valid_Start, g.Valid_End, g.Matching_Val, g.Valid_Start, g.Valid_End, s.[Depth] + 1, 1
FROM sourceCTE s
INNER JOIN Groups g ON g.Target_Val = s.[ParentVal] AND s.IsHeirarchy = 1
),
ResultCTE AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY Id ORDER BY [Depth] DESC) [RNum]
FROM sourceCTE
)
SELECT DENSE_RANK() OVER (ORDER BY ParentVal) [Group_Id], Id
FROM ResultCTE
WHERE [RNum] = 1
Here's a working fiddle.
I can't promise this is the best solution, because just like the query optimiser I gave up after about 2 hours, ha.
Also, for any future questions, please provide sample data in script format to save time creating the structure.

SQL grouping by distinct values in a multi-value string column

(I want to perform a group-by based on the distinct values in a string column that has multiple values
The said column has a list of strings in a standard format separated by commas. The potential values are only a,b,c,d.
For example the column collection (type: String) contains:
Row 1: ["a","b"]
Row 2: ["b","c"]
Row 3: ["b","c","a"]
Row 4: ["d"]`
The expected output is a count of unique values:
collection | count
a | 2
b | 3
c | 2
d | 1
For all the below i used this table:
create table tmp (
id INT auto_increment,
test VARCHAR(255),
PRIMARY KEY (id)
);
insert into tmp (test) values
("a,b"),
("b,c"),
("b,c,a"),
("d")
;
If the possible values are only a,b,c,d you can try one of this:
Tke note that this will only works if you have not so similar values like test and test_new, because then the test would be joined also with all test_new rows and the count would not match
select collection, COUNT(*) as count from tmp JOIN (
select CONCAT("%", tb.collection, "%") as like_collection, collection from (
select "a" COLLATE utf8_general_ci as collection
union select "b" COLLATE utf8_general_ci as collection
union select "c" COLLATE utf8_general_ci as collection
union select "d" COLLATE utf8_general_ci as collection
) tb
) tb1
ON tmp.test LIKE tb1.like_collection
GROUP BY tb1.collection;
Which will give you the result you want
collection | count
a | 2
b | 3
c | 2
d | 1
or you can try this one
SELECT
(SELECT COUNT(*) FROM tmp WHERE test LIKE '%a%') as a_count,
(SELECT COUNT(*) FROM tmp WHERE test LIKE '%b%') as b_count,
(SELECT COUNT(*) FROM tmp WHERE test LIKE '%c%') as c_count,
(SELECT COUNT(*) FROM tmp WHERE test LIKE '%d%') as d_count
;
The result would be like this
a_count | b_count | c_count | d_count
2 | 3 | 2 | 1
What you need to do is to first explode the collection column into separate rows (like a flatMap operation). In redshift the only way to generate new rows is to JOIN - so let's CROSS JOIN your input table with a static table having consecutive numbers, and take only ones having id less or equal to number of elements in the collection. Then we'll use split_part function to read the item at correct index. Once we have the exploaded table, we'll do a simple GROUP BY.
If your items are stored as JSON array strings ('["a", "b", "c"]') then you can use JSON_ARRAY_LENGTH and JSON_EXTRACT_ARRAY_ELEMENT_TEXT instead of REGEXP_COUNT and SPLIT_PART respectively.
with
index as (
select 1 as i
union all select 2
union all select 3
union all select 4 -- could be substituted with 'select row_number() over () as i from arbitrary_table limit 4'
),
agg as (
select 'a,b' as collection
union all select 'b,c'
union all select 'b,c,a'
union all select 'd'
)
select
split_part(collection, ',', i) as item,
count(*)
from index,agg
where regexp_count(agg.collection, ',') + 1 >= index.i -- only get rows where number of items matches
group by 1

create a table of duplicated rows of another table using the select statement

I have a table with one column containing different integers.
For each integer in the table I would like to duplicate it as the number of digits -
For example:
12345 (5 digits):
1. 12345
2. 12345
3. 12345
4. 12345
5. 12345
I thought doing it using with recursion t (...) as () but I didn't manage, since I don't really understand how it works and what is happening "behind the scenes.
I don't want to use insert because I want it to be scalable and automatic for as many integers as needed in a table.
Any thoughts and an explanation would be great.
The easiest way is to join to a table with numbers from 1 to n in it.
SELECT n, x
FROM yourtable
JOIN
(
SELECT day_of_calendar AS n
FROM sys_calendar.CALENDAR
WHERE n BETWEEN 1 AND 12 -- maximum number of digits
) AS dt
ON n <= CHAR_LENGTH(TRIM(ABS(x)))
In my example I abused TD's builtin calendar, but that's not a good choice, as the optimizer doesn't know how many rows will be returned and as the plan must be a Product Join it might decide to do something stupid. So better use a number table...
Create a numbers table that will contain the integers from 1 to the maximum number of digits that the numbers in your table will have (I went with 6):
create table numbers(num int)
insert numbers
select 1 union select 2 union select 3 union select 4 union select 5 union select 6
You already have your table (but here's what I was using to test):
create table your_table(num int)
insert your_table
select 12345 union select 678
Here's the query to get your results:
select ROW_NUMBER() over(partition by b.num order by b.num) row_num, b.num, LEN(cast(b.num as char)) num_digits
into #temp
from your_table b
cross join numbers n
select t.num
from #temp t
where t.row_num <= t.num_digits
I found a nice way to perform this action. Here goes:
with recursive t (num,num_as_char,char_n)
as
(
select num
,cast (num as varchar (100)) as num_as_char
,substr (num_as_char,1,1)
from numbers
union all
select num
,substr (t.num_as_char,2) as num_as_char2
,substr (num_as_char2,1,1)
from t
where char_length (num_as_char2) > 0
)
select *
from t
order by num,char_length (num_as_char) desc

SQL count one field two times in select with different parameters

I like to have my query count one column two times in my select based on the value. So for example.
input: table
id | type
-------------|-------------
1 | 1
2 | 1
3 | 2
4 | 2
5 | 2
output: query (in 1 row, not two):
countfirst = 2 (two times 1)
countsecond = 3 (three times 2)
An default count in an select counts all rows in the query. But i like to count rows based
on an number without limiting the query. When using for example WHERE type = '1', type 2
gets filtered and cannot be counted anymore.
Is there an solution for this case in SQL?
--- EXAMPLE USE (situation above is simplefied but case is the same) ---
With one query i get all cars grouped by type from an table. There are two type signs: yellow (in db 1) and grey (in db 2). So in that query i have the folowing output:
Renault - ten times found - two yellow signs - eight grey signs
Create a table, script is given below.
CREATE TABLE [dbo].[temptbl](
[id] [int] NULL,
[type] [int] NULL
) ON [PRIMARY]
Execute the insert script as
insert into [temptbl] values(1,1)
insert into [temptbl] values(2,1)
insert into [temptbl] values(3,2)
insert into [temptbl] values(4,2)
insert into [temptbl] values(5,2)
Then execute the query.
;WITH cte as(
SELECT [type], Count([type]) cnt
FROM temptbl
GROUP BY [type]
)
SELECT * FROM cte
pivot (Sum([cnt]) for [type] in ([1],[2])) as AvgIncomePerDay
You can use the GROUP BY clause as Mureinik suggested, but with the addition of a WHERE clause to filter the results.
Below shows the results for type = 1 (assuming type is an INT:
SELECT type, COUNT(*) AS NoOfRecords
FROM table
WHERE type IN (1)
GROUP BY type
So if we wanted 1 and 2 we can use:
SELECT type, COUNT(*) AS NoOfRecords
FROM table
WHERE type IN (1, 2)
GROUP BY type
Lastly, that IN statement can pull type from another query:
SELECT type, COUNT(*) AS NoOfRecords
FROM table
WHERE type IN (SELECT type FROM someOtherTable)
GROUP BY type

Pivot values of a column based on a search string

Note: I would like to do this in a single SQL statement. not pl/sql, cursor loop, etc.
I have data that looks like this:
ID String
-- ------
01 2~3~1~4
02 0~3~4~6
03 1~4~5~1
I want to provide a report that somehow pivots the values of the String column into distinct rows such as:
Value "Total number in table"
----- -----------------------
1 3
2 1
3 2
4 3
5 1
6 1
How do I go about doing this? It's like a pivot table but I am trying to pivot the data in a column, rather than pivoting the columns in the table.
Note that in real application, I do not actually know what the values of the String column are; I only know that the separation between values is '~'
Given this test data:
CREATE TABLE tt (ID INTEGER, VALUE VARCHAR2(100));
INSERT INTO tt VALUES (1,'2~3~1~4');
INSERT INTO tt VALUES (2,'0~3~4~6');
INSERT INTO tt VALUES (3,'1~4~5~1');
This query:
SELECT VALUE, COUNT(*) "Total number in table"
FROM (SELECT tt.ID, SUBSTR(qq.value, sp, ep-sp) VALUE
FROM (SELECT id, value
, INSTR('~'||value, '~', 1, L) sp -- 1st posn of substr at this level
, INSTR(value||'~', '~', 1, L) ep -- posn of delimiter at this level
FROM tt JOIN (SELECT LEVEL L FROM dual CONNECT BY LEVEL < 20) q -- 20 is max #substrings
ON LENGTH(value)-LENGTH(REPLACE(value,'~'))+1 >= L
) qq JOIN tt on qq.id = tt.id)
GROUP BY VALUE
ORDER BY VALUE;
Results in:
VALUE Total number in table
---------- ---------------------
0 1
1 3
2 1
3 2
4 3
5 1
6 1
7 rows selected
SQL>
You can adjust the maximum number of items in your search string by adjusting the "LEVEL < 20" to "LEVEL < your_max_items".