Redshift split single dynamic column into multiple rows in new table - sql

With a table like:
uid | segmentids
-------------------------+----------------------------------------
f9b6d54b-c646-4bbb-b0ec | 4454918|4455158|4455638|4455878|4455998
asd7a0s9-c646-asd7-b0ec | 1265899|1265923|1265935|1266826|1266596
gd3355ff-cjr8-assa-fke0 | 2237557|2237581|2237593
laksnfo3-kgi5-fke0-b0ec | 4454918|4455158|4455638|4455878
How to create a new table with:
uid | segmentids
-------------------------+---------------------------
f9b6d54b-c646-4bbb-b0ec | 4454918
f9b6d54b-c646-4bbb-b0ec | 1265899
f9b6d54b-c646-4bbb-b0ec | 2237557
f9b6d54b-c646-4bbb-b0ec | 4454918
f9b6d54b-c646-4bbb-b0ec | 4454918
asd7a0s9-c646-asd7-b0ec | 1265899
asd7a0s9-c646-asd7-b0ec | 1265923
asd7a0s9-c646-asd7-b0ec | 1265935
asd7a0s9-c646-asd7-b0ec | 1266826
asd7a0s9-c646-asd7-b0ec | 1266596
The number of segments are dynamic, can vary with each record.
I tried the Split function with delimiter, but it requires the index in string, which is dynamic here.
Any suggestions?

Here is the Redshift answer, it will work with up to 10 thousand segment ids values per row.
test data
create table test_split (uid varchar(50),segmentids varchar(max));
insert into test_split
values
('f9b6d54b-c646-4bbb-b0ec','4454918|4455158|4455638|4455878|4455998'),
('asd7a0s9-c646-asd7-b0ec','1265899|1265923|1265935|1266826|1266596'),
('asd7345s9-c646-asd7-b0ec','1235935|1263456|1265675696'),
('as345a0s9-c646-asd7-b0ec','12765899|12658883|12777935|144466826|1266226|12345')
;
code
with ten_numbers as (select 1 as num union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9 union select 0)
, generted_numbers AS
(
SELECT (1000 * t1.num) + (100 * t2.num) + (10 * t3.num) + t4.num AS gen_num
FROM ten_numbers AS t1
JOIN ten_numbers AS t2 ON 1 = 1
JOIN ten_numbers AS t3 ON 1 = 1
JOIN ten_numbers AS t4 ON 1 = 1
)
, splitter AS
(
SELECT *
FROM generted_numbers
WHERE gen_num BETWEEN 1 AND (SELECT max(REGEXP_COUNT(segmentids, '\\|') + 1)
FROM test_split)
)
--select * from splitter;
, expanded_input AS
(
SELECT
uid,
split_part(segmentids, '|', s.gen_num) AS segment
FROM test_split AS ts
JOIN splitter AS s ON 1 = 1
WHERE split_part(segmentids, '|', s.gen_num) <> ''
)
SELECT * FROM expanded_input;
the first 2 cte steps (ten_numbers and generated_numbers) are used to generate a number of rows, this is needed because generate_series is not supported
The next step (splitter) just takes a number of rows equal to the max number of delimiters + 1 (which is the max number of segments)
finally, we cross join splitter with the input data, take the related value using split_part and then exclude blank parts (which are caused where the row has < the max number of segments)

You can iterate over the SUPER array returned by split_to_array -- see the "Unnesting and flattening" section of this post. Using the same test_split table as the previous answer:
WITH seg_array AS
(SELECT uid,
split_to_array(segmentids, '|') segs
FROM test_split)
SELECT uid,
segmentid::int
FROM seg_array a,
a.segs AS segmentid;

Redshift now has the super data type & the split_to_array function which is similar to postgresql string_to_array
Redshift now also supports unnesting arrays through a syntax similar to a LATERAL JOIN in postgresql.
Using these techniques, we may write the same transformation in 2022 as
WITH split_up AS (
SELECT
uid
, split_to_array(segmentids) segment_array
)
SELECT
su.uid
, CAST(sid AS VARCHAR) segmentid
FROM split_up su
JOIN split_up.segment_array sid ON TRUE

Related

Pivot with column name in Postgres

I have the following table tbl:
column1 | column2 | column 3
-----------------------------------
1 | 'value1' | 3
2 | 'value2' | 4
How to do "pivot" with column names to produce output like:
column1 | 1 | 2
column2 | 'value1' |'value2'
column3 | 3 | 4
As has been commented, the issue of data types is undefined in the question.
If you are OK with all result columns being type text (every data type can be converted to text), you can use one of these:
Plain SQL
WITH cte AS (
SELECT nu.*
FROM tbl t
, LATERAL (
VALUES
(1, t.column1::text)
, (2, t.column2)
, (3, t.column3::text)
) nu(rn, c)
)
SELECT *
FROM (TABLE cte OFFSET 0 LIMIT 3) c1
JOIN (TABLE cte OFFSET 3 LIMIT 3) c2 USING (rn);
The same with useful column names:
WITH cte AS (
SELECT nu.*
FROM tbl t
, LATERAL (
VALUES
('column1', t.column1::text)
, ('column2', t.column2)
, ('column3', t.column3::text)
) nu(rn, c)
)
SELECT * FROM (
SELECT *
FROM (TABLE cte OFFSET 0 LIMIT 3) c1
JOIN (TABLE cte OFFSET 3 LIMIT 3) c2 USING (rn)
) t (key, row1, row2);
Works in any modern version of Postgres.
The SQL string has to be adapted to the number of rows and columns. See fiddles below!
Using a document type as stepping stone
Makes for shorter code.
With many rows and many columns, performance of the SQL solution may scale better because the intermediate derived table is smaller.
(The thread is limited as you can't have more than ~ 1600 table columns in Postgres.)
Since everything is converted to text anyway, hstore seems most efficient. See:
Key value pair in PostgreSQL
SELECT key
, arr[1] AS row1
, arr[2] AS row2
FROM (
SELECT x.key, array_agg(x.value) AS arr
FROM tbl t, each(hstore(t)) x
GROUP BY 1
) sub
ORDER BY 1;
Technically speaking we would have to enforce the right sort order when in array_agg(), but that should work without explicit ORDER BY. To be absolutely sure you can add one: array_agg(x.value ORDER BY t.ctid) Using ctid for lack of information.
You can do the same with JSON functions in (Postgres 9.3+). Just replace each(hstore(t) with json_each_text(row_to_json(t). The rest is identical.
These fiddles demonstrate how to scale each query:
Original example with 2 rows of 3 columns:
db<>fiddle here
Scaled up to 3 rows of 4 columns:
db<>fiddle here

Subquery using multiple IN operator

I am trying to fetch all the id's in list 1 and with those id's from list 1, I am trying to fetch all the values in list 2 along with the count based on values in list 2.
DECLARE #Table1 AS TABLE (
id int,
l1 varchar(20)
);
INSERT INTO #Table1 VALUES
(1,'sun'),
(2,'shine'),
(3,'moon'),
(4,'light'),
(5,'earth'),
(6,'revolves'),
(7,'flow'),
(8,'fire'),
(9,'fighter'),
(10,'sun'),
(10,'shine'),
(11,'shine'),
(12,'moon'),
(1,'revolves'),
(10,'revolves'),
(2,'air'),
(3,'shine'),
(4,'fire'),
(5,'love'),
(6,'sun'),
(7,'rises');
/*
OPERATION 1
fetch all distinct ID's that has values from List 1
List1
sun
moon
earth
Initial OUTPUT1:
distinct_id list1_value
1 sun
3 moon
5 earth
10 sun
12 moon
6 sun
OPERATION2
fetch all the id, count_of_list2_values, list2_values
based on the id's that we recieved from OPERATION1
List2
shine
revolves
Expected Output:
id list1-value count_of_list2_values, list2_values
1 sun 1 revolves
3 moon 1 shine
5 earth 0 NULL
10 sun 2 shine,revolves
12 moon 0 NULL
6 sun 1 revolves
*/
My query:
Here is what I tried
select id, count(l1),l1
from #table1
where id in ('shine','revolves') and id in ('sun','moon','earth')
How can I achieve this.
I know this should be a subquery, having multiple in. How can this be achieved?
SQL fiddle Link:
https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=7a85dbf51ca5b5d35e87d968c46300bb
foo
foo
There are several ways this could be done. Here's how I'd do it:
First set up the data:
DECLARE #Table1 AS TABLE (
id int,
l1 varchar(20)
) ;
INSERT INTO #Table1 VALUES
(1,'sun'),
(2,'shine'),
(3,'moon'),
(4,'light'),
(5,'earth'),
(6,'revolves'),
(7,'flow'),
(8,'fire'),
(9,'fighter'),
(10,'sun'),
(10,'shine'),
(11,'shine'),
(12,'moon'),
(1,'revolves'),
(10,'revolves'),
(2,'air'),
(3,'shine'),
(4,'fire'),
(5,'love'),
(6,'sun'),
(7,'rises') ;
Since this is a known list, set the "target" data up as it's own set. (In SQL, tables are almost invariably better to work with than demented lists. Oops, typo! I meant delimited lists.)
DECLARE #Targets AS TABLE (
l2 varchar(20)
) ;
INSERT INTO #Targets VALUES
('sun'),
('moon'),
('earth') ;
OPERATION 1
fetch all distinct ID's that has values from List 1
(sun, moon, earth)
Easy enough with a join:
SELECT Id
from #Table1 t1
inner join #Targets tg
on tg.l2 = t1.l1
OPERATION 2
fetch all the id, count_of_list2_values, list2_values
based on the id's that we recieved from OPERATION1
If I'm following the desired logic correctly, then (read the "join" comments first):
SELECT
tt.Id
-- This next counts how many items in the Operation 1 list are not in the target list
-- (Spaced out, to make it easier to compare with the next line)
,sum( case when tg2.l2 is null then 1 else 0 end)
-- And this concatenates them together in a string (in later editions of SQL Server)
,string_agg(case when tg2.l2 is null then tt.l1 else null end, ', ')
from #Table1 tt
inner join (-- Operation 1 as a subquery, produce list of the Ids to work with
select t1.id
from #Table1 t1
inner join #Targets tg
on tg.l2 = t1.l1
) xx
on xx.id = tt.id
-- This is used to identify the target values vs. the non-target values
left outer join #Targets tg2
on tg2.l2 = tt.l1
-- Aggregate, because that's what we need to do
group by tt.Id
-- Order it, because why not?
order by tt.Id
If you're using Sql Server 2017 then you can use string_agg function and outer apply operator:
select
l1.id,
l1.l1,
l2.cnt as count_of_list2_values,
l2.l1 as list2_values
from #Table1 as l1
outer apply (
select
count(*) as cnt,
string_agg(tt.l1, ',') as l1
from #Table1 as tt
where
tt.l1 in ('shine','revolves') and
tt.id = l1.id
) as l2
where
l1.l1 in ('sun','moon','earth')
db fiddle demo
In previous editions, I'm not sure it's possible to aggregate and count in one pass without creation of the special function for this. You can, of course, do it like this with xquery, but it might be a bit of an overkill (I'd not do this in production code at least):
select
l1.id,
l1.l1,
l2.data.value('count(l1)', 'int'),
stuff(l2.data.query('for $i in l1 return concat(",",$i/text()[1])').value('.','nvarchar(max)'),1,1,'')
from #Table1 as l1
outer apply (
select
tt.l1
from #Table1 as tt
where
tt.l1 in ('shine','revolves') and
tt.id = l1.id
for xml path(''), type
) as l2(data)
where
l1.l1 in ('sun','moon','earth')
db fiddle demo
If you don't mind to do it with double scan / seek of the table then you can either use #forpas answer or do something like this:
with cte_list2 as (
select tt.l1, tt.id
from #Table1 as tt
where
tt.l1 in ('shine','revolves')
)
select
l1.id,
l1.l1,
l22.cnt as count_of_list2_values,
stuff(l21.data.value('.', 'nvarchar(max)'),1,1,'') as list2_values
from #Table1 as l1
outer apply (
select
',' + tt.l1
from cte_list2 as tt
where
tt.id = l1.id
for xml path(''), type
) as l21(data)
outer apply (
select count(*) as cnt
from cte_list2 as tt
where
tt.id = l1.id
) as l22(cnt)
where
l1.l1 in ('sun','moon','earth')
With this:
with
cte as(
select t1.id, t2.l1
from table1 t1 left join (
select * from table1 where l1 in ('shine','revolves')
) t2 on t2.id = t1.id
where t1.l1 in ('sun','moon','earth')
),
cte1 as(
select
c.id,
stuff(( select ',' + cte.l1 from cte where id = c.id for xml path(''), type).value('.', 'NVARCHAR(MAX)'), 1, 1, '') col
from cte c
)
select
id,
count(col) count_of_list2_values,
max(col) list2_values
from cte1
group by id
The 1st CTE gives these results:
id | l1
-: | :-------
1 | revolves
3 | shine
5 | null
10 | shine
10 | revolves
12 | null
6 | revolves
and the 2nd operates on these results to concatenate the common grouped values of l1.
Finally I use group by id and aggergation on the results of the 2nd CTE.
See the demo
Results:
id | count_of_list2_values | list2_values
-: | --------------------: | :-------------
1 | 1 | revolves
3 | 1 | shine
5 | 0 | null
6 | 1 | revolves
10 | 2 | shine,revolves
12 | 0 | null

SQL grouping by distinct values in a multi-value string column

(I want to perform a group-by based on the distinct values in a string column that has multiple values
The said column has a list of strings in a standard format separated by commas. The potential values are only a,b,c,d.
For example the column collection (type: String) contains:
Row 1: ["a","b"]
Row 2: ["b","c"]
Row 3: ["b","c","a"]
Row 4: ["d"]`
The expected output is a count of unique values:
collection | count
a | 2
b | 3
c | 2
d | 1
For all the below i used this table:
create table tmp (
id INT auto_increment,
test VARCHAR(255),
PRIMARY KEY (id)
);
insert into tmp (test) values
("a,b"),
("b,c"),
("b,c,a"),
("d")
;
If the possible values are only a,b,c,d you can try one of this:
Tke note that this will only works if you have not so similar values like test and test_new, because then the test would be joined also with all test_new rows and the count would not match
select collection, COUNT(*) as count from tmp JOIN (
select CONCAT("%", tb.collection, "%") as like_collection, collection from (
select "a" COLLATE utf8_general_ci as collection
union select "b" COLLATE utf8_general_ci as collection
union select "c" COLLATE utf8_general_ci as collection
union select "d" COLLATE utf8_general_ci as collection
) tb
) tb1
ON tmp.test LIKE tb1.like_collection
GROUP BY tb1.collection;
Which will give you the result you want
collection | count
a | 2
b | 3
c | 2
d | 1
or you can try this one
SELECT
(SELECT COUNT(*) FROM tmp WHERE test LIKE '%a%') as a_count,
(SELECT COUNT(*) FROM tmp WHERE test LIKE '%b%') as b_count,
(SELECT COUNT(*) FROM tmp WHERE test LIKE '%c%') as c_count,
(SELECT COUNT(*) FROM tmp WHERE test LIKE '%d%') as d_count
;
The result would be like this
a_count | b_count | c_count | d_count
2 | 3 | 2 | 1
What you need to do is to first explode the collection column into separate rows (like a flatMap operation). In redshift the only way to generate new rows is to JOIN - so let's CROSS JOIN your input table with a static table having consecutive numbers, and take only ones having id less or equal to number of elements in the collection. Then we'll use split_part function to read the item at correct index. Once we have the exploaded table, we'll do a simple GROUP BY.
If your items are stored as JSON array strings ('["a", "b", "c"]') then you can use JSON_ARRAY_LENGTH and JSON_EXTRACT_ARRAY_ELEMENT_TEXT instead of REGEXP_COUNT and SPLIT_PART respectively.
with
index as (
select 1 as i
union all select 2
union all select 3
union all select 4 -- could be substituted with 'select row_number() over () as i from arbitrary_table limit 4'
),
agg as (
select 'a,b' as collection
union all select 'b,c'
union all select 'b,c,a'
union all select 'd'
)
select
split_part(collection, ',', i) as item,
count(*)
from index,agg
where regexp_count(agg.collection, ',') + 1 >= index.i -- only get rows where number of items matches
group by 1

sequence increment with 2018AA00001 in postgresql

i have created generate_letters & generate_num table using below query
select chr(i) as letter from generate_series(65,90) i;
select lpad(i::text,6,'0') as num from generate_series(1,100000) i;
after doing cross join with two tables using below query
select concat_ws('','2018',gl1.letter,gl2.letter,d.num) as seq
from generate_letters gl1
cross join generate_letters gl2
cross join generate_num d limit 10;
i am getting the out put result (
2018AA000001
2018AA000002
2018AA000003
2018AA000004
2018AA000005
)
but how i need to use sequence for column(bill_id) increment using above query.
please suggest me.
Create a sequence and then use the DEFAULT clause for the required expression.
SQL Fiddle
PostgreSQL 9.6 Schema Setup:
CREATE SEQUENCE yourseq INCREMENT 1 START 1 MINVALUE 1;
CREATE TABLE yourtable
(
bill_id TEXT DEFAULT '2018AA'||lpad(NEXTVAL('yourseq'::regclass)::text, 6
, '0' ),
bill_desc TEXT
);
INSERT INTO yourtable(bill_desc) VALUES ('Telephone Bill');
INSERT INTO yourtable(bill_desc) VALUES ('Water Bill');
Query 1:
select * FROM yourtable
Results:
| bill_id | bill_desc |
|--------------|----------------|
| 2018AA000001 | Telephone Bill |
| 2018AA000002 | Water Bill |

How do I trace former ids using a recursive query?

I have a table of provider information (providers) that contains the columns reporting_unit and predesessor. Predesessor is either
null or contains the reporting_unit that that row used to represent. I need
to find what the current reporting_unit for any provider is. By that I mean for any reporting_unit with a predesessor, that reporting_unit is the current_reporting_unit for the predesessor.
I am trying
to use a recursive CTE to accomplish this because some of the time
there are multiple links.
The table looks like this:
CREATE TABLE providers (
reporting_unit TEXT,
predesessor TEXT
);
INSERT INTO providers
VALUES
(NULL, NULL),
('ARE88', NULL),
('99BX7', '99BX6'),
('99BX6', '99BX5'),
('99BX5', NULL)
;
The results I would like to get from that are:
reporting_unit | current_reporting_unit
---------------------------------------
'99BX5' | '99BX7'
'99BX6' | '99BX7'
My current query is :
WITH RECURSIVE current_ru AS (
SELECT reporting_unit, predesessor
FROM providers
WHERE predesessor IS NULL
UNION ALL
SELECT P.reporting_unit, P.predesessor
FROM providers P
JOIN current_ru CR
ON P.reporting_unit = CR.predesessor
)
SELECT *
FROM current_ru
;
But that isn't giving me the results I'm looking for. I have tried a number of variations on this query but they all seem to end up in an infinite loop. How
You should find relations in the reverse order. Add depth column to find the deepest link:
with recursive current_ru (reporting_unit, predesessor, depth) as (
select reporting_unit, predesessor, 1
from providers
where predesessor is not null
union
select r.reporting_unit, p.predesessor, depth+ 1
from providers p
join current_ru r
on p.reporting_unit = r.predesessor
)
select *
from current_ru;
reporting_unit | predesessor | depth
----------------+-------------+-------
99BX7 | 99BX6 | 1
99BX6 | 99BX5 | 1
99BX6 | | 2
99BX7 | 99BX5 | 2
99BX7 | | 3
(5 rows)
Now switch the two columns, change their names, eliminate null rows and select the deepest links:
with recursive current_ru (reporting_unit, predesessor, depth) as (
select reporting_unit, predesessor, 1
from providers
where predesessor is not null
union
select r.reporting_unit, p.predesessor, depth+ 1
from providers p
join current_ru r
on p.reporting_unit = r.predesessor
)
select distinct on(predesessor)
predesessor reporting_unit,
reporting_unit current_reporting_unit
from current_ru
where predesessor is not null
order by predesessor, depth desc;
reporting_unit | current_reporting_unit
----------------+------------------------
99BX5 | 99BX7
99BX6 | 99BX7
(2 rows)