bigquery split string to chars - google-bigquery

Suppose I have a table, in which one of the columns is a string:
id | value
________________
1 | HELLO
----------------
2 | BYE
How would I split each STRING into it's chars, to create the following table:
id | value
________________
1 | H
----------------
1 | E
----------------
1 | L
----------------
1 | L
....
?

You can use SPLIT function with empty string as delimiter, i.e.
SELECT id, SPLIT(value, '') value FROM Table
Please note, that SPLIT returns repeated field, and if you want flat results (wasn't clear from your question), you would use
SELECT * FROM
FLATTEN((SELECT id, SPLIT(value, '') value FROM Table), value)

Apparently, if you pass an empty delimiter, it works:
select id, split(str, '')
from (
select 1 as id, "HELLO" as str
)

Related

How to extract a JSON value in Hive

I Have a JSON string that is stored in a single cell in the DB corresponding to a parent ID
{"profileState":"ACTIVE","isDefault":"true","joinedOn":"2019-03-24T15:19:52.639Z","profileType":"ADULT","id":"abc","signupDeviceId":"1"}||{"profileState":"ACTIVE","isDefault":"true","joinedOn":"2021-09-05T07:47:00.245Z","imageId":"19","profileType":"KIDS","name":"Kids","id":"efg","signupDeviceId":"1"}
Now I want to use the above JSON to extract the id from this. Let say we have data like
Parent ID | Profile JSON
1 | {profile_json} (see above string)
I want the output to look like this
Parent ID | ID
1 | abc
1 | efg
Now, I've tried a couple of iterations to solve this
First Approach:
select
get_json_object(p.profile, '$$.id') as id,
test.parent_id
from (
select split(
regexp_replace(
regexp_extract(profiles, '^\\[(.+)\\]$$',1),
'\\}\\,\\{', '\\}\\|\\|\\{'),
'\\|\\|') as profile_list,
parent_id ,
from source_table) test
lateral view explode(test.profile_list) p as profile
)
But this is returning the id column as having NULL values. Is there something I'm missing here.
Second Approach:
with profiles as(
select regexp_replace(
regexp_extract(profiles, '^\\[(.+)\\]$$',1),
'\\}\\,\\{', '\\}\\|\\|\\{') as profile_list,
parent_id
from source_table
)
SELECT
get_json_object (t1.profile_list,'$.id')
FROM profiles t1
The second approach is only returning the first id (abc) as per the above JSON string.
I tried to replicate this in apache hive v4.
Data
+----------------------------------------------------+------------------+
| data | parent_id |
+----------------------------------------------------+------------------+
| {"profileState":"ACTIVE","isDefault":"true","joinedOn":"2019-03-24T15:19:52.639Z","profileType":"ADULT","id":"abc","signupDeviceId":"1"}||{"profileState":"ACTIVE","isDefault":"true","joinedOn":"2021-09-05T07:47:00.245Z","imageId":"19","profileType":"KIDS","name":"Kids","id":"efg","signupDeviceId":"1"} | 1.0 |
+----------------------------------------------------+------------------+
Sql
select pid,get_json_object(expl_jid,'$.id') json_id from
(select parent_id pid,split(data,'\\|\\|') jid from tabl1)a
lateral view explode(jid) exp_tab as expl_jid;
+------+----------+
| pid | json_id |
+------+----------+
| 1.0 | abc |
| 1.0 | efg |
+------+----------+
Solve this. Was using a extract $ in the First Approach
select
get_json_object(p.profile, '$.id') as id,
test.parent_id
from (
select split(
regexp_replace(
regexp_extract(profiles, '^\\[(.+)\\]$$',1),
'\\}\\,\\{', '\\}\\|\\|\\{'),
'\\|\\|') as profile_list,
parent_id ,
from source_table) test
lateral view explode(test.profile_list) p as profile
)

Sort each character in a string from a specific column in Snowflake SQL

I am trying to alphabetically sort each value in a column with Snowflake. For example I have:
| NAME |
| ---- |
| abc |
| bca |
| acb |
and want
| NAME |
| ---- |
| abc |
| abc |
| abc |
how would I go about doing that? I've tried using SPLIT and the ordering the rows, but that doesn't seem to work without a specific delimiter.
Using REGEXP_REPLACE to introduce separator between each character, STRTOK_SPLIT_TO_TABLE to get individual letters as rows and LISTAGG to combine again as sorted string:
SELECT tab.col, LISTAGG(s.value) WITHIN GROUP (ORDER BY s.value) AS result
FROM tab
, TABLE(STRTOK_SPLIT_TO_TABLE(REGEXP_REPLACE(tab.col, '(.)', '\\1~'), '~')) AS s
GROUP BY tab.col;
For sample data:
CREATE OR REPLACE TABLE tab
AS
SELECT 'abc' AS col UNION
SELECT 'bca' UNION
SELECT 'acb';
Output:
Similar implementation as Lukasz's, but using regexp_extract_all to extract individual characters in the form of an array that we later split to rows using flatten . The listagg then stitches it back in the order we specify in within group clause.
with cte (col) as
(select 'abc' union
select 'bca' union
select 'acb')
select col, listagg(b.value) within group (order by b.value) as col2
from cte, lateral flatten(regexp_extract_all(col,'.')) b
group by col;

How to use regexp_count with regexp_substr to output multiple matches per string in SQL (Redshift)?

I have a table containing a column with strings. I want to extract all pieces of text in each string that come immediately after a certain substring. For this minimum reproducible example, let's assume this substring is abc. So I want all subsequent terms after abc.
I'm able to achieve this in cases where there is only 1 abc per row, but my logic fails when there are multiple abcs. I'm also getting the number of substring occurrences, but am having trouble relating that to retrieving all of those occurrences.
My approach/attempt:
I created a temp table that contains the # of successful regex matches in my main string:
CREATE TEMP TABLE match_count AS (
SELECT DISTINCT id, main_txt, regexp_count(main_txt, 'abc (\\S+)', 1) AS cnt
FROM my_data_source
WHERE regexp_count(main_txt, 'abc (\\S+)', 1) > 0);
My output:
id main_txt cnt
1 wpfwe abc weiofnew abc wieone 2
2 abc weoin 1
3 abc weoifn abc we abc w 3
To get my final output, I have a query like:
SELECT id, main_txt, regexp_substr(main_txt, 'abc (\\S+)', 1, cnt, 'e') AS output
FROM match_count;
My actual final output:
id main_txt output
1 wpfwe abc weiofnew abc wieone wieone
2 abc weoin weoin
3 abc weoifn abc we abc w w
My expected final output:
id main_txt output
1 wpfwe abc weiofnew abc wieone weiofnew
1 wpfwe abc weiofnew abc wieone wieone
2 abc weoin weoin
3 abc weoifn abc we abc w weoifn
3 abc weoifn abc we abc w we
3 abc weoifn abc we abc w w
So my code only gets the final match (where the occurrence # = cnt). How can I modify it to include every match?
One way to solve this problem is to use a recursive CTE to make a list of match numbers for each string (so if there are 2 matches, it generates rows with 1 and 2 in them), these are then joined back to the main table as the occurrence parameter to regexp_substr:
WITH RECURSIVE match_counts(id, match_count) AS (
SELECT DISTINCT id, regexp_count(main_txt, 'abc (\\S+)', 1)
FROM my_data_source
WHERE regexp_count(main_txt, 'abc (\\S+)', 1) > 0
),
match_nums(id, match_num, match_count) AS (
SELECT id, 1, match_count
FROM match_counts
UNION ALL
SELECT id, match_num + 1, match_count
FROM match_nums
WHERE match_num < match_count
)
SELECT m.id, main_txt, regexp_substr(main_txt, 'abc (\\S+)', 1, match_num, 'e') AS output
FROM my_data_source m
JOIN match_nums n ON m.id = n.id
ORDER BY m.id, n.match_num
Unfortunately I don't have access to Redshift to test this, however I have tested it on Oracle (which has similar regexp functions) and it works there:
Oracle demo on dbfiddle. Note that Oracle doesn't support the e parameter to regexp_substr so returns the entire match instead of the group. (Edit - it has been confirmed to work on Redshift too, thanks #HaleemurAli).
Note if the delimiter abc might legitimately occur at the end of a word, you should add a word break to the beginning of the regex (i.e. \\babc (\\S+)) to prevent it matching (for example) deabc.
The solutions below do not handle the case where main_text has consecutive occurrences of abc consistently.
ex.
wpfwe abc abc abc weiofnew abc wieone
set up
CREATE TABLE test_hal_unnest (id int, main_text varchar (500));
INSERT INTO test_hal_unnest VALUES
(1, 'wpfwe abc weiofnew abc wieone'),
(2, 'abc weoin'),
(3, 'abc weoifn abc we abc w');
Possible solution by splitting the string into words
assuming you are searching for all words that comes after the word abc in a string, you don't necessarily have to use regex. regex support in redshift is unfortunately not as full featured as postgres or some other databases. for instance, you can't extract all substrings that match a regex pattern to an array, or split a string to an array based on a regex pattern.
steps:
split text to array with delimiter ' '
unnest array with ordinality
look up the previous array element using LAG, ordered by the word index
filter rows where the previous word is abc
the extra columns idx & prev_word are left in the final output to illustrate how the problem is solved. they may be dropped from the final query without issue
WITH text_split AS (
SELECT Id
, main_text
, SPLIT_TO_ARRAY(main_text, ' ') text_arr
FROM test_hal_unnest
)
, text_unnested AS (
SELECT ts.id
, ts.main_text
, ts.text_arr
, CAST(ta as VARCHAR) text_word -- converts super >> text
, idx -- this is the word index
FROM text_split ts
JOIN ts.text_arr ta AT idx
ON TRUE
-- ^^ array unnesting happens via joins
)
, with_prevword AS (
SELECT id
, main_text
, idx
, text_word
, LAG(text_word) over (PARTITION BY id ORDER BY idx) prev_word
FROM text_unnested
ORDER BY id, idx
)
SELECT *
FROM with_prevword
WHERE prev_word = 'abc';
output:
id | main_text | idx | text_word | prev_word
----+-------------------------------+-----+-----------+-----------
1 | wpfwe abc weiofnew abc wieone | 2 | weiofnew | abc
1 | wpfwe abc weiofnew abc wieone | 4 | wieone | abc
2 | abc weoin | 1 | weoin | abc
3 | abc weoifn abc we abc w | 1 | weoifn | abc
3 | abc weoifn abc we abc w | 3 | we | abc
3 | abc weoifn abc we abc w | 5 | w | abc
(6 rows)
note on unnest array with ordinality
quoting redshift documentation on this topic, since its kind of hidden
Amazon Redshift also supports an array index when iterating over the array using the AT keyword. The clause x AS y AT z iterates over array x and generates the field z, which is the array index.
alternative shorter solution by splitting on abc
This problem would be more easily solved with regular expression functionality available in redsfhit if instead of
1, wpfwe abc weiofnew abc wieone
the source data was already split up into multiple rows on abc
1, wpfwe
1, abc weiofnew
1, abc wieone
This solution first expands the source data by splitting on abc. however since split_to_array does not accepts are regular expression pattern, we first inject a delimiter ; before abc, and then split on ;.
Any delimiter will work, as long as it is guaranteed not to be present in column main_text
WITH text_array AS (
SELECT
id
, main_text
, SPLIT_TO_ARRAY(REGEXP_REPLACE(main_text, 'abc ', ';abc '), ';') array
FROM test_hal_unnest
)
SELECT
ta.id
, ta.main_text
, REGEXP_SUBSTR(CAST(st AS VARCHAR), 'abc (\\S+)', 1, 1, 'e') output
FROM text_array ta
JOIN ta.array st ON TRUE
WHERE st LIKE 'abc%';

Count string occurances within a list column - Snowflake/SQL

I have a table with a column that contains a list of strings like below:
EXAMPLE:
STRING User_ID [...]
"[""null"",""personal"",""Other""]" 2122213 ....
"[""Other"",""to_dos_and_thing""]" 2132214 ....
"[""getting_things_done"",""TO_dos_and_thing"",""Work!!!!!""]" 4342323 ....
QUESTION:
I want to be able to get a count of the amount of times each unique string appears (strings are seperable within the strings column by commas) but only know how to do the following:
SELECT u.STRING, count(u.USERID) as cnt
FROM table u
group by u.STRING
order by cnt desc;
However the above method doesn't work as it only counts the number of user ids that use a specific grouping of strings.
The ideal output using the example above would like this!
DESIRED OUTPUT:
STRING COUNT_Instances
"null" 1223
"personal" 543
"Other" 324
"to_dos_and_thing" 221
"getting_things_done" 146
"Work!!!!!" 22
Based on your description, here is my sample table:
create table u (user_id number, string varchar);
insert into u values
(2122213, '"[""null"",""personal"",""Other""]"'),
(2132214, '"[""Other"",""to_dos_and_thing""]"'),
(2132215, '"[""getting_things_done"",""TO_dos_and_thing"",""Work!!!!!""]"' );
I used SPLIT_TO_TABLE to split each string as a row, and then REGEXP_SUBSTR to clean the data. So here's the query and output:
select REGEXP_SUBSTR( s.VALUE, '""(.*)""', 1, 1, 'i', 1 ) extracted, count(*) from u,
lateral SPLIT_TO_TABLE( string , ',' ) s
GROUP BY extracted
order by count(*) DESC;
+---------------------+----------+
| EXTRACTED | COUNT(*) |
+---------------------+----------+
| Other | 2 |
| null | 1 |
| personal | 1 |
| to_dos_and_thing | 1 |
| getting_things_done | 1 |
| TO_dos_and_thing | 1 |
| Work!!!!! | 1 |
+---------------------+----------+
SPLIT_TO_TABLE https://docs.snowflake.com/en/sql-reference/functions/split_to_table.html
REGEXP_SUBSTR https://docs.snowflake.com/en/sql-reference/functions/regexp_substr.html

Sort the digits of a numerical string

I need to SORT all the digits from some string values in Postgres.
For instance, if I have two strings, e.g.
"70005" ==> "00057"
"70001" ==> "00017"
"32451" ==> "12345"
I can't cast the strings to integer or bigint due to my logic limitations. Is it possible to do this?
Use a recursive cte. Take the first char. if is '0' ignore it other wise go to the begining of target string.
Then use LPAD to append 0 until you get length 10.
SQL DEMO
WITH RECURSIVE cte (id, source, target) as (
SELECT 1 as id, '70001' as source , '' as target
UNION
SELECT 2 as id, '70005' as source , '' as target
UNION ALL
SELECT id,
substring(source from 2 for length(source)-1) as source,
CASE WHEN substring(source from 1 for 1) = '0' THEN target
ELSE substring(source from 1 for 1) || target
END
FROM cte
WHERE length(source) > 0
), reverse as (
SELECT id,
target,
row_number() over (partition by id
order by length(target) desc) rn
FROM cte
)
SELECT id, LPAD(target::text, 10, '0')
FROM reverse
WHERE rn = 1
OUTPUT
| id | lpad |
|----|------------|
| 1 | 0000000017 |
| 2 | 0000000057 |
Assuming that your data is organized like this:
Table: strings
| id | string |
|----+---------|
| 1 | '70005' |
| 2 | '70001' |
etc...
Then you can use a query like this:
SELECT all_digits.id,
array_to_string(array_agg(all_digits.digit ORDER BY all_digits.digit), '')
FROM (
SELECT strings.id, digits.digit
FROM strings, unnest(string_to_array(strings.string, NULL)) digits(digit)
) all_digits
GROUP BY all_digits.id
What this query does is split your table up into one row for each character in the string, sorts the table, and then aggregates the characters back into a string.
There's a SQL fiddle here: http://sqlfiddle.com/#!15/7f7fb0/14