How to explode substrings inside a string in a column in SQL - sql

Let's say I have a table like the one below
| Header 1 | Header 2 | Header 3
--------------------------------------------------------------------------------------
| id1 | detail1 | <a#test.com> , <b#test.com> , <c#test.com> , <d#test.com>
How do i explode it on SQL based on the substring emails inside the angle brackets such that it looks like the one below.
| Header 1 | Header 2 | Header 3. |
-------------------------------------------
| id1 | detail1 | a#test.com |
| id1 | detail1 | b#test.com |
| id1 | detail1 | c#test.com |
| id1 | detail1 | d#test.com |

Using regexp_extract_all and explode should do.
select `Header 1`, `Header 2`, explode(regexp_extract_all(`Header 3`, '<(.+?)>')) as `Header 3` from table
this should get you
+--------+--------+----------+
|Header 1|Header 2|Header 3 |
+--------+--------+----------+
|id1 |detail1 |a#test.com|
|id1 |detail1 |b#test.com|
|id1 |detail1 |c#test.com|
|id1 |detail1 |d#test.com|
+--------+--------+----------+
Be aware that regexp_extract_all was added to spark since version 3.1.0.
For spark blow 3.1.0
This can be done with split, somewhat a dirty hack. But the strategy and the results are the same.
select `Header 1`, `Header 2`, explode(array_remove(split(`Header 3`, '[<>,\\s]+'), '')) as `Header 3` from table
What this do is to regex match the delimiters and split the string into array. It also needs an array_remove function call to remove unneeded empty string.
Explanation
With regexp_extract_all, we use the pattern <(.+?)> to extract all strings within angle brackets, into an array like this
['a#test.com', 'b#test.com', 'c#test.com']
For the pattern (.+?)here
. matches 1 character;
+ is a quantifier of ., looking for 1 or unlimited matches;
? is a non greedy modifier, makes the match stop as soon as possible;
brackets makes the pattern with in angle brackets as a matching group, so we can extract from groups later;
Now with explode, we can separate elements of the array into multiple rows, hence the result above.

Related

hive regex find pattern and return it in select statement

I would like to extract 3 words before the selay dervice but the query returns an empty column :(
with a as (
select * from tablename1 b
where lower(ptranscript) rlike 'selay dervice'
)
select *,regexp_extract(lower(a.ptranscript),'([a-zA-Z0-9]+\s+){3}selay dervice',0) from a
##########update 1
as pointed by Raid earlier, in Hive we cannot use \s and have to use \\s. I updated the above regex accordingly and it works
with a as (
select * from tablename1 b
where lower(ptranscript) rlike 'selay dervice'
)
select *,regexp_extract(lower(a.ptranscript),'([a-zA-Z0-9]+\\s+){3}selay dervice',0) from a
Try below:
with a as (
select * from tablename1 b
where lower(ptranscript) rlike 'selay dervice'
)
select *,regexp_extract(lower(a.ptranscript),'(?:[a-zA-Z0-9]+ ){3}selay dervice',0) from a
Note that if there are less than 3 words before selay dervice you will get empty results.
I tested similar query in latest apache hive and got something like below:
+----------------------------------+-----------------------------+
| key | regex_ext |
+----------------------------------+-----------------------------+
| rlk1 selay dervice | |
| selay dervice k4 | |
| k5 selay dervice ew | |
| thre word b4 selay dervice | thre word b4 selay dervice |
| four word be four selay dervice | word be four selay dervice |
+----------------------------------+-----------------------------+
Edit 1:
Result does not vary with or without ?
All 3 versions below gives same result.
'(?:[a-zA-Z0-9]+ )'
'([a-zA-Z0-9]+ )'
'([a-zA-Z0-9]+\\s)'
As per docs \s matches any white space not just spacebar

How to split string into rows by number of characters in Bigquery?

if I have a table for example:
mydataset.itempf containing:
id | item
1 | ABCDEFGHIJKL
2 | ZXDFKDLFKFGF
And I would like the "item" field to be split by 4 characters into different rows like:
id | item
1 | ABCD
1 | EFGH
1 | IJKL
2 | ZXDF
2 | KDLF
2 | KFGF
How can I write this in bigquery? Please help.
Consider below approach
select id, item
from your_table,
unnest(regexp_extract_all(item, r'.{1,4}')) item
if applied to sample data in your question - output is
Use the Substring with the Count Method, That Should make it easier to see which ones are longer than others.

How can I return the best matched row first in sort order from a set returned by querying a single search term against multiple columns in Postgres?

Background
I have a Postgres 11 table like so:
CREATE TABLE
some_schema.foo_table (
id INTEGER PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
bar_text TEXT,
foo_text TEXT,
foobar_text TEXT
);
It has some data like this:
INSERT INTO some_schema.foo_table (bar_text, foo_text, foobar_text)
VALUES ('eddie', '123456', 'something0987');
INSERT INTO some_schema.foo_table (bar_text, foo_text, foobar_text)
VALUES ('Snake', '12345-54321', 'that_##$%_snake');
INSERT INTO some_schema.foo_table (bar_text, foo_text, foobar_text)
VALUES ('Sally', '12345', '24-7avocado');
id | bar_text | foo_text | foobar_text
----+----------+-------------+-----------------
1 | eddie | 123456 | something0987
2 | Snake | 12345-54321 | that_##$%_snake
3 | Sally | 12345 | 24-7avocado
The problem
I need to query each one of these columns and compare the values to a given term (passed in as an argument from app logic), and make sure the best-matched row (considering comparison with all the columns, not just one) is returned first in the sort order.
There is no way to know in advance which of the columns is likely to be a better match for the given term.
If I compare the given term to each value using the similarity() function, I can see at a glance which row has the best match in any of the three columns and can see that's the one I would want ranked first in the sort order.
SELECT
f.id,
f.foo_text,
f.bar_text,
f.foobar_text,
similarity('12345', foo_text) AS foo_similarity,
similarity('12345', bar_text) AS bar_similarity,
similarity('12345', foobar_text) AS foobar_similarity
FROM some_schema.foo_table f
WHERE
(
f.foo_text ILIKE '%12345%'
OR
f.bar_text ILIKE '%12345%'
OR
f.foobar_text ILIKE '%12345%'
)
;
id | foo_text | bar_text | foobar_text | foo_similarity | bar_similarity | foobar_similarity
----+-------------+----------+-----------------+----------------+----------------+-------------------
2 | 12345-54321 | Snake | that_##$%_snake | 0.5 | 0 | 0
3 | 12345 | Sally | 24-7avocado | 1 | 0 | 0
1 | 123456 | eddie | something0987 | 0.625 | 0 | 0
(3 rows)
Clearly in this case, id #3 (Sally) is the best match (exact, as it happens); this is the row I'd like to return first.
However, since I don't know ahead of time that foo_text is going to be the column with the best match, I don't know how to define the ORDER BY clause.
I figured this would be a common enough problem, but I haven't found any hints in a fair bit of SO and DDG .
How can I always rank the best-matched row first in the returned set, without knowing which column will provide the best match to the search term?
Use greatest():
greatest(similarity('12345', foo_text), similarity('12345', bar_text), similarity('12345', foobar_text)) desc

Recursive self join over file data

I know there are many questions about recursive self joins, but they're mostly in a hierarchical data structure as follows:
ID | Value | Parent id
-----------------------------
But I was wondering if there was a way to do this in a specific case that I have where I don't necessarily have a parent id. My data will look like this when I initially load the file.
ID | Line |
-------------------------
1 | 3,Formula,1,2,3,4,...
2 | *,record,abc,efg,hij,...
3 | ,,1,x,y,z,...
4 | ,,2,q,r,s,...
5 | 3,Formula,5,6,7,8,...
6 | *,record,lmn,opq,rst,...
7 | ,,1,t,u,v,...
8 | ,,2,l,m,n,...
Essentially, its a CSV file where each row in the table is a line in the file. Lines 1 and 5 identify an object header and lines 3, 4, 7, and 8 identify the rows belonging to the object. The object header lines can have only 40 attributes which is why the object is broken up across multiple sections in the CSV file.
What I'd like to do is take the table, separate out the record # column, and join it with itself multiple times so it achieves something like this:
ID | Line |
-------------------------
1 | 3,Formula,1,2,3,4,5,6,7,8,...
2 | *,record,abc,efg,hij,lmn,opq,rst
3 | ,,1,x,y,z,t,u,v,...
4 | ,,2,q,r,s,l,m,n,...
I know its probably possible, I'm just not sure where to start. My initial idea was to create a view that separates out the first and second columns in a view, and use the view as a way of joining in a repeated fashion on those two columns. However, I have some problems:
I don't know how many sections will occur in the file for the same
object
The file can contain other objects as well so joining on the first two columns would be problematic if you have something like
ID | Line |
-------------------------
1 | 3,Formula,1,2,3,4,...
2 | *,record,abc,efg,hij,...
3 | ,,1,x,y,z,...
4 | ,,2,q,r,s,...
5 | 3,Formula,5,6,7,8,...
6 | *,record,lmn,opq,rst,...
7 | ,,1,t,u,v,...
8 | ,,2,l,m,n,...
9 | ,4,Data,1,2,3,4,...
10 | *,record,lmn,opq,rst,...
11 | ,,1,t,u,v,...
In the above case, my plan could join rows from the Data object in row 9 with the first rows of the Formula object by matching the record value of 1.
UPDATE
I know this is somewhat confusing. I tried doing this with C# a while back, but I had to basically write a recursive decent parser to parse the specific file format and it simply took to long because I had to get it in the database afterwards and it was too much for entity framework. It was taking hours just to convert one file since these files are excessively large.
Either way, #Nolan Shang has the closest result to what I want. The only difference is this (sorry for the bad formatting):
+----+------------+------------------------------------------+-----------------------+
| ID | header | x | value
|
+----+------------+------------------------------------------+-----------------------+
| 1 | 3,Formula, | ,1,2,3,4,5,6,7,8 |3,Formula,1,2,3,4,5,6,7,8 |
| 2 | ,, | ,1,x,y,z,t,u,v | ,1,x,y,z,t,u,v |
| 3 | ,, | ,2,q,r,s,l,m,n | ,2,q,r,s,l,m,n |
| 4 | *,record, | ,abc,efg,hij,lmn,opq,rst |*,record,abc,efg,hij,lmn,opq,rst |
| 5 | ,4, | ,Data,1,2,3,4 |,4,Data,1,2,3,4 |
| 6 | *,record, | ,lmn,opq,rst | ,lmn,opq,rst |
| 7 | ,, | ,1,t,u,v | ,1,t,u,v |
+----+------------+------------------------------------------+-----------------------------------------------+
I agree that it would be better to export this to a scripting language and do it there. This will be a lot of work in TSQL.
You've intimated that there are other possible scenarios you haven't shown, so I obviously can't give a comprehensive solution. I'm guessing this isn't something you need to do quickly on a repeated basis. More of a one-time transformation, so performance isn't an issue.
One approach would be to do a LEFT JOIN to a hard-coded table of the possible identifying sub-strings like:
3,Formula,
*,record,
,,1,
,,2,
,4,Data,
Looks like it pretty much has to be human-selected and hard-coded because I can't find a reliable pattern that can be used to SELECT only these sub-strings.
Then you SELECT from this artificially-created table (or derived table, or CTE) and LEFT JOIN to your actual table with a LIKE to get all the rows that use each of these values as their starting substring, strip out the starting characters to get the rest of the string, and use the STUFF..FOR XML trick to build the desired Line.
How you get the ID column depends on what you want, for instance in your second example, I don't know what ID you want for the ,4,Data,... line. Do you want 5 because that's the next number in the results, or do you want 9 because that's the ID of the first occurrance of that sub-string? Code accordingly. If you want 5 it's a ROW_NUMBER(). If you want 9, you can add an ID column to the artificial table you created at the start of this approach.
BTW, there's really nothing recursive about what you need done, so if you're still thinking in those terms, now would be a good time to stop. This is more of a "Group Concatenation" problem.
Here is a sample, but has some different with you need.
It is because I use the value the second comma as group header, so the ,,1 and ,,2 will be treated as same group, if you can use a parent id to indicated a group will be better
DECLARE #testdata TABLE(ID int,Line varchar(8000))
INSERT INTO #testdata
SELECT 1,'3,Formula,1,2,3,4,...' UNION ALL
SELECT 2,'*,record,abc,efg,hij,...' UNION ALL
SELECT 3,',,1,x,y,z,...' UNION ALL
SELECT 4,',,2,q,r,s,...' UNION ALL
SELECT 5,'3,Formula,5,6,7,8,...' UNION ALL
SELECT 6,'*,record,lmn,opq,rst,...' UNION ALL
SELECT 7,',,1,t,u,v,...' UNION ALL
SELECT 8,',,2,l,m,n,...' UNION ALL
SELECT 9,',4,Data,1,2,3,4,...' UNION ALL
SELECT 10,'*,record,lmn,opq,rst,...' UNION ALL
SELECT 11,',,1,t,u,v,...'
;WITH t AS(
SELECT *,REPLACE(SUBSTRING(t.Line,LEN(c.header)+1,LEN(t.Line)),',...','') AS data
FROM #testdata AS t
CROSS APPLY(VALUES(LEFT(t.Line,CHARINDEX(',',t.Line, CHARINDEX(',',t.Line)+1 )))) c(header)
)
SELECT MIN(ID) AS ID,t.header,c.x,t.header+STUFF(c.x,1,1,'') AS value
FROM t
OUTER APPLY(SELECT ','+tb.data FROM t AS tb WHERE tb.header=t.header FOR XML PATH('') ) c(x)
GROUP BY t.header,c.x
+----+------------+------------------------------------------+-----------------------------------------------+
| ID | header | x | value |
+----+------------+------------------------------------------+-----------------------------------------------+
| 1 | 3,Formula, | ,1,2,3,4,5,6,7,8 | 3,Formula,1,2,3,4,5,6,7,8 |
| 3 | ,, | ,1,x,y,z,2,q,r,s,1,t,u,v,2,l,m,n,1,t,u,v | ,,1,x,y,z,2,q,r,s,1,t,u,v,2,l,m,n,1,t,u,v |
| 2 | *,record, | ,abc,efg,hij,lmn,opq,rst,lmn,opq,rst | *,record,abc,efg,hij,lmn,opq,rst,lmn,opq,rst |
| 9 | ,4, | ,Data,1,2,3,4 | ,4,Data,1,2,3,4 |
+----+------------+------------------------------------------+-----------------------------------------------+

Dynamically Modify Internal Table Values by Data Type

This is a similar question to the one I posted last week.
I have an internal table based off of a dictionary structure with a format similar to the following:
+---------+--------+---------+--------+---------+--------+-----+
| column1 | delim1 | column3 | delim2 | column5 | delim3 | ... |
+---------+--------+---------+--------+---------+--------+-----+
| value1 | | | value 1 | | | value 1 | | | ... |
| value2 | | | value 2 | | | value 2 | | | ... |
| value3 | | | value 3 | | | value 3 | | | ... |
+---------+--------+---------+--------+---------+--------+-----+
The delim* columns are all of type delim, and the typing of the non-delimiter columns are irrelevant (assuming none of them are also type delim).
The data in this table is obtained in a single statement:
SELECT * FROM <table_name> INTO CORRESPONDING FIELDS OF TABLE <internal_table_name>.
Thus, I have a completely full table except for the delimiter values, which are determined by user input on the selection screen (that is, we cannot rely on them always being ,, or any other common delimiter).
I'd like to find a way to dynamically set all of the values of type delim to some input for every row.
Obviously I could just hardcode the delimiter names and loop over the table setting all of them, but that's not dynamic. Unfortunately I can't bank on a simple API.
What I've tried (this doesn't work, and it's such a bad technique that I felt dirty just writing it):
DATA lt_fields TYPE TABLE OF rollname.
SELECT fieldname FROM dd03l
INTO TABLE lt_fields
WHERE tabname = '<table_name>'
AND as4local = 'A'
AND rollname = 'DELIM'.
LOOP AT lt_output ASSIGNING FIELD-SYMBOL(<fs>).
LOOP AT lt_fields ASSIGNING FIELD-SYMBOL(<fs2>).
<fs>-<fs2> = '|'.
ENDLOOP.
ENDLOOP.
Once again, I'm not set in my ways and would switch to another approach altogether if I believe it's better.
Although I still believe you're barking up the wrong tree with the entire approach, You have been pointed in the right direction both here and in the previous question
ASSIGN COMPONENT <fs2> OF STRUCTURE <fs> TO FIELD-SYMBOL(<fs3>). " might just as well continue with the write-only naming conventions
IF sy-subrc = 0.
<fs3> = '|'.
ENDIF.