splitting strings in oracle sql based on length - sql

I want to split my strings in Oracle based on length with space as a delimiter.
For example,
MY_STRING="welcome to programming world"
My output should be
STRING1="welcome to "
STRING2="programming "
The strings should be a maximum of 13 characters in length. The words after position 26 can be ignored.

You don't mention what version of Oracle you're using. If you're using 10g or above you can use regular expressions to get what you need:
with spaces as (
select regexp_instr('welcome to programming world' || ' '
, '[[:space:]]', 1, level) as s
from dual
connect by level <= regexp_count('welcome to programming world' || ' '
, '[[:space:]]')
)
, actual as (
select max(case when s <= 13 then s else 0 end) as a
, max(case when s <= 26 then s else 0 end) as b
from spaces
)
select substr('welcome to programming world',1,a)
, substr('welcome to programming world',a, b - a)
from actual
This finds the positional index of all the spaces, then finds the one that's nearest but less than 14. Lastly uses a simple substr to split your string. The strings will have a trailing space so you might want to trim this.
You have to concatenate your string with a space to ensure that there is a trailing space so the last word doesn't get removed if your string is shorter than 26 characters.
Assuming you're using an earlier version you could hack something together with instr and length but it won't be very pretty at all.

Related

substring after split by a separator oracle

Like if I have a string "123456,852369,7852159,1596357"
The out put looking for "1234,8523,7852,1596"
Requirement is....we want to collect 4 char after every ',' separator
like split, substring and again concat
select
REGEXP_REPLACE('MEDA,MEDA,MEDA,MEDA,MEDA,MEDA,MEDA,MEDA,MDCB,MDCB,MDCB,MDCB,MDCB,MDCB', '([^,]+)(,\1)+', '\1')
from dual;
we want to collect 4 char after every ',' separator
Here is an approach using regexp_replace:
select regexp_replace(
'123456,852369,7852159,1596357',
'([^,]{4})[^,]*(,|$)',
'\1\2'
)
from dual
Regexp breakdown:
([^,]{4}) 4 characters others than "," (capture that group as \1)
[^,]* 0 to n characters other than "," (no capture)
(,|$) either character "," or the end of string (capture this as \2)
The function replaces each match with capture 1 (the 4 characters we want) followed by capture 2 (the separator, if there is one).
Demo:
RESULT
1234,8523,7852,1596
One option might be to split the string, extract 4 characters and aggregate them back:
SQL> with test (col) as
2 (select '123456,852369,7852159,1596357' from dual)
3 select listagg(regexp_substr(col, '[^,]{4}', 1, level), ',')
4 within group (order by level) result
5 from test
6 connect by level <= regexp_count(col, ',') + 1;
RESULT
--------------------------------------------------------------------------------
1234,8523,7852,1596
SQL>
With REGEX_REPLACE:
select regexp_replace(the_string, '(^|,)([^,]{4})[^,]*', '\1\2')
from mytable;
This looks for
the beginning of the string or the comma
then four characters that are not a comma
then any number of trailing characters that are not a comma
And only keeps
the beginning or the comma
the four characters that follow
Demo: https://dbfiddle.uk/efUFvKyO

How can I separate a string in BigQuery into multiple columns without breaking up distinct words?

I'm trying to separate a string into two columns, but only if the total string's length is larger than 25 characters. If it's shorter than 25 characters, then I want it on the 2nd column only. If it's longer than 25, then I want the first part of the string to be in the 1st column and the second part of the string to be in the 2nd column.
Here's the kicker... I don't want words to be broken up. So if the total length of the string is 26, I know that I'll need two columns, but I need to figure out where to splice up the string so that only complete words are represented in each column.
For example, the string is "Transportation Project Manager". Since it has over 25 characters, I want the first column to say "Transportation Project" and the second column to say "Manager". "Transportation Project" has less than 25 characters but I want it to stop there since there isn't another complete word that would fit within the 25 character limit.
Another example- The string is "Caseworker I". Since it's less than 25 characters, I want the whole string to be represented in column 2.
Thank you for your time!
In order to split a string into 2 columns respecting a defined maximum length (following the logic you described), we will use JavaScript User Defined Function in BigQuery (UDF) together with the builtin function LENGTH.
First, the string will be analysed. If the character after the maximum threshold is a white space then it will be split at the given maximum string length. However, if this is not the case, every single character will be checked, counting backwards, until a white space is found and the string will be split. Having this procedure, avoids the function to break up a word and it will be always split respecting the maximum allowed length.
Below is the query with some sample data,
CREATE TEMP FUNCTION split_str_1(s string,len int64)
RETURNS string
LANGUAGE js AS """
var len_aux = len, prev = 0;
//first part of the string within the threshold
output = [];
//the rest of the string wihtout the first part
output2 = [];
//if the next character in the string is a whitespace, them split the string
if(s[len_aux++] == ' ') {
output.push(s.substring(prev,len_aux));
output2.push(s.substring(prev,s.length));
}
else{
do {
if(s.substring(len_aux - 1, len_aux) == ' ')
{
output.push(s.substring(prev,len_aux));
prev = len_aux;
output2.push(s.substring(prev,s.length));
break;
}len_aux--;
} while(len_aux > prev)
}
//outputting the first part of the string
return output[0];
""";
CREATE TEMP FUNCTION split_str_2(s string,len int64)
RETURNS string
LANGUAGE js AS """
var len_aux = len, prev = 0;
//first part of the string within the threshold
output = [];
//the rest of the string wihtout the first part
output2 = [];
//if the next character in the string is a whitespace, them split the string
if(s[len_aux++] == ' ') {
output.push(s.substring(prev,len_aux));
output2.push(s.substring(prev,s.length));
}
else{
do {
if(s.substring(len_aux - 1, len_aux) == ' ')
{
output.push(s.substring(prev,len_aux));
prev = len_aux;
output2.push(s.substring(prev,s.length));
break;
}len_aux--;
} while(len_aux > prev)
}
//outputting the first part of the string
return output2[0];
""";
WITH data AS (
SELECT "Trying to split a string with more than 25 characters length" AS str UNION ALL
SELECT "Trying to split" AS str
)
SELECT str,
IF(LENGTH(str)>25, split_str_1(str,25), null) as column_1,
CASE WHEN LENGTH(str)>25 THEN split_str_2(str,25) ELSE str END AS column_2
FROM data
And the output,
Notice that there are 2 JavaScript UDF's, this is because the first one returns the first part of the string and the second returns the second part, when the string is longer than 25 characters. Also, the maximum allowed length is passed as an argument, but it can be statically defined within the UDF as len=25.
I think your angle of attack should be to find the first space before the 25th character and then split based on that.
Using other submitted answers phrases as sample data:
with sample_data as(
select 'Transportation Project Manager' as phrase union all
select 'Caseworker I'as phrase union all
select "This's 25 characters long" as phrase union all
select "This's 25 characters long (not!)" as phrase union all
select 'Antidisestablishmentarianist' as phrase union all
select 'Trying to split a string with more than 25 characters in length' as phrase union all
select 'Trying to split' as phrase
),
temp as (
select
phrase,
length(phrase) as phrase_len,
-- Find the first space before the 25th character
-- by reversing the first 25 characters
25-strpos(reverse(substr(phrase,1,25)),' ') as first_space_before_25
from sample_data
)
select
phrase,
phrase_len,
first_space_before_25,
case when phrase_len <= 25 or first_space_before_25 = 25 then null
when phrase_len > 25 then substr(phrase,1,first_space_before_25)
else null
end as col1,
case when phrase_len <= 25 or first_space_before_25 = 25 then phrase
when phrase_len > 25 then substr(phrase,first_space_before_25+1, phrase_len)
else null
end as col2
from temp
I think this gets you pretty close using basic sql string manipulation. You might need/want to clean this up a bit depending on if you want col2 to start with a space or be trimmed, and depending on your cutoff point (you mentioned less than 25 and greater than 25, but not exactly 25).
Below is for BigQuery Standard SQL
#standardSQL
SELECT phrase,
IF(IFNULL(cut, len ) >= len, NULL, SUBSTR(phrase, 1, cut)) col1,
IF(IFNULL(cut, len ) >= len, phrase, SUBSTR(phrase, cut + 1)) col2
FROM (
SELECT phrase, LENGTH(phrase) len,
(
SELECT cut FROM (
SELECT -1 + SUM(LENGTH(word) + 1) OVER(ORDER BY OFFSET) AS cut
FROM UNNEST(SPLIT(phrase, ' ')) word WITH OFFSET
)
WHERE cut <= 25
ORDER BY cut DESC
LIMIT 1
) cut
FROM `project.dataset.table`
)
You can test, play with above using sample data (nicely provided in other answers) as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'Transportation Project Manager' AS phrase UNION ALL
SELECT 'Caseworker I' UNION ALL
SELECT "This's 25 characters long" UNION ALL
SELECT "This's 25 characters long (not!)" UNION ALL
SELECT 'Antidisestablishmentarianist' UNION ALL
SELECT 'Trying to split a string with more than 25 characters in length' UNION ALL
SELECT 'Trying to split'
)
SELECT phrase,
IF(IFNULL(cut, len ) >= len, NULL, SUBSTR(phrase, 1, cut)) col1,
IF(IFNULL(cut, len ) >= len, phrase, SUBSTR(phrase, cut + 1)) col2
FROM (
SELECT phrase, LENGTH(phrase) len,
(
SELECT cut FROM (
SELECT -1 + SUM(LENGTH(word) + 1) OVER(ORDER BY OFFSET) AS cut
FROM UNNEST(SPLIT(phrase, ' ')) word WITH OFFSET
)
WHERE cut <= 25
ORDER BY cut DESC
LIMIT 1
) cut
FROM `project.dataset.table`
)
with output
Row phrase col1 col2
1 Transportation Project Manager Transportation Project Manager
2 Caseworker I null Caseworker I
3 This's 25 characters long null This's 25 characters long
4 This's 25 characters long (not!) This's 25 characters long (not!)
5 Antidisestablishmentarianist null Antidisestablishmentarianist
6 Trying to split a string with more than 25 characters in length Trying to split a string with more than 25 characters in length
7 Trying to split null Trying to split
Note: if you want to get rid of leading (in col2) and trailing (in col1) spaces - you can just add TRIM() to handle this little extra logic
Wow, this is a great interview question! Here's what I came up with:
WITH sample_data AS
(
SELECT 'Transportation Project Manager' AS phrase
UNION ALL
SELECT 'Caseworker I' AS phrase
UNION ALL
SELECT "This's 25 characters long" AS phrase
UNION ALL
SELECT "This's 25 characters long (not!)" AS phrase
UNION ALL
SELECT 'Antidisestablishmentarianist' AS phrase
),
unnested_words AS --Make a dataset with one row per "word" per phrase
(
SELECT
*,
--To preserve the spaces for character counts, prepend one to every word but the first
CASE WHEN i = 0 THEN '' ELSE ' ' END || word AS word_with_space
FROM
sample_data
CROSS JOIN
UNNEST(SPLIT(phrase, ' ')) AS word WITH OFFSET AS i
),
with_word_length AS
(
SELECT
*,
--This doesn't need its own CTE, but done here for clarity
LENGTH(word_with_space) AS word_length
FROM
unnested_words
),
running_sum AS --Mark when the total character length exceeds 25
(
SELECT
*,
SUM(word_length) OVER (PARTITION BY phrase ORDER BY i) <= 25 AS is_first_25
FROM
with_word_length
),
by_subphrase AS --Make a subphrase of words in the first 25, and one for any others
(
SELECT
phrase,
ARRAY_TO_STRING(ARRAY_AGG(word), '') AS subphrase
FROM
running_sum
GROUP BY
phrase, is_first_25
),
by_phrase AS --Put subphrases into an array (back to one row per phrase)
(
SELECT
phrase, ARRAY_AGG(subphrase) AS subphrases
FROM
by_subphrase
GROUP BY
1
)
SELECT
phrase,
--Break the array of subphrases into columns per your rules
CASE WHEN ARRAY_LENGTH(subphrases) = 1 THEN subphrases[OFFSET(0)] ELSE subphrases[OFFSET(1)] END,
CASE WHEN ARRAY_LENGTH(subphrases) = 1 THEN NULL ELSE subphrases[OFFSET(0)] END
FROM
by_phrase
Not very pretty but gets it done.

Oracle SQL - Redacting multiple occurences all but last four digits of numbers of varying length within free text narrative

Is there are straightforward way, perhaps using REGEXP_REPLACE or the like, to redact all but the last four digits of numbers (or varying length of 5 or above) appearing within free text (there may be multiple occurrences of separate numbers within the text)?
E.g.
Input = 'This is a test text with numbers 12345, 9876543210 and separately number 1234567887654321 all buried within the text'
Output = 'This is a test text with numbers ****5, *****3210 and separately number ************4321 all buried within the text'
With REGEX_REPLACE it's obviously straightforward to replace all numbers with the *, but it's maintaining the final four digits and replacing with the correct number of *s that's vexing me.
Any help would be much appreciated!
(Just for context, due to the usual kind of business limitations, this had to be done within the query retrieving the data rather than using actual Oracle DBMS redaction functionality).
Many thanks.
You could try the following regex:
regexp_replace(txt, '(\d{4})(\d+(\D|$))', '****\2')
This captures sequences of 4 digits followed by at least one digit, then by a non-digit character (or the end of string), and replaces them with 4 stars.
Demo on DB Fiddle:
with t as (select 'select This is a test text with numbers 12345, 9876543210 and separately number 1234567887654321 all buried within the text' txt from dual)
select regexp_replace(txt, '(\d{4})(\d+\D)', '****\2') new_text from t
| NEW_TEXT |
| :-------------------------------------------------------------------------------------------------------------------------- |
| select This is a test text with numbers ****5, ****543210 and separately number ****567887654321 all buried within the text |
Edit
Here is a simplified version, suggested by Aleksej in the comments:
regexp_replace(txt, '(\d{4})(\d+)', '****\2')
This works because of the greadiness of the regexp engine, that will slurp as many '\d+' as possible.
If you really need to keep the length of the numbers, then (I think) there is not wayy todo it in one step. You'll have to split the string in numbers and not numbers and then replace the digits seperatly:
SELECT listagg(CASE WHEN REGEXP_LIKE(txt, '\d{5,}') -- if the string is of your desired format
THEN LPAD('*', LENGTH(txt) - 4,'*') || SUBSTR(txt, LENGTH(txt) -3) -- replace all digits but the last 4 with *
ELSE txt END)
within GROUP (ORDER BY lvl)
FROM (SELECT LEVEL lvl, REGEXP_SUBSTR(txt, '(\d+|\D+)', 1, LEVEL ) txt -- Split the string in numerical and non numerical parts
FROM (select 'This is a test text with numbers 12345, 9876543210 and separately number 1234567887654321 all buried within the text' AS txt FROM dual)
CONNECT BY REGEXP_SUBSTR(txt, '(\d+|\D+)', 1, LEVEL ) IS NOT NULL)
Result:
This is a test text with numbers *2345, ******3210 and separately number ************4321 all buried within the text
And as your example replaced the first for digits of your first number - you might also want to replace at least 4 digits:
SELECT listagg(CASE WHEN REGEXP_LIKE(txt, '\d{5,}') -- if the string is of your desired format
THEN LPAD('*', GREATEST(LENGTH(txt) - 4, 4),'*') || SUBSTR(txt, GREATEST(LENGTH(txt) -3, 5)) -- replace all digits but the last 4 with *
ELSE txt END)
within GROUP (ORDER BY lvl)
FROM (SELECT LEVEL lvl, REGEXP_SUBSTR(txt, '(\d+|\D+)', 1, LEVEL ) txt -- Split the string in numerical and non numerical parts
FROM (select 'This is a test text with numbers 12345, 9876543210 and separately number 1234567887654321 all buried within the text' AS txt FROM dual)
CONNECT BY REGEXP_SUBSTR(txt, '(\d+|\D+)', 1, LEVEL ) IS NOT NULL)
(Added GREATEST in the second line to replace at least 4 digits.)
Result:
This is a test text with numbers ****5, ******3210 and separately number ************4321 all buried within the text

Counting word lengths in a string

I am using an Oracle regular expression to extract the first letter of each word in a string. The results are returned in a single cell, with spaces representing hard breaks. Here is an example...
input:
'I hope that some kind person
browsing stack overflow
can help me'
output:
ihtskp bso chm
What I am trying to do next is count the length of each "word" in my output, like this:
6 3 3
Alternatively, a count of the words in each line of the original string would be acceptable, as it would yield the same result.
Thanks!
Count the number of spaces and add one:
select (length(your_col) - length(replace(your_col, ' '))+1) from your_table;
It will give you the number of words per line. From there you can get all counts on one line by using listagg function:
select LISTAGG(cnt,' ') within group (order by null) from (
select (length(a)-length(replace(a,' '))+1) cnt from (
select 'apa bpa bv' a from dual
union all
select 'n bb gg' a from dual
union all
select 'ff ff rr gg' a from dual))
group by null;
Perhaps you also need to split the strings if they contain newlines or are they split already?
I tried to edit my original post but it hasn't appeared, but I figured out a way to solve my issue. I just decided to break the words into rows, since I know how to character count rows, and then reassembled the character counts into a single cell using listagg:
with my_string as (
select regexp_substr (words,'[0-9]+|[a-z]+|[A-Z]+',1,lvl) parsed
from (
select words, level lvl
from letters connect by level <= length(words) - length(replace(words,' ')) + 1)
)
select listagg(length(parsed),' ') within group (order by parsed) word_count
from my_string

Remove leading zeros

Given data in a column which look like this:
00001 00
00026 00
I need to use SQL to remove anything after the space and all leading zeros from the values so that the final output will be:
1
26
How can I best do this?
Btw I'm using DB2
This was tested on DB2 for Linux/Unix/Windows and z/OS.
You can use the LOCATE() function in DB2 to find the character position of the first space in a string, and then send that to SUBSTR() as the end location (minus one) to get only the first number of the string. Casting to INT will get rid of the leading zeros, but if you need it in string form, you can CAST again to CHAR.
SELECT CAST(SUBSTR(col, 1, LOCATE(' ', col) - 1) AS INT)
FROM tab
In DB2 (Express-C 9.7.5) you can use the SQL standard TRIM() function:
db2 => CREATE TABLE tbl (vc VARCHAR(64))
DB20000I The SQL command completed successfully.
db2 => INSERT INTO tbl (vc) VALUES ('00001 00'), ('00026 00')
DB20000I The SQL command completed successfully.
db2 => SELECT TRIM(TRIM('0' FROM vc)) AS trimmed FROM tbl
TRIMMED
----------------------------------------------------------------
1
26
2 record(s) selected.
The inner TRIM() removes leading and trailing zero characters, while the outer trim removes spaces.
This worked for me on the AS400 DB2.
The "L" stands for Leading.
You can also use "T" for Trailing.
I am assuming the field type is currently VARCHAR, do you need to store things other than INTs?
If the field type was INT, they would be removed automatically.
Alternatively, to select the values:
SELECT (CAST(CAST Col1 AS int) AS varchar) AS Col1
I found this thread for some reason and find it odd that no one actually answered the question. It seems that the goal is to return a left adjusted field:
SELECT
TRIM(L '0' FROM SUBSTR(trim(col) || ' ',1,LOCATE(' ',trim(col) || ' ') - 1))
FROM tab
One option is implicit casting: SELECT SUBSTR(column, 1, 5) + 0 AS column_as_number ...
That assumes that the structure is nnnnn nn, ie exactly 5 characters, a space and two more characters.
Explicit casting, ie SUBSTR(column,1,5)::INT is also a possibility, but exact syntax depends on the RDBMS in question.
Use the following to achieve this when the space location is variable, or even when it's fixed and you want to make a more robust query (in case it moves later):
SELECT CAST(SUBSTR(LTRIM('00123 45'), 1, CASE WHEN LOCATE(' ', LTRIM('00123 45')) <= 1 THEN LEN('00123 45') ELSE LOCATE(' ', LTRIM('00123 45')) - 1 END) AS BIGINT)
If you know the column will always contain a blank space after the start:
SELECT CAST(LOCATE(LTRIM('00123 45'), 1, LOCATE(' ', LTRIM('00123 45')) - 1) AS BIGINT)
both of these result in:
123
so your query would
SELECT CAST(SUBSTR(LTRIM(myCol1), 1, CASE WHEN LOCATE(' ', LTRIM(myCol1)) <= 1 THEN LEN(myCol1) ELSE LOCATE(' ', LTRIM(myCol1)) - 1 END) AS BIGINT)
FROM myTable1
This removes any content after the first space character (ignoring leading spaces), and then converts the remainder to a 64bit integer which will then remove all leading zeroes.
If you want to keep all the numbers and just remove the leading zeroes and any spaces you can use:
SELECT CAST(REPLACE('00123 45', ' ', '') AS BIGINT)
While my answer might seem quite verbose compared to simply SELECT CAST(SUBSTR(myCol1, 1, 5) AS BIGINT) FROM myTable1 but it allows for the space character to not always be there, situations where the myCol1 value is not of the form nnnnn nn if the string is nn nn then the convert to int will fail.
Remember to be careful if you use the TRIM function to remove the leading zeroes, and actually in all situations you will need to test your code with data like 00120 00 and see if it returns 12 instead of the correct value of 120.