Counting word lengths in a string - sql

I am using an Oracle regular expression to extract the first letter of each word in a string. The results are returned in a single cell, with spaces representing hard breaks. Here is an example...
input:
'I hope that some kind person
browsing stack overflow
can help me'
output:
ihtskp bso chm
What I am trying to do next is count the length of each "word" in my output, like this:
6 3 3
Alternatively, a count of the words in each line of the original string would be acceptable, as it would yield the same result.
Thanks!

Count the number of spaces and add one:
select (length(your_col) - length(replace(your_col, ' '))+1) from your_table;
It will give you the number of words per line. From there you can get all counts on one line by using listagg function:
select LISTAGG(cnt,' ') within group (order by null) from (
select (length(a)-length(replace(a,' '))+1) cnt from (
select 'apa bpa bv' a from dual
union all
select 'n bb gg' a from dual
union all
select 'ff ff rr gg' a from dual))
group by null;
Perhaps you also need to split the strings if they contain newlines or are they split already?

I tried to edit my original post but it hasn't appeared, but I figured out a way to solve my issue. I just decided to break the words into rows, since I know how to character count rows, and then reassembled the character counts into a single cell using listagg:
with my_string as (
select regexp_substr (words,'[0-9]+|[a-z]+|[A-Z]+',1,lvl) parsed
from (
select words, level lvl
from letters connect by level <= length(words) - length(replace(words,' ')) + 1)
)
select listagg(length(parsed),' ') within group (order by parsed) word_count
from my_string

Related

How can I count repeated values in the string in BigQuery?

Example:
I have the following string:
201904,BLANK,201902,BLANK,BLANK,201811,201810,201809
How can I count the number of repeated values "BLANK" that goes one by one?
In the described example the answer is 2, but what is the query?
Thanks for your help in advance!
Below is for BigQuery Standard SQL (with quick simplified example)
Corrected Version
#standardSQL
WITH `project.dataset.table` AS (
SELECT '201904,BLANK,201902,BLANK,BLANK,201811,201810,201809,BLANK,BLANK,BLANK' value UNION ALL
SELECT '201904,BLANK,201902,BLANK,BLANK,BLANK,201811' UNION ALL
SELECT '201904,BLANK,201902,BLANK,201811,201902,BLANK,201811'
)
SELECT value,
(
SELECT MAX(ARRAY_LENGTH(SPLIT(list))) - 1
FROM UNNEST(REGEXP_EXTRACT_ALL(value || ',', r'(?:BLANK,){1,}')) list
) max_repeated_count
FROM `project.dataset.table`
The idea here is
extract all instances of consecutive BLANK
split each such instances to array of elements of BLANK
and finally get max length of those arrays as a result
Just something came as quick approach
Refactored Version
#standardSQL
WITH `project.dataset.table` AS (
SELECT '201904,BLANK,201902,BLANK,BLANK,201811,201810,201809,BLANK,BLANK,BLANK' value UNION ALL
SELECT '201904,BLANK,201902,BLANK,BLANK,BLANK,201811' UNION ALL
SELECT '201904,BLANK,201902,BLANK,201811,201902,BLANK,201811'
)
SELECT value,
(
SELECT MAX(LENGTH(element) - 1)
FROM UNNEST(REGEXP_EXTRACT_ALL(REPLACE(value || ',', 'BLANK', ''), r',+')) element
) max_repeated_count
FROM `project.dataset.table`
Both with output
Row value max_repeated_count
1 201904,BLANK,201902,BLANK,BLANK,201811,201810,201809,BLANK,BLANK,BLANK 3
2 201904,BLANK,201902,BLANK,BLANK,BLANK,201811 3
3 201904,BLANK,201902,BLANK,201811,201902,BLANK,201811 1
Refactored version is slightly different (but main idea the same)
it removes all BLANKS (assuming BLANK cannot be part of other element - if it can - code can easily be adjusted)
then extract all consecutive entries of commas into array
calculates max length of such sequences of commas
Maybe I misunderstood, but can't you simply split by the value you're looking for and subtract 2 (1 for the first element and 1 for counting elements after splitting):
declare t DEFAULT '201904,BLANK,201902,BLANK,BLANK,201811,201810,201809';
SELECT
t as theString,
split(t,'BLANK') as theSplittedString,
array_length(split(t,'BLANK'))-2 as theAmount
n>0 - amount of repetition,
0 - no repetition,
-1 - element not found

Oracle SQL - Redacting multiple occurences all but last four digits of numbers of varying length within free text narrative

Is there are straightforward way, perhaps using REGEXP_REPLACE or the like, to redact all but the last four digits of numbers (or varying length of 5 or above) appearing within free text (there may be multiple occurrences of separate numbers within the text)?
E.g.
Input = 'This is a test text with numbers 12345, 9876543210 and separately number 1234567887654321 all buried within the text'
Output = 'This is a test text with numbers ****5, *****3210 and separately number ************4321 all buried within the text'
With REGEX_REPLACE it's obviously straightforward to replace all numbers with the *, but it's maintaining the final four digits and replacing with the correct number of *s that's vexing me.
Any help would be much appreciated!
(Just for context, due to the usual kind of business limitations, this had to be done within the query retrieving the data rather than using actual Oracle DBMS redaction functionality).
Many thanks.
You could try the following regex:
regexp_replace(txt, '(\d{4})(\d+(\D|$))', '****\2')
This captures sequences of 4 digits followed by at least one digit, then by a non-digit character (or the end of string), and replaces them with 4 stars.
Demo on DB Fiddle:
with t as (select 'select This is a test text with numbers 12345, 9876543210 and separately number 1234567887654321 all buried within the text' txt from dual)
select regexp_replace(txt, '(\d{4})(\d+\D)', '****\2') new_text from t
| NEW_TEXT |
| :-------------------------------------------------------------------------------------------------------------------------- |
| select This is a test text with numbers ****5, ****543210 and separately number ****567887654321 all buried within the text |
Edit
Here is a simplified version, suggested by Aleksej in the comments:
regexp_replace(txt, '(\d{4})(\d+)', '****\2')
This works because of the greadiness of the regexp engine, that will slurp as many '\d+' as possible.
If you really need to keep the length of the numbers, then (I think) there is not wayy todo it in one step. You'll have to split the string in numbers and not numbers and then replace the digits seperatly:
SELECT listagg(CASE WHEN REGEXP_LIKE(txt, '\d{5,}') -- if the string is of your desired format
THEN LPAD('*', LENGTH(txt) - 4,'*') || SUBSTR(txt, LENGTH(txt) -3) -- replace all digits but the last 4 with *
ELSE txt END)
within GROUP (ORDER BY lvl)
FROM (SELECT LEVEL lvl, REGEXP_SUBSTR(txt, '(\d+|\D+)', 1, LEVEL ) txt -- Split the string in numerical and non numerical parts
FROM (select 'This is a test text with numbers 12345, 9876543210 and separately number 1234567887654321 all buried within the text' AS txt FROM dual)
CONNECT BY REGEXP_SUBSTR(txt, '(\d+|\D+)', 1, LEVEL ) IS NOT NULL)
Result:
This is a test text with numbers *2345, ******3210 and separately number ************4321 all buried within the text
And as your example replaced the first for digits of your first number - you might also want to replace at least 4 digits:
SELECT listagg(CASE WHEN REGEXP_LIKE(txt, '\d{5,}') -- if the string is of your desired format
THEN LPAD('*', GREATEST(LENGTH(txt) - 4, 4),'*') || SUBSTR(txt, GREATEST(LENGTH(txt) -3, 5)) -- replace all digits but the last 4 with *
ELSE txt END)
within GROUP (ORDER BY lvl)
FROM (SELECT LEVEL lvl, REGEXP_SUBSTR(txt, '(\d+|\D+)', 1, LEVEL ) txt -- Split the string in numerical and non numerical parts
FROM (select 'This is a test text with numbers 12345, 9876543210 and separately number 1234567887654321 all buried within the text' AS txt FROM dual)
CONNECT BY REGEXP_SUBSTR(txt, '(\d+|\D+)', 1, LEVEL ) IS NOT NULL)
(Added GREATEST in the second line to replace at least 4 digits.)
Result:
This is a test text with numbers ****5, ******3210 and separately number ************4321 all buried within the text

USING SQL . extract numbers comma separated from string 'HEADER|N1000|E1001|N1002|E1003|N1004|N1005'

'HEADER|N1000|E1001|N1002|E1003|N1004|N1005'
'HEADER|N156|E1|N7|E122|N4|E5'
'HEADER|E0|E1|E2|E3|E4|E5'
'HEADER|N0|N1|N2|N3|N4|N5'
'HEADER|N125'
How to extract the numbers in comma-separated format from this stringS?
Expected result:
1000,1001,1002,1003,1004,1005
How to extract the numbers with N or E as suffix/prefix ie.
N1000
Expected result:
1000,1002,1004,1005
below regex does not return the result needed. But I want some thing like this
select REGEXP_REPLACE(REGEXP_REPLACE('HEADER|N1000|E1001|N1002|E1003|N1004|N1005', '.*?(\d+)', '\1,'), ',?\.*$', '') from dual
the problem here is
when i want numbers with E OR N
select REGEXP_REPLACE(REGEXP_REPLACE('HEADER|N1000|E1001|N1002|E1003|N1004|N1005', '.*?N(\d+)', '\1,'), ',?\.*$', '') from dual
select REGEXP_REPLACE(REGEXP_REPLACE('HEADER|N1000|E1001|N1002|E1003|N1004|N1005', '.*?E(\d+)', '\1,'), ',?\.*$', '') from dual
they give good results for this scenerio
but when i input 'HEADER|N1000|E1001' it gives wrong answer plzzz verify and correct it
Update
Based on the changes to the question, the original answer is not valid. Instead, the solution is considerably more complex, using a hierarchical query to extract all the numbers from the string and then LISTAGG to put back together a list of numbers extracted from each string. To extract all numbers we use this query:
WITH cte AS (
SELECT DISTINCT data, level AS l, REGEXP_SUBSTR(data, '[NE]\d+', 1, level) AS num FROM test
CONNECT BY REGEXP_SUBSTR(data, '[NE]\d+', 1, level) IS NOT NULL
)
SELECT data, LISTAGG(SUBSTR(num, 2), ',') WITHIN GROUP (ORDER BY l) AS "All numbers"
FROM cte
GROUP BY data
Output (for the new sample data):
DATA All numbers
HEADER|E0|E1|E2|E3|E4|E5 0,1,2,3,4,5
HEADER|N0|N1|N2|N3|N4|N5 0,1,2,3,4,5
HEADER|N1000|E1001|N1002|E1003|N1004|N1005 1000,1001,1002,1003,1004,1005
HEADER|N125 125
HEADER|N156|E1|N7|E122|N4|E5 156,1,7,122,4,5
To select only numbers beginning with E, we modify the query to replace the [EN] in the REGEXP_SUBSTR expressions with just E i.e.
SELECT DISTINCT data, level AS l, REGEXP_SUBSTR(data, 'E\d+', 1, level) AS num FROM test
CONNECT BY REGEXP_SUBSTR(data, 'E\d+', 1, level) IS NOT NULL
Output:
DATA E-numbers
HEADER|E0|E1|E2|E3|E4|E5 0,1,2,3,4,5
HEADER|N0|N1|N2|N3|N4|N5
HEADER|N1000|E1001|N1002|E1003|N1004|N1005 1001,1003
HEADER|N125
HEADER|N156|E1|N7|E122|N4|E5 1,122,5
A similar change can be made to extract numbers commencing with N.
Demo on dbfiddle
Original Answer
One way to achieve your desired result is to replace a string of characters leading up to a number with that number and a comma, and then replace any characters from the last ,| to the end of string from the result:
SELECT REGEXP_REPLACE(REGEXP_REPLACE('HEADER|N1000|E1001|N1002|E1003|N1004|N1005|', '.*?(\d+)', '\1,'), ',?\|.*$', '') FROM dual
Output:
1000,1001,1002,1003,1004,1005
To only output the numbers beginning with N, we add that to the prefix string before the capture group:
SELECT REGEXP_REPLACE(REGEXP_REPLACE('HEADER|N1000|E1001|N1002|E1003|N1004|N1005|', '.*?N(\d+)', '\1,'), ',?\|.*$', '') FROM dual
Output:
1000,1002,1004,1005
To only output the numbers beginning with E, we add that to the prefix string before the capture group:
SELECT REGEXP_REPLACE(REGEXP_REPLACE('HEADER|N1000|E1001|N1002|E1003|N1004|N1005|', '.*?E(\d+)', '\1,'), ',?\|.*$', '') FROM dual
Output:
1001,1003
Demo on dbfiddle
I don't know what DBMS you are using, but here's one way to do it in Postgres:
WITH cte AS (
SELECT CAST('HEADER|N1000|E1001|N1002|E1003|N1004|N1005|' AS VARCHAR(1000)) AS myValue
)
SELECT SUBSTRING(MyVal FROM 2)
FROM (
SELECT REGEXP_SPLIT_TO_TABLE(myValue,'\|') MyVal
FROM cte
) src
WHERE SUBSTRING(MyVal FROM 1 FOR 1) = 'N'
;
SQL Fiddle
As Far as I have understood the question , you want to extract substrings starting with N from the string, You can try following (And then you can merge the output seperated by commas if needed)
select REPLACE(value, 'N', '')
from STRING_SPLIT('HEADER|N1000|E1001|N1002|E1003|N1004|N1005|', '|')
where value like 'N%'
OutPut :
1000
1002
1004
1005

Concatenate & Trim String

Can anyone help me, I have a problem regarding on how can I get the below result of data. refer to below sample data. So the logic for this is first I want delete the letters before the number and if i get that same thing goes on , I will delete the numbers before the letter so I can get my desired result.
Table:
SALV3000640PIX32BLU
SALV3334470A9CARBONGRY
TP3000620PIXL128BLK
Desired Output:
PIX32BLU
A9CARBONGRY
PIXL128BLK
You need to use a combination of the SUBSTRING and PATINDEX Functions
SELECT
SUBSTRING(SUBSTRING(fielda,PATINDEX('%[^a-z]%',fielda),99),PATINDEX('%[^0-9]%',SUBSTRING(fielda,PATINDEX('%[^a-z]%',fielda),99)),99) AS youroutput
FROM yourtable
Input
yourtable
fielda
SALV3000640PIX32BLU
SALV3334470A9CARBONGRY
TP3000620PIXL128BLK
Output
youroutput
PIX32BLU
A9CARBONGRY
PIXL128BLK
SQL Fiddle:http://sqlfiddle.com/#!6/5722b6/29/0
To do this you can use
PATINDEX('%[0-9]%',FieldName)
which will give you the position of the first number, then trim off any letters before this using SUBSTRING or other string functions. (You need to trim away the first letters before continuing with the next step because unlike CHARINDEX there is no starting point parameter in the PATINDEX function).
Then on the remaining string use
PATINDEX('%[a-z]%',FieldName)
to find the position of the first letter in the remaining string. Now trim off the numbers in front using SUBSTRING etc.
You may find this other solution helpful
SQL to find first non-numeric character in a string
Try this it may helps you
;With cte (Data)
AS
(
SELECT 'SALV3000640PIX32BLU' UNION ALL
SELECT 'SALV3334470A9CARBONGRY' UNION ALL
SELECT 'SALV3334470A9CARBONGRY' UNION ALL
SELECT 'SALV3334470B9CARBONGRY' UNION ALL
SELECT 'SALV3334470D9CARBONGRY' UNION ALL
SELECT 'TP3000620PIXL128BLK'
)
SELECT * , CASE WHEN CHARINDEX('PIX',Data)>0 THEN SUBSTRING(Data,CHARINDEX('PIX',Data),LEN(Data))
WHEN CHARINDEX('A9C',Data)>0 THEN SUBSTRING(Data,CHARINDEX('A9C',Data),LEN(Data))
ELSE NULL END AS DesiredResult FROM cte
Result
Data DesiredResult
-------------------------------------
SALV3000640PIX32BLU PIX32BLU
SALV3334470A9CARBONGRY A9CARBONGRY
SALV3334470A9CARBONGRY A9CARBONGRY
SALV3334470B9CARBONGRY NULL
SALV3334470D9CARBONGRY NULL
TP3000620PIXL128BLK PIXL128BLK

Using REGEXP_SUBSTR with Strings Qualifier

Getting Examples from similar Stack Overflow threads,
Remove all characters after a specific character in PL/SQL
and
How to Select a substring in Oracle SQL up to a specific character?
I would want to retrieve only the first characters before the occurrence of a string.
Example:
STRING_EXAMPLE
TREE_OF_APPLES
The Resulting Data set should only show only STRING_EXAM and TREE_OF_AP because PLE is my delimiter
Whenever i use the below REGEXP_SUBSTR, It gets only STRING_ because REGEXP_SUBSTR treats PLE as separate expressions (P, L and E), not as a single expression (PLE).
SELECT REGEXP_SUBSTR('STRING_EXAMPLE','[^PLE]+',1,1) from dual;
How can i do this without using numerous INSTRs and SUBSTRs?
Thank you.
The problem with your query is that if you use [^PLE] it would match any characters other than P or L or E. You are looking for an occurence of PLE consecutively. So, use
select REGEXP_SUBSTR(colname,'(.+)PLE',1,1,null,1)
from tablename
This returns the substring up to the last occurrence of PLE in the string.
If the string contains multiple instances of PLE and only the substring up to the first occurrence needs to be extracted, use
select REGEXP_SUBSTR(colname,'(.+?)PLE',1,1,null,1)
from tablename
Why use regular expressions for this?
select substr(colname, 1, instr(colname, 'PLE')-1) from...
would be more efficient.
with
inputs( colname ) as (
select 'FIRST_EXAMPLE' from dual union all
select 'IMPLEMENTATION' from dual union all
select 'PARIS' from dual union all
select 'PLEONASM' from dual
)
select colname, substr(colname, 1, instr(colname, 'PLE')-1) as result
from inputs
;
COLNAME RESULT
-------------- ----------
FIRST_EXAMPLE FIRST_EXAM
IMPLEMENTATION IM
PARIS
PLEONASM