Proper Case in Big Query - google-bigquery

I have this sentence "i want to buy bananas" across column 'Bananas' in Big Query.
I want to get "I Want To Buy Bananas". How do I it? I was expecting PROPER(Bananas) function when I saw LOWER and UPPER but it seems like PROPER case is not supported?
DZ

October 2020 Update:
BigQuery now support INITCAP function - which takes a STRING and returns it with the first character in each word in uppercase and all other characters in lowercase. Non-alphabetic characters remain the same.
So, below type of fancy-shmancy UDF is not needed anymore - instead you just use
#standradSQL
SELECT str, INITCAP(str) proper_str
FROM `project.dataset.table`
-- ~~~~~~~~~~~~~~~~~~
Below example is for BigQuery Standrad SQL
#standradSQL
CREATE TEMP FUNCTION PROPER(str STRING) AS ((
SELECT STRING_AGG(CONCAT(UPPER(SUBSTR(w,1,1)), LOWER(SUBSTR(w,2))), ' ' ORDER BY pos)
FROM UNNEST(SPLIT(str, ' ')) w WITH OFFSET pos
));
WITH `project.dataset.table` AS (
SELECT 'i Want to buy bananas' str
)
SELECT str, PROPER(str) proper_str
FROM `project.dataset.table`
result is
Row str proper_str
1 i Want to buy bananas I Want To Buy Bananas

I expanded on Mikhail Berlyant's answer to also capitalise after hypens (-) as I needed to use proper case for place names. Had to switch from the SPLIT function to using a regex to do this.
I test for an empty string at the start and return an empty string (as opposed to null) to match the behaviour of the native UPPER and LOWER functions.
CREATE TEMP FUNCTION PROPER(str STRING) AS ((
SELECT
IF(str = '', '',
STRING_AGG(
CONCAT(
UPPER(SUBSTR(single_words,1,1)),
LOWER(SUBSTR(single_words,2))
),
'' ORDER BY position
)
)
FROM UNNEST(REGEXP_EXTRACT_ALL(str, r' +|-+|.[^ -]*')) AS single_words
WITH OFFSET AS position
));
WITH test_table AS (
SELECT 'i Want to buy bananas' AS str
UNION ALL
SELECT 'neWCASTle upon-tyne' AS str
)
SELECT str, PROPER(str) AS proper_str
FROM test_table
Output
Row str proper_str
1 i Want to buy bananas I Want To Buy Bananas
2 neWCASTle upon-tyne Newcastle Upon-Tyne

Related

Remove items in a delimited list that are non numeric in SQL for Redshift

I am working with a field called codes that is a delimited list of values, separated by commas. Within each item there is a title ending in a colon and then a code number following the colon. I want a list of only the code numbers after each colon.
Example Value:
name-form-na-stage0:3278648990379886572,rules-na-unwanted-sdfle2:6886328308933282817,us-disdg-order-stage1:1273671130817907765
Desired Output:
3278648990379886572,6886328308933282817,1273671130817907765
The title does always start with a letter and the end with a colon so I can see how REGEXP_REPLACE might work to replace any string between starting with a letter and ending with a colon with '' might work but I am not good at REGEXP_REPLACE patterns. Chat GPT is down fml.
Side note, if anyone knows of a good guide for understanding pattern notation for regular expressions it would be much appreciated!
I tried this and it is not working REGEXP_REPLACE(REPLACE(REPLACE(codes,':', ' '), ',', ' ') ,' [^0-9]+ ', ' ')
This solution assumes a few things:
No colons anywhere else except immediately before the numbers
No number at the very start
At a high level, this query finds how many colons there are, splits the entire string into that many parts, and then only keeps the number up to the comma immediately after the number, and then aggregates the numbers into a comma-delimited list.
Assuming a table like this:
create temp table tbl_string (id int, strval varchar(1000));
insert into tbl_string
values
(1, 'name-form-na-stage0:3278648990379886572,rules-na-unwanted-sdfle2:6886328308933282817,us-disdg-order-stage1:1273671130817907765');
with recursive cte_num_of_delims AS (
select max(regexp_count(strval, ':')) AS num_of_delims
from tbl_string
), cte_nums(nums) AS (
select 1 as nums
union all
select nums + 1
from cte_nums
where nums <= (select num_of_delims from cte_num_of_delims)
), cte_strings_nums_combined as (
select id,
strval,
nums as index
from cte_nums
cross join tbl_string
), prefinal as (
select *,
split_part(strval, ':', index) as parsed_vals
from cte_strings_nums_combined
where parsed_vals != ''
and index != 1
), final as (
select *,
case
when charindex(',', parsed_vals) = 0
then parsed_vals
else left(parsed_vals, charindex(',', parsed_vals) - 1)
end as final_vals
from prefinal
)
select listagg(final_vals, ',')
from final

How can I separate a string in BigQuery into multiple columns without breaking up distinct words?

I'm trying to separate a string into two columns, but only if the total string's length is larger than 25 characters. If it's shorter than 25 characters, then I want it on the 2nd column only. If it's longer than 25, then I want the first part of the string to be in the 1st column and the second part of the string to be in the 2nd column.
Here's the kicker... I don't want words to be broken up. So if the total length of the string is 26, I know that I'll need two columns, but I need to figure out where to splice up the string so that only complete words are represented in each column.
For example, the string is "Transportation Project Manager". Since it has over 25 characters, I want the first column to say "Transportation Project" and the second column to say "Manager". "Transportation Project" has less than 25 characters but I want it to stop there since there isn't another complete word that would fit within the 25 character limit.
Another example- The string is "Caseworker I". Since it's less than 25 characters, I want the whole string to be represented in column 2.
Thank you for your time!
In order to split a string into 2 columns respecting a defined maximum length (following the logic you described), we will use JavaScript User Defined Function in BigQuery (UDF) together with the builtin function LENGTH.
First, the string will be analysed. If the character after the maximum threshold is a white space then it will be split at the given maximum string length. However, if this is not the case, every single character will be checked, counting backwards, until a white space is found and the string will be split. Having this procedure, avoids the function to break up a word and it will be always split respecting the maximum allowed length.
Below is the query with some sample data,
CREATE TEMP FUNCTION split_str_1(s string,len int64)
RETURNS string
LANGUAGE js AS """
var len_aux = len, prev = 0;
//first part of the string within the threshold
output = [];
//the rest of the string wihtout the first part
output2 = [];
//if the next character in the string is a whitespace, them split the string
if(s[len_aux++] == ' ') {
output.push(s.substring(prev,len_aux));
output2.push(s.substring(prev,s.length));
}
else{
do {
if(s.substring(len_aux - 1, len_aux) == ' ')
{
output.push(s.substring(prev,len_aux));
prev = len_aux;
output2.push(s.substring(prev,s.length));
break;
}len_aux--;
} while(len_aux > prev)
}
//outputting the first part of the string
return output[0];
""";
CREATE TEMP FUNCTION split_str_2(s string,len int64)
RETURNS string
LANGUAGE js AS """
var len_aux = len, prev = 0;
//first part of the string within the threshold
output = [];
//the rest of the string wihtout the first part
output2 = [];
//if the next character in the string is a whitespace, them split the string
if(s[len_aux++] == ' ') {
output.push(s.substring(prev,len_aux));
output2.push(s.substring(prev,s.length));
}
else{
do {
if(s.substring(len_aux - 1, len_aux) == ' ')
{
output.push(s.substring(prev,len_aux));
prev = len_aux;
output2.push(s.substring(prev,s.length));
break;
}len_aux--;
} while(len_aux > prev)
}
//outputting the first part of the string
return output2[0];
""";
WITH data AS (
SELECT "Trying to split a string with more than 25 characters length" AS str UNION ALL
SELECT "Trying to split" AS str
)
SELECT str,
IF(LENGTH(str)>25, split_str_1(str,25), null) as column_1,
CASE WHEN LENGTH(str)>25 THEN split_str_2(str,25) ELSE str END AS column_2
FROM data
And the output,
Notice that there are 2 JavaScript UDF's, this is because the first one returns the first part of the string and the second returns the second part, when the string is longer than 25 characters. Also, the maximum allowed length is passed as an argument, but it can be statically defined within the UDF as len=25.
I think your angle of attack should be to find the first space before the 25th character and then split based on that.
Using other submitted answers phrases as sample data:
with sample_data as(
select 'Transportation Project Manager' as phrase union all
select 'Caseworker I'as phrase union all
select "This's 25 characters long" as phrase union all
select "This's 25 characters long (not!)" as phrase union all
select 'Antidisestablishmentarianist' as phrase union all
select 'Trying to split a string with more than 25 characters in length' as phrase union all
select 'Trying to split' as phrase
),
temp as (
select
phrase,
length(phrase) as phrase_len,
-- Find the first space before the 25th character
-- by reversing the first 25 characters
25-strpos(reverse(substr(phrase,1,25)),' ') as first_space_before_25
from sample_data
)
select
phrase,
phrase_len,
first_space_before_25,
case when phrase_len <= 25 or first_space_before_25 = 25 then null
when phrase_len > 25 then substr(phrase,1,first_space_before_25)
else null
end as col1,
case when phrase_len <= 25 or first_space_before_25 = 25 then phrase
when phrase_len > 25 then substr(phrase,first_space_before_25+1, phrase_len)
else null
end as col2
from temp
I think this gets you pretty close using basic sql string manipulation. You might need/want to clean this up a bit depending on if you want col2 to start with a space or be trimmed, and depending on your cutoff point (you mentioned less than 25 and greater than 25, but not exactly 25).
Below is for BigQuery Standard SQL
#standardSQL
SELECT phrase,
IF(IFNULL(cut, len ) >= len, NULL, SUBSTR(phrase, 1, cut)) col1,
IF(IFNULL(cut, len ) >= len, phrase, SUBSTR(phrase, cut + 1)) col2
FROM (
SELECT phrase, LENGTH(phrase) len,
(
SELECT cut FROM (
SELECT -1 + SUM(LENGTH(word) + 1) OVER(ORDER BY OFFSET) AS cut
FROM UNNEST(SPLIT(phrase, ' ')) word WITH OFFSET
)
WHERE cut <= 25
ORDER BY cut DESC
LIMIT 1
) cut
FROM `project.dataset.table`
)
You can test, play with above using sample data (nicely provided in other answers) as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'Transportation Project Manager' AS phrase UNION ALL
SELECT 'Caseworker I' UNION ALL
SELECT "This's 25 characters long" UNION ALL
SELECT "This's 25 characters long (not!)" UNION ALL
SELECT 'Antidisestablishmentarianist' UNION ALL
SELECT 'Trying to split a string with more than 25 characters in length' UNION ALL
SELECT 'Trying to split'
)
SELECT phrase,
IF(IFNULL(cut, len ) >= len, NULL, SUBSTR(phrase, 1, cut)) col1,
IF(IFNULL(cut, len ) >= len, phrase, SUBSTR(phrase, cut + 1)) col2
FROM (
SELECT phrase, LENGTH(phrase) len,
(
SELECT cut FROM (
SELECT -1 + SUM(LENGTH(word) + 1) OVER(ORDER BY OFFSET) AS cut
FROM UNNEST(SPLIT(phrase, ' ')) word WITH OFFSET
)
WHERE cut <= 25
ORDER BY cut DESC
LIMIT 1
) cut
FROM `project.dataset.table`
)
with output
Row phrase col1 col2
1 Transportation Project Manager Transportation Project Manager
2 Caseworker I null Caseworker I
3 This's 25 characters long null This's 25 characters long
4 This's 25 characters long (not!) This's 25 characters long (not!)
5 Antidisestablishmentarianist null Antidisestablishmentarianist
6 Trying to split a string with more than 25 characters in length Trying to split a string with more than 25 characters in length
7 Trying to split null Trying to split
Note: if you want to get rid of leading (in col2) and trailing (in col1) spaces - you can just add TRIM() to handle this little extra logic
Wow, this is a great interview question! Here's what I came up with:
WITH sample_data AS
(
SELECT 'Transportation Project Manager' AS phrase
UNION ALL
SELECT 'Caseworker I' AS phrase
UNION ALL
SELECT "This's 25 characters long" AS phrase
UNION ALL
SELECT "This's 25 characters long (not!)" AS phrase
UNION ALL
SELECT 'Antidisestablishmentarianist' AS phrase
),
unnested_words AS --Make a dataset with one row per "word" per phrase
(
SELECT
*,
--To preserve the spaces for character counts, prepend one to every word but the first
CASE WHEN i = 0 THEN '' ELSE ' ' END || word AS word_with_space
FROM
sample_data
CROSS JOIN
UNNEST(SPLIT(phrase, ' ')) AS word WITH OFFSET AS i
),
with_word_length AS
(
SELECT
*,
--This doesn't need its own CTE, but done here for clarity
LENGTH(word_with_space) AS word_length
FROM
unnested_words
),
running_sum AS --Mark when the total character length exceeds 25
(
SELECT
*,
SUM(word_length) OVER (PARTITION BY phrase ORDER BY i) <= 25 AS is_first_25
FROM
with_word_length
),
by_subphrase AS --Make a subphrase of words in the first 25, and one for any others
(
SELECT
phrase,
ARRAY_TO_STRING(ARRAY_AGG(word), '') AS subphrase
FROM
running_sum
GROUP BY
phrase, is_first_25
),
by_phrase AS --Put subphrases into an array (back to one row per phrase)
(
SELECT
phrase, ARRAY_AGG(subphrase) AS subphrases
FROM
by_subphrase
GROUP BY
1
)
SELECT
phrase,
--Break the array of subphrases into columns per your rules
CASE WHEN ARRAY_LENGTH(subphrases) = 1 THEN subphrases[OFFSET(0)] ELSE subphrases[OFFSET(1)] END,
CASE WHEN ARRAY_LENGTH(subphrases) = 1 THEN NULL ELSE subphrases[OFFSET(0)] END
FROM
by_phrase
Not very pretty but gets it done.

USING SQL . extract numbers comma separated from string 'HEADER|N1000|E1001|N1002|E1003|N1004|N1005'

'HEADER|N1000|E1001|N1002|E1003|N1004|N1005'
'HEADER|N156|E1|N7|E122|N4|E5'
'HEADER|E0|E1|E2|E3|E4|E5'
'HEADER|N0|N1|N2|N3|N4|N5'
'HEADER|N125'
How to extract the numbers in comma-separated format from this stringS?
Expected result:
1000,1001,1002,1003,1004,1005
How to extract the numbers with N or E as suffix/prefix ie.
N1000
Expected result:
1000,1002,1004,1005
below regex does not return the result needed. But I want some thing like this
select REGEXP_REPLACE(REGEXP_REPLACE('HEADER|N1000|E1001|N1002|E1003|N1004|N1005', '.*?(\d+)', '\1,'), ',?\.*$', '') from dual
the problem here is
when i want numbers with E OR N
select REGEXP_REPLACE(REGEXP_REPLACE('HEADER|N1000|E1001|N1002|E1003|N1004|N1005', '.*?N(\d+)', '\1,'), ',?\.*$', '') from dual
select REGEXP_REPLACE(REGEXP_REPLACE('HEADER|N1000|E1001|N1002|E1003|N1004|N1005', '.*?E(\d+)', '\1,'), ',?\.*$', '') from dual
they give good results for this scenerio
but when i input 'HEADER|N1000|E1001' it gives wrong answer plzzz verify and correct it
Update
Based on the changes to the question, the original answer is not valid. Instead, the solution is considerably more complex, using a hierarchical query to extract all the numbers from the string and then LISTAGG to put back together a list of numbers extracted from each string. To extract all numbers we use this query:
WITH cte AS (
SELECT DISTINCT data, level AS l, REGEXP_SUBSTR(data, '[NE]\d+', 1, level) AS num FROM test
CONNECT BY REGEXP_SUBSTR(data, '[NE]\d+', 1, level) IS NOT NULL
)
SELECT data, LISTAGG(SUBSTR(num, 2), ',') WITHIN GROUP (ORDER BY l) AS "All numbers"
FROM cte
GROUP BY data
Output (for the new sample data):
DATA All numbers
HEADER|E0|E1|E2|E3|E4|E5 0,1,2,3,4,5
HEADER|N0|N1|N2|N3|N4|N5 0,1,2,3,4,5
HEADER|N1000|E1001|N1002|E1003|N1004|N1005 1000,1001,1002,1003,1004,1005
HEADER|N125 125
HEADER|N156|E1|N7|E122|N4|E5 156,1,7,122,4,5
To select only numbers beginning with E, we modify the query to replace the [EN] in the REGEXP_SUBSTR expressions with just E i.e.
SELECT DISTINCT data, level AS l, REGEXP_SUBSTR(data, 'E\d+', 1, level) AS num FROM test
CONNECT BY REGEXP_SUBSTR(data, 'E\d+', 1, level) IS NOT NULL
Output:
DATA E-numbers
HEADER|E0|E1|E2|E3|E4|E5 0,1,2,3,4,5
HEADER|N0|N1|N2|N3|N4|N5
HEADER|N1000|E1001|N1002|E1003|N1004|N1005 1001,1003
HEADER|N125
HEADER|N156|E1|N7|E122|N4|E5 1,122,5
A similar change can be made to extract numbers commencing with N.
Demo on dbfiddle
Original Answer
One way to achieve your desired result is to replace a string of characters leading up to a number with that number and a comma, and then replace any characters from the last ,| to the end of string from the result:
SELECT REGEXP_REPLACE(REGEXP_REPLACE('HEADER|N1000|E1001|N1002|E1003|N1004|N1005|', '.*?(\d+)', '\1,'), ',?\|.*$', '') FROM dual
Output:
1000,1001,1002,1003,1004,1005
To only output the numbers beginning with N, we add that to the prefix string before the capture group:
SELECT REGEXP_REPLACE(REGEXP_REPLACE('HEADER|N1000|E1001|N1002|E1003|N1004|N1005|', '.*?N(\d+)', '\1,'), ',?\|.*$', '') FROM dual
Output:
1000,1002,1004,1005
To only output the numbers beginning with E, we add that to the prefix string before the capture group:
SELECT REGEXP_REPLACE(REGEXP_REPLACE('HEADER|N1000|E1001|N1002|E1003|N1004|N1005|', '.*?E(\d+)', '\1,'), ',?\|.*$', '') FROM dual
Output:
1001,1003
Demo on dbfiddle
I don't know what DBMS you are using, but here's one way to do it in Postgres:
WITH cte AS (
SELECT CAST('HEADER|N1000|E1001|N1002|E1003|N1004|N1005|' AS VARCHAR(1000)) AS myValue
)
SELECT SUBSTRING(MyVal FROM 2)
FROM (
SELECT REGEXP_SPLIT_TO_TABLE(myValue,'\|') MyVal
FROM cte
) src
WHERE SUBSTRING(MyVal FROM 1 FOR 1) = 'N'
;
SQL Fiddle
As Far as I have understood the question , you want to extract substrings starting with N from the string, You can try following (And then you can merge the output seperated by commas if needed)
select REPLACE(value, 'N', '')
from STRING_SPLIT('HEADER|N1000|E1001|N1002|E1003|N1004|N1005|', '|')
where value like 'N%'
OutPut :
1000
1002
1004
1005

PostgreSQL count number of times substring occurs in text

I'm writing a PostgreSQL function to count the number of times a particular text substring occurs in another piece of text. For example, calling count('foobarbaz', 'ba') should return 2.
I understand that to test whether the substring occurs, I use a condition similar to the below:
WHERE 'foobarbaz' like '%ba%'
However, I need it to return 2 for the number of times 'ba' occurs. How can I proceed?
Thanks in advance for your help.
I would highly suggest checking out this answer I posted to "How do you count the occurrences of an anchored string using PostgreSQL?". The chosen answer was shown to be massively slower than an adapted version of regexp_replace(). The overhead of creating the rows, and the running the aggregate is just simply too high.
The fastest way to do this is as follows...
SELECT
(length(str) - length(replace(str, replacestr, '')) )::int
/ length(replacestr)
FROM ( VALUES
('foobarbaz', 'ba')
) AS t(str, replacestr);
Here we
Take the length of the string, L1
Subtract from L1 the length of the string with all of the replacements removed L2 to get L3 the difference in string length.
Divide L3 by the length of the replacement to get the occurrences
For comparison that's about five times faster than the method of using regexp_matches() which looks like this.
SELECT count(*)
FROM ( VALUES
('foobarbaz', 'ba')
) AS t(str, replacestr)
CROSS JOIN LATERAL regexp_matches(str, replacestr, 'g');
How about use a regular expression:
SELECT count(*)
FROM regexp_matches('foobarbaz', 'ba', 'g');
The 'g' flag repeats multiple matches on a string (not just the first).
There is a
str_count( src, occurence )
function based on
SELECT (length( str ) - length(replace( str, occurrence, '' ))) / length( occurence )
and a
str_countm( src, regexp )
based on the #MikeT-mentioned
SELECT count(*) FROM regexp_matches( str, regexp, 'g')
available here: postgres-utils
Try with:
SELECT array_length (string_to_array ('1524215121518546516323203210856879', '1'), 1) - 1
--RESULT: 7

Counting word lengths in a string

I am using an Oracle regular expression to extract the first letter of each word in a string. The results are returned in a single cell, with spaces representing hard breaks. Here is an example...
input:
'I hope that some kind person
browsing stack overflow
can help me'
output:
ihtskp bso chm
What I am trying to do next is count the length of each "word" in my output, like this:
6 3 3
Alternatively, a count of the words in each line of the original string would be acceptable, as it would yield the same result.
Thanks!
Count the number of spaces and add one:
select (length(your_col) - length(replace(your_col, ' '))+1) from your_table;
It will give you the number of words per line. From there you can get all counts on one line by using listagg function:
select LISTAGG(cnt,' ') within group (order by null) from (
select (length(a)-length(replace(a,' '))+1) cnt from (
select 'apa bpa bv' a from dual
union all
select 'n bb gg' a from dual
union all
select 'ff ff rr gg' a from dual))
group by null;
Perhaps you also need to split the strings if they contain newlines or are they split already?
I tried to edit my original post but it hasn't appeared, but I figured out a way to solve my issue. I just decided to break the words into rows, since I know how to character count rows, and then reassembled the character counts into a single cell using listagg:
with my_string as (
select regexp_substr (words,'[0-9]+|[a-z]+|[A-Z]+',1,lvl) parsed
from (
select words, level lvl
from letters connect by level <= length(words) - length(replace(words,' ')) + 1)
)
select listagg(length(parsed),' ') within group (order by parsed) word_count
from my_string