Community,
I need assistance with removing the UNDER SCORES '_' and make the name readable first name letter UpperCase last name UpperCase, while removing the number as well. Hope this makes sense. I am running Presto and using Query Fabric. I there a better way to write this syntax?
Email Address
Full_Metal_Jacket#movie.com
TOP_GUN2#movie.email.com
Needed Outcome
Full Metal Jacket
Top Gun
Partical working Resolution:
,REPLACE(SPLIT_PART(T.EMAIL, '#', 1),'_',' ') Name
Something like this:
,LOWER(REPLACE(UPPER(SPLIT_PART(T.EMAIL, '#', 1)),'_',' '))Name
Try this:
WITH t(email) AS (
VALUES 'Full_Metal_Jacket#movie.com', 'TOP_GUN2#movie.email.com'
)
SELECT array_join(
transform(
split(regexp_extract(email, '(^[^0-9#]+)', 1), '_'),
part -> upper(substr(part, 1, 1)) || lower(substr(part, 2))),
' ')
FROM t;
How it works:
extract the non-numeric prefix up to the # using a regex via regexp_extract
split the prefix on _ to produce an array
transform the array by capitalizing the first letter of each element and lowercasing the rest.
Finally, join them all together with a space using the array_join function.
Update:
Here's another variant without involving transform and the intermediate array:
regexp_replace(
replace(regexp_extract(email, '(^[^0-9#]+)', 1), '_', ' '),
'(\w)(\w*)',
x -> upper(x[1]) || lower(x[2]))
Like the approach above, it first extracts the non-numeric prefix, then it replaces underscores with spaces with the replace function, and finally, it uses regexp_replace to process each word. The (\w)(\w*) regular expression captures the first letter of the word and the rest of the word into two separate capture groups. The x -> upper(x[1]) || lower(x[2]) lambda expression then capitalizes the first letter (first capture group -- x[1]) and lower cases the rest (second capture group -- x[2]).
Related
I have data in table:
id
question
1
1.1 Covid-19 [cases]
2
1.1 Covid-19 [deaths]
I want to split the data into columns. To get below output:
id
questionid
question_name
sub_question_name
1
1.1
Covid-19
cases
2
1.1
Covid-19
deaths
Is any function to get above output.?
One way of doing this is using the much useful PostgreSQL SPLIT_PART function, which allows you to split on a character (in your specific case, the space). As long as you don't need brackets for the last field, you may split on the open bracket and remove the last bracket with the RTRIM function.
SELECT id,
SPLIT_PART(question, ' ', 1) AS questionid,
SPLIT_PART(question, ' ', 2) AS question_name,
RTRIM(SPLIT_PART(question, '[', 2), ']') AS sub_question_name
FROM tab
Check the demo here.
You can deepen your understanding of these functions on PostgreSQL official documentation related to the string functions.
EDIT: For a more advanced matching, you should consider using regex and PostgreSQL pattern matching:
SELECT id,
(REGEXP_MATCHES(question, '^[\d\.]+'))[1] AS questionid,
(REGEXP_MATCHES(question, '(?<= )[^[]+'))[1] AS question_name,
(REGEXP_MATCHES(question, '(?<=\[).*(?=\]$)'))[1] AS sub_question_name
FROM tab
Regex for questionid Explanation:
^: start of string
[\d\.]+: any existing combination of digit and dots
Regex for question_name Explanation:
(?<= ): positive lookbehind that matches a space before the match
[^[]+: any existing combination of any character other than [
Regex for sub_question_name Explanation:
(?<=\[): positive lookbehind that matches an open bracket before the match
.*: any character
(?=\]$): positive lookahead that matches a closed bracket after the match
Check the demo here.
You can also use regexp_replace, in this example, the regexp_replace will replace the square brackets (first and third groups) group 1 -> ^(\[), group 3 -> (\])$ by the second group (.*).
the third argument \2 in the end of the function indicates what group should remain in the text.
select
id,
split_part(question, ' ', 1) p1,
split_part(question, ' ', 2) p2,
regexp_replace(split_part(question, ' ', 3), '^(\[)(.*)(\])$', '\2') p3
from
covid;
Here is the example
I have a SQL query which returns some rows having the below format:
DB_host
DB_host_instance
How can i filter to get rows which only have the format of 'DB_host' (place a condition to return values with only one occurrence of '_')
i tried using [0-9a-zA-Z_0-9a-zA-Z], but seems like its not right. Please suggest.
One option would be using REGEXP_COUNT and at most one underscore is needed then use
WHERE REGEXP_COUNT( col, '_' ) <= 1
or strictly one underscore should exist then use
WHERE REGEXP_COUNT( col, '_' ) = 1
A simple method is a regular expression:
where regexp_like(col, '^[^_]+_[^_]+$')
This matches the full string when there is a string with no underscores followed by an underscore followed by another string with no underscores.
You could also do this with LIKE, but it is more complicated:
where col like '%\_%' and col not like '%\_%\_%'
That is, has one underscore but not two. The \ is needed because _ is a wildcard for LIKE patterns.
You can suppress underscores in the string, and ensure that the length of the result is just one character less than the original:
where len(replace(col, '_', '')) = len(col) - 1
I wonder how this method would compare to a regex or two likes in terms of efficiency on a large dataset. I would not be surprised it it was more efficient.
I would just like to know where do I put the \g in this query?
SELECT project,
SUBSTRING(address FROM 'A-Za-z') AS letters,
SUBSTRING(address FROM '\d') AS numbers
FROM repositories
I tried this but this brings back nothing (it doesn't throw an error though)
SELECT project,
SUBSTRING(CONCAT(address, '#') FROM 'A-Za-z' FOR '#') AS letters,
SUBSTRING(CONCAT(address, '#') FROM '\d' FOR '#') AS numbers
FROM repositories
Here is an example: I would like the string 1DDsg6bXmh3W63FTVN4BLwuQ4HwiUk5hX to return DDsgbXmhWFTVNBLwuQHwiUkhX. So basically return all the letters...and then my second one is to return all the numbers.
The g (“global”) modifier in regular expressions indicates that all matches rather than only the first one should be used.
That doesn't make much sense in the substring function, which returns only a single value, namely the first match. So there is no way to use g with substring.
In those functions where it makes sense in PostgreSQL (regexp_replace and regexp_matches), the g can be specified in the optional last flags parameter.
If you want to find all substrings that match a pattern, use regexp_matches.
For your example, which really has nothing to do with substring at all, I'd use
SELECT translate('1DDsg6bXmh3W63FTVN4BLwuQ4HwiUk5hX', '0123456789', '');
translate
---------------------------
DDsgbXmhWFTVNBLwuQHwiUkhX
(1 row)
So this is not pure SQL but Postgresql, but this also does the job:
SELECT project,
regexp_replace(address, '[^A-Za-z]', '', 'g') AS letters,
regexp_replace(address, '[^0-9]', '', 'g') AS numbers
FROM repositories;
Im using regexp to find the text after a word appear.
Fiddle demo
The problem is some address use different abreviations for big house: Some have space some have dot
Quinta
QTA
Qta.
I want all the text after any of those appear. Ignoring Case.
I try this one but not sure how include multiple start
SELECT
REGEXP_SUBSTR ("Address", '[^QUINTA]+') "REGEXPR_SUBSTR"
FROM Address;
Solution:
I believe this will match the abbreviations you want:
SELECT
REGEXP_REPLACE("Address", '^.*Q(UIN)?TA\.? *|^.*', '', 1, 1, 'i')
"REGEXPR_SUBSTR"
FROM Address;
Demo in SQL fiddle
Explanation:
It tries to match everything from the begging of the string:
until it finds Q + UIN (optional) + TA + . (optional) + any number of spaces.
if it doesn't find it, then it matches the whole string with ^.*.
Since I'm using REGEXP_REPLACE, it replaces the match with an empty string, thus removing all characters until "QTA", any of its alternations, or the whole string.
Notice the last parameter passed to REGEXP_REPLACE: 'i'. That is a flag that sets a case-insensitive match (flags described here).
The part you were interested in making optional uses a ( pattern ) that is a group with the ? quantifier (which makes it optional). Therefore, Q(UIN)?TA matches either "QUINTA" or "QTA".
Alternatively, in the scope of your question, if you wanted different options, you need to use alternation with a |. For example (pattern1|pattern2|etc) matches any one of the 3 options. Also, the regex (QUINTA|QTA) matches exactly the same as Q(UIN)?TA
What was wrong with your pattern:
The construct you were trying ([^QUINTA]+) uses a character class, and it matches any character except Q, U, I, N, T or A, repeated 1 or more times. But it's applied to characters, not words. For example, [^QUINTA]+ matches the string "BCDEFGHJKLMOPRSVWXYZ" completely, and it fails to match "TIA".
I have a string that appears as:
00012345678 Rain, Kip
I would like to filter out the first numbers/integers, then re-arrange the first and last name.
Kip Rain
I was thinking that I could do INSTR({string},',','1') to get to the first comma, but I am unsure how to do both numbers and punctuation in one line. Would I have to chain the INSTR?
Thanks for your help!
You can chain them; but with complicated things this quickly becomes confusing to work out what's happening. Unless you have demonstrable performance concerns it's often quicker to use regular expressions. In this case, it's probably easiest to use REGEXP_REPLACE()
select regexp_replace(your_string
, '[^[:alpha:]]+([[:alpha:]]+)[^[:alpha:]]+([[:alpha:]]+)'
, '\2 \1')
from ...
The second parameter is the match string; in this case we're searching for everything that is not an alphabetic character ([^[:alpha:]]) 1 or more times (+), followed by alphabetic characters ([[:alpha:]]) 1 or more times. This is repeated to take into account the spaces and comma; and would match your string as follows:
|string | matched by |
+--------------+----------------+
|'00012345678 '| [^[:alpha:]]+ |
|'Rain' | ([[:alpha:]]+) |
|', ' | [^[:alpha:]]+ |
|'Kip' | ([[:alpha:]]+) |
The parenthesis here represent groups; the first set the first group etc...
The third parameter of REGEXP_REPLACE() tells Oracle what to replace your string with; this where the groups come in - you can replace groups in any order. In this instance I want the second group (Kip), followed by a space, followed by the first group (Rain).
You can see this all demonstrated in this SQL Fiddle
Yes, it is alright to chain them:
substr(str, 1, instr(str, ' ')) number_part
substr(str, instr(str, ' '), instr(str, ',') - instr(str, ' ')) Kip
substr(str, instr(str, ' ', 2), len(str)) Rain
In last example you may use something more preceise than len(str) if your string is longer.
I am biased towards using the regular expression variation of the substr function.
First obtain a repeating list of non-numeric characters as follows:
REGEXP_SUBSTR('00012345678 Rain, Kip','([[:alpha:]]|[-])+',1,1)
where [[:alpha:]] is a character class where all alphabetic characters are included.
The bracketed expression, [-], is just a matching list which is my way of identifying that the last name, Rain, could include a hyphen. The alternation operator, '|', states that either the alphabetic or hyphen characters are acceptable.
The '+' indicates that we are looking to match one or more occurrences.
Second, obtain the last non-numeric characters at the end of the string:
REGEXP_SUBSTR('00012345678 Rain, Kip','[^, ]+$',1,1)
Here, I am going to the end of the string (using the anchor, '$'), and find all character after the comma and space.
Next I combine (with a space in between) using the concatenator operator, ||.
REGEXP_SUBSTR('00012345678 Rain, Kip','[^, ]+$',1,1) ||' ' || REGEXP_SUBSTR('00012345678 Rain, Kip','([[:alpha:]]|[ -])+',1,1)