Convert to string to 'Proper' casing - sql

Is there a way to convert a string to 'Proper' casing? I'm using the Excel definition of 'Proper' which will format text such that the first letter of any word is capitalized and the remaining letters are lower case.
Sample Inputs | Outputs
I browsed the string function/operators Presto documentation so it seems like this isn't possible, but hoping someone here can prove me wrong!

You use regexp_replace to turn string into the title case:
select regexp_replace('Hell asdasd QWEEQ aWQW', '(\w)(\w*)', x -> upper(x[1]) || lower(x[2]));
Output:
_col0
Hell Asdasd Qweeq Awqw

Related

What's the best way to 'normalize' a string in Redshift?

Since my texts are in Portuguese, there are many words with accent and other special characters, like: "coração", "hambúrguer", "São Paulo".
Normally, I treat these names in Python with the following function:
from unicodedata import normalize
def string_normalizer(text):
result = normalize("NFKD", text.lower()).encode("ASCII", "ignore").decode("ASCII")
return result.replace(" ", "-")
This would replace the blank spaces with '-', replace special characters and apply a lowercase convertion. The word "coração" would become "coracao", "São Paulo" would become "Sao Paulo" and so on. Now, I'm not sure what's the best way to do this in Redshift. My solution would be to apply multiple replaces, something like this:
replace(replace(replace(lower(column), 'á', 'a'), 'ç', 'c')...
Even though this works, it doesn't look like the best solution. Is there an easy way to normalize my string?
In Redshift, you can use the translate function to normalize a string. The translate function takes three arguments: the source string, the characters to replace, and the replacement characters. You can use this function to replace all the special characters in your string with their ASCII equivalent.
For example, the following query uses the translate function to replace all the special characters in a string with their ASCII equivalent. Additionally, spaces are replaced with "-" characters.
SELECT translate('São Paulo', ' áàãâäéèêëíìîïóòõôöúùûüçÁÀÃÄÉÈÊËÍÌÎÏÓÒÕÖÔÚÙÛÜÇ', '-aaaaaeeeeiiiiooooouuuucAAAAAEEEEIIIIOOOOOUUUUC')
This query would return the string "Sao Paulo". You can use the lower function to convert the string to lowercase.
Here's an example of how you could use these functions together to normalize a string:
SELECT lower(translate('São Paulo', ' áàãâäéèêëíìîïóòõôöúùûüçÁÀà ÄÉÈÊËÍÌÎÏÓÒÕÖÔÚÙÛÜÇ', '-aaaaaeeeeiiiiooooouuuucAAAAAEEEEIIIIOOOOOUUUUC'))
This query would return the string "sao-paulo".

How to get URL-friendly string in Oracle?

I'm looking for a way to get URL friendly strings. I'm handling properly the special characters (?;:!?./§*$^¨£µ...)
However I'm struggling with latin letters that have some accents like : ÄÊÍÕàùã...
For a string like
ÄÊÍÕABCDEàùã
I'm expecting
AEIOABCDEaua
I tried :
SELECT CONVERT('ÄÊÍÕABCDEàùã', 'US7ASCII', 'AL32UTF8') FROM DUAL;
But it returns
AEI?ABCDEau?
It's ignoring some of the characters (Õ,ã). I tried all character sets detailed here but none of them converted all the string characters properly.
Is there a way to convert all latin letters to their corresponding in simple form please ?
Thanks
Cheers,
Answer :
select utl_raw.cast_to_varchar2((nlssort('ÄÊÍÕABCDEàùã', 'nls_sort=binary_ai'))) from dual
returns :
aeioabcdeaua

Parse string with `T` to timestamp PostgreSQL

I have this string 2019-02-14T17:49:20.987 which I want to parse into a timestamp. So I am playing with the to_timestamp function and it seems to work fine except... The problem is with this T letter there. How do I make PostgreSQL skip it?
What pattern should I use in to_timestamp?
Of course I can replace the T with a space and then parse it but I find this approach too clumsy.
Quote from the manual
If there are characters in the template string that are not template patterns, the corresponding characters in the input data string are simply skipped over (whether or not they are equal to the template string characters).
So just put any non-template character there (e.g. X):
select to_timestamp('2019-02-14T17:49:20.987', 'YYYY-MM-DDXHH24:MI:SS.MS')
Online example: https://rextester.com/OHYD18205
Alternatively, you can simply cast the value:
select '2019-02-14T17:49:20.987'::timestamp
The string with T is a valid input literal for timestamp or timestamptz:
select '2019-02-14T17:49:20.987'::timestamp;
timestamp
-------------------------
2019-02-14 17:49:20.987
(1 row)

howto cut text from specific character in sqlite query

SQLITE Query question:
I have a query which returns string with the character '#' in it.
I would like to remove all characters after this specific character '#':
select field from mytable;
result :
text#othertext
text2#othertext
text3#othertext
So in my sample I would like to create a query which only returns :
text
text2
text3
I tried something with instr() to get the index, but instr() was not recognized as a function -> SQL Error: no such function: instr (probably old version of db . sqlite_version()-> 3.7.5).
Any hints howto achieve this ?
There are two approaches:
You can rtrim the string of all characters other than the # character.
This assumes, of course, that (a) there is only one # in the string; and (b) that you're dealing with simple strings (e.g. 7-bit ASCII) in which it is easy to list all the characters to be stripped.
You can use sqlite3_create_function to create your own rendition of INSTR. The specifics here will vary a bit upon how you're using

substring extraction in HQL

There's a URL field in my Hive DB that is of string type with this specific pattern:
/Cats-g294078-o303631-Maine_Coon_and_Tabby.html
and I would like to extract the two Cat "types" near the end of the string, with the result being something like:
mainecoontabby
Basically, I'd like to only extract - as one lowercase string - the Cat "types" which are always separated by '_ and _', preceded by '-', and followed by '.html'.
Is there a simple way to do this in HQL? I know HQL has limited functionality, otherwise I'd be using regexp or substring or something like that.
Thanks,
Clark
HQL does have a substr function as cited here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions
It returns the piece of a string starting at a value until the end (or for a particular length)
I'd also utilize the function locate to determine the location of the '-' and '_' in the URL.
As long as there are always three dashes and three underscores this should be pretty straight forward.
Might need case statements to determine number of dashes and underscores otherwise.
solution here...
LOWER(REGEXP_REPLACE(SUBSTRING(catString, LOCATE('-', catString, 19)+1), '(_to_)|(\.html)|_', ''))
Interestingly, the following did NOT work... JJFord3, any idea why?
LOWER(REGEXP_EXTRACT(SUBSTRING(FL.url, LOCATE('-', FL.url, 19)+1), '[^(_to_)|(\.html)|_]', 0))