How do we substring from reverse in Postgres? In oracle we can provide the no.of occurrences of the pattern and extract the expected records. In Postgres we do not have such option.
I tried using substring(), left() and right() functions, still it is not working. Any suggestion would be helpful.
Value in the column,
col1
100~500~~~~Bangalore~~~~KA~null~Train
Expected result,
Train
To get the characters after the last ~ you can use substring() with a regex:
substring(col1 from '~([^~]+)$')
Or get the position of the last ~ by using the reverse function:
right(col1, strpos(reverse(col1), '~') - 1)
A more general approach is to convert the string to an array, then pick the last array element:
(string_to_array(col1, '~'))[cardinality(string_to_array(col1, '~'))]
The best solution to this sort of problems is to not store multiple values delimited by some character in a single column. If you really need to de-normalize using arrays or JSON would at least be a bit more flexible (and robust)
Related
I need to create a View on top of a Hive Table, masking data in a particular column.
The Table has a column of String Type. The data in that particular column is of JSON structure. I need to mask a value of a particular field say 'ip_address'
{"id":1,"first_name":"john","last_name":"doe","email":"sample#123.com","ip_address":"111.111.111.111"}
expected:
{"id":1,"first_name":"john","last_name":"doe","email":"sample#123.com","ip_address":null}
These are the few Built-in Hive Functions I have tried, they don't seem to help my cause.
mask
get_json_object
STR_TO_MAP
if clause
Also I don't think substring and regexp_Extract are useful here coz the position of the field value is not always predetermined plus I'm not familiar with regex expressions.
PS: Any help is appreciated that would help me avoid writing a new UDF.
regexp_replace:
select regexp_replace(column_name,'"ip_address":".*?"', '"ip_address":null') as column_name will work fine with any position.
You can add any number of optional spaces before and after ::
regexp_replace(column_name,'"ip_address" *: *".*?"', '"ip_address":null')
Regexp '"ip_address" *: *".*?"' meaning:
"ip_address" - literally "ip_address"
* - 0 or more spaces (allowed in json)
: - literally :
* - 0 or more spaces
".*?" - any number of any characters (non-greedy) inside double-quotes.
See also similar question if you want to replace value with some calculated value, for example obfuscate using sha256, not with just null: https://stackoverflow.com/a/54179543/2700344
Is there a way to obtain only the file extension excluding query parameters using split_part and reverse from an SQL query?
ie.
www.example.com?hffhqowhf
or
test.jpg?34rfqeyfhhf
Returns:
com
jpg
Not tied down to com or jpg but in general?
Many Thanks
There are a number of ways of achieving this (for example using a combination of INSTR and SUBSTR functions) but the cleanest way is probably to use a regular expression, something like this:
(
assumption: the string always has a query parameter starting with a '?' and that is the only occurrence of this character in the string
caveat: I don't currently have access to Impala so you may need to
adjust the regex expression to get it to work precisely as you
require
)
Reverse the string (REVERSE function) - so that the substring you want is between the '?' and the next '.'. If you don't reverse the string it is harder to identify which '.' in the string you are dealing with
Extract the substring between '?' and '.' but excluding these 2 characters e.g.
select regexp_extract(reverse('www.example.com?hffhqowhf'),'?([^.]+)',1);
Reverse the output again to get the required result
I have a collection of strings and want to filter out those where the last four characters are: (alpha)(alpha)(number)(number).
I know I can make a substring of each of these and separately, but what is the method to determine the types of the characters in the sequence?
This is for SQL in Hive.
You can use regular expressions. Something like:
where col regexp '[a-zA-Z]{2}[0-9]{2}$'
I am running below query in Teradata :
sel requesttext from dbc.tables
where tablename='old_employee_table'
Result:
alter table DB_NAME.employee_table,no fallback ;
I want to get below result using SQL:
DB_NAME.employee_table
Requesttext can be:
create set table DB_NAME.employee_table;
DB Name and table can occur anywhere in the result. Since .(dot) is joining them that's why i want to split with .(dot).
Basically I need sql which can result me surrounding values of .(dot)
I want DBName and Tablename in result.
I'm not a Teradata person, but this should work for both strings given so far, as long as teradata's regexp_substr() supports positive look-behind and positive look-ahead assertions (I might have the Teradata syntax wrong, so a little tweaking may be needed):
SELECT REGEXP_SUBSTR(requesttext, '(?<= )(\w+\.\w+)(?=[,$]?)', 1, 1)
FROM dbc.tables
WHERE tablename='old_employee_table'
See the regex101 example. Hopefully it translates to Teradata easily.
The regex looks for and returns the words either side of and including the period, when preceded by a space, and followed by an optional comma or the end of the line.
You could do this with either regexp_substr() or strtok().
As Jamie Zawinski said:
Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems.
So I would go with the strtok() method. Also I'm lazy and regular expressions are hard.
Function strtok() takes three arguments:
The string being split
The delimiter to split the string
The number of the token to grab.
To get at the <database>.<table> from that string that is returned in your query, we can split by a space, grab the third token, then split that by a comma and grab the first token.
That would look like:
SELECT strtok(strtok(requestText,' ',3),',',1)
FROM dbc.tables
WHERE tablename='old_employee_table'
In Amazon Redshift tables, I have a string column from which I need to extract numbers only out. For this currently I use
translate(stringfield, '0123456789'||stringfield, '0123456789')
I was trying out REPLACE function, but its not gonna be elegant.
Any thoughts with converting the string into ASCII first and then doing some operation to extract only number? Or any other alternatives.
It is hard here as Redshift do not support functions and is missing lot of traditional functions.
Edit:
Trying out the below, but it only returns 051-a92 where as I need 05192 as output. I am thinking of substring etc, but I only have regexp_substr available right now. How do I get rid of any characters in between
select REGEXP_SUBSTR('somestring-051-a92', '[0-9]+..[0-9]+', 1)
might be late but I was solving the same problem and finally came up with this
select REGEXP_replace('somestring-051-a92', '[a-z/-]', '')
alternatively, you can create a Python UDF now
Typically your inputs will conform to some sort of pattern that can be used to do the parsing using SUBSTRING() with CHARINDEX() { aka STRPOS(), POSITION() }.
E.g. find the first hyphen and the second hyphen and take the data between them.
If not (and assuming your character range is limited to ASCII) then your best bet would be to nest 26+ REPLACE() functions to remove all of the standard alpha characters (and any punctuation as well).
If you have multibyte characters in your data though then this is a non-starter.
Better method is to remove all the non-numeric values:
select REGEXP_replace('somestring-051-a92', '[^0-9]', '')
You can specify "any non digit" that includes non-printable, symbols, alpha, etc.
e.g., regexp_replace('brws--A*1','[\D]')
returns
"1"