Impala/Hive function to get the substring of a string - hive

I am using regexp_extract for getting sub-string from a string
My string is ":abd: 576892034 :erg: 94856023MXCI :oute: A RF WERS YUT :oowpo: 649217349GBT GB"
How will get this using regexp_extract function.
I need 576892034 if i pass :abd:

Try:
REGEXP_EXTRACT('your string', ':abd: ([^:]+)', 1)
The regexp :abd: ([^:]+) means match ':abd: ' folowed by any characters that are not ':'.
This regexp assumes that ':' does not appears withing the "value" strings. As such, it would fail on this input:
:abd: 5768:92034 :erg: 94856023MXCI :oute: A RF WERS YUT :oowpo: 649217349GBT GB

Related

REGEXP_EXTRACT in hive or Impala to extract a Substring

Hi i am new to hive i am using regexp_extract for getting substring from a string
my fixmessagestr is 10123=TICKET~}|167=CS~}|1=XTL9911~}|336=REG~}|10120= ~}|111=909~}|
how will I get XTL9911 using regexp_extract function. Need to get value for 1 Tag
I am using below and
select regexp_extract(fixmessagestr, '}1=(.*?)', 1) and it's giving null
In your string there is pipe | before tag- 1, not curly brace. Anyway both pipe and curly brace are special characters in regexp and should be shielded by double backslash (in Hive/Impala, in other databases it can be single backslash), also add end character, it is ~:
select regexp_extract('10123=TICKET~}|167=CS~}|1=XTL9911~}|336=REG~}|10120= ~}|111=909~}|', \
'\\|1=(.*?)~', 1)
Result:
XTL9911

Snowflake SQL Regex

I am trying to identify a value that is nested in a string using Snowflakes regexp_substr()
The value that I want to access is in quotes:
...
Type:
value: "CategoryA"
...
Edit: This text is nested in a much larger portion of text.
I want to extract CategoryA for all columns using regexp_substr. But I am unsure how.
I have tried:
regexp_substr(col, 'Type\\W+(\\w+)\\W+\\w.+')
and while that gives the portion of the string, I just want what is in quotes and can't figure out how to do so.
You could use regexp_replace() instead:
regexp_replace(col, '(^[^"]*")|("[^"]*$)", '')
The regexp matches on both following conditions, and replaces matching parts with the empty string:
^[^"]*": everything from the beginning of the string to the first double quote
("[^"]*$)": everything from the last double quote to the end of the string

Oracle mask multiple instances of a set of characters in a string - multiple variations of "phrase to be masked " in the string

I am attempting to mask multiple instances of a set of characters in a string.
The most likely variations of the string are
BL-nn-nnnnnnn
BLnn-nnnnnnn
BLnnnnnnnnn
BL-nnnnnnnnn
and mask all with 'BL-XX-XXXXXXX ' (note the space character at end of the masked string) there are other words that can start with BL in the string as well.
Any help with REGEXP_REPLACE() function is greatly appreciated. Thanks!
The following:
REGEXP_REPLACE(TEST_DATA, 'BL[-0-9 ]+', 'BL-XX-XXXXXXX ')
seems to do what you want.
db<>fiddle here
First remove all dashes and format by using regexp_replace() with [[:alnum:]] pattern to represent numeric or digit characters:
select regexp_replace(replace(str,'-',''), 'BL([[:alnum:]]{2})([[:alnum:]]{7})','BL-\1-\2 ')
as "Result String"
from tab
Demo

Removing leading special characters in Hive

I am trying to remove leading special characters (could be -"$&^#_)
from "Persi és Levon Cnatówóeez using Hive.
select REGEXP_REPLACE('“Persi és Levon Cnatówóeez', '[^a-zA-Z0-9]+', '')
but this removes all special characters.
I am expecting an output similar to
Persi és Levon Cnatówóeez
Try this:
select REGEXP_REPLACE('"Persi és Levon Cnatówóeez', '[^a-zA-Z0-9\u00E0-\u00FC ]+', '');
I tried it on Hive and it replaces any character that is not a letter (a-zA-Z) a number (0-9) or an accented character (\u00E0-\u00FC).
0: jdbc:hive2://localhost:10000> select REGEXP_REPLACE('"Persi és Levon Cnatówóeez', '[^a-zA-Z0-9\u00E0-\u00FC ]+', '');
+----------------------------+--+
| _c0 |
+----------------------------+--+
| Persi és Levon Cnatówóeez |
+----------------------------+--+
1 row selected (0.104 seconds)
0: jdbc:hive2://localhost:10000>
From the Hive documentation:
regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT)
Returns the string resulting from replacing all substrings in INITIAL_STRING that match the java regular expression syntax defined in PATTERN with instances of REPLACEMENT. For example, regexp_replace("foobar", "oo|ar", "") returns 'fb.' Note that some care is necessary in using predefined character classes: using '\s' as the second argument will match the letter s; '\s' is necessary to match whitespace, etc.
Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
You should do something like this:
select REGEXP_REPLACE('“Persi és Levon Cnatówóeez', '^[\!-\/\[-\`]+', '')
I haven't Hive right know to try this code, but the idea should be correct. In the second field you must put what you want to substitute, not what you want to keep in your string. In this specific case, this should remove (substitute with empty string '') every consequent character in the beginning of the line, that is in the range from ! to /, or in the range [ to ` referring to the ASCII table.

Teradata substring out of bounds

I'm having issues figuring out the bounds between a substring. For example for the string 063016_shape_tea_cleanse__emshptea1_I want to substring out emshptea1, but it also has to work for the string 063016_shape_tea_cleanse__emshptea1_TESTDATA_HERE.
Currently I have:
sel SUBSTR('063016_shape_tea_cleanse__emshptea1_',POSITION('__' IN '063016_shape_tea_cleanse__emshptea1_')+2,
POSITION('_' IN SUBSTR('063016_shape_tea_cleanse__emshptea1_',POSITION('__' IN '063016_shape_tea_cleanse__emshptea1_') + 2,CHARACTER_LENGTH('063016_shape_tea_cleanse__emshptea1_') - (POSITION('__' IN '063016_shape_tea_cleanse__emshptea1_') + 2)))-1)
But that is erroring out due to it trying to substring 27 to -1.
You might use a regular expression, this will extract everything between __ and the following _ or end of string:
REGEXP_SUBSTR(col, '(?<=__).+?(?=(_|$))')
'(?<= )' is a look-behind, i.e search for previous characters without adding it to the result. Here: search for __
'.+' matches any character, one or multiple times. This would match until the end of the string ("greedy"), '?' ("lazy") prevents that.
'(?= )' is a look-ahead, i.e. search for following characters without adding it to the result.
( | ) The pipe splits an expression in multiple alternatives. Here either an underscore character or the end of the string $