Hive SQL Extract string of varying length between two non-alphanumeric characters - hive

I would like to extract strings of varying length located between two repeating underscores in Hive QL. Below I show a sampling of the pattern of the rows. Specifically, I would like to extract the string between the 3rd and 4th underscores. Thanks!
2016_sadfsa_IL_THIS_xsdaf_asd_eventbyevent_tsaC_NA_300x250
2017_thisshopper_MA_THIS_NAT_Leb_ReasonsWhy_HDIMC_NA_300x600
2017_FordShopper_IL_THESE_NAT_sov_winterEvent_HDIMC_NA_300x600
Just kept trying and I modified this from previous responses to non-Hive SQL. I am still interested in knowing better ways of doing this. Note that creative_str is the name of the column:
select creative_str, ltrim(rtrim(substring(regexp_replace(cast(creative_str as varchar(1000)), '_', repeat(cast(' ' as varchar(1000)),10000)), 30001, 10000)))
from impression_cr

You should be able to do this with Hive's SPLIT() function. If you're trying to grab the value between the third and fourth underscores, this will do it:
SELECT SPLIT("2016_sadfsa_IL_THIS_xsdaf_asd_eventbyevent_tsaC_NA_300x250", "[_]")[3],
SPLIT("2017_thisshopper_MA_THIS_NAT_Leb_ReasonsWhy_HDIMC_NA_300x600", "[_]")[3],
SPLIT("2017_FordShopper_IL_THESE_NAT_sov_winterEvent_HDIMC_NA_300x600", "[_]")[3]

Related

Add a character in a string at certain location based on logic in SQL Server

I have comma separated data like this in one of the column
48FGTG,100ERTD,18NH,07EWR,9FDC,2POANAR,100GTEDC
46FGTG,78ERTD,67NH,76EWR,3FDC
The numbers in the starting is percentage, whatever comes after the first alphabetic character is percentage, it varies from 0-100.
I have to update the data like
48% FGTG,100% ERTD,18% NH,07% EWR,9% FDC,2% POANAR,100% GTEDC
46% FGTG,78% ERTD,67% NH,76% EWR,3% FDC
I can filter out the percentile in regex, but not sure using it in SQL. Any lead would be helpful.
You can do it like
select STRING_AGG(substring(value,0,PATINDEX('%[^0-9]%',value))+'%'+substring(value,PATINDEX('%[^0-9]%',value),len(value)),',') from string_split('48FGTG,100ERTD,18NH,07EWR,9FDC,2POANAR,100GTEDC
46FGTG,78ERTD,67NH,76EWR,3FDC',',')
Here's what I have done
1.Use PATINDEX to find the first occurrence of character
2.Use substring function to extract the first number and then remaining string
3.Use STRING_AGG to concatenates the values of string expressions and places separator values between them

Postgres SQL regexp_replace replace all number

I need some help with the next. I have a field text in SQL, this record a list of times sepparates with '|'. For example
'14613|15474|3832|148|5236|5348|1055|524' Each value is a time in milliseconds. This field could any length, for example is perfect correct '3215|2654' or '4565' (only 1 value). I need get this field and replace all number with -1000 value.
So '14613|15474|3832|148|5236|5348|1055|524' will be '-1000|-1000|-1000|-1000|-1000|-1000|-1000|-1000'
Or '3215|2654' => '-1000|-1000' Or '4565' => '-1000'.
I try use regexp_replace(times_field,'[[:digit:]]','-1000','g') but it replace each digit, not the complete number, so in this example:
'3215|2654' than must be '-1000|-1000', i get:
'-1000-1000-1000-1000|-1000-1000-1000-1000', I try with other combinations and more options of regexp but i'm done.
Please need your help, thanks!!!.
We can try using REGEXP_REPLACE here:
UPDATE yourTable
SET times_field = REGEXP_REPLACE(times_field, '\y[0-9]+\y', '-1000', 'g');
If instead you don't really want to alter your data but rather just view your data this way, then use a select:
SELECT
times_field,
REGEXP_REPLACE(times_field, '\y[0-9]+\y', '-1000', 'g') AS times_field_replace
FROM yourTable;
Note that in either case we pass g as the fourtb parameter to REGEXP_REPLACE to do a global replacement of all pipe separated numbers.
[[:digit:]] - matches a digit [0-9]
+ Quantifier - matches between one and unlimited times, as many times as possible
your regexp must look like
regexp_replace(times_field,'[[:digit:]]+','-1000','g')

Oracle SQL regex: extract every instance of a string and preceding/following characters

I'm pulling data from an Oracle CLOB field containing tens of thousands of characters. The data look like this:
...
196|9900000296567|V|
197|S05S53499|D|
198|TO|20170128000000|50118.0|||T|N|
196|9900009777884|V|
197|H02FC07599|D|
198|01|20170128000000|64452.0|||T|N|
198|02|20170128000000|14235.0|||T|N|
196|9900014386487|V|
197|S10C20599|D|
198|1|20170128000000|6246.0|||T|N|
196|9900015184256|V|
197|S13G44199|D|
198|L|20170128000000|1731.0|||T|N|
198|N|20170128000000|5915.0|||T|N|
196|9900018826270|V|
197|S10C20599|D|
198|01|20170128000000|3678.0|||T|N|
198|02|20170128000000|25286.0|||T|N|
...
I want to extract every occurrence of a string (e.g. S10C20599) with the preceding 25 characters and following 75 characters. If this bit is not possible I'd happily settle for the same number of preceding and following characters. I don't care if I get overlaps in the extracted data, and the code should not error if the search string occurs <25 characters from the beginning of the file or <75 characters from the end.
Thanks for any tips.
If there is only one value, you can use:
select regexp_substr(col, '.{0-25}S10C20599.{0-75}')
Otherwise, you need to some sort of recursive or hierarchical query to fetch multiple values from a single string.

Display certain sequence only in VARCHAR

I have a column error_desc with values like:
Failure occurred in (Class::Method) xxxxCalcModule::endCustomer. Fan id 111232 is not Effective or not present in BL9_XXXXX for date 20160XXX.
What SQL query can I use to display only the number 111232 from that column? The number is placed at 66th position in VARCHAR column and ends 71st.
SELECT substr(ERROR_DESC,66,6) as ABC FROM bl1_cycle_errors where error_desc like '%FAN%'
This solution uses regular expressions.
The challenge I faced was on pulling out alphanumerics. We have to retain only numbers and filter out string,alphanumerics or punctuations in this case, to detect the standalone number.
Pure strings and words not containing numbers can be easily filtered out using
[^[:digit:]]
Possible combinations of alphanumerics are :
1.Begins with a character, contains numbers, may end with characters or punctuations :
[a-zA-Z]+[0-9]+[[:punct:]]*[a-zA-Z]*[[:punct:]]*
2.Begins with numbers and then contains alphabets,may contain punctuations :
[0-9]+[[:punct:]]*[a-zA-Z]+[[:punct:]]*
Begins with numbers then contains punctuations,may contain alphabets :
-- [0-9]+[a-zA-Z][[:punct:]]+[a-zA-Z] --Not able to highlight as code, refer solution's last regex combination
Combining these regular expressions using | operator we get:
select trim(REGEXP_REPLACE(error_desc,'[^[:digit:]]|[a-zA-Z]+[0-9]+[[:punct:]]*[a-zA-Z]*[[:punct:]]*|[0-9]+[[:punct:]]*[a-zA-Z]+[[:punct:]]*|[0-9]+[a-zA-Z]*[[:punct:]]+[a-zA-Z]*',' '))
from error_table;
Will work in most cases.

Cut string after first occurrence of a character

I have strings like 'keepme:cutme' or 'string-without-separator' which should become respectively 'keepme' and 'string-without-separator'. Can this be done in PostgreSQL? I tried:
select substring('first:last' from '.+:')
But this leaves the : in and won't work if there is no : in the string.
Use split_part():
SELECT split_part('first:last', ':', 1) AS first_part
Returns the whole string if the delimiter is not there. And it's simple to get the 2nd or 3rd part etc.
Substantially faster than functions using regular expression matching. And since we have a fixed delimiter we don't need the magic of regular expressions.
Related:
Split comma separated column data into additional columns
regexp_replace() may be overload for what you need, but it also gives the additional benefit of regex. For instance, if strings use multiple delimiters.
Example use:
select regexp_replace( 'first:last', E':.*', '');
SQL Select to pick everything after the last occurrence of a character
select right('first:last', charindex(':', reverse('first:last')) - 1)