Extract substring with a specific pattern in Hive SQL - sql

I have a column with this sample data. I need to extract all substring that starts with "M6". Is there a way to do it with regexp_extract?
Data Column
HEY01230328_M6K21SG_UNO_NYC_241
M6EW2BJ_UNO_NYC_251
M6HW2WL_UNO_NYC_251
HEY08460329_NA_M6LAB3D_UNO_NYC_241
Desired Output
M6K21SG
M6EW2BJ
M6HW2WL
M6LAB3D

Try using:
SELECT colname FROM tableName WHERE REGEXP_EXTRACT(colname, ".*(M6[^_]*).*",1)
Regex used:
.*(M6[^_]*).*
Regex Demo
Explanation:
.* - matches 0+ occurrences of any character that is not a newline character
(M6[^_]*) - matches M6 followed by 0+ occurrences of any character that is not a _. So, after M6, it keeps on matching everything until it finds the next _. The parenthesis is used to store this sub-match in Group 1
.* - matches 0+ occurrences of any character that is not a newline character

Related

SQL, extract everything before 5th comma

For example, my column "tags" have
"movie/spiderman,genre/action,movie:marvel",
"movie/kingsman,genre/action",
"movie/spiderman,genre/action,movie:marvel,movie:dfjkl,movie:fskj,movie:aa,movie:mdkk"
I'm trying to return everything before 5th comma. below is the result example
"movie/spiderman,genre/action,movie:marvel",
"movie/kingsman,genre/action",
"movie/spiderman,genre/action,movie:marvel,movie:dfjkl,movie:fskj"
I've tried below code but it's not working.
select
NVL(SUBSTRING(tags, 1,REGEXP_INSTR(tags,',',1,5) -1),tags)
from myTable
You can use
REGEXP_REPLACE(tags, '^(([^,]*,){4}[^,]*).*', '\\1')
See the regex demo.
The REGEXP_REPLACE will find the occurrence of the following pattern:
^ - start of string
(([^,]*,){4}[^,]*) - Group 1 (\1 refers to this part of the match): four sequences of any zero or more chars other than a comma and a comma, and then zero or more chars other than a comma
.* - the rest of the string.
The \1 replacement restores Group 1 value in the resulting string.

Hive Regex Pattern Issue

My data is having 3 column values something like this-
1111 some input $in put1
1121 - $in put2
Between the first value and second there is a space delimiter. In between the second and third columns, there is a space AND a "$" delimiter.
The second value in the second row is not provided so it is just a dash (-).
My table statement in Hive is as below-
CREATE TABLE tab1(someid string,something1 string, something2 string)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES(
"input.regex"="^(\\S+)\\s+(\\S+\\s+\\S+)\\s+.(.+)$",
"output.format.string"="%1$s %2$s %3$s")
stored as textfile;
The results I am getting out of it is -
1111 Some input $in put1
1121 - $in put2
What I am expecting is -
1111 Some input in put1
1121 - in put2
If the second value is just a dash (-) then it is to be taken as a value for the second column.
In the last column, I do not want any delimiter symbol.
What wrong I am doing in the regex. I like to have the same S pattern regex. Please help.
Does this work?
"input.regex"="^(\\S+)\\s+([^$]+)\\$(.+)$"
Regex breakdown:
^ beginning of the string
(\\S+) 1 to N non-space characters (capturing)
\\s+ 1 to N space characters
([^$]+) 1 to N characters other than a literal "$"
\\$ a literal "$"
(.+) 1 to N characters
$ end of the string

Delete specific pattern between commas in text file

I have thousand of SQL queries written over notepad++ line by line.Single line contain single SQL query.Every SQL query contain list of columns to be selected from database as comma separated values.Now we want certain columns not to be part of that list which follow a specific pattern/regular expression.The SQL query follows a specific pattern :
A trimmed column has been selected as alias 'PK'
Every query has got a 'dated'where condition at the end of it.
Sometimes the pattern which we wish to remove exist in either PK/where or both.we don't want to remove that column/pattern from those places.Just from the column selection list.
Below is the example of a SQL query :
select (TRIM(TAE_TSP_REC_UPDATE)) as PK,TAE_AMT_FAIR_MV,TAE_TXT_ACCT_NUM,TAE_CDE_OWNER_TYPE,TAE_DTE_AQA_ABA,TAE_RID_OWNER,TAE_FID_OWNER,TAE_CID_OWNER,TAE_TSP_REC_UPDATE from TABLE_TAX_REP where DATE(TAE_TSP_REC_UPDATE)>='03/31/2018'
After removal of columns/patterns query should look like below :
select (TRIM(TAE_TSP_REC_UPDATE)) as PK,TAE_AMT_FAIR_MV,TAE_TXT_ACCT_NUM,TAE_CDE_OWNER_TYPE,TAE_DTE_AQA_ABA from TABLE_TAX_REP where DATE(TAE_TSP_REC_UPDATE)>='03/31/2018'
want to remove below patterns from each and every query between the commas :
.FID.
.RID.
.CID.
.TSP.
If the pattern exist within TRIM/DATE function it should not be touched.It should only be removed from column selection list.
Could somebody please help me regarding above.Thanks in advance
You may use
(?:\G(?!^)|\sas\s(?=.*'\d{2}/\d{2}/\d{4}'$))(?:(?!\sfrom\s).)*?\K,?\s*[A-Z_]+_(?:[FRC]ID|TSP)_[A-Z_]+
Details
(?:\G(?!^)|\sas\s(?=.*'\d{2}/\d{2}/\d{4}'$)) - two alternatives:
\G(?!^) - the end of the previous location, not a position at the start of the line
| - or
\sas\s(?=.*'\d{2}/\d{2}/\d{4}'$) - an as surrounded with single whitespaces that is followed with any 0+ chars other than line break chars and then ', 2 digits, /, 2 digits, /, 4 digits and ' at the end of the line
(?:(?!\sfrom\s).)*? - consumes any char other than a linebreak char, 0 or more repetitions, as few as possible, that does not start whitespace, from, whitespace sequence
\K - a match reset operator discarding all text matched so far
,?\s* - an optional comma followed with 0+ whitespaces
[A-Z_]+_(?:[FRC]ID|TSP)_[A-Z_]+ - ASCII letters or/and _, 1 or more occurrences, followed with _, then F, R or C followed with ID or TSP, then _, and again 1 or more occurrences of ASCII letters or/and _.
See the regex demo.

How can I extract the numerical prefix of a string using REGEX_EXTRACT on Hive?

I'm not sure how to write my regex command on Hive to pull the numerical prefix substring from this string: 211118-1_20569 - (DHCP). I need to return 211118, but also have the flexibility to return digits with smaller or larger values depending on the size of the numerical prefix.
hive> select regexp_extract('211118-1_20569 - (DHCP)','^\\d+',0);
OK
211118
or
hive> select regexp_extract('211118-1_20569 - (DHCP)','^[0-9]+',0);
OK
211118
^ - The beginning of a line
\d - A digit: [0-9]
[0-9] - the characters between '0' and '9'
X+ - X, one or more times
https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
regexp_extract(string subject, string pattern, int index)
predefined character classes (e.g. \d) should be preceded with additional backslash (\\d)
index = 0 matches the whole pattern
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringOperators

match line that doesnt contain certain words

I have the following string:
ignoreword1,word1, ignoreword2
i would like to match any word that is not ignoreword1 or ignoreword2
this is what i have so far
(?s)^((?!ignoreword1).)*$
the main goal is to use the regex as part of postgresql database to select rows where the column match a substring after removing "ignoreword1", "ignoreword2" and the comma ","
To match any word that is not ignoreword1 or Ignoreword2 use 
\b(?!(?:ignoreword1|ignoreword2)\b)\w+
In PostgreSQL, word boundaries are [[:<:]] and [[:>:]], so use something like:
[[:<:]](?!(?:ignoreword1|ignoreword2)[[:>:]])[a-zA-Z]+
Pattern details:
[[:<:]] - leading word boundary
(?!(?:ignoreword1|ignoreword2)[[:>:]]) - fail the match if the whole string is either ignoreword1 or ignoreword2
[a-zA-Z]+ - one or more any ASCII letters.