Hive Regex Pattern Issue - sql

My data is having 3 column values something like this-
1111 some input $in put1
1121 - $in put2
Between the first value and second there is a space delimiter. In between the second and third columns, there is a space AND a "$" delimiter.
The second value in the second row is not provided so it is just a dash (-).
My table statement in Hive is as below-
CREATE TABLE tab1(someid string,something1 string, something2 string)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES(
"input.regex"="^(\\S+)\\s+(\\S+\\s+\\S+)\\s+.(.+)$",
"output.format.string"="%1$s %2$s %3$s")
stored as textfile;
The results I am getting out of it is -
1111 Some input $in put1
1121 - $in put2
What I am expecting is -
1111 Some input in put1
1121 - in put2
If the second value is just a dash (-) then it is to be taken as a value for the second column.
In the last column, I do not want any delimiter symbol.
What wrong I am doing in the regex. I like to have the same S pattern regex. Please help.

Does this work?
"input.regex"="^(\\S+)\\s+([^$]+)\\$(.+)$"
Regex breakdown:
^ beginning of the string
(\\S+) 1 to N non-space characters (capturing)
\\s+ 1 to N space characters
([^$]+) 1 to N characters other than a literal "$"
\\$ a literal "$"
(.+) 1 to N characters
$ end of the string

Related

SQL, extract everything before 5th comma

For example, my column "tags" have
"movie/spiderman,genre/action,movie:marvel",
"movie/kingsman,genre/action",
"movie/spiderman,genre/action,movie:marvel,movie:dfjkl,movie:fskj,movie:aa,movie:mdkk"
I'm trying to return everything before 5th comma. below is the result example
"movie/spiderman,genre/action,movie:marvel",
"movie/kingsman,genre/action",
"movie/spiderman,genre/action,movie:marvel,movie:dfjkl,movie:fskj"
I've tried below code but it's not working.
select
NVL(SUBSTRING(tags, 1,REGEXP_INSTR(tags,',',1,5) -1),tags)
from myTable
You can use
REGEXP_REPLACE(tags, '^(([^,]*,){4}[^,]*).*', '\\1')
See the regex demo.
The REGEXP_REPLACE will find the occurrence of the following pattern:
^ - start of string
(([^,]*,){4}[^,]*) - Group 1 (\1 refers to this part of the match): four sequences of any zero or more chars other than a comma and a comma, and then zero or more chars other than a comma
.* - the rest of the string.
The \1 replacement restores Group 1 value in the resulting string.

Extract substring with a specific pattern in Hive SQL

I have a column with this sample data. I need to extract all substring that starts with "M6". Is there a way to do it with regexp_extract?
Data Column
HEY01230328_M6K21SG_UNO_NYC_241
M6EW2BJ_UNO_NYC_251
M6HW2WL_UNO_NYC_251
HEY08460329_NA_M6LAB3D_UNO_NYC_241
Desired Output
M6K21SG
M6EW2BJ
M6HW2WL
M6LAB3D
Try using:
SELECT colname FROM tableName WHERE REGEXP_EXTRACT(colname, ".*(M6[^_]*).*",1)
Regex used:
.*(M6[^_]*).*
Regex Demo
Explanation:
.* - matches 0+ occurrences of any character that is not a newline character
(M6[^_]*) - matches M6 followed by 0+ occurrences of any character that is not a _. So, after M6, it keeps on matching everything until it finds the next _. The parenthesis is used to store this sub-match in Group 1
.* - matches 0+ occurrences of any character that is not a newline character

Insert comma after every 7th character using regex and hive sql

Insert comma after every 7th character and make sure the data is having comma after every 7th character correctly using regex in hive sql.
Also to ignore the space while selecting the 7th character.
Sample Input Data:
12F123f, 123asfH 0DB68ZZ, AG12453
112312f, 1212sfH 0DB68ZZ, AQ13463
Output:
12F123f,123asfH,0DB68ZZ,AG12453
112312f,1212sfH,0DB68ZZ,AQ13463
I tried the below code, but it didn't work and insert the commas correctly.
select regexp_replace('12345 12456,12345 123', '(/(.{5})/g,"$1$")','')
I think you can use
select regexp_replace('12345 12456,12345 123', '(?!^)[\\s,]+([^\\s,]+)', ',$1')
See the regex demo
Details
(?!^) - no match if at string start
[\s,]+ - 1 or more whitespaces or commas
([^\s,]+) - Capturing group 1: one or more chars other than whitespaces and commas.
The ,$1 replacement replaces the match with a comma and the value in Group 1.
You just want to replace the empty char to ,, am I right? the SQL as below:
select regexp_replace('12F123f,123asfH 0DB68ZZ,AG12453',' ',',') as result;
+----------------------------------+--+
| result |
+----------------------------------+--+
| 12F123f,123asfH,0DB68ZZ,AG12453 |
+----------------------------------+--+

How can I extract the numerical prefix of a string using REGEX_EXTRACT on Hive?

I'm not sure how to write my regex command on Hive to pull the numerical prefix substring from this string: 211118-1_20569 - (DHCP). I need to return 211118, but also have the flexibility to return digits with smaller or larger values depending on the size of the numerical prefix.
hive> select regexp_extract('211118-1_20569 - (DHCP)','^\\d+',0);
OK
211118
or
hive> select regexp_extract('211118-1_20569 - (DHCP)','^[0-9]+',0);
OK
211118
^ - The beginning of a line
\d - A digit: [0-9]
[0-9] - the characters between '0' and '9'
X+ - X, one or more times
https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
regexp_extract(string subject, string pattern, int index)
predefined character classes (e.g. \d) should be preceded with additional backslash (\\d)
index = 0 matches the whole pattern
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringOperators

Postgresql : Pattern matching of values starting with "IR"

If I have table contents that looks like this :
id | value
------------
1 |CT 6510
2 |IR 52
3 |IRAB
4 |IR AB
5 |IR52
I need to get only those rows with contents starting with "IR" and then a number, (the spaces ignored). It means I should get the values :
2 |IR 52
5 |IR52
because it starts with "IR" and the next non space character is an integer. unlike IRAB, that also starts with "IR" but "A" is the next character. I've only been able to query all starting with IR. But other IR's are also appearing.
select * from public.record where value ilike 'ir%'
How do I do this? Thanks.
You can use the operator ~, which performs a regular expression matching.
e.g:
SELECT * from public.record where value ~ '^IR ?\d';
Add a asterisk to perform a case insensitive matching.
SELECT * from public.record where value ~* '^ir ?\d';
The symbols mean:
^: begin of the string
?: the character before (here a white space) is optional
\d: all digits, equivalent to [0-9]
See for more info: Regular Expression Match Operators
See also this question, very informative: difference-between-like-and-in-postgres