Separate area code from phone with HIVEQL - sql

I have a table with DDD and Phone fields. Some were registered correctly, others the ddd is next to the phone and I need to separate.
my table:
Modified table:
I am starting my studies in HIVEQL, how can I make this change?

Use regexp_extract(str, regex, group_number) to extract ddd and telefone. Demo:
with mytable as (--test data
select stack(3,'5566997000000','5521997000001','24997000011') as str
)
select regexp_extract(str,'^(?:55)?(\\d{2})(\\d+)',1) as ddd,
regexp_extract(str,'^(?:55)?(\\d{2})(\\d+)',2) as telefone
from mytable
Result:
ddd telefone
66 997000000
21 997000001
24 997000011
Regexp '^(?:55)?(\\d{2})(\\d+)' meaning:
^ - beginning of the string anchor
(?:55)? - non-capturing group with 55 country code zero or one time (optional)
(\\d{2}) - capturing group with two digits - ddd
(\\d+) - capturing group with 1+ digits - telefone

Related

is there a way to extract 70% of character from a string using bigquery?

I have column with Names of various length like :
Name | ID
Avi | 01
Li | 02
Amandeep | 03
I want to extract 70% of characters.
I am using :
substring(Name,1, (length(Name)-5))
But this does not work when length(name) is less than 2 or 3
I think you want:
SELECT Name, ID, SUBSTRING(Name, 1, CAST(CEIL(0.7*LENGTH(Name)) AS INT64)) AS Name70Pct
FROM yourTable;
Here we are taking 70% of the length of the name. I wrap with CEIL() to ensure that a name of one character will at least return that one character.

RegEx with spaces and delimiters

I have a two columns with the following data:
Column 1: BIG123 - Telecommunications (John Barrot)
Column 2: 7 Congressional 1 - Toward
The data format is the same with spaces and the "-" as the delimiter for each column, but the organization, names, and beginning code can be longer or shorter than what you see here(instead of Telecommunications it can be CEO or instead of John Barrott it can be Guy Rodriguez, etc). I need to extract the following:
(Column names are in bold)
Organization Telecommunications
Supervisor John Barrot
Profile
Congressional 1 - Toward
I have been using the following cheat sheet but I am still having issues extracting: https://cheatography.com/davechild/cheat-sheets/regular-expressions/
I have tried regex_extract(column1, [A-Z][a-z]) and I only get the first two letters of column 1 after the "-".
Any help would be great.
Thanks,
DW
With your example try the following
with sample_data as (
select 'BIG123 - Telecommunications (John Barrot)' AS COLUMN_1, '7 Congressional 1 - Toward' as COLUMN_2
)
select regexp_extract(COLUMN_1, r'.+-\s(\S+)') as Organization
, regexp_extract(COLUMN_1, r'.+\((.+\w)') as Supervisor
, regexp_extract(COLUMN_2, r'\d+\s(.+)') as Profile
from sample_data

Bigquery query to get all occurences that match a certain pattern after slash

I have a table in bigquery that has a column with many rows that contain strings that can be something like this
row 1 mmmmm hhhhh ccccc tttt /tst /kl:2 /aaaa nnnn
row 2 ddd bb /lamp /mode:2 /nana
row 3 /dada
I need to catch all of the:
tst, kl and aaaa, lamp, mode, nana, dada (meaning all of the words after slash)
How do i do that?
Tried something like this but it did not find
SELECT column1,
SPLIT(REGEXP_REPLACE(column1,r'(\/.*?(\s|$))', ',')) AS regex_found
FROM table
You could use query similar to this one. It returns single repeated columns tokens containing words that you're interested in:
SELECT
REGEXP_EXTRACT_ALL(column, r'\/([^ :]+)') AS tokens
FROM
UNNEST(['mmmmm hhhhh ccccc tttt /tst /kl:2 /aaaa nnnn', 'ddd bb /lamp /mode:2 /nana', '/dada']) AS column

How to select last sentence from a column, which is starting from a number

I want to get the last sentence that is starting from a number in a column.
Example Code:
WITH q AS (SELECT '1.abc def ghi 2.sdadasd. rewtretrtr1 3. hjgjhjhgj, yo whats. 4. gog mi man. Its been' AS sentence FROM DUAL)
SELECT SUBSTR(sentence, INSTR(sentence,'.',-1) + 1)
FROM q;
My Output
Its been
Expected Output
4. gog mi man. Its been
Is this possible in Oracle?
This is a good use case for handy Oracle regexp function REGEXP_SUBSTR():
SELECT REGEXP_SUBSTR(sentence, '\d\.\D+$') FROM q;
Regexp breakdown:
\d -- a digit
\. -- a dot
\D+ -- as many non-digit characters as possible (at least one)
$ -- end of string
REGEXP_SUBSTR() searches the string for a for the given regular expression and returns a given occurence (first occurence by default).
Demo on DB Fiddle:
WITH q AS (SELECT '1.abc def ghi 2.sdadasd. rewtretrtr1 3. hjgjhjhgj, yo whats. 4. gog mi man. Its been' AS sentence FROM DUAL)
SELECT REGEXP_SUBSTR(sentence, '\d\.\D+$') FROM q;
| REGEXP_SUBSTR(SENTENCE,'\D\.\D+$') |
| :--------------------------------- |
| 4. gog mi man. Its been |
EDIT
It turns out that you are dealing with much more complex strings:
the portion to capture might contain numbers
the string may contain new line
I hence would suggest a new approach, that relies on REGEXP_REPLACE() to remove the unwanted part of the string.
Consider:
SELECT REGEXP_REPLACE(sentence, '.*\d+\.', '', 1, 0, 'n') FROM q;
Regexp .*\d+\. will greadily match everything from the beginning of the string to the last occurence of a digit followed by a dot and a space. REGEXP_REPLACE will suppress that part of the string. The 'n' modifier allows the . character to match on the new line character.
With this expression, you get the expected part of the string, only minus the digit(s) and dot at the beginning (that's as good as it gets, since Oracle does not support regex lookaheads... sigh).
Demo on DB Fiddle:
Given this input string:
We have received customer approval on the
warranty nozzle including revised ERO repairs. Please proceed with the repairs.
Please provide photos and damage mapping when complete per customer requests." 9/12/19 MH
10. CHECKING WITH VENDOR ABOUT ECD. 9/13/19
MH11. Per Vendor,
"Originally I quoted a 3-4 week delivery once approved. This month is shot. W
e are booked solid. We estimate a delivery date of 10/11" 9/13/19 MH
The query returns:
Per Vendor,
"Originally I quoted a 3-4 week delivery once approved. This month is shot. W
e are booked solid. We estimate a delivery date of 10/11" 9/13/19 MH
This is quite tricky, if your sentences can contain digits. But it can be done in Oracle:
WITH q AS (
SELECT '1.abc def ghi 2.sdadasd. rewtretrtr1 3. hjgjhjhgj, yo whats. 4. gog mi man. Its been' AS sentence FROM DUAL union all
SELECT '1.abc def ghi 2.sdadasd. rewtretrtr1 3. hjgjhjhgj, yo whats. 4. gog mi 3 men. Its been' AS sentence FROM DUAL
)
SELECT regexp_substr(sentence, '\d[.](\D|\d+[^.])*$')
FROM q;

Hive UDF to return multiple colunm output

How create a UDF which take a String return multiple Strings ?
The UDF so far I have seen could only give one output. How to get multiple feilds as output from a UDF ?
Simplest would be implementation of name -> FirstName, LastName.
Not looking for alternate solution to split names, but looking for API / UDF which would help implement such needs .
Lets Say nameSplitteris my UDF
Select age,nameSplitter(name) as firstName,LastName from myTable;
InPut
****Input****
------------------------
Age | Name
------------------------
24 | John Smit
13 | Sheldon Cooper
-------------------------
OutPut
****Out put ****
-----------------------------------
Age | First Name | Last Name
-----------------------------------
24 | John | Smit
13 | Sheldon | Cooper
-----------------------------------
Use split() function, it splits strinng around regexp pattern and returns an array:
select age,
NameSplitted[0] as FirstName,
NameSplitted[1] as LastName
from
(
select age,
split(Name,' +') as NameSplitted
from myTable
)s;
Or just select age, split(Name,' +')[0] FirstName, split(Name,' +')[0] LastName from myTable;
Pattern ' +' means one or more spaces.
Also if you have three words names or even longer and you want to split only first word as a name and everything else as last name, or using more complex rule, you can use regexp_extract function like in this example:
hive> select regexp_extract('Johannes Chrysostomus Wolfgangus Theophilus Mozart', '^(.*?)(?: +)(.*)$', 1);
OK
Johannes
Time taken: 1.144 seconds, Fetched: 1 row(s)
hive> select regexp_extract('Johannes Chrysostomus Wolfgangus Theophilus Mozart', '^(.*?)(?: +)(.*)$', 2);
OK
Chrysostomus Wolfgangus Theophilus Mozart
Time taken: 0.692 seconds, Fetched: 1 row(s)
Pattern here means: the beginning of the string '^', first capturing group consisting of any number of characters (.*?), non-capturing group consisting of any number of spaces (?: +), last capturing group consisting of any number of characters greedy (.*), and $ means the end of the string