How to extract the words inside the 2nd brackets using Regexp_extract in Bigquery? - google-bigquery

I am having a textPayload column in bigquery table containing these values
textPayload
# User#Host: root[root] # [44.27.156.25] thread_id: 67301 server_id: 1220687984
I need to extract the username and host name as separate fields in the following fashion:
User:root Host:44.27.156.25,
All the values of this column will be containing the text as posted above
I am trying like this Select Regexp_Extract(textPayload, -> unable to get the Regex
I am new to regexp_extract and I am not able to extract the 2nd word which is the host:44.27.156.25,
Can anyone help me in extracting the Host name through a Regexp_extract ?

You could use the context of User# in the text payload, and that you want an IP address in square brackets, to find the content you want:
SELECT
textPayload,
REGEXP_EXTRACT(textPayload, r"\bUser#.*?\[(.*?)\]") AS User,
REGEXP_EXTRACT(textPayload, r"\[(\d+\.\d+\.\d+\.\d+)\]"
FROM yourTable;

Try regexp_extract_all with r'\[(.+?)\]':
select regexp_extract_all('# User#Host: root[root] # [44.27.156.25] thread_id: 67301 server_id: 1220687984', r'\[(.+?)\]')

Related

How do I print the first occurence of a string after a special character in Hive using reg_extract or split?

I am having a deep dilemma in hive. My data set in Hive looks like this:
##214628##564#7576#7876
#12771#242###256823
###3264###7236473####3
In each instance, I want to print only the first string after the #. So the output should be something like this:
214628
12771
3264
I tried using the reg_extract function, but alas I am getting only NULL values. Since hive doesn't support reg_substr, the following synatax doesn't work:
to_number(trim(regexp_substr(col_name,'[^#]+',1,1)))
Any suggestions are wecome!
You can use regexp_replace and then substr combination.
First remove all multiple occurrences of # from the string using regexp_replace().
regexp_replace(col,'#+','#') -- for data '#####123##' this will produce '#123#'
Then remove first # using substr. And then use instr to fetch everything starting from first till #.
substr(substr(str,2),1, instr(substr(str,2),'#')-1) this will produce '123'
You can see whole sql below.
select substr(substr(str,2),1, instr(substr(str,2),'#')-1) as result
from (
SELECT regexp_replace('#####123##','#+','#') as str) a
I assumed you always have # in the beginning. if you just add if left(str,1)='#'... and handle according to the data.

SQL to find curly brackets in a JSON file

I am reading a JSON file to find if the file has data.
If it has data the file would look like
{"transactions":[{"id":"132482","postingId":"754","studentId":"12345"}
If the file has no data it would look like
{}
I am trying to email the user if the file has data.
I tried regex [{}],^{},/{/},^{(*)}$ to check for no data and not to email.
All these expression failed.
SELECT '{"transactions":[{"id":"132482","postingId":"754","studentId":"602000335"}' value
FROM dual
WHERE REGEXP_LIKE ('{"transactions":[{"id":"132482","postingId":"754","studentId":"602000335"}', '^{(*)}$');
Am I missing something.
use regex :
^[a-zA-Z0-9\_]+$
Match strings that have letters and numbers. If a string does not have letters and numbers, it will not match.
SELECT '{"transactions":[{"id":"132482","postingId":"754","studentId":"602000335"}' value
FROM dual
WHERE REGEXP_LIKE ('{"transactions":[{"id":"132482","postingId":"754","studentId":"602000335"}', '^[a-zA-Z0-9\_]+$');

Regexp_Extract BigQuery anything up to "|"

I'm fairly new to coding and I was wondering if you could give me a hand writing some regular expression for BigQuery SQL.
Basically I would like to extract everything before the bar sign "|" for one of my column.
Example:
Source string:
bla-BLABLA-cid=123456_sept1220_blabla--potato-Blah|someMore_string_stuff-IDontNeed
Desired output:
bla-BLABLA-cid=123456_sept1220_blabla--potato-Blah
I thought about using the REGEXP_EXTRACT(string, delimiter) function but I'm totally unable to write some regex (LOL). Therefore I had a look over Stack, and have found stuff like:
SELECT REGEXP_EXTRACT( String_Name , "\S*\s*\|" ) ,
# or
SELECT REGEXP_EXTRACT( String_Name , '.+?(?=|)')
But every time I get error messages like " invalid perl operator: (?= " or "Illegal escape space"
Would you have any suggestions on why I get these messages and/or how could I proceed to extract these strings?
Many many thanks in advance <3
You can use SPLIT instead:
SELECT SPLIT("bla-BLABLA-cid=123456_sept1220_blabla--potato-Blah|someMore_string_stuff-IDontNeed", "|")[OFFSET(0)]
Prefix the pattern string with r:
SELECT REGEXP_EXTRACT(String_Name, r'\S*\s*\|')
This is the syntax for a raw string constant. You can review what this means in the documentation.

Big Query Regex Extraction

I am trying to extract a item_subtype field from an URL.
This regex works fine in the to get the first item item_type
SELECT REGEXP_EXTRACT('info?item_type=icecream&item_subtype=chocolate/cookies%20cream,vanilla&page=1', r'item_type=(\w+)')
but what is the correct regex to get everything starting from 'chocolate' all the way to before the '&page1'
I have tried this, but can't seem to get it to work to go further
SELECT REGEXP_EXTRACT('info?item_type=icecream&item_subtype=chocolate/cookies%20cream,vanilla&page=1', r'item_subtype=(\w+[^Z])')
basically, I want to extract 'chocolate/cookies%20cream,vanilla'
In your case, \w+ only matches one or more letters, digits or underscores. Your expected values may contain other characters, too.
You may use
SELECT REGEXP_EXTRACT('info?item_type=icecream&item_subtype=chocolate/cookies%20cream,vanilla&page=1', r'item_subtype=([^&]+)')
See the regex demo.
Notes:
item_subtype= - this string is matched as a literal char sequence
([^&]+) - a Capturing group 1 that matches and captures one or more chars other than & into a separate memory buffer that is returned by REGEXP_EXTRACT function.

REGEXP REPLACE with backslashes in Spark-SQL

I have a string containing \s\ keyword. Now, I want to replace it with NULL.
select string,REGEXP_REPLACE(string,'\\\s\\','') from test
But unable to replace with the above statement in spark sql
input: \s\help
output: help
want to use regexp_replace
To replace one \ in the actual string you need to use \\\\ (4 backslashes) in the pattern of the regexep_replace. Please do look at https://stackoverflow.com/a/4025508/9042433 to understand why 4 backslashes are needed to replace just one backslash
So, the required statement would become like below
select name, regexp_replace(name, '\\\\s\\\\', '') from test
Below screenshot has examples for better understanding