Single hive query to remove certain text in data - sql

I have a column data like this in 2 formats
1)"/abc/testapp/v1?FirstName=username&Lastname=test123"
2)"/abc/testapp/v1?FirstName=username"
I want to retrieve the output as "/abc/testapp/v1?FirstName=username" and strip out the data starting with "&Lastname" and ending with "".The idea is to remove the Lastname with its value.
But if the data doesn't contain "&Lastname" then it should also work fine as per the second scenario
The value for Lastname shown in the example is "test123" but in general this will be dynamic
I have started with regexp_replace but i am able to replace "&Lastname" but not its value.
select regexp_replace("/abc/testapp/v1?FirstName=username&Lastname=test123&type=en_US","&Lastname","");
Can someone please help here how i can achieve both these with a single hive query?

Use split function:
with your_data as (--Use your table instead of this example
select stack (2,
"/abc/testapp/v1?FirstName=username&Lastname=test123",
"/abc/testapp/v1?FirstName=username"
) as str
)
select split(str,'&')[0] from your_data;
Result:
_c0
/abc/testapp/v1?FirstName=username
/abc/testapp/v1?FirstName=username
Or use '&Lastname' pattern for split:
select split(str,'&Lastname')[0] from your_data;
It will allow something else with & except starting with &Lastname

for both queries with or without last name its working in this way using split for hive no need for any table to select you can directly execute the function like select functionname
select
split("/abc/testapp/v1FirstName=username&Lastname=test123",'&')[0]
select
split("/abc/testapp/v1FirstName=username",'&')[0]
Result :
_c0
/abc/testapp/v1FirstName=username
you can make a single query :
select
split("/abc/testapp/v1FirstName=username&Lastname=test123",'&')[0],
split("/abc/testapp/v1FirstName=username",'&')[0]
_c0 _c1
/abc/testapp/v1FirstName=username /abc/testapp/v1FirstName=username

Related

How to get the nth match from regexp_matches() as plain text

I have this code:
with demo as (
select 'WWW.HELLO.COM' web
union all
select 'hi.co.uk' web)
select regexp_matches(replace(lower(web),'www.',''),'([^\.]*)') from demo
And the table I get is:
regexp_matches
{hello}
{hi}
What I would like to do is:
with demo as (
select 'WWW.HELLO.COM' web
union all
select 'hi.co.uk' web)
select regexp_matches(replace(lower(web),'www.',''),'([^\.]*)')[1] from demo
Or even the big query version:
with demo as (
select 'WWW.HELLO.COM' web
union all
select 'hi.co.uk' web)
select regexp_matches(replace(lower(web),'www.',''),'([^\.]*)')[offset(1)] from demo
But neither works. Is this possible? If it isn't clear, the result I would like is:
match
hello
hi
Use split_part() instead. Simpler, faster. To get the first word, before the first separator .:
WITH demo(web) AS (
VALUES
('WWW.HELLO.COM')
, ('hi.co.uk')
)
SELECT split_part(replace(lower(web), 'www.', ''), '.', 1)
FROM demo;
db<>fiddle here
See:
Split comma separated column data into additional columns
regexp_matches() returns setof text[], i.e. 0-n rows of text arrays. (Because each regular expression can result in a set of multiple matching strings.)
In Postgres 10 or later, there is also the simpler variant regexp_match() that only returns the first match, i.e. text[]. Either way, the surrounding curly braces in your result are the text representation of the array literal.
You can take the first row and unnest the first element of the array, but since you neither want the set nor the array to begin with, use split_part() instead. Simpler, faster, and less versatile. But good enough for the purpose. And it returns exactly what you want to begin with: text.
I'm a little confused. Doesn't this do what you want?
with demo as (
select 'WWW.HELLO.COM' web
union all
select 'hi.co.uk' web
)
select (regexp_matches(replace(lower(web), 'www.',''), '([^\.]*)'))[1]
from demo
This is basically your query with extra parentheses so it does not generate a syntax error.
Here is a db<>fiddle illustrating that it returns what you want.

Databricks - String manipulation via sql command

I have a table column that I need to get from databricks whatever appears between the 15th and 16th appearance of the character # as follows the following example:
Column
1234##E#A#1234#01/01/4500#X#*ABCDE#7#1##N#N#N#0#Z.POIUS.LKJS_20200103#0#
Results
Z.POIUS.LKJS_20200103
how can I do this?
select reverse(substring_index(reverse(substring_index('1234##E#A#1234#01/01/4500#X#*ABCDE#7#1##N#N#N#0#Z.POIUS.LKJS_20200103#0#', '#', 16)),'#', 1))
You can just split the string and get the 15th element, eg something like this:
%sql
SELECT *,
regexp_extract( yourCol, '(?:[^#]*(#)){15}(.[^#]+)', 2 ) xregex,
split( yourCol, '#' )[15] AS xsplit
FROM tmp
I was experimenting with regex which may be appropriate for some cases too. My results:

Regular expression for gettin data after - in sql

I have a column with assignment numbers like - 11827,27266,91717,09818-2,726252-3,8716151-0,827272,18181
Now i am selecting the records like
select assignment_number from table;
But now i want that the column detail is retreived in such a way that numbers are only retrieved without -2 -3 etc like
726252-3---> 726252 8716151-0-->8716151
I know i can use regex for this but i do not know how to use it
This will select everthing before the character -:
^([^-]+)
From 726252-3 will match 726252
You would use regexp() substr:
select regexp_substr(assignmentnumber, '[0-9]+')
This will return the first string of numbers encountered in the string.

BigQuery SPLIT() and grouping by result

Using SPLIT() & NTH(), I'm splitting a string value, and taking the 2nd substring as the result. I then want to group on that result. However, when I use SPLIT() in conjunction with a GROUP BY, it keeps giving the error:
Error: (L1:55): Cannot group by an aggregate
The result is a string, so why is it not possible to group on it?
For example, this works and returns the correct string:
SELECT NTH(2,SPLIT('FIRST-SECOND','-')) as second_part FROM [FOO.bar] limit 10
But then grouping on the result does not work:
SELECT NTH(2,SPLIT('FIRST-SECOND','-')) as second_part FROM [FOO.bar] GROUP BY second_part limit 10
My best guess would be that you can get an equivalent result by using a subquery. Something like :
SELECT * FROM (Select NTH(2,SPLIT('FIRST-SECOND','-')) as second_part FROM [FOO.bar] limit 10) GROUP BY second_part
The system returns Nth in an aggregate internally I guess
If there are always just 2 values separated by a delimeter, then a simpler approach would be to use REGEXP_EXTRACT:
SELECT REGEXP_EXTRACT('FIRST-SECOND','-(.*)') as second_part
from [FOO.bar]
GROUP BY second_part
limit 10
I like David's answer - sometimes splitting can get a bit more complicated using RegEx. Extracting the first option from a split command, then GROUPing BY is a very common operation. The way I normally do this in BigQuery is using a REGEXP_EXTRACT as follows:
In this simple example, the column "splitme" is pipe-delimited (|).
SELECT REGEXP_EXTRACT(splitme, r'(?U)^(.*)\|') AS title, COUNT(*) as c
FROM [my_table]
GROUP BY title;
This means, extract the string from the beginning of "splitme" to the first occurrence of a pipe (|). The "(?U)" is the "un-greedy" match flag in the re2 RegEx engine's syntax. Without this flag, if there are multiple pipe-delimited values, this RegEx would match everything up until the last pipe.
In my practice, I am usually using something like below with N being number of values in "list" to skip.
SELECT REGEXP_EXTRACT(string + '|', r'(?U)^(?:.*\|){N}(.*)\|') AS substring
So if I would be interested in third value in list I would use:
SELECT
REGEXP_EXTRACT(string + '|', r'(?U)^(?:.*\|){2}(.*)\|') AS substring,
COUNT(1) AS weight
FROM yourtable
GROUP BY 1
More details on re2 syntax here

How to get part of the string that matched with regular expression in Oracle SQL

Lets say I have following string: 'product=1627;color=45;size=7' in some field of the table.
I want to query for the color and get 45.
With this query:
SELECT REGEXP_SUBSTR('product=1627;color=45;size=7', 'color\=([^;]+);?') "colorID"
FROM DUAL;
I get :
colorID
---------
color=45;
1 row selected
.
Is it possible to get part of the matched string - 45 for this example?
One way to do it is with REGEXP_REPLACE. You need to define the whole string as a regex pattern and then use just the element you want as the replace string. In this example the ColorID is the third pattern in the entire string
SELECT REGEXP_REPLACE('product=1627;color=45;size=7'
, '(.*)(color\=)([^;]+);?(.*)'
, '\3') "colorID"
FROM DUAL;
It is possible there may be less clunky regex solutions, but this one definitely works. Here's a SQL Fiddle.
Try something like this:
SELECT REGEXP_SUBSTR(REGEXP_SUBSTR('product=1627;color=45;size=7', 'color\=([^;]+);?'), '[[:digit:]]+') "colorID"
FROM DUAL;
From Oracle 11g onwards we can specify capture groups in REGEXP_SUBSTR.
SELECT REGEXP_SUBSTR('product=1627;color=45;size=7', 'color=(\d+);', 1, 1, 'i', 1) "colorID"
FROM DUAL;