Split and rejoin part of a string in BigQuery - sql

I'm querying the github sample_files dataset in bigquery and I want to get the the path excluding the filename.
So if I have /path/to/file.txt
I want it to return /path/to
In python I could do something like
"/".join(str.split(a, "/")[0:-1])
but I'm not sure how to do that in bigquery/sql
Any ideas? THanks!

One method is regexp_replace():
regexp_replace('/path/to/file.txt', '/[^/]+$', '')

I would use REGEXP_EXTRACT as in below example
REGEXP_EXTRACT(full_path, r'(.+)/[^/]*$')
Split and rejoin part of a string in BigQuery
If for some reason you need or more comfortable with mimicking same approach (Split and rejoin) with SPLIT as in your question - you can use below approach (provided along with sample data for testing , playing with)
#standardSQL
WITH `project.dataset.table` AS (
SELECT '/path/to/file.txt' full_path UNION ALL
SELECT '/path/to/'
)
SELECT full_path,
(
SELECT STRING_AGG(part, '/')
FROM UNNEST(SPLIT(full_path, '/')) part WITH OFFSET
WHERE OFFSET < ARRAY_LENGTH(SPLIT(full_path, '/')) - 1
) path
FROM `project.dataset.table`
with output
Row full_path path
1 /path/to/file.txt /path/to
2 /path/to/ /path/to

Related

Impala / Hive - Extract text from comma delimited string where each separation doesnt match a pattern

is there a way in hive or impala to extract a string from a delimited string but only where the string i want doesnt match one or multiple patterns?
For instance, i have a field with IPs (the number varies depending on network adapters):
169.254.182.175,192.168.0.1,10.199.44.111
I would like to extract the IP that doesnt start with 169.254. (there could be many of these) and doesnt equal 192.168.0.1
The IPs can be in any order as well.
I tried doing substr with nested cases but due the unknown number of ips in the string it didnt work out.
Could this be accomplished with regex_extract or something similar?
Thanks,
You may use regexp_replace with capturing group for patterns that you do not want to keep and specify only groups of interest in the replacement string.
See example below in Impala (impalad version 3.4.0):
select
addr_list,
/*Concat is used just for visualization*/
rtrim(ltrim(regexp_replace(addr_list,concat(
/*Group of 169.254.*.* that should be excluded*/
'(169\\.254\\.\\d{1,3}\\.\\d{1,3})', '|',
/*Another group for 192.168.0.1*/
'(192\.168\.0\.1)', '|',
/*And the group that we need to keep*/
'(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})'
/*So keep the third group in the output.
Other groups will be replaced with empty string*/
), '\\3'), ','), ',') as ip_whitelist
from(values
('169.254.182.175,192.168.0.1,169.254.2.12,10.199.44.111,169.254.0.2' as addr_list),
('10.58.3.142,169.254.2.12'),
('192.168.0.1,192.100.0.2,154.16.171.3')
) as t
addr_list
ip_whitelist
169.254.182.175,192.168.0.1,169.254.2.12,10.199.44.111,169.254.0.2
10.199.44.111
10.58.3.142,169.254.2.12
10.58.3.142
192.168.0.1,192.100.0.2,154.16.171.3
192.100.0.2,154.16.171.3
regexp_extract works differently for unknown reason, because the same regex with 3 as return group doesn't return anything at all for case 1 and 3.
select
t.addr_list,
rtrim(ltrim(regexp_replace(addr_list, r.regex, '\\3'), ','), ',') as ip_whitelist,
regexp_extract(addr_list, r.regex, 3) as ip_wl_extract
from(values
('169.254.182.175,192.168.0.1,169.254.2.12,10.199.44.111,169.254.0.2' as addr_list),
('10.58.3.142,169.254.2.12'),
('192.168.0.1,192.100.0.2,154.16.171.3')
) as t
cross join (
select concat(
'(169\\.254\\.\\d{1,3}\\.\\d{1,3})', '|',
'(192\.168\.0\.1)', '|',
'(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})'
) as regex
) as r
addr_list
ip_whitelist
ip_wl_extract
169.254.182.175,192.168.0.1,169.254.2.12,10.199.44.111,169.254.0.2
10.199.44.111
10.58.3.142,169.254.2.12
10.58.3.142
10.58.3.142
192.168.0.1,192.100.0.2,154.16.171.3
192.100.0.2,154.16.171.3

Regex that matches strings with specific text not between text in BigQuery

I have the following strings:
step_1->step_2->step_3
step_1->step_3
step_1->step_2->step_1->step_3
step_1->step_2->step_1->step_2->step_3
What I would like to do is to capture the ones that between step_1 and step 3 there's no step_2.
The results should be like this:
string result
step_1->step_2->step_3 false
step_1->step_3 true
step_1->step_2->step_1->step_3 true
step_1->step_2->step_1->step_2->step_3 false
I have tried to use the negative lookahead but I found out that BigQuery doesn't support it. Any ideas?
You are essentially looking for when the pattern does not exist. The following regex would support that embedded in a case statement. This would not support a scenario where you have both conditions in a single string, however that was not a scenario you listed in your sample data.
Try the following:
with sample_data as (
select 'step_1->step_2->step_3' as string union all
select 'step_1->step_3' union all
select 'step_1->step_2->step_1->step_3' union all
select 'step_1->step_2->step_1->step_2->step_3' union all
select 'step_1->step_2->step_1->step_2->step_2->step_3' union all
select 'step_1->step_2->step_1->step_2->step_2'
)
select
string,
-- CASE WHEN regexp_extract(string, r'step_1->(\w+)->step_3') IS NULL THEN TRUE
CASE WHEN regexp_extract(string, r'1(->step_2)+->step_3') IS NULL THEN TRUE
ELSE FALSE END as result
from sample_data
This results in:
Consider also below option
select string,
not regexp_contains(string, r'step_1->(step_2->)+step_3\b') as result
from your_table
I believe #Daniel_Zagales answer is the one you were expecting. However here is a broader solution that can maybe be interesting in your usecase:it consists in using arrays
WITH sample AS (
SELECT 'step_1->step_2->step_3' AS path
UNION ALL SELECT 'step_1->step_3'
UNION ALL SELECT 'step_1->step_2->step_1->step_3'
UNION ALL SELECT 'step_1->step_2->step_1->step_2->step_3'
),
temp AS (
SELECT
path,
SPLIT(REGEXP_REPLACE(path,'step_', ''), '->') AS sequences
FROM
sample)
SELECT
path,
position,
flattened AS current_step,
LAG(flattened) OVER (PARTITION BY path ORDER BY OFFSET ) AS previous_step,
LEAD(flattened) OVER (PARTITION BY path ORDER BY OFFSET ) AS following_step
FROM
temp,
temp.sequences AS flattened
WITH
OFFSET AS position
This query returns the following table
The concept is to get an array of the step number (splitting on '->' and erasing 'step_') and to keep the OFFSET (crucial as UNNESTing arrays does not guarantee keeping the order of an array).
The table obtained contains for each path and step of said path, the previous and following step. It is therefore easy to test for instance if successive steps have a difference of 1.
(SELECT * FROM <previous> WHERE ABS(current_step-previous_step) != 1 for example)
(CASTing to INT required)

Regex: how to get the text between a few colons?

So, i have a lot of strings like the ones below in my database:
product1:1stparty:single_aduls:android:
product2:3rdparty:married_adults:ios:
product3:3rdparty:other_adults:android:
I need a regex to get only the text after the product name and before the device category. So, in the first line I'd get 1stparty:single_aduls, in the second 3rdparty:married_adults and in the third 3rdparty:other_adults. I'm stuck and can't find a way to solve that. Could anyone help me please?
As a regular expression, you can use:
select regexp_extract('product1:1stparty:single_aduls:android:', '^[^:]*:(.*):[^:]*:$')
This returns every after the first colon and before the penultimate colon.
We can try using REGEXP_REPLACE here:
SELECT REGEXP_REPLACE(val, r"^.*?:|:[^:]+:$", "") AS output
FROM yourTable;
This approach removes either the leading ...: or trailing :...: from the column, leaving behind the content you want. Here is a demo showing that the regex replacement is working:
Demo
You can also use standard split function and access result array element by index, which is quite clear to read and understand.
with a as (
select split('product1:1stparty:single_aduls:android:', ':') as splitted
)
select splitted[ordinal(2)] || ':' || splitted[ordinal (3)] as subs
from a
Consider below example
with your_table as (
select 'product1:1stparty:single_aduls:android:' txt union all
select 'product2:3rdparty:married_adults:ios:' union all
select 'product3:3rdparty:other_adults:android:'
)
select *,
(
select string_agg(part, ':' order by offset)
from unnest(split(txt, ':')) part with offset
where offset in (1, 2)
) result
from your_table
with output

Get name after and before certain character in SQL Server

I got the following entry in my database:
\\folder.abc\es\Folder-A\\2020-08-03\namefile.csv
So basically, I want everything after the last \ and before .
the namefile in that example
Thanks in advance.
If you are making use of an older version of SQL server which doenst support string_split. The reverse function comes in handy as follows.
The steps i do is reverse the string, grab the char position of ".", grab the char position of "\" then apply the substring function on it to slice the data between the two positions. Finally i reverse it again to get the proper value.
Here is an example
with data
as(select '\\folder.abc\es\Folder-A\\2020-08-03\namefile.csv' as col
)
select reverse(substring(reverse(col)
,charindex('.',reverse(col))+1
,charindex('\',reverse(col))
-
charindex('.',reverse(col))-1
)
) as file_name
from data
+-----------+
| file_name |
+-----------+
| namefile |
+-----------+
dbfiddle link
https://dbfiddle.uk/?rdbms=sqlserver_2014&fiddle=8c0fc11f5ec813671228c362f5375126
You can use:
select t.*,
left(s.value, charindex('.', s.value))
from t cross apply
string_split(t.entry, '\') s
where t.entry like concat('%', s.value);
This splits the string into different components and matches on the one at the end of the string. If components can repeat, the above can return duplicates. That is easily addressed by moving more logic into the apply:
select t.*, s.val
from t cross apply
(select top (1) left(s.value, charindex('.', s.value)) as val
from string_split(t.entry, '\') s
where t.entry like concat('%', s.value)
) s
You can just use String functions (REVERSE,CHARINDEX,SUBSTRING).
SELECT
REVERSE(
SUBSTRING(REVERSE('\\folder.abc\es\Folder-A\\2020-08-03\namefile.csv'),
CHARINDEX('.',REVERSE('\\folder.abc\es\Folder-A\\2020-08-03\namefile.csv'))+1,
CHARINDEX('\',REVERSE('\\folder.abc\es\Folder-A\\2020-08-03\namefile.csv'))-
CHARINDEX('.',REVERSE('\\folder.abc\es\Folder-A\\2020-08-03\namefile.csv'))-1))
OR
SELECT
REVERSE
(
SUBSTRING( --get filename
reverse(path), --to get position last \
CHARINDEX('.',reverse(path))+1,
CHARINDEX('\',reverse(path))- CHARINDEX('.',reverse(path))-1)
)

How to count number commas present in a CSV file

i have one csv file (ABC.txt) which has data as below :
1234,"djjdjd",45566,84774,45666,"djdjd"
i want to count number of commas present in a row of this CSV file.
how can i get this .
REGEXP_COUNT() can do this easily:
with tbl(data_string) as
(
select '1234,"djjdjd",45566,84774,45666,"djdjd"' from dual
)
select regexp_count(data_string, ',') from tbl;
For a pure (oracle) sql solution, try
SELECT NVL(LENGTH(REGEXP_REPLACE(csv_row, '[^,]', '')), 0) FROM csv_table
assuming that you have the data from your csv file stored in the database.
The query replaces all charcters but commas in the original string, so determining the length becomes equivalent to counting.
Note that you might want to treat commas inside csv fields differently.
Alternative ( see Alex K.'s comment )
SELECT LENGTH(csv_row) - NVL(LENGTH(REPLACE(csv_row, ',', '')), 0) FROM csv_table