Extract a sub string from a string - google-bigquery

In google-bigquery, I need to pull the string that is between domain** and ** as in the example bellow
The string is under the column "Site_Data"
Can someone help me? 10x!

See example below
#standardSQL
WITH yourTable AS (
SELECT '756-1__6565656565656, tagtype**unmapped,domain**www.sport.com,userarriveddirectly**False' AS Site_Data
)
SELECT
REGEXP_EXTRACT(Site_Data, r'domain\*\*(.*)\*\*') AS x,
Site_Data
FROM yourTable

Do all of the strings have that format? There are a couple of different options, assuming that you always need the third string after the ** delimiter.
1) Use SPLIT, e.g.:
#standardSQL
WITH SampleData AS (
SELECT '756-1__67648582789116,tagtype**unmapped,domain**www.sport.com,userarriveddirectly**False' AS site_data
)
SELECT SPLIT(site_data, '**')[OFFSET(2)] AS visit_type
FROM SampleData;
2) Use REGEXP_EXTRACT, e.g.:
#standardSQL
WITH SampleData AS (
SELECT '756-1__67648582789116,tagtype**unmapped,domain**www.sport.com,userarriveddirectly**False' AS site_data
)
SELECT REGEXP_EXTRACT(site_data, r'[^\*]+\*\*[^\*]+\*\*([^\*]+)') AS visit_type
FROM SampleData;
Taking this a step further, if you want to split the domain and the arrival type, you can use SPLIT again:
#standardSQL
WITH SampleData AS (
SELECT '756-1__67648582789116,tagtype**unmapped,domain**www.sport.com,userarriveddirectly**False' AS site_data
)
SELECT
SPLIT(visit_type)[OFFSET(0)] AS domain,
SPLIT(visit_type)[OFFSET(1)] AS arrival_type
FROM (
SELECT SPLIT(site_data, '**')[OFFSET(2)] AS visit_type
FROM SampleData
);

Related

Geography function over a column

I am trying to use the st_makeline() function in order to create lines for every points and the next one in a single column.
Do I need to create another column with the 2 points already ?
with t1 as(
SELECT *, ST_GEOGPOINT(cast(long as float64) , cast(lat as float64)) geometry FROM `my_table.faissal.trajets_flix`
where id = 1
order by index_loc
)
select index_loc geometry
from t1
Here are the results
Thanks for your help
You seems to want to write this code:
https://cloud.google.com/bigquery/docs/reference/standard-sql/geography_functions#st_makeline
WITH t1 as (
SELECT *, ST_GEOGPOINT(cast(long as float64), cast(lat as float64)) geometry
FROM `my_table.faissal.trajets_flix`
-- WHERE id = 1
)
SELECT id, ST_MAKELINE(ARRAY_AGG(geometry ORDER BY index_loc)) traj
FROM t1
GROUP BY id;
with output:
When visualized on the map.
Consider also below simple and cheap option
select st_geogfromtext(format('linestring(%s)',
string_agg(long || ' ' || lat order by index_loc))
) as path
from `my_table.faissal.trajets_flix`
where id = 1
if applied to sample data in your question - output is
which is visualized as

BQ function to transform caron character from a string to character without caron

I am trying to get rid of a caron characters (č,ć,š,ž) and transform them to c,c,s,z :
with my_table as (
select 'abcčd sštuv' caron, 'abccd sstuv' no_caron union all
select 'uvzž', 'uvzz' union all
select 'ijkl cČd', 'ijkl cCd' union ALL
select 'Ćdef', 'Cdef'
)
SELECT *
FROM my_table
I was trying to get rid of them with
SELECT *,
regexp_replace(caron, r'š', 's') as no_caron
FROM my_table
but I think this is inefficient. I know there is an option to write your on function as described
here, but I have no idea how to use it in my case.
Thanks in advance!
Use below
SELECT *, regexp_replace(normalize(caron, NFD), r"\pM", '') output
FROM my_table
if applied to sample data in your question - output is

Extracting Key-worlds out of string and show them in another column

I need to write a query to extract specific names out of String and have them show in another column for example a column has this field
Column:
Row 1: jasdhj31e31jh123hkkj,12l1,3jjds,Amin,02323rdcsnj
Row 2:jasnasc8918212,ahsahkdjjMina67,
Row 3:kasdhakshd,asda,asdasd,121,121,Sina878788kasas
Key Words: Amin,Mina,Sina
How could I have these key words in another column? I dont want to insert another column but if that's the only solution let me know.
Any help appreciated!
Below is for BigQuery Standard SQL
#standardSQL
WITH keywords AS (
SELECT keyword
FROM UNNEST(SPLIT('Amin,Mina,Sina')) keyword
)
SELECT str, STRING_AGG(keyword) keywords_in_str
FROM `project.dataset.table`
CROSS JOIN keywords
WHERE REGEXP_CONTAINS(str, CONCAT(r'(?i)', keyword))
GROUP BY str
You can test, play with above using dummy data from your question as below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'jasdhMINAj31e31jh123hkkj,12l1,3jjds,Amin,02323rdcsnj' str UNION ALL
SELECT 'jasnasc8918212,ahsahkdjjMina67,' UNION ALL
SELECT 'kasdhakshd,asda,asdasd,121,121,Sina878788kasas'
), keywords AS (
SELECT keyword
FROM UNNEST(SPLIT('Amin,Mina,Sina')) keyword
)
SELECT str, STRING_AGG(keyword) keywords_in_str
FROM `project.dataset.table`
CROSS JOIN keywords
WHERE REGEXP_CONTAINS(str, CONCAT(r'(?i)', keyword))
GROUP BY str
with results as
Row str keywords_in_str
1 jasdhMINAj31e31jh123hkkj,12l1,3jjds,Amin,02323rdcsnj Amin,Mina
2 jasnasc8918212,ahsahkdjjMina67, Mina
3 kasdhakshd,asda,asdasd,121,121,Sina878788kasas Sina
to count the no of keywords
#standardSQL
WITH `project.dataset.table` AS (
SELECT 'jasdhMINAj31e31jh123hkkj,12l1,3jjds,Amin,02323rdcsnj' str UNION ALL
SELECT 'jasnasc8918212,ahsahkdjjMina67,' UNION ALL
SELECT 'kasdhakshd,asda,asdasd,121,121,Sina878788kasas'
)
select str,array(select as struct countif(lower(x) ="amin") amin,countif(lower(x) ="mina") mina,countif(lower(x)="sina") sina from unnest(x)x)keyword from
(select str,regexp_extract_all(str,"(?i)(Amin|Mina|Sina)")x from `project.dataset.table`)

Bigquery array of STRINGs to array of INTs

I'm trying to pull an array of INT64 s in BigQuery standard SQL from a column which is a long string of numbers separated by commas (for example, 2013,1625,1297,7634). I can pull an array of strings easily with:
SELECT
SPLIT(string_col,",")
FROM
table
However, I want to return an array of INT64 s, not an array of strings. How can I do that? I've tried
CAST(SPLIT(string_col,",") AS ARRAY<INT64>)
but that doesn't work.
Below is for BigQuery Standard SQL
#standardSQL
WITH yourTable AS (
SELECT 1 AS id, '2013,1625,1297,7634' AS string_col UNION ALL
SELECT 2, '1,2,3,4,5'
)
SELECT id,
(SELECT ARRAY_AGG(CAST(num AS INT64))
FROM UNNEST(SPLIT(string_col)) AS num
) AS num,
ARRAY(SELECT CAST(num AS INT64)
FROM UNNEST(SPLIT(string_col)) AS num
) AS num_2
FROM yourTable
Mikhail beat me to it and his answer is more extensive but adding this as a more minimal repro:
SELECT CAST(num as INT64) from unnest(SPLIT("2013,1625,1297,7634",",")) as num;

Regex pattern inside REPLACE function

SELECT REPLACE('ABCTemplate1', 'Template\d+', '');
SELECT REPLACE('ABC_XYZTemplate21', 'Template\d+', '');
I am trying to remove the part Template followed by n digits from a string. The result should be
ABC
ABC_XYZ
However REPLACE is not able to read regex. I am using SQLSERVER 2008. Am I doing something wrong here? Any suggestions?
SELECT SUBSTRING('ABCTemplate1', 1, CHARINDEX('Template','ABCTemplate1')-1)
or
SELECT SUBSTRING('ABC_XYZTemplate21',1,PATINDEX('%Template[0-9]%','ABC_XYZTemplate21')-1)
More generally,
SELECT SUBSTRING(column_name,1,PATINDEX('%Template[0-9]%',column_name)-1)
FROM sometable
WHERE PATINDEX('%Template[0-9]%',column_name) > 0
You can use substring with charindex or patindex if the pattern being looked for is fixed.
select SUBSTRING('ABCTemplate1',1, CHARINDEX ( 'Template' ,'ABCTemplate1')-1)
My answer expects that "Template" is enough to determine where to cut the string:
select LEFT('ABCTemplate1', CHARINDEX('Template', 'ABCTemplate1') - 1)
Using numbers table..
;with cte
as
(select 'ABCTemplate1' as string--this can simulate your table column
)
select * from cte c
cross apply
(
select replace('ABCTemplate1','template'+cast(n as varchar(2)),'') as rplcd
from
numbers
where n<=9
)
b
where c.string<>b.rplcd
Using Recursive CTE..
;with cte
as
(
select cast(replace('ABCTemplate21','template','') as varchar(100)) as string,0 as num
union all
select cast(replace(string,cast(num as varchar(2)),'') as varchar(100)),num+1
from cte
where num<=9
)
select top 1 string from cte
order by num desc