Convert eastern arabic numerals to western arabic in bigquery - google-bigquery

I have a problem where a eastern-arabic numeral has entered my table as a timestamp and bigquery doesn't recognise this as a timestamp and will not execute my queries.
I wish to be able to convert this:
'٢٠١٨-١٠-١١T١٦:٠١:٤١.٠٤١Z'
into this:
'2018-10-11T16:01:41.041Z
in bigquery, Is this possible?

How about this SQL UDF:
CREATE TEMP FUNCTION arabicConvert(input STRING) AS ((
SELECT STRING_AGG(COALESCE(FORMAT('%i', i), letter), '')
FROM (SELECT SPLIT(input, '') x), UNNEST(x) letter
LEFT JOIN (SELECT letter_dict,i FROM (
SELECT SPLIT('٠١٢٣٤٥٦٧٨٩', '') l), UNNEST(l) letter_dict WITH OFFSET i
)
ON letter=letter_dict
));
SELECT arabicConvert('٢٠١٨-١٠-١١T١٦:٠١:٤١.٠٤١Z') converted
2018-10-11T16:01:41.041Z

There is alternative, lighter option :o)
CREATE TEMP FUNCTION arabicNumeralsConvert(input STRING) AS ((
CODE_POINTS_TO_STRING(ARRAY(
SELECT IF(code > 1600, code - 1584, code)
FROM UNNEST(TO_CODE_POINTS(input)) code
))
));
WITH t AS (
SELECT '٢٠١٨-١٠-١١T١٦:٠١:٤١.٠٤١Z' str UNION ALL
SELECT '2018-10-12T20:34:57.546Z'
)
SELECT str, arabicNumeralsConvert(str) converted
FROM t
result is as
str converted
٢٠١٨-١٠-١١T١٦:٠١:٤١.٠٤١Z 2018-10-11T16:01:41.041Z
2018-10-12T20:34:57.546Z 2018-10-12T20:34:57.546Z

Related

How to split a string having values "." seperated in SQL

For Eg; abc.def.efg , separate into independent strings as abc def efg
Head
abc.def.efg
to
left
center
right
abc
def
efg
On SQL Server with a 3-part delimited string you can use parsename
with t as (
select 'left.centre.right' Head
)
select ParseName(Head,3) L, ParseName(Head,2) C, ParseName(Head,1) R
from t;
on MySQL, you can do:
with t as (
select 'left.centre.right' Head
)
select
substring_index(Head,'.',1) as L,
substring_index(substring_index(Head,'.',2),'.',-1) as M,
substring_index(Head,'.',-1) as R
from t;
results:
L
M
R
left
centre
right
see: DBFIDDLE, and DOCS
Look into the split_part() equivalent for the RDBMS you're using.
E.g.
SELECT
split_part(Head, '.', 1) AS "left",
split_part(Head, '.', 2) AS center,
split_part(Head, '.', 3) AS "right"
FROM your_table
EDIT: corrected the indexes, see: https://www.postgresqltutorial.com/postgresql-split_part/

ARRAY_AGG not allowed in user-defined function (Standard SQL)

Working on a user-defined function on BigQuery to extract emails from messy data sets, I'm facing an issued with ARRAY_AGG() not being allowed in the body of a temp user defined-function (UDF).
CREATE TEMP FUNCTION GET_EMAIL(emails ARRAY<STRING>, index INT64) AS (
ARRAY_AGG(
DISTINCT
(SELECT * FROM
UNNEST(
SPLIT(
REPLACE(
LOWER(
ARRAY_TO_STRING(emails, ",")
)," ", ""
)
)
) AS e where e like '%#%'
) IGNORE NULLS
)[SAFE_OFFSET(index)]
);
SELECT GET_EMAIL(["bob#hotmail.com,test#gmail.com", "12345", "bon#yahoo.com"],1) as email_1
I've tried to bypass the ARRAY_AGG by selecting from UNNEST with OFFSET and then WHERE the offset would be the index.
However, now there's a column limitation (not more than one column in inside a scalar sub-query SELECT clause) suggesting to use a SELECT AS STRUCT instead.
I gave a try to the SELECT AS STRUCT:
CREATE TEMP FUNCTION GET_EMAIL(emails ARRAY<STRING>, index INT64) AS (
(SELECT AS STRUCT DISTINCT list.e, list.o FROM
UNNEST(
SPLIT(
REPLACE(
LOWER(
ARRAY_TO_STRING(emails, ", ")
)," ", ""
)
)
) AS list
WITH OFFSET as list.o
WHERE list.e like '%#%' AND list.o = index)
);
SELECT GET_EMAIL(["bob#hotmail.com,test#gmail.com", "12345", "bob#yahoo.com"],1) as email_1
But it doesn't like my DISTINCT and then even removing it, it will complain about parsing e and o.
So I'm out of ideas here, I probably made a knot. Can anyone suggest how to do this job inside a UDF? Thanks.
Below version works
CREATE TEMP FUNCTION GET_EMAIL(emails ARRAY<STRING>, index INT64) AS ((
SELECT ARRAY(
SELECT *
FROM UNNEST(
SPLIT(
REPLACE(
LOWER(
ARRAY_TO_STRING(emails, ",")
)," ", ""
)
)
) AS e WHERE e LIKE '%#%'
)[SAFE_OFFSET(index)]
));
SELECT GET_EMAIL(["bob#hotmail.com,test#gmail.com", "12345", "bon#yahoo.com"], 1) AS email_1
with result
Row email_1
1 test#gmail.com
Or below version (which is just slight correction of your original query)
CREATE TEMP FUNCTION GET_EMAIL(emails ARRAY<STRING>, index INT64) AS ((
SELECT ARRAY_AGG(e)[SAFE_OFFSET(index)]
FROM UNNEST(
SPLIT(
REPLACE(
LOWER(
ARRAY_TO_STRING(emails, ",")
)," ", ""
)
)
) AS e WHERE e LIKE '%#%'
));
SELECT GET_EMAIL(["bob#hotmail.com,test#gmail.com", "12345", "bon#yahoo.com"], 1) AS email_1
obviously with the same result

REGEX get all matched patterns by SQL DB2

all.
I need to extract from the string by REGEX all that matching the pattern "TTT\d{3}"
For the string in example i would like to get:
TTT108,TTT109,TTT111,TTT110
The DB2 function i would like to use is REGEXP_REPLACE(str,'REGEX pattern', ',').
The number of matching can be 0,1,2,3... in each string.
Thank you.
The example:
TTT108(optional);TTT109(optional);TTT111(optional);TTT110optional);ENTITYLIST_2=(optional);ENTITYLIST_3=(optional);Containment_Status=(optional)
If you want to extract the valid instead of replacing the invalid characters, please check if this helps:
with data (s) as (values
('TTT108(optional);TTT109(optional);TTT111(optional);TTT110optional);ENTITYLIST_2=(optional);ENTITYLIST_3=(optional);Containment_Status=(optional)')
)
select listagg(sst,', ') within group (order by n)
from (
select n,
regexp_substr(s,'(TTT[0-9][0-9][0-9])', 1, n)
from data
cross join (values (1),(2),(3),(4),(5)) x (n) -- any numbers table
where n <= regexp_count(s,'(TTT[0-9][0-9][0-9])')
) x (n,sst)
For any number of tokens & Db2 versions before 11.1:
select id, listagg(tok, ',') str
from
(
values
(1, 'TTT108(optional);TTT109(optional);TTT111(optional);TTT110optional);ENTITYLIST_2=(optional);ENTITYLIST_3=(optional);Containment_Status=(optional)')
) mytable (id, str)
, xmltable
(
'for $id in tokenize($s, ";") let $new := replace($id, "(TTT\d{3}).*", "$1") where matches($id, "(TTT\d{3}).*") return <i>{string($new)}</i>'
passing mytable.str as "s"
columns tok varchar(6) path '.'
) t
group by id;

Bigquery array of STRINGs to array of INTs

I'm trying to pull an array of INT64 s in BigQuery standard SQL from a column which is a long string of numbers separated by commas (for example, 2013,1625,1297,7634). I can pull an array of strings easily with:
SELECT
SPLIT(string_col,",")
FROM
table
However, I want to return an array of INT64 s, not an array of strings. How can I do that? I've tried
CAST(SPLIT(string_col,",") AS ARRAY<INT64>)
but that doesn't work.
Below is for BigQuery Standard SQL
#standardSQL
WITH yourTable AS (
SELECT 1 AS id, '2013,1625,1297,7634' AS string_col UNION ALL
SELECT 2, '1,2,3,4,5'
)
SELECT id,
(SELECT ARRAY_AGG(CAST(num AS INT64))
FROM UNNEST(SPLIT(string_col)) AS num
) AS num,
ARRAY(SELECT CAST(num AS INT64)
FROM UNNEST(SPLIT(string_col)) AS num
) AS num_2
FROM yourTable
Mikhail beat me to it and his answer is more extensive but adding this as a more minimal repro:
SELECT CAST(num as INT64) from unnest(SPLIT("2013,1625,1297,7634",",")) as num;

Get SQL Substring After a Certain Character but before a Different Character

I have some key values that I want to parse out of my SQL Server table. Here are some examples of these key values:
R50470B50469
B17699C88C68AM
R22818B17565C32G16SU
B1444
What I am wanting to get out of the string, is all the numbers that occur after the character 'B' but before any other letter character if it exists such as 'C'. How can I do this in SQL?
WITH VALS(Val) AS
(
SELECT 'R50470B50469' UNION ALL
SELECT 'R22818B17565C32G16SU' UNION ALL
SELECT 'R22818B17565C32G16SU' UNION ALL
SELECT 'B1444'
)
SELECT SUBSTRING(Tail,0,PATINDEX('%[AC-Z]%', Tail))
FROM VALS
CROSS APPLY
(SELECT RIGHT(Val, LEN(Val) - CHARINDEX('B', Val)) + 'X') T(Tail)
WHERE Val LIKE '%B%'