Avoid handling invalid codepoint in Bigquery

Avoid handling invalid codepoint in Bigquery - google-bigquery

I'm using the following SQL code in order to decode UTF-8 chars in Big query:
#standardSQL
CREATE TEMP FUNCTION DecodeUnicode(s STRING) AS (
(SELECT CODE_POINTS_TO_STRING(ARRAY_AGG(CAST(CONCAT('0x', x) AS INT64)))
FROM UNNEST(SPLIT(s, '\\u')) AS x
WHERE x != ''
)
);
WITH normal AS (
select '\\u05DE\\u05EA\\u05DE\\u05D8\\u05D9\\u05E7\\u05D4123' as edited
), uchars AS (
SELECT DISTINCT
c,
DecodeUnicode(c) uchar
FROM normal,
UNNEST(REGEXP_EXTRACT_ALL(edited, r'(\\u[ABCDEF0-9]{4,8})')) c
)
SELECT
edited,
STRING_AGG(IFNULL(uchar, x), '' ORDER BY pos) decoded
FROM(
SELECT
edited,
pos,
SUBSTR(edited,
SUM(CASE char WHEN '' THEN 1 ELSE 6 END)
OVER(PARTITION BY edited ORDER BY pos) - CASE char WHEN '' THEN 0 ELSE 5 END,
CASE char WHEN '' THEN 1 ELSE 6 END) x,
uchar
FROM normal ,
UNNEST(REGEXP_EXTRACT_ALL(edited, r'(\\u[ABCDEF0-9]{4,8})|.')) char WITH
OFFSET AS pos LEFT JOIN uchars u ON u.c = char
)
GROUP BY edited
The problem is that some of the values I'm handling are not valid when using the function above ('DecodeUnicode')
- for example this part 'u05D4123' is not charbase valid.
What can I change in my code that when I have such values the function will not handle it and therefore I will not get the 'Invalid codepoint' Error that I get now?

One of the option is to use SAFE.CODE_POINTS_TO_STRING instead of CODE_POINTS_TO_STRING but then you still will need to eliminate not handled code from result - for example using regexp as it is in below example
#standardSQL
CREATE TEMP FUNCTION DecodeUnicode(s STRING) AS (
(SELECT SAFE.CODE_POINTS_TO_STRING(ARRAY_AGG(CAST(CONCAT('0x', x) AS INT64)))
FROM UNNEST(SPLIT(s, '\\u')) AS x
WHERE x != ''
)
);
WITH normal AS (
SELECT '\\u05DE\\u05EA\\u05DE\\u05D8\\u05D9\\u05E7\\u05D4123' AS edited
), uchars AS (
SELECT DISTINCT
c,
DecodeUnicode(c) uchar
FROM normal,
UNNEST(REGEXP_EXTRACT_ALL(edited, r'(\\u[ABCDEF0-9]{4,8})')) c
)
SELECT
edited,
STRING_AGG(IFNULL(uchar, x), '' ORDER BY pos) decoded,
REGEXP_REPLACE(STRING_AGG(IFNULL(uchar, x), '' ORDER BY pos) ,r'\\u[ABCDEF0-9]{4,8}', '') decoded_and_fixed
FROM(
SELECT
edited,
pos,
SUBSTR(edited,
SUM(CASE char WHEN '' THEN 1 ELSE 6 END)
OVER(PARTITION BY edited ORDER BY pos) - CASE char WHEN '' THEN 0 ELSE 5 END,
CASE char WHEN '' THEN 1 ELSE 6 END) x,
uchar
FROM normal ,
UNNEST(REGEXP_EXTRACT_ALL(edited, r'(\\u[ABCDEF0-9]{4,8})|.')) char WITH
OFFSET AS pos LEFT JOIN uchars u ON u.c = char
)
GROUP BY edited
with result
Row edited decoded decoded_and_fixed
1 \u05DE\u05EA\u05DE\u05D8\u05D9\u05E7\u05D4123 מתמטיק\u05D4 מתמטיק

Related

Convert String to Tuple in BigQuery

I have a variable passed as an argument in BigQuery which is in the format "('a','b','c')"
with vars as (
select "{0}" as var1,
)
-- where, {0} = "('a','b','c')"
To use it in BigQuery I need to make it a tuple ('a','b','c').
How can it be done?
Any alternate approach is also welcome.
Example:
with vars as (
select "('a','b','c')" as index
)
select * from `<some_other_db>.table` where index in (
select index from vars)
-- gives me empty results because index is now a string
Present output:
select * from <db_name>.table where index in "('a','b','c')"
Required output:
select * from <db_name>.table where index in ('a','b','c')

Below is for BigQuery Standard SQL
#standardSQL
WITH vars AS (
SELECT "('a','b','c')" AS var
)
SELECT *
FROM `<some_other_db>.table`
WHERE index IN UNNEST((
SELECT SPLIT(REGEXP_REPLACE(var, r'[()\']', ''))
FROM vars
))
You can test, play with above using some dummy data as in below example
#standardSQL
WITH vars AS (
SELECT "('a','b','c')" AS var
), `<some_other_db>.table` AS (
SELECT 1 id, 'a' index UNION ALL
SELECT 2, 'd' UNION ALL
SELECT 3, 'c' UNION ALL
SELECT 4, 'e'
)
SELECT *
FROM `<some_other_db>.table`
WHERE index IN UNNEST((
SELECT SPLIT(REGEXP_REPLACE(var, r'[()\']', ''))
FROM vars
))
with output
Row id index
1 1 a
2 3 c

I think this does what you are asking for:
with vars as ( select "('a','b','c')" as var1)
select as struct
MAX(CASE WHEN n = 0 then var END) as f1,
MAX(CASE WHEN n = 1 then var END) as f2,
MAX(CASE WHEN n = 2 then var END) as f3
from vars v cross join
unnest(SPLIT(REPLACE(REPLACE(var1, '(', ''), ')', ''), ',')) var with offset n;

Possible to Search Partial Matched Strings from same table?

I have a table and lets say the table has items with the item numbers:
12345
12345_DDM
345653
2345664
45567
45567_DDM
I am having trouble creating a query that will get all of the _DDM and the corresponding item that has the same prefix digits.
So in this case I'd want both 12345 and 12345_DDM etc to be returned

Use like to find rows with _DDM.
Use EXISTS to find rows with numbers also having a _DDM row.
working demo
select *
from tablename t1
where columnname LIKE '%_DDM'
or exists (select 1 from tablename t2
where t1.columnname + '_DDM' = t2.columnname)

Try this query:
--sample data
;with tbl as (
select col from (values ('12345'),('12345_DDM'),('345653'),('2345664'), ('45567'),('45567_DDM')) A(col)
)
--select query
select col from (
select col,
prefix,
max(case when charindex('_DDM', col) > 0 then 1 else 0 end) over (partition by prefix) [prefixGroupWith_DDM]
from (
select col,
case when charindex('_DDM', col) - 1 > 0 then substring(col, 1, charindex('_DDM', col) - 1) else col end [prefix]
from tbl
) a
) a where [prefixGroupWith_DDM] = 1

SQL LEFT() not working as expected when used with GROUP BY and Partition

I have codes that are like 1231231A, 1231231A, 3453454B etc
I need to group them by their number (ignoring the char which is a version) and just get one of each. I also need to drop the last char. My code works in grouping them and returning one of each, but it returns the last char.
Why is it returning the last char when i chop it off?
Expected output is
1231231
3453454
What I'm getting is
1231231A
3453454B
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY T.fldProductDescrip
ORDER BY T.fldEffectiveDate DESC) AS rn
FROM (
-- Insert statements for procedure here
SELECT JST.flduid
,JST.fldEffectiveDate
,(CASE
WHEN RIGHT(fldProductDescrip, 1) LIKE '[0-9]'
THEN fldProductDescrip
ELSE LEFT(fldProductDescrip, DATALENGTH(fldProductDescrip) - 1)
END) as fldProductDescrip
,(
CASE
WHEN PE.fldLogoutDateTime IS NULL
THEN PE.fldESigUser
ELSE ''
END
) AS LoggedIn
,(
CASE
WHEN PE.fldLogoutDateTime IS NULL
THEN PE.fldLoginDateTime
ELSE ''
END
) AS LoggedInDateTime
FROM tblJSJobSheetTemplates JST
INNER JOIN tblJSProducts JP ON JST.fldProductUID = JP.fldUID
INNER JOIN tblJSProductEsig PE ON JP.fldProductDescrip = PE.fldProduct
) AS T
WHERE LoggedIn <> ''
)AS G WHERE rn = 1

Check palindrome without using string functions with condition

I have a table EmployeeTable.
If I want only that records where employeename have character of 1 to 5
will be palindrome and there also condition like total character is more then 10 then 4 to 8 if character less then 7 then 2 to 5 and if character less then 5 then all char will be checked and there that are palindrome then only display.
Examples :- neen will be display
neetan not selected
kiratitamara will be selected
I try this something on string function like FOR first case like name less then 5 character long
SELECT SUBSTRING(EmployeeName,1,5),* from EmaployeeTable where
REVERSE (SUBSTRING(EmployeeName,1,5))=SUBSTRING(EmployeeName,1,5)
I want to do that without string functions,
Can anyone help me on this?

You need at least SUBSTRING(), I have a solution like this:
(In SQL Server)
DECLARE #txt varchar(max) = 'abcba'
;WITH CTE (cNo, cChar) AS (
SELECT 1, SUBSTRING(#txt, 1, 1)
UNION ALL
SELECT cNo + 1, SUBSTRING(#txt, cNo + 1, 1)
FROM CTE
WHERE SUBSTRING(#txt, cNo + 1, 1) <> ''
)
SELECT COUNT(*)
FROM (
SELECT *, ROW_NUMBER() OVER (ORDER BY cNo DESC) as cRevNo
FROM CTE t1 CROSS JOIN
(SELECT Max(cNo) AS strLength FROM CTE) t2) dt
WHERE
dt.cNo <= dt.strLength / 2
AND
dt.cChar <> (SELECT dti.cChar FROM CTE dti WHERE dti.cNo = cRevNo)
The result will shows the count of differences and 0 means no differences.
Note :
Current solution is Non-Case-Sensitive for change it to a Case-Sensitive you need to check the strings in a case-sensitive collation like Latin1_General_BIN
You can use this solution as a SVF or something like that.

I dont realy understand why you dont want to use string functions in your query, but here is one solution. Compute everything beforehand:
Add Column:
ALTER TABLE EmployeeTable
ADD SubString AS
SUBSTRING(EmployeeName,
(
CASE WHEN LEN(EmployeeName)>10
THEN 4
WHEN LEN(EmployeeName)>7
THEN 2
ELSE 1 END
)
,
(
CASE WHEN LEN(EmployeeName)>10
THEN 8
WHEN LEN(EmployeeName)>7
THEN 5
ELSE 5 END
)
PERSISTED
GO
ALTER TABLE EmployeeTable
ADD Palindrome AS
REVERSE(SUBSTRING(EmployeeName,
(
CASE WHEN LEN(EmployeeName)>10
THEN 4
WHEN LEN(EmployeeName)>7
THEN 2
ELSE 1 END
)
,
(
CASE WHEN LEN(EmployeeName)>10
THEN 8
WHEN LEN(EmployeeName)>7
THEN 5
ELSE 5 END
)) PERSISTED
GO
Then your query will looks like:
SELECT * from EmaployeeTable
where Palindrome = SubString
BUT!
This is not a good idea. Please tell us, why you dont want to use string functios.

You could do it building a list of palindrome words using a recursive query that generates palindrome words till a length o n characters and then selects employees with the name matching a palindrome word. This may be a really inefficient way, but it does the trick
This is a sample query for Oracle, PostgreSQL should support this feature as well with little differences on syntax. I don't know about other RDBMS.
with EmployeeTable AS (
SELECT 'ADA' AS employeename
FROM DUAL
UNION ALL
SELECT 'IDA' AS employeename
FROM DUAL
UNION ALL
SELECT 'JACK' AS employeename
FROM DUAL
), letters as (
select chr(ascii('A') + rownum - 1) as letter
from dual
connect by ascii('A') + rownum - 1 <= ascii('Z')
), palindromes(word, len ) as (
SELECT WORD, LEN
FROM (
select CAST(NULL AS VARCHAR2(100)) as word, 0 as len
from DUAL
union all
select letter as word, 1 as len
from letters
)
union all
select l.letter||p.word||l.letter AS WORD, len + 1 AS LEN
from palindromes p
cross join letters l
where len <= 4
)
SEARCH BREADTH FIRST BY word SET order1
CYCLE word SET is_cycle TO 'Y' DEFAULT 'N'
select *
from EmployeeTable
WHERE employeename IN (
SELECT WORD
FROM palindromes
)

DECLARE #cPalindrome VARCHAR(100) = 'SUBI NO ONIBUS'
SET #cPalindrome = REPLACE(#cPalindrome, ' ', '')
;WITH tPalindromo (iNo) AS (
SELECT 1
WHERE SUBSTRING(#cPalindrome, 1, 1) = SUBSTRING(#cPalindrome, LEN(#cPalindrome), 1)
UNION ALL
SELECT iNo + 1
FROM tPalindromo
WHERE SUBSTRING(#cPalindrome, iNo + 1, 1) = SUBSTRING(#cPalindrome, LEN(#cPalindrome) - iNo, 1)
AND LEN(#cPalindrome) > iNo
)
SELECT IIF(MAX(iNo) = LEN(#cPalindrome), 'PALINDROME', 'NOT PALINDROME')
FROM tPalindromo

Oracle sql to case conversion for every adjacent characters in a string

With a condition like no two adjacent characters ( from a to z) should be in same case;
I need to change helloworld to HeLlOwOrLd , and used a query like :
SELECT listagg(jumping_char,'') WITHIN GROUP(ORDER BY rn) jumped_word
FROM
(SELECT rn,
CASE
WHEN mod(rn, 2) = 1
THEN upper(split_word)
ELSE lower(split_word)
END jumping_char
FROM
(SELECT regexp_substr('helloworld','.',LEVEL)split_word,
ROWNUM rn
FROM dual
CONNECT BY LEVEL <= LENGTH('helloworld')
)
);
Now I got a string like hello2world should becomes HeLlO2wOrLd.
Any simple ,different queries are appreciated and thanks in advance.

If I understand you correctly, you want to "skip" over non-characters in the input. You can achieve that by using regexp_count() with an offset of rn (instead of simply using rn) in your old solution:
SELECT listagg(jumping_char,'') WITHIN GROUP(ORDER BY rn) jumped_word
FROM
(SELECT rn,
CASE
when mod (regexp_count('hello2world', '[a-zA-Z]', rn), 2) = 1
THEN upper(split_word)
ELSE lower(split_word)
END jumping_char
FROM
(SELECT regexp_substr('hello2world','.',LEVEL)split_word,
ROWNUM rn
FROM dual
CONNECT BY LEVEL <= LENGTH('hello2world')
)
);
UPDATE:
Here's an alternative solution using the MODEL clause, just for completeness' sake:
with t as
(select 'hello2world' txt from dual)
select listagg(case
when mod(v2.char_cnt, 2) = 1
then upper(v2.txt)
else lower(v2.txt)
end,
'') within group(order by v2.rn)
from (
select
v1.txt,
rownum as rn,
sum(case
when regexp_like(txt, '[a-zA-Z]')
then 1
else 0
end) over (partition by 1 order by rownum) as char_cnt
from (
SELECT TXT
FROM T
MODEL
RETURN UPDATED ROWS
PARTITION BY(ROWNUM RN)
DIMENSION BY (0 POSITION)
MEASURES (TXT ,length(txt) NB_MOT)
RULES
(TXT[FOR POSITION FROM 1 TO NB_MOT[0] INCREMENT 1] =
substr(txt[0], CV(POSITION), 1) )
) v1
) v2

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Avoid handling invalid codepoint in Bigquery - google-bigquery

Related

Convert String to Tuple in BigQuery

Possible to Search Partial Matched Strings from same table?

SQL LEFT() not working as expected when used with GROUP BY and Partition

Check palindrome without using string functions with condition

Oracle sql to case conversion for every adjacent characters in a string

Categories

Resources