function to convert unicode in bigquery

function to convert unicode in bigquery - google-bigquery

I trying out the NORMALIZE function with NFKC in bigquery from the documentation, I see that I can convert a string to a readable format. For example
WITH EquivalentNames AS (
SELECT name
FROM UNNEST([
'Jane\u2004Doe',
'\u0026 Hello'
]) AS name
)
SELECT
NORMALIZE(name, NFKC) AS normalized_str
FROM EquivalentNames
GROUP BY 1;
The ampersand character shows up correctly, but I have a table, with a column of STRING with unicode character in its values, but I'm not able to use NORMALIZE to convert it to a readable format.
I've also tried some of the other solutions presented
Decode Unicode's to Local language in Bigquery but nothing is working yet.
Attached is an example of the data:

You posted a question about NORMALIZE, but didn't make your goals clear.
Here I'll answer the question about NORMALIZE - to point out that it probably doesn't do what you are expecting it to do. But at least it's acting as expected.
There are many ways to encode the same string with Unicode. Normalize chooses one, while preserving the string.
See this query:
SELECT *, a=b ab, a=c ac, a=d ad, b=c bc, b=d bd, c=d cd
FROM (
SELECT NORMALIZE('hello ñá 😞', NFC) a
, NORMALIZE('hello ñá 😞', NFKC) b
, NORMALIZE('hello ñá 😞', NFD) c
, NORMALIZE('hello ñá 😞', NFKD) d
)
As you see - every time you get the same string, they just have different non-visible representations.

The \u2004 is so called thick space so that is why you thnk it is not showing correctly because you just see space -
But if you will try some other codes - like for example \2020 - you will see it is actually showing even without extra processing with NORMALIZE function
As in below
#standardSQL
WITH EquivalentNames AS (
SELECT name
FROM UNNEST([
'Jane\u2020Doe',
'\u0026 Hello'
]) AS name
)
SELECT
name, NORMALIZE(name, NFKC) AS normalized_str
FROM EquivalentNames
GROUP BY 1
with result
Row name normalized_str
1 Jane†Doe Jane†Doe
2 & Hello & Hello

Related

Redshift - How to group by max() on an array field

I am using redshift
I have a table like this :
metric is a super type, built with the array() function within redshift
user
metrics
red
array(2021, 120)
red
array(2020, 99)
blue
array(2021, 151)
I would like to do :
select user, max(metrics) from table group by user
and get this :
user
metrics
red
array(2021, 120)
blue
array(2021, 151)
Sadly using this query, I only get null values
Do you know how to handle that ?
Thanks

If you are familiar with Redshift Spectrum, the logic is very similar to unnest an array field when you query an external schema.
In your case, the query is pretty simple:
SELECT t.user, max(metric)
FROM my_schema.my_table as t, t.metrics as metric
GROUP BY 1
If the array contains types other than numerical ones, you can simply cast it to int or double like:
max(metric::int)
In this way, pure string such as "hello world" are considered as null, but string like "33333" is converted to int.

SQL ORACLE : Is it possible to convert NUMBER with CHAR (varchar2 datatype) to NUMBER datatype

I have a big problem right now and I really need your help, because I can't find the right answer.
I am currently writing a script that triggers a migration process from a table with raw data (data we received from an excel file) to a new normalized schema.
My problem is that there is a column PRICE (varchar2 datatype) with a bunch of traps. For example: 540S, 25oo , I200 , S000 .
And I need to convert it to the correct NUMBER(9,2) format so I can get: 5405, 2500, 1200, 5000 as NUMBER for the previous examples and INSERT INTO my_new_table.
Is there any way I can parse every CHAR of these strings that verify certain conditions?
Or others better way?
Thank you :)!

One of the wonderful things about Oracle that some other DBs lack, is the TRANSLATE function:
SELECT TRANSLATE(number, 'SsIilOoxyz', '5511100') FROM t
This will convert:
S, s to 5
I, i and l to 1
O, o to 0
Remove any x, y or z from the number
The second and third arguments to translate define what characters are to be mapped. If the first string is longer than the second then anything over the length of the second is deleted from the resulting string. Mapping is direct based on position:
'SsIilOoxyz',
'5511100'
Look at the columns of the characters; the character above is mapped to the character below:
S->5,
s->5,
I->1,
i->1,
l->1,
O->0,
o->0,
x->removed,
y->removed,
z->removed`

You can use translate() and along with to_number(). Your rules are not exactly clear, but something like this:
select to_number(translate(price, '0123456789IoS', '012345678910'))
from t;
This replaces I with 1, o with 0, and removes S.

BigQuery - JSON_EXTRACT only extracts first entry

I have a column containing a json-string as follows:
[{"answer":"europe-austria-swiss","text":"Österreich, Schweiz"},{"answer":"europe-italy","text":"Italien"},{"answer":"europe-france","text":"Frankreich"}]
I want to extract ALL answers given in ONE column and row, seperated by a comma:
europe-austria-swiss, europe-italy, europe-france
I think I tried all possibilites offered by JSON_EXTRACT and JSON_EXTRACT_ARRAY or replacing parentheses and other signs, but I either only get the first entry extracted (in this case
europe-austria-swiss
) or it splits up in rows as array from which I can no longer extract the strings of "answer".
Has anyone any idea on how to solve that problem? It's very much appreciated!
This column is of course part of a much larger table (if that is relevant anyhow).

I think I know what's going on (please, correct me if I'm wrong).
My best guess is that you are trying something like:
SELECT JSON_EXTRACT(json_text, "$.answer") AS answers
FROM UNNEST([
'{"answer":"europe-austria-swiss","text":"Österreich, Schweiz"},{"answer":"europe-italy","text":"Italien"},{"answer":"europe-france","text":"Frankreich"}'
]) as json_text
This returns:
"europe-austria-swiss"
However, if you change the underlying data for something like this (each line as a json string object), it should resolve the issue:
SELECT JSON_EXTRACT(json_text, "$.answer") AS answers
FROM UNNEST([
'{"answer":"europe-austria-swiss","text":"Österreich, Schweiz"}',
'{"answer":"europe-italy","text":"Italien"}',
'{"answer":"europe-france","text":"Frankreich"}'
]) as json_text
Result:
"europe-austria-swiss"
"europe-italy"
"europe-france"
Hope this helps!

Below is for BigQuery Standard SQL
#standardSQL
SELECT (
SELECT STRING_AGG(JSON_EXTRACT_SCALAR(answer, '$.answer'), ' ,')
FROM UNNEST(JSON_EXTRACT_ARRAY(json_string)) answer
) AS answers
FROM `project.dataset.table`
You can test, play with above using sample data from your question as in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT '[{"answer":"europe-austria-swiss","text":"Österreich, Schweiz"},{"answer":"europe-italy","text":"Italien"},{"answer":"europe-france","text":"Frankreich"}]' json_string
)
SELECT (
SELECT STRING_AGG(JSON_EXTRACT_SCALAR(answer, '$.answer'), ' ,')
FROM UNNEST(JSON_EXTRACT_ARRAY(json_string)) answer
) AS answers
FROM `project.dataset.table`
with result
Row answers
1 europe-austria-swiss ,europe-italy ,europe-france

Using period "." in Standard SQL in BigQuery

BigQuery Standard SQL does not seems to allow period "." in the select statement. Even a simple query (see below) seems to fail. This is a big problem for datasets with field names that contain "." Is there an easy way to avoid this issue?
select id, time_ts as time.ts
from `bigquery-public-data.hacker_news.comments`
LIMIT 10
Returns error...
Error: Syntax error: Unexpected "." at [1:27]
This also fails...
select * except(detected_circle.center_x )
from [bigquery-public-data:eclipse_megamovie.photos_v_0_2]
LIMIT 10

It depends on what you are trying to accomplish. One interpretation is that you want to return a STRUCT named time with a single field named ts inside of it. If that's the case, you can use the STRUCT operator to build the result:
SELECT
id,
STRUCT(time_ts AS ts) AS time
FROM `bigquery-public-data.hacker_news.comments`
LIMIT 10;
In the BigQuery UI, it will display the result as id and time.ts, where the latter indicates that ts is inside a STRUCT named time.
BigQuery disallows columns in the result whose names include periods, so you'll get an error if you run the following query:
SELECT
id,
time_ts AS `time.ts`
FROM `bigquery-public-data.hacker_news.comments`
LIMIT 10;
Invalid field name "time.ts". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.

Elliot's answer great and addresses first part of your question, so let me address second part of it (as it is quite different)
First, wanted to mention that select modifiers like SELECT * EXCEPT are supported for BigQuery Standard SQL so, instead of
SELECT * EXCEPT(detected_circle.center_x )
FROM [bigquery-public-data:eclipse_megamovie.photos_v_0_2]
LIMIT 10
you should rather tried
#standardSQL
SELECT * EXCEPT(detected_circle.center_x )
FROM `bigquery-public-data.eclipse_megamovie.photos_v_0_2`
LIMIT 10
and of course now we are back to issue with `using period in standard sql
So, above code can only be interpreted as you try to eliminate center_x field from detected_circle STRUCT (nullable record). Technically speaking, this makes sense and can be done using below code
SELECT *
REPLACE(STRUCT(detected_circle.radius, detected_circle.center_y ) AS detected_circle)
FROM `bigquery-public-data.eclipse_megamovie.photos_v_0_2`
LIMIT 10
... still not clear to me how to use your recommendation to remove the entire detected_circle.*
SELECT * EXCEPT(detected_circle)
FROM `bigquery-public-data.eclipse_megamovie.photos_v_0_2`
LIMIT 10

How do I remove the first character of a string and treat the remaining values as an integer in BigQuery

I currently am working with a large data set that was pre-populated in BigQuery. I have a column of orderID's which have the following set-up: o377412876, o380940924, etc. This is stored in a string. I need to do the following and am running into problems:
1) Strip off the first character using the BigQuery query language
2) Convert the remaining (or treat the remaining values), as an integer.
I will then run a join against the values. Now, I would be abundantly happier down this operation in either Python, R, or another language. That said, the challenge I have been given based on client needs is to write all the scripts in BigQuery's querying language.

SELECT 10 * INTEGER(REGEXP_REPLACE(x, '^.', ''))
FROM
(SELECT 'o1234' AS x)
12340

You can use SUBSTR function and SAFE_CAST (in case there are NULL values in your column). INTEGER does not work on BQ.
SELECT SAFE_CAST(SUBSTR(x, 2) AS INT64)
FROM (SELECT 'o1234' AS x)
Output: 1234

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

function to convert unicode in bigquery - google-bigquery

Related

Redshift - How to group by max() on an array field

SQL ORACLE : Is it possible to convert NUMBER with CHAR (varchar2 datatype) to NUMBER datatype

BigQuery - JSON_EXTRACT only extracts first entry

Using period "." in Standard SQL in BigQuery

How do I remove the first character of a string and treat the remaining values as an integer in BigQuery

Categories

Resources