Safe casting a regexp match from REGEXP_REPLACE

Safe casting a regexp match from REGEXP_REPLACE - google-bigquery

I have a table with some code points (e.g. &#38) which I want to strip out from a text value in BigQuery.
My strategy is to use a regexp replace on the number replacing the number with the valid character.
If I try:
WITH items as (SELECT "Test & " as item)
SELECT
CODE_POINTS_TO_STRING([SAFE_CAST(REGEXP_EXTRACT(item, r"&#([0-9]{2})") AS INT64)]) as test_replace
FROM items
This will produce the output that I want for the entry
[
{
"test_replace": "&"
}
]
If I try:
WITH items as (SELECT "Test & " as item)
SELECT
REGEXP_REPLACE(
item,
r"&#([0-9]{2});",
CODE_POINTS_TO_STRING([SAFE_CAST("\\1" as INT64)])
) as full_replace
FROM items
This will produce a null output
[
{
"full_replace": null
}
]
However if I hard code the value in:
WITH items as (SELECT "Test & " as item)
SELECT
REGEXP_REPLACE(
item,
r"&#([0-9]{2});",
CODE_POINTS_TO_STRING([SAFE_CAST("38" as INT64)])
) as full_replace
FROM items
This works.
[
{
"full_replace": "Test & "
]
I know that the regexp is evaluating correctly as if I try:
WITH items as (SELECT "Test & " as item)
SELECT
REGEXP_REPLACE(
item,
r"&#([0-9]{2});",
CONCAT("\\1", "test")
) as part_replace
FROM ITEMS
This will return:
[
{
"part_replace": "Test 38test "
}
]
My question is therefore, how do I get the SAFE_CAST() Function to evaluate the regexp match (it seems to be evaluating the string literal).

I have a table with some code points (e.g. &#38) which I want to strip out from a text value in BigQuery.
Try approach as in below example
#standardSQL
CREATE TEMP FUNCTION multiReplace(item STRING, arr ARRAY<STRUCT<x STRING, y STRING>>)
RETURNS STRING
LANGUAGE js AS """
for (i = 0; i < arr.length; i++) {
item = item.replace(arr[i].x, arr[i].y)
};
return item;
""";
WITH items AS (
SELECT "Test & abc ' xyz" AS item UNION ALL
SELECT "abc xyz"
)
SELECT item, multiReplace(item, points) full_replace
FROM (
SELECT
item,
ARRAY(
SELECT AS STRUCT val, CODE_POINTS_TO_STRING([SAFE_CAST(SUBSTR(val, -3, 2) AS INT64)]) point
FROM UNNEST(REGEXP_EXTRACT_ALL(item, r'(&#[0-9]{2};)')) val
) points
FROM items
)
with result
Row item full_replace
1 Test & abc ' xyz Test & abc ' xyz
2 abc xyz abc xyz
Option 2
While the simplest way to approach above is
#standardSQL
CREATE TEMP FUNCTION multiReplace(item STRING)
RETURNS STRING
LANGUAGE js AS """
var decodeHtmlEntity = function(str) {
return str.replace(/&#([0-9]{2});/g, function(match, dec) {
return String.fromCharCode(dec);
});
};
return decodeHtmlEntity(item);
""";
WITH items AS (
SELECT "Test & abc ' xyz" AS item UNION ALL
SELECT "abc xyz"
)
SELECT item, multiReplace(item) full_replace
FROM items
with the same output

Related

Query key values in a json column

I have a table "jobs" with one of the columns called "check_list" ( varchar(max) that has JSON values, an example value would be
{
"items":[
{
"name":"machine 1",
"state":"",
"comment":"",
"isReleaseToProductionCheck":true,
"mnachine_id":10
},
{
"name":"machine 2",
"state":"",
"comment":"",
"isReleaseToProductionCheck":true,
"machine_id":12
}
]
}
Now how would I write a SQL query to only return the rows where the column "check_list" has items[machine_id] = 12

In the end after some trial and error this was the solution that worked for me. I had to add the ISJSON check because some of the older data was invalid
WITH jobs (id, workorder, selectedMachine) AS(
SELECT
[id],
[workorder],
(
select
*
from
openjson(check_list, '$.items') with (machine_id int '$.machine_id')
where
machine_id = 12
) as selectedMachine
FROM
engineering_job_schedule
WHERE
ISJSON(check_list) > 0
)
Select
*
from
jobs
where
selectedMachine = 12

create a group of linked items

There is a list of users, who buy different product items. I want to group the item by user buying behavior. If any user buys two products, these shall be in the same group. The buying links the products.
user
item
1
cat food
1
cat toy
2
cat toy
2
cat snacks
10
dog food
10
dog collar
11
dog food
11
candy
12
candy
12
apples
15
paper
In this sample case all items for a cat shall be grouped together: "cat food" to "cat toy" to "cat snacks". The items with dog, candy, apples should be one group, because user buying’s link these. The paper is another group.
There are about 200 different products in the table and I need to do a disjoint-set union (DSU).

In JavaScript there several implementation of Disjoint Set Union (DSU), here this was used for the user defined function (UDF) in BigQuery. The main idea is to use a find and union function and to save the linking in a tree, represented as an array, please see here for details.
create temp function DSU(A array<struct<a string,b string>>)
returns array<struct<a string,b string>>
language js as
"""
// https://gist.github.com/KSoto/3300322fc2fb9b270dce2bf1e3d80cf3
// Disjoint-set bigquery
class DSU {
constructor() {
this.parents = [];
}
find(x) {
if(typeof this.parents[x] != "undefined") {
if(this.parents[x]<0) {
return x;
} else {
if(this.parents[x]!=x) {
this.parents[x]=this.find(this.parents[x]);
}
return (this.parents[x]);
}
} else {
this.parents[x]=-1;
return x;
}
}
union(x,y) {
var xpar = this.find(x);
var ypar = this.find(y);
if(xpar != ypar) {
this.parents[xpar]+=this.parents[ypar];
this.parents[ypar]=xpar;
}
}
console_print() {
// console.log(this.parents);
}
}
var dsu = new DSU();
for(var i in A){
dsu.union(A[i].a,A[i].b);
}
var out=[]
for(var i in A){
out[i]={b:dsu.find(A[i].a),a:A[i].a};
}
return out;
""";
with #recursive
your_table as (
SELECT 1 as user, "cat food" as item
UNION ALL SELECT 1, "cat toy"
UNION ALL SELECT 2, "cat snacks"
UNION ALL SELECT 2, "cat toy"
UNION ALL SELECT 10, "dog food"
union all select 10, "dog collar"
union all select 11, "dog food"
union all select 11, "candy"
union all select 12, "candy"
union all select 12, "apples"
union all select 15, "paper"
), helper as (
select distinct a, b
from (
Select user,min(item) as b, array_agg(item) as a_list
from your_table
group by 1
), unnest(a_list) as a
)
Select * except(tmp_count),
first_value(item) over(partition by b order by tmp_count desc,b) as item_most_common
from
(
select * ,
count(item) over(partition by b,item) as tmp_count
from your_table
left join (select X.a, min(X.b) as b from (select DSU(array_agg(struct(''||a,''||b))) as X from helper),unnest(X) X group by 1 order by 1) as combinder
on ''||item=combinder.a
)
The data is in the table your_table. A helper table is used to buid all pairs of two items, which any user brought. Combined as an array, this is giving to the UDF DSU. This function returns all items in column a and in column b the group. We want the most common item of the group to be shown as group name, therefore we use some window functions to determine it.

BigQuery SQL JSON Returning additional rows when current row contains multiple values

I have a table that looks like this
keyA | data:{"value":false}}
keyB | data:{"value":3}}
keyC | data:{"value":{"paid":10,"unpaid":20}}}
For keyA,keyB I can easily extract a single value with JSON_EXTRACT_SCALAR, but for keyC I would like to return multiple values and change the key name, so the final output looks like this:
keyA | false
keyB | 3
keyC-paid | 10
keyD-unpaid | 20
I know I can use UNNEST and JSON_EXTRACT multiple values and create additional but unsure how to combine them to adjust the key column name as well?

Even more generic approach
create temp function extract_keys(input string) returns array<string> language js as """
return Object.keys(JSON.parse(input));
""";
create temp function extract_values(input string) returns array<string> language js as """
return Object.values(JSON.parse(input));
""";
create temp function extract_all_leaves(input string) returns string language js as '''
function flattenObj(obj, parent = '', res = {}){
for(let key in obj){
let propName = parent ? parent + '.' + key : key;
if(typeof obj[key] == 'object'){
flattenObj(obj[key], propName, res);
} else {
res[propName] = obj[key];
}
}
return JSON.stringify(res);
}
return flattenObj(JSON.parse(input));
''';
select col || replace(replace(key, 'value', ''), '.', '-') as col, value,
from your_table,
unnest([struct(extract_all_leaves(data) as json)]),
unnest(extract_keys(json)) key with offset
join unnest(extract_values(json)) value with offset
using(offset)
if applied to sample data in your question - output is
Benefit of this approach is that it is quite generic and thus can handle any level of nesting in json
For example for below data/table
the output is

Try this one:
WITH sample AS (
SELECT 'keyA' AS col, '{"value":false}' AS data
UNION ALL
SELECT 'keyB' AS col, '{"value":3}' AS data
UNION ALL
SELECT 'keyC' AS col, '{"value":{"paid":10,"unpaid":20}}' AS data
)
SELECT col || IFNULL('-' || k, '') AS col,
IFNULL(v, JSON_VALUE(data, '$.value')) AS data
FROM (
SELECT col, data,
`bqutil.fn.json_extract_keys`(JSON_QUERY(data, '$.value')) AS keys,
`bqutil.fn.json_extract_values`(JSON_QUERY(data, '$.value')) AS vals
FROM sample
) LEFT JOIN UNNEST(keys) k WITH OFFSET ki
LEFT JOIN UNNEST(vals) v WITH OFFSET vi ON ki = vi;

What SQL query is the equivalent to this function for retrieving a list of unique items

I'm trying to change this function into an SQL query (using Room). The goal is to return a list of items with no duplicates.
A duplicate is defined by either the item.id or any combination of linked ids being present.
fun removeDuplicates(items: List<Table>?) : List<Table>?{
val returnItems = ArrayList<Table>()
items?.distinctBy { _item ->
_item.id
}?.forEach { item ->
val LID1 = item.linked_id_1
val LID2 = item.linked_id_2
val isFoundReturnItem = returnItems.firstOrNull {
(it.linked_id_1 == LID2 && it.linked_id_2 == LID1) ||
(it.linked_id_1 == LID1 && it.linked_id_2 == LID2)
}
//only add to our new list if not already present
if(isFoundReturnItem == null)
returnItems.add(item)
}
return returnItems
}

If I read your question right here is the answer for Microsoft SQL. Structure:
Select Distinct Field1, Field2, ...
From Table
Where Field1 between 'a' and 'm'
Your Script: The distinct command makes distinct rows.
Select Distinct Item
From YourTableName
You can also use GROUP BY this allows aggregations on distinct values
Select Field1, Field2 = max(Field2), ...
From Table
Where Field1 between 'a' and 'm'
Group by Field1

Snowflake get_path() or flatten() array query - to find latest key:value

I have a column 'amp' in a table 'EXAMPLE'. Column 'amp' is an array which looks like this:
[{
"list": [{
"element": {
"x_id": "12356789XXX",
"y_id": "12356789XXX38998",
}
},
{
"element": {
"x_id": "5677888356789XXX",
"y_id": "1XXX387688",
}
}]
}]
How should I query using get_path() or flatten() to extract the latest x_id and y_id value (or other alternative)
In this example it is only 2 elements, but there could 1 to 6000 elements containing x_id and y_id.
Help much appreciated!

Someone may have a more elegant way than this, but you can use a CTE. In the first table expression, grab the max of the array. In the second part, grab the values you need.
set json = '[{"list": [{"element": {"x_id": "12356789XXX","y_id": "12356789XXX38998"}},{"element": {"x_id": "5677888356789XXX","y_id": "1XXX387688",}}]}]';
create temp table foo(v variant);
insert into foo select parse_json($json);
with
MAX_INDEX(M) as
(
select max("INDEX") MAX_INDEX
from foo, lateral flatten(v, recursive => true)
),
VALS(V, P, K) as
(
select "VALUE", "PATH", "KEY"
from foo, lateral flatten(v, recursive => true)
)
select k as "KEY", V::string as VALUE from vals, max_index
where VALS.P = '[0].list[' || max_index.m || '].element.x_id' or
VALS.P = '[0].list[' || max_index.m || '].element.y_id'
;

Assuming that the outer array ALWAYS contains a single dictionary element, you could use this:
SELECT amp[0]:"list"[ARRAY_SIZE(amp[0]:"list")-1]:"element":"x_id"::VARCHAR AS x_id
,amp[0]:"list"[ARRAY_SIZE(amp[0]:"list")-1]:"element":"y_id"::VARCHAR AS y_id
FROM T
;
Or if you prefer a bit more modularity/readability, you could use this:
WITH CTE1 AS (
SELECT amp[0]:"list" AS _ARRAY
FROM T
)
,CTE2 AS (
SELECT _ARRAY[ARRAY_SIZE(_ARRAY)-1]:"element" AS _DICT
FROM CTE1
)
SELECT _DICT:"x_id"::VARCHAR AS x_id
,_DICT:"y_id"::VARCHAR AS y_id
FROM CTE2
;
Note: I have not used FLATTEN here because I did not see a good reason to use it.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Safe casting a regexp match from REGEXP_REPLACE - google-bigquery

Related

Query key values in a json column

create a group of linked items

BigQuery SQL JSON Returning additional rows when current row contains multiple values

What SQL query is the equivalent to this function for retrieving a list of unique items

Snowflake get_path() or flatten() array query - to find latest key:value

Categories

Resources