Cannot pass input field of repeated record type into Bigquery UDF - google-bigquery

When I pass an input field of repeated record type into Bigquery UDF, it keeps saying that the input field is not found.
This is my 2 rows of data:
{"name":"cynthia", "Persons":[ { "name":"john","age":1},{"name":"jane","age":2} ]}
{"name":"jim","Persons":[ { "name":"mary","age":1},{"name":"joe","age":2} ]}
This is the schema of the data:
[
{"name":"name","type":"string"},
{"name":"Persons","mode":"repeated","type":"RECORD",
"fields":
[
{"name": "name","type": "STRING"},
{"name": "age","type": "INTEGER"}
]
}
]
And this is the query:
SELECT
name,maxts
FROM
js
(
//input table
[dw_test.clokTest_bag],
//input columns
name, Persons,
//output schema
"[
{name: 'name', type:'string'},
{name: 'maxts', type:'string'}
]",
//function
"function(r, emit)
{
emit({name: r.name, maxts: '2'});
}"
)
LIMIT 10
Error I got when trying to run the query:
Error: 5.3 - 15.6: Undefined input field Persons
Job ID: ord2-us-dc:job_IPGQQEOo6NHGUsoVvhqLZ8pVLMQ
Would someone please help?
Thank you.

In your list of input columns, list the leaf fields directly:
//input columns
name, Persons.name, Persons.age,
They'll still appear in their proper structure when you get the records in your UDF.

Related

How to flatten out nested array of strings in json column?

I have the following table:
id
contents
123
{ blocks: [{ text: "abc" }, { text: "123" }] }
foo
{ blocks: [{ text: "bar" }, { text: "moretext" }, { text: "ok" }] }
I want to write a view of the above that looks like:
id
contents
raw_text
123
{blocks: [{text: "abc"}, {text: "123"}]}
abc, 123
foo
{blocks: [{text: "bar"}, {text: "moretext"}, { text: "ok"}]}
bar, moretext, ok
This was the query I tried running:
select post.id, array_to_string(array_agg(jsonb_array_elements(post.contents -> 'blocks') ->> 'text')) as paragraphs from post group by id
But it results in the error
aggregate function calls cannot contain set-returning function calls.
If a JSON array of all the values is also acceptable, you can use a JSON path query:
select id, contents,
jsonb_path_query_array(contents, '$.blocks[*].text')
from post;
As there is no simply cast from a JSON array to a native Postgres array, and you do need that as a CSV string, you need to unnest and aggregate with a scalar sub-query:
select id, contents,
(select string_agg(x.item ->> 'text', ', ')
from jsonb_array_elements(contents -> 'blocks') as x(item)) as raw_text
from post;
The reason for your error is, that you are mixing nesting multiple aggregate and set returning function which simply isn't supported.

How do I use BigQuery DML to transform some fields of a struct nested within an array, within a struct, within an array?

I think this is a more complex version of the question in Update values in struct arrays in BigQuery.
I'm trying to update some of the fields in a struct, where the struct is heavily nested. I'm having trouble creating the SQL to do it. Here's my table schema:
CREATE TABLE `my_dataset.test_data_for_so`
(
date DATE,
hits ARRAY<STRUCT<search STRUCT<query STRING, other_column STRING>, metadata ARRAY<STRUCT<key STRING, value STRING>>>>
);
This is what the schema looks like in the BigQuery GUI after I create the table:
Here's the data I've inserted:
INSERT INTO `my_dataset.test_data_for_so` (date, hits)
VALUES (
CAST('2021-01-01' AS date),
[
STRUCT(
STRUCT<query STRING, other_column STRING>('foo bar', 'foo bar'),
[
STRUCT<key STRING, value STRING>('foo bar', 'foo bar')
]
)
]
)
My goal is to transform the "search.query" and "metadata.value" fields. For example, uppercasing them, leaving every other column (and every other struct field) in the row unchanged.
I'm looking for a solution involving either manually specifying each column in the SQL, or preferably, one where I can only mention the columns/fields I want to transform in the SQL, omitting all other columns/fields. This is a minimal example. The table I'm working on in production has hundreds of columns and fields.
For example, that row, when transformed this way, would change from:
[
{
"date": "2021-01-01",
"hits": [
{
"search": {
"query": "foo bar",
"other_column": "foo bar"
},
"metadata": [
{
"key": "foo bar",
"value": "foo bar"
}
]
}
]
}
]
to:
[
{
"date": "2021-01-01",
"hits": [
{
"search": {
"query": "FOO BAR",
"other_column": "foo bar"
},
"metadata": [
{
"key": "foo bar",
"value": "FOO BAR"
}
]
}
]
}
]
preferably, one where I can only mention the columns/fields I want to transform in the SQL ...
Use below approach - it does exactly what you wish - ONLY those fields that are to be updated are in use, all other (tens or hundreds ...) are preserved as is
update your_table
set hits = array(
select as struct *
replace(
(select as struct * replace (upper(query) as query) from unnest([search])) as search,
array(select as struct * replace(upper(value) as value) from unnest(metadata)) as metadata
)
from unnest(hits)
)
where true;
if applied to sample data in your question - result is

how to use trino/presto to query redis

I have a simple string and hash stored in redis
get test
"1"
hget htest first
"first hash"
I'm able to see the "table" test, but there are no columns
trino> show columns from redis.default.test;
Column | Type | Extra | Comment
--------+------+-------+---------
(0 rows)
and obviously I can't get result from select
trino> select * from redis.default.test;
Query 20210918_174414_00006_dmp3x failed: line 1:8: SELECT * not allowed from relation
that has no columns
I see in the documentation that I might need to create a table definition file, but I wasn't able to create one that will work.
I had few variations of this, but this is the one for example:
{
"tableName": "test",
"schemaName": "default",
"value": {
"dataFormat": "json",
"fields": [
{
"name": "number",
"mapping": 0,
"type": "INT"
}
]
}
}
any idea what am I doing wrong?
I focused on the string since it's simpler, but I also need to query the hash

How to perform a SELECT in the results returned from a GROUP BY Druid?

I am having a hard time converting this simple SQL Query below into Druid:
SELECT country, city, Count(*)
FROM people_data
WHERE name="Mary"
GROUP BY country, city;
So I came up with this query so far:
{
"queryType": "groupBy",
"dataSource" : "people_data",
"granularity": "all",
"metric" : "num_of_pages",
"dimensions": ["country", "city"],
"filter" : {
"type" : "and",
"fields" : [
{
"type": "in",
"dimension": "name",
"values": ["Mary"]
},
{
"type" : "javascript",
"dimension" : "email",
"function" : "function(value) { return (value.length !== 0) }"
}
]
},
"aggregations": [
{ "type": "longSum", "name": "num_of_pages", "fieldName": "count" }
],
"intervals": [ "2016-07-20/2016-07-21" ]
}
The query above runs but it doesn't seem like groupBy in the Druid datasource is even being evaluated since I see people in my output with names other than Mary. Does anyone have any input on how to make this work?
Simple answer is that you cannot select arbitrary dimensions in your groupBy queries.
Strictly speaking even SQL query does not make sense. If for a given combination of country, city there are many different values of name and street, then how do you squeeze that into a single row? You have to aggregate them, e.g. by using max function.
In this case you can include the same column in your data as both dimension and metric, e.g. name_dim and name_metric, and include corresponding aggregation over your metric, max(name_metric).
Please note, that if these columns, name etc, have high granularity values, then that will kill Druid's roll-up feature.

Get Most Recent Column Value With Nested And Repeated Fields

I have a table with the following structure:
and the following data in it:
[
{
"addresses": [
{
"city": "New York"
},
{
"city": "San Francisco"
}
],
"age": "26.0",
"name": "Foo Bar",
"createdAt": "2016-02-01 15:54:25 UTC"
},
{
"addresses": [
{
"city": "New York"
},
{
"city": "San Francisco"
}
],
"age": "26.0",
"name": "Foo Bar",
"createdAt": "2016-02-01 15:54:16 UTC"
}
]
What I'd like to do is recreate the same table (same structure) but with only the latest version of a row. In this example let's say that I'd like to group by everything by name and take the row with the most recent createdAt.
I tried to do something like this: Google Big Query SQL - Get Most Recent Column Value but I couldn't get it to work with record and repeated fields.
I really hoped someone from Google Team will provide answer on this question as it is very frequent topic/problem asked here on SO. BigQuery definitelly not friendly enough with writing Nested / Repeated stuff back to BQ off of BQ query.
So, I will provide the workaround I found relatively long time ago. I DO NOT like it, but (and that is why I hoped for the answer from Google Team) it works. I hope you will be able to adopt it for you particular scenario
So, based on your example, assume you have table as below
and you expect to get most recent records based on createdAt column, so result will look like:
Below code does this:
SELECT name, age, createdAt, addresses.city
FROM JS(
( // input table
SELECT name, age, createdAt, NEST(city) AS addresses
FROM (
SELECT name, age, createdAt, addresses.city
FROM (
SELECT
name, age, createdAt, addresses.city,
MAX(createdAt) OVER(PARTITION BY name, age) AS lastAt
FROM yourTable
)
WHERE createdAt = lastAt
)
GROUP BY name, age, createdAt
),
name, age, createdAt, addresses, // input columns
"[ // output schema
{'name': 'name', 'type': 'STRING'},
{'name': 'age', 'type': 'INTEGER'},
{'name': 'createdAt', 'type': 'INTEGER'},
{'name': 'addresses', 'type': 'RECORD',
'mode': 'REPEATED',
'fields': [
{'name': 'city', 'type': 'STRING'}
]
}
]",
"function(row, emit) { // function
var c = [];
for (var i = 0; i < row.addresses.length; i++) {
c.push({city:row.addresses[i]});
};
emit({name: row.name, age: row.age, createdAt: row.createdAt, addresses: c});
}"
)
the way above code works is: it implicitely flattens original records; find rows that belong to most recent records (partitioned by name and age); assembles those rows back into respective records. final step is processing with JS UDF to build proper schema that can be actually written back to BigQuery Table as nested/repeated vs flatten
The last step is the most annoying part of this workaround as it needs to be customized each time for specific schema(s)
Please note, in this example - it is only one nested field inside addresses record, so NEST() fuction worked. In scenarious when you have more than just one
field inside - above approach still works, but you need to involve concatenation of those fields to put them inside nest() and than inside js function to do extra splitting those fields, etc.
You can see examples in below answers:
Create a table with Record type column
create a table with a column type RECORD
How to store the result of query on the current table without changing the table schema?
I hope this is good foundation for you to experiment with and make your case work!