Adding an ORDER BY statement to a query without flattening results leads to "Cannot query the cross product of repeated fields" - google-bigquery

Query:
"SELECT * FROM [table] ORDER BY id DESC LIMIT 10"
AllowLargeResults = true
FlattenResults = false
table schema:
[
{
"name": "id",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "repeated_field_1",
"type": "STRING",
"mode": "REPEATED"
},
{
"name": "repeated_field_2",
"type": "STRING",
"mode": "REPEATED"
}
]
The query "SELECT * FROM [table] LIMIT 10" works just fine. I get this error when I add an order by clause, even though the order by does not mention either repeated field.
Is there any way to make this work?

The ORDER BY clause causes BigQuery to automatically flatten the output of a query, causing your query to attempt to generate a cross product of repeated_field_1 and repeated_field_2.
If you don't care about preserving the repeatedness of the fields, you could explicitly FLATTEN both of them, which will cause your query to generate the cross product that the original query is complaining about.
SELECT *
FROM FLATTEN(FLATTEN([table], repeated_field_1), repeated_field_2)
ORDER BY id DESC
LIMIT 10
Other than that, I don't have a good workaround for your query to both ORDER BY and also output repeated fields.
See also: BigQuery flattens result when selecting into table with GROUP BY even with “noflatten_results” flag on

Related

How to group by the amount of values in an array in postgresql

I have a posts table with few columns including a liked_by column which's type is an int array.
As I can't post the table here I'll post a single post's JSON structure which comes as below
"post": {
"ID": 1,
"CreatedAt": "2022-08-15T11:06:44.386954+05:30",
"UpdatedAt": "2022-08-15T11:06:44.386954+05:30",
"DeletedAt": null,
"title": "Pofst1131",
"postText": "yyhfgwegfewgewwegwegwegweg",
"img": "fegjegwegwg.com",
"userName": "AthfanFasee",
"likedBy": [
3,
1,
4
],
"createdBy": 1,
}
I'm trying to send posts in the order they are liked (Most Liked Posts). Which should order the posts according to the number of values inside the liked_by array. How can I achieve this in Postgres?
For a side note, I'm using Go lang with GORM ORM but I'm using raw SQL builder instead of ORM tools. I'll be fine with solving this problem using go lang as well. The way I achieved this in MongoDB and NodeJS is to group by the size of liked by array and add a total like count field and sort using that field as below
if(sort === 'likesCount') {
data = Post.aggregate([
{
$addFields: {
totalLikesCount: { $size: "$likedBy" }
}
}
])
data = data.sort('-totalLikesCount');
} else {
data = data.sort('-createdAt') ;
}
Use a native query.
Provided that the table column that contains the sample data is called post, then
select <list of expressions> from the_table
order by json_array_length(post->'likedBy') desc;
Unrelated but why don't you try a normalized data design?
Edit
Now that I know your table structure here is the updated query. Use array_length.
select <list of expressions> from public.posts
order by array_length(liked_by, 1) desc nulls last;
You may also wish to add a where clause too.

Flattening a nested and repeated structure in BigQuery (standard SQL)

There are a lot of posts on unnesting repeated fields in BigQuery -- but, being new to this environment, I have tried almost every code variation I found to flatten a data file. But, I cannot seem to produce one without creating blanks in the id field. It seem like I need to unflatten a nested variable?
I'm using a COVID Dimensions data set that is part of the public collection. Here is some minimal code that produces my problem:
SELECT
id,
authors
FROM
`covid-19-dimensions-ai.data.publications`
CROSS JOIN
UNNEST(authors)
LIMIT 1000
And, here is the JSON structure after running this query. Everything is flattened with the structure I want, but I don't know how to fill in / avoid the blank id variables.
{
"id": "pub.1130234899",
"authors": {
"first_name": "Eric M",
"last_name": "Yoshida",
"initials": null,
"researcher_id": "ur.01071531321.03",
"grid_ids": [
"grid.17091.3e"
],
"corresponding": false,
"raw_affiliations": [
"Division of Gastroenterology, University of British Columbia, Vancouver, British Columbia, Canada"
],
"affiliations_address": [
{
"grid_id": "grid.17091.3e",
"city_id": "6173331",
"state_code": "CA-BC",
"country_code": "CA",
"raw_affiliation": "Division of Gastroenterology, University of British Columbia, Vancouver, British Columbia, Canada"
}
]
}
}
See small correction to your original query
SELECT
id,
author
FROM
`covid-19-dimensions-ai.data.publications`
CROSS JOIN
UNNEST(authors) author
LIMIT 1000

Snowflake SQL: How to loop through array with JSON objects, to find item that meets condition

Breaking my head on this. In Snowflake my field city_info looks like (for 3 sample records)
[{"name": "age", "content": 35}, {"name": "city", "content": "Chicago"}]
[{"name": "age", "content": 20}, {"name": "city", "content": "Boston"}]
[{"name": "city", "content": "New York"}, {"name": "age", "content": 42}]
I try to extract a column city from this
Chicago
Boston
New York
I tried to flatten this
select *
from lateral flatten(input =>
select city_info::VARIANT as event
from data
)
And from there I can derive the value, but this only allows me to do this for 1 row (so I have to add limit 1 which doesn't makes sense, as I need this for all my rows).
If I try to do it for the 3 rows it tells me subquery returns more than one row.
Any help is appreciated! Chris
You could write it as:
SELECT value:content::string AS city_name
FROM tab,
LATERAL FLATTEN(input => tab.city_info)
WHERE value:name::string = 'city'

How to query and iterate over array of structures in Athena (Presto)?

I have a S3 bucket with 500,000+ json records, eg.
{
"userId": "00000000001",
"profile": {
"created": 1539469486,
"userId": "00000000001",
"primaryApplicant": {
"totalSavings": 65000,
"incomes": [
{ "amount": 5000, "incomeType": "SALARY", "frequency": "FORTNIGHTLY" },
{ "amount": 2000, "incomeType": "OTHER", "frequency": "MONTHLY" }
]
}
}
}
I created a new table in Athena
CREATE EXTERNAL TABLE profiles (
userId string,
profile struct<
created:int,
userId:string,
primaryApplicant:struct<
totalSavings:int,
incomes:array<struct<amount:int,incomeType:string,frequency:string>>,
>
>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true')
LOCATION 's3://profile-data'
I am interested in the incomeTypes, eg. "SALARY", "PENSIONS", "OTHER", etc.. and ran this query changing jsonData.incometype each time:
SELECT jsonData
FROM "sampledb"."profiles"
CROSS JOIN UNNEST(sampledb.profiles.profile.primaryApplicant.incomes) AS la(jsonData)
WHERE jsonData.incometype='SALARY'
This worked fine with CROSS JOIN UNNEST which flattened the incomes array so that the data example above would span across 2 rows. The only idiosyncratic thing was that CROSS JOIN UNNEST made all the field names lowercase, eg. a row looked like this:
{amount=1520, incometype=SALARY, frequency=FORTNIGHTLY}
Now I have been asked how many users have two or more "SALARY" entries, eg.
"incomes": [
{ "amount": 3000, "incomeType": "SALARY", "frequency": "FORTNIGHTLY" },
{ "amount": 4000, "incomeType": "SALARY", "frequency": "MONTHLY" }
],
I'm not sure how to go about this.
How do I query the array of structures to look for duplicate incomeTypes of "SALARY"?
Do I have to iterate over the array?
What should the result look like?
UNNEST is a very powerful feature, and it's possible to solve this problem using it. However, I think using Presto's Lambda functions is more straight forward:
SELECT COUNT(*)
FROM sampledb.profiles
WHERE CARDINALITY(FILTER(profile.primaryApplicant.incomes, income -> income.incomeType = 'SALARY')) > 1
This solution uses FILTER on the profile.primaryApplicant.incomes array to get only those with an incomeType of SALARY, and then CARDINALITY to extract the length of that result.
Case sensitivity is never easy with SQL engines. In general I think you should not expect them to respect case, and many don't. Athena in particular explicitly converts column names to lower case.
You can combine filter with cardinality to filter array elements having incomeType = 'SALARY' more than once.
This can be further improve so that intermediate array is not materialized by using reduce (see examples in the docs; I'm not quoting them here, since they do not directly answer your question).

Flattening multiple repeated fields in Google BigQuery

I'm trying to flatten data from repeated fields in Big Query. I have had a look at this Querying multiple repeated fields in BigQuery, however I can't seem to get this to work.
My data looks like the following:
[
{
"visitorId": null,
"visitNumber": "15",
"device": {
"browser": "Safari (in-app)",
"browserVersion": "(not set)",
"browserSize": "380x670",
"operatingSystem": "iOS",
},
"hits": [
{
"isEntrance": "true",
"isExit": "true",
"referer": null,
"page": {
"pagePath": "/news/bla-bla-bla",
"hostname": "www.example.com",
"pageTitle": "Win tickets!!",
"searchKeyword": null,
"searchCategory": null,
"pagePathLevel1": "/news/",
"pagePathLevel2": "/bla-bla-bla",
"pagePathLevel3": "",
"pagePathLevel4": ""
},
"transaction": null
}
]
}
]
What I want is the fields in the hits-page repeated fields.
For instance i want to fetch the hits.page.pagePath (with the value "/news/bla-bla-bla")
I have tried with the following query, but i get an error:
SELECT
visitorId,
visitNumber,
device.browser,
hits.page.pagePath
FROM
`Project.Page`
LIMIT 1000
The error i'm getting is this
Error: Cannot access field page on a value with type ARRAY<STRUCT<hitNumber INT64, time INT64, hour INT64, ...>>
In ga_sessions schema, the field hits is represented as an ARRAY type.
Usually when working with this type field you need to apply the UNNEST operation in order to open the array.
Specifically, in the FROM clause, you can apply a CROSS JOIN (you unnest arrays by applying a cross join operation, which can be represented as a comma followed by the UNNEST function) like so:
SELECT
visitorId,
visitNumber,
device.browser,
hits.page.pagePath
FROM `Project.Page`,
UNNEST(hits) hits
LIMIT 1000
If you want specific pagePaths, you can filter them out like so:
SELECT
visitorId,
visitNumber,
device.browser,
hits.page.pagePath
FROM `Project.Page`,
UNNEST(hits) hits
WHERE regexp_contains(hits.page.pagePath, r'/news/bla-bla-bla')
LIMIT 1000
Make sure to follow through BigQuery documentation on this topic, it's really well written and you'll learn a lot on new techniques to process big data.