Unnesting Multiple Nested Fields Deep in BigQuery - sql

I am trying to use Standard SQL Dialect in BigQuery to unnest the changelog.histories.items repeated record (outlined in green) to access the rows in the nested items table (outlined in blue). The parent Record "changelog" (outlined in Red) is not a repeated record and therefore I am having issues figuring out what to unnest.
Queries that attempt to unnest changelog.histories or changelog.histories.items result in the below error.
SELECT changelog.histories.items.to
FROM jirasparta_database.jira_issues,
unnest(changelog.histories)
Error: Cannot access field items on a value with type ARRAY, ...>, items ARRAYto STRING, field STRING, fieldtype STRING, ...>>, ...>> at [1:28]

#standardSQL
SELECT item.to
FROM jirasparta_database.jira_issues,
UNNEST(changelog.histories) history, UNNEST(history.items) item
Basically, you have to flatten the STRUCT and ARRAY values. You can have a look into this documentation for more details.

Related

SparkSQL: How to query a column with datatype: List of Maps

I have a dataframe with column of array (or list) with each element being a map of String, complex data type (meaning --String, nested map, list etc; in a way you may assume column data type is similar to List[Map[String,AnyRef]])
now i want to query on this table like..
select * from the tableX where column.<any of the array element>['someArbitaryKey'] in ('a','b','c')
I am not sure how to represent <any of the array element> in the spark SQL. Need help.
The idea is to transform the list of maps into a list of booleans, where each boolean indicates if the respective map contains the wanted key (k2 in the code below). After that all we have to check if the boolean array contains at least one true element.
select * from tableX where array_contains(transform(col1, map->map_contains_key(map,'k2')), true)
I have assumed that the name of the column holding the list of maps is col1.
The second parameter of the transform function could be replaced by any expression that returns a boolean value. In this example map_contains_key is used, but any check resulting in a boolean value would work.
A bit unrelated: I believe that the data type of the map cannot be Map[String,AnyRef] as there is no encoder for AnyRef available.

Cast in Google BigQuery not appropriate?

I have a #StandardSQL query
SELECT
CAST(created_utc AS STRING),
author,
FROM
`table`
WHERE
something = "Something"
which gives me the following error,
Error: Cannot read field 'created_utc' of type STRING as INT64
An example of created_utc is 1517360483
If I understand that error, which I clearly don't. created_utc is stored a string, but the query is trying unsuccessfully to convert it to a INT64. I would have hoped the CAST function would enforce it to be kept as a string.
What have I done wrong?
The problem is that you don't actually have a single table. In your question, you wrote table, but I suspect that you are querying table*, which matches multiple tables where one of them happens to have a different type for that column. Instead of using table*, your options are to:
Use UNION ALL with the individual tables, preforming casts as appropriate in the SELECT lists.
If you know which table(s) have that column as an INT64 instead of a STRING, and you are okay with excluding them, you can use a filter on _TABLE_SUFFIX to skip reading from certain tables.
As Elliott has already pointed - some of your values are actually cannot be casted to INT64 because they are not represented integers and rather have some other characters than digits
Using below SELECT you can identify such values so it will help you to locate problematic entries and make then decision on next actions
#standardSQL
SELECT created_utc, author
FROM `table`
WHERE something = "Something"
AND NOT REGEXP_CONTAINS(created_utc , r'[0-9]')

SQL: how to return nth value from json in postgresql

I am trying to SELECT and parse a javascript list in a postgres table column, it has no keys:
{coastal,transitional,contemporary,romantic,traditional,
industrial,modern,contemporary_eclectic,regency,mediterranean}
What SQL command get's the nth value?
I know you can get values by key like this:
SELECT {column_name}->>{key value}
FROM {table_name}
But I really want to just pull values by list-value order. Is there some syntax that I cannot find? Or do I need to transform this array into a different data type?
The same actually works for arrays:
{column_name}->>N
where N is the integer position of an element.
References:
https://www.postgresql.org/docs/current/static/functions-json.html
Turns out I asked the wrong question--I have a postgres array, not a JSON:
I was struggling because posgres starts counting arrays at position 1, not 0--doh!
{column_name}[1] //this is the first value in the array

Get an average value for element in column of arrays of json data in postgres

I have some data in a postgres table that is a string representation of an array of json data, like this:
[
{"UsageInfo"=>"P-1008366", "Role"=>"Abstract", "RetailPrice"=>2, "EffectivePrice"=>0},
{"Role"=>"Text", "ProjectCode"=>"", "PublicationCode"=>"", "RetailPrice"=>2},
{"Role"=>"Abstract", "RetailPrice"=>2, "EffectivePrice"=>0, "ParentItemId"=>"396487"}
]
This is is data in one cell from a single column of similar data in my database.
The datatype of this stored in the db is varchar(max).
My goal is to find the average RetailPrice of EVERY json item with "Role"=>"Abstract", including all of the json elements in the array, and all of the rows in the database.
Something like:
SELECT avg(json_extract_path_text(json_item, 'RetailPrice'))
FROM (
SELECT cast(json_items to varchar[]) as json_item
FROM my_table
WHERE json_extract_path_text(json_item, 'Role') like 'Abstract'
)
Now, obviously this particular query wouldn't work for a few reasons. Postgres doesn't let you directly convert a varchar to a varchar[]. Even after I had an array, this query would do nothing to iterate through the array. There are probably other issues with it too, but I hope it helps to clarify what it is I want to get.
Any advice on how to get the average retail price from all of these arrays of json data in the database?
It does not seem like Redshift would support the json data type per se. At least, I found nothing in the online manual.
But I found a few JSON function in the manual, which should be instrumental:
JSON_ARRAY_LENGTH
JSON_EXTRACT_ARRAY_ELEMENT_TEXT
JSON_EXTRACT_PATH_TEXT
Since generate_series() is not supported, we have to substitute for that ...
SELECT tbl_id
, round(avg((json_extract_path_text(elem, 'RetailPrice'))::numeric), 2) AS avg_retail_price
FROM (
SELECT *, json_extract_array_element_text(json_items, pos) AS elem
FROM (VALUES (0),(1),(2),(3),(4),(5)) a(pos)
CROSS JOIN tbl
) sub
WHERE json_extract_path_text(elem, 'Role') = 'Abstract'
GROUP BY 1;
I substituted with a poor man's solution: A dummy table counting from 0 to n (the VALUES expression). Make sure you count up to the maximum number of possible elements in your array. If you need this on a regular basis create an actual numbers table.
Modern Postgres has much better options, like json_array_elements() to unnest a json array. Compare to your sibling question for Postgres:
Can get an average of values in a json array using postgres?
I tested in Postgres with the related operator ->>, where it works:
SQL Fiddle.

How to use array_agg() for varchar[]

I have a column in our database called min_crew that has varying character arrays such as '{CA, FO, FA}'.
I have a query where I'm trying to get aggregates of these arrays without success:
SELECT use.user_sched_id, array_agg(se.sched_entry_id) AS seids
, array_agg(se.min_crew)
FROM base.sched_entry se
LEFT JOIN base.user_sched_entry use ON se.sched_entry_id = use.sched_entry_id
WHERE se.sched_entry_id = ANY(ARRAY[623, 625])
GROUP BY user_sched_id;
Both 623 and 625 have the same use.user_sched_id, so the result should be the grouping of the seids and the min_crew, but I just keep getting this error:
ERROR: could not find array type for data type character varying[]
If I remove the array_agg(se.min_crew) portion of the code, I do get a table returned with the user_sched_id = 2131 and seids = '{623, 625}'.
The standard aggregate function array_agg() only works for base types, not array types as input.
(But Postgres 9.5+ has a new variant of array_agg() that can!)
You could use the custom aggregate function array_agg_mult() as defined in this related answer:
Selecting data into a Postgres array
Create it once per database. Then your query could work like this:
SELECT use.user_sched_id, array_agg(se.sched_entry_id) AS seids
,array_agg_mult(ARRAY[se.min_crew]) AS min_crew_arr
FROM base.sched_entry se
LEFT JOIN base.user_sched_entry use USING (sched_entry_id)
WHERE se.sched_entry_id = ANY(ARRAY[623, 625])
GROUP BY user_sched_id;
There is a detailed rationale in the linked answer.
Extents have to match
In response to your comment, consider this quote from the manual on array types:
Multidimensional arrays must have matching extents for each dimension.
A mismatch causes an error.
There is no way around that, the array type does not allow such a mismatch in Postgres. You could pad your arrays with NULL values so that all dimensions have matching extents.
But I would rather translate the arrays to a comma-separated lists with array_to_string() for the purpose of this query and use string_agg() to aggregate the text - preferably with a different separator. Using a newline in my example:
SELECT use.user_sched_id, array_agg(se.sched_entry_id) AS seids
,string_agg(array_to_string(se.min_crew, ','), E'\n') AS min_crews
FROM ...
Normalize
You might want to consider normalizing your schema to begin with. Typically, you would implement such an n:m relationship with a separate table like outlined in this example:
How to implement a many-to-many relationship in PostgreSQL?