Accessing complex types in AWS Athena - sql

I've used Glue to generate tables for Athena. I have some nested array/struct values (complex types) that I'm having trouble accessing via query.
I have two tables, the one in question is named "sample_parquet".
ids (array<struct<idType:string,idValue:string>>)
The the cell has the value:
[{idtype=ttd_id, idvalue=cf275376-8116-4cad-a035-e241e14b1470}, {idtype=md5_email, idvalue=932babe184fb11c92b09b3e13e936124}]
And I've tried:
select ids.idtype from sample_parquet limit 1
Which yields:
SYNTAX_ERROR: line 1:8: Expression "ids" is not of type ROW
And:
select s.idtype from sample_parquet.ids s limit 1;
Which yields:
SYNTAX_ERROR: line 1:22: Schema sample_parquet does not exist
I've also tried:
select json_extract(ids, '$.idtype') as idtype from sample_parquet limit 1;
Which yields:
SYNTAX_ERROR: line 8:8: Unexpected parameters (array(row(idtype varchar,idvalue varchar)), varchar(8)) for function json_extract. Expected: json_extract(varchar(x), JsonPath) , json_extract(json, JsonPath)
Thanks for any help.

You are trying to access the elements of an array like you'd access a dictionary/key-value.
Use UNNEST to flatten the array and then you can use the . operator.
For more information on working with JSONs and ARRAYs on AWS Docs.

ids is a column of type array, not a relation (e.g. a table, view, or a subquery). Confusingly, when dealing with nested types in Athena/Presto you have to stop thinking in terms of SQL and instead think more as you would in a programming language.
There are dedicated functions that act on arrays, maps, as well as lambda functions (no relationship with the AWS service), that can be used to dig into nested types.
When you say SELECT ids.idtype … I assume that what you're after could be written like ids.map((id) => id.ittype) in JavaScript. In Athena/Presto this could be expressed as SELECT transform(ids, id -> id.idtype) ….
The result of transform will be a relation with a column of type array<string>. If you want each element of that array as a separate row, you need to use UNNEST, but if you instead want the first value you can use the element_at function. There are also other functions that you may be familiar with such as filter, slice, and flatten that produce new arrays, as well as reduce, which produce a scalar value.

Related

Using Bookshelf to execute a query on Postgres JSONB array elements

I have a postgres table with jsonb array elements and I'm trying to do sql queries to extract the matching elements. I have the raw SQL query running from the postgres command line interface:
select * from movies where director #> any (array ['70', '45']::jsonb[])
This returns the results I'm looking for (all records from the movies table where the director jsonb elements contain any of the elements in the input element).
In the code, the value for ['70, '45'] would be a dynamic variable ie. fixArr and the length of the array is unknown.
I'm trying to build this into my Bookshelf code but haven't been able to find any examples that address the complexity of the use case. I've tried the following approaches but none of them work:
models.Movies.where('director', '#> any', '(array' + JSON.stringify(fixArr) + '::jsonb[])').fetchAll()
ERROR: The operator "#> any" is not permitted
db.knex.raw('select * from movies where director #> any(array'+[JSON.stringify(fixArr)]+'::jsonb[])')
ERROR: column "45" does not exist
models.Movies.query('where', 'director', '#>', 'any (array', JSON.stringify(fixArr) + '::jsonb[])').fetchAll()
ERROR: invalid input syntax for type json
Can anyone help with this?
As you have noticed, knex nor bookshelf doesn't bring any support for making jsonb queries easier. As far as I know the only knex based ORM that supports jsonb queries etc. nicely is Objection.js
In your case I suppose better operator to find if jsonb column contains any of the given values would be ?|, so query would be like:
const idsAsString = ids.map(val => `'${val}'`).join(',');
db.knex.raw(`select * from movies where director \\?| array[${idsAsString}]`);
More info how to deal with jsonb queries and indexing with knex can be found here https://www.vincit.fi/en/blog/objection-js-postgresql-power-json-queries/
No, you're just running into the limitations of that particular query builder and ORM.
The easiest way is using bookshelf.Model.query and knex.raw (whereRaw, etc.). Alias with AS and subclass your Bookshelf model to add these aliased attributes if you care about such things.
If you want things to look clean and abstracted through Bookshelf, you'll just need to denormalize the JSONB into flat tables. This might be the best approach if your JSONB is mostly flat and simple.
If you end up using lots of JSONB (it can be quite performant with appropriate indexes) then Bookshelf ORM is wasted effort. The knex query builder is only not a waste of time insofar as it handles escaping, quoting, etc.

Access column from composite type array in Postgres C API

I access array of composite values like this:
PG_GETARG_ARRAYTYPE_P(0)
/* Then I deconstruct it into C array */
deconstruct_array()
/* Later I iterate thru values and attempt to access columns of my composite type */
GetAttributeByName(input_data1[i], "keyColumnName", &isnull[0])
This is how it looks in SQL:
SELECT * FROM my_c_function(array[(44, 1)::comp_type, (43, 0)::comp_type], array[(42, 1)::comp_type, (43, 1)::comp_type]);
Expected result:
array[(44, 1)::comp_type, (42, 1)::comp_type, (43, 1)::comp_type] /*order doesn't matter*/
But this does not work, because GetAttributeByName() works only with HeapTupleHeader, sadly I have array of Datum.
Normally you get HeapTupleHeader by accessing function attribute like so: PG_GETARG_HEAPTUPLEHEADER(0) but that is not meant for arrays (or I'm wrong?).
So is there some function/makro to get columns from Datum that is composite type or to convert composite type Datum into HeapTuple? I have gone as deep as heap_getattr(), but can't really find anything useful. Can't remember if there is already some kind of function that would access composite array in similar fashion and would show me how to do it.
For the context:
I have 2 arrays of composite type and I want to write C function for fast concatenation of them. I however cannot simply add right argument to left, because they could share "key" column and in that case I would like result to have only values from right side.
This is simple task in plpgSQL (unnest, full join, array_agg) but is very slow. I have tested the same task in hstore and json and both are much faster than unnest+array_agg, but I cannot use those data types without extensive database structure changes, so I was looking for different solution.
I guess all you need is the DatumGetHeapTupleHeader macro defined in fmgr.h.

Check if two collections have common part

I have a hive table in which two columns are arrays (created using collect_set built in function). I would like to get only those rows in which "any element in col1 = any element in col2", so simply they have any common part. I see that there is a array_contains(Array, value) function, but it needs one value, not a collection. Is it possible to express such condition?
You can explode array elements and use simple where condition
See explode UDTF manual

What is the order of data across multiple nested fields in BigQuery?

Given a BigQuery table with the schema: target:STRING,evName:STRING,evTime:TIMESTAMP, consider the following subselect:
SELECT target,
NEST(evName) AS evNames,
NEST(evTime) AS evTimes,
FROM [...]
GROUP BY target
This will group events by target into rows with two repeated fields evNames and evTimes. I understand that the values within each of the repeated fields are not ordered in any predictable way, but is the ordering guaranteed to be consistent between the two repeated fields?
In other words, if I pick N-th value from evNames and N-th value from evTimes within a given row, will they form a proper pair from the original table?
What I would really like to do is to create a nested repeated record, something like:
SELECT target, NEST(RECORD(evName, evTime)) AS events FROM [...] GROUP BY target
but I believe creating RECORDs on the fly like this is currently not supported.
By the way, this question is motivated by the desire to use recently introduced BigQuery user defined functions to implement state machines, as an alternative to window functions tricks.
Note: I realize that an alternative is to emulate record by serializing multiple fields into a single string representation, e.g.:
SELECT target, NEST(CONCAT(evName, ',', STRING(evTime))) ...
and then deserialize the "record" in later stages, but I'd like to avoid that if I can.

Yii CSqlDataProvider confusion

I am having some trouble understanding CSqlDataProvider and how it works.
When I am using CActiveDataProvider, the results can be accessed as follows:
$data->userProfile['first_name'];
However, when I use CSqlDataProvider, I understand that the results are returned as an array not an object. However, the structure of the array is flat. In other words, I am seeing the following array:
$data['first_name']
instead of
$data['userProfile']['first_name']
But the problem here is what if I have another joined table (let's call it 'author') in my sql code that also contains a first_name field? With CActiveDataProvider, the two fields are disambiguated, so I can do the following to access the two fields:
$data->userProfile['first_name'];
$data->author['first_name'];
But with CSqlDataProvider, there doesn't seem to be anyway I can access the data as follows:
$data['userProfile']['first_name'];
$data['author']['first_name'];
So, outside of assigning a unique name to those fields directly inside my SQL, by doing something like this:
select author.first_name as author_first_name, userProfile.first_name as user_first_name
And then referring to them like this:
$data['author_first_name'];
$data['user_first_name']
is there anyway to get CSqlDataProvider to automatically structure the arrays so they are nested in the same way that CActiveDataProvider objects are? So that I can call them by using $data['userProfile']['first_name']
Or is there another class I should be using to obtain these kinds of nested arrays?
Many thanks!
As far as I can tell, no Yii DB methods break out JOIN query results in to 2D arrays like you are looking for. I think you will need to - as you suggest - alias the column names in your select statement.
MySql returns a single row of data when you JOIN tables in a query, and CSqlDataProvider returns exactly what MySql does: single tabular array representation indexed/keyed by the column names, just like your query returns.
If you want to break apart your results into a multi-dimensional array I would either alias the columns, or use a regular CActiveDataProvider (which you can still pass complex queries and joins in via CDbCritiera).