Querying Redshift Spectrum array of string columns - sql

I have an external (s3) table in my redshift cluster with an array of string column. It's literally just a list of strings. I can query and only select the array column no worries. I can query all 3 of the array columns no worries but as soon as I try to query other columns that are not arrays I get the following:
error: Spectrum Scan Error
I have tried the following as I saw it on some other stack overflow questions
select id_col, b
from test.test_table as a, a.array_col as b
but when I run the above I get: navigation on array_col is not allowed as it is not a struct/tuple type
Of course, this error message makes sense as it isn't a struct or tuple type but I am lost as to how on earth I can query a simple array of strings and have found no documentation on how to do this. Any help or advice would be greatly appreciated!

Since you renamed the external table, you have to use that name for all the fields that you want to retrieve; in the same way you did with the array column. Your query will be:
select a.id_col, b
from test.test_table as a, a.array_col as b

Related

Looking for guidance on my sql query that apparently includes an array

Quite new to sql, and looking for help on what i'm doing wrong.
With the code below, i'm getting the error "cannot access field value on a value with type array<struct> at [1:30]"
The "audience size value" comes from the dataset public_campaigns where as the engagement rate comes from the data set public_instagram_channels
I think the dataset that's causing the issue here is the public_campaigns.
thanks in advance for your help!
SELECT creator_audience_size.value, AVG(engagement_rate/1000000) AS avgER
FROM `public_instagram_channels` AS pic
JOIN `public_campaigns`AS pc
ON pic.id=pc.id
GROUP BY creator_audience_size.value
This is to do with the type of one of the columns using REPEATED mode.
In Google BigQuery you have to use UNNEST on these repeated columns to get their individual values in the result set.
It's unclear from what you've posted which column is the repeated type - looking at the table definition for public_instagram_channels and public_campaigns will reveal this - look for the word REPEATED in the Mode column of the table definition.
Once you've found it, include UNNEST in your query, as per this untested example:
SELECT creator_audience_size.value, AVG(engagement_rate/1000000) AS avgER
FROM `public_instagram_channels` AS pic,
UNNEST(`column_name`) AS whatever_you_want
JOIN `public_campaigns`AS pc ON pic.id = pc.id
GROUP BY creator_audience_size.value

Transform a column of type string to an array/record i.e. nesting a column

I am trying to get calculate and retrieve some indicators from mutiple tables I have in my dataset on bigquery. I am want to invoke nesting on sfam which is a column of strings which I can't do for now i.e. it could have values or be null. So the goal is to transform that column into an array/record...that's the idea that came to mind and I have no idea how to go about doing it.
The product and cart are grouped by key_web, dat_log, univ, suniv, fam and sfam.
The data is broken down into universe refered to as univ which is composed of sub-universe refered to as suniv. Sub-universes contain families refered to as 'fam' which may or may not have sub-families refered to as sfam. I want to invoke nesting on prd.sfam to reduce the resulting columns.
The data is collected from Google Analytics for insight into website trafic and users activities.
I am trying to get information and indicators about each visitor, the amount of time he/she spent on particular pages, actions taken and so on. The resulting table gives me the sum of time spent on those pages, sum of total number of visits for a single day and a breakdown to which category it belongs, thus the univ, suniv, fam and sfam colummns which are of type string (the sfam could be null since some sub-universes suniv only have families famand don't go down to a sub-family level sfam.
dat_log: refers to the date
nrb_fp: number of views for a product page
tps_fp: total time spent on said page
I tried different methods that I found online but none worked, so I post my code and problem in hope of finding guidance and a solution !
A simpler query would be:
select
prd.key_web
, dat_log
, prd.nrb_fp
, prd.tps_fp
, prd.univ
, prd.suniv
, prd.fam
, prd.sfam
from product as prd
left join cart as cart
on prd.key_web = cart.key_web
and prd.dat_log = cart.dat_log
and prd.univ = cart.univ
and prd.suniv = cart.suniv
and prd.fam = cart.fam
and prd.sfam = cart.sfam
And this is a sample result of the query for the last 6 columns in text and images:
Again, I want to get a column of array as sfam where I have all the string values of sfam even nulls.
I limited the output to only only the last 6 columns, the first 3 are the row, key_web and dat_log. Each fam is composed of several sfam or none (null), I want to be able to do nesting on either the fam or sfam.
I want to get a column of array as sfam where I have all the string values of sfam even nulls.
This is not possible in BigQuery. As the documentation explains:
Currently, BigQuery has two following limitations with respect to NULLs and ARRAYs:
BigQuery raises an error if query result has ARRAYs which contain NULL elements, although such ARRAYs can be used inside the query.
That is, your result set cannot contain an array with NULL elements.
Obviously, in BigQuery you cannot output array which holds NULL, but if for some reason you need to preserve them somehow - the workaround is to create array of structs as opposed to arrays of single elements
For example (BigQuery Standard SQL) if you try to execute below
SELECT ['a', 'b', NULL] arr1, ['x', NULL, NULL] arr2
you will get error: Array cannot have a null element; error in writing field arr1
While if you will try below
SELECT ARRAY_AGG(STRUCT(val1, val2)) arr
FROM UNNEST(['a', 'b', NULL]) val1 WITH OFFSET
JOIN UNNEST(['x', NULL, NULL]) val2 WITH OFFSET
USING(OFFSET)
you get result
Row arr.val1 arr.val2
1 a x
b null
null null
As you can see - approaching this way - you can have have even both elements as NULL

How to explode nested array of structure with unknown array length in Hive?

I have a hive table emp_test as below:
'name' as string <br>
'testing' as array< struct < code:string,tests:array < struct < testtype:string,errorline:string>>>>
and have column values :"name" as "JOHN" and "testing" as
[{"code":"cod1234","tests":[{"testtype":"java","errorline":"100"},{"testtype":"C++","errorline":"10000"}]},<br>
{"code":"cod6790","tests":[{"testtype":"hive","errorline":"10"},{"testtype":"pig","errorline":"978"},{"testtype":"spark","errorline":"35"}]}
]
How to select these values and store in another table
emp_test_detail(name,code,testtype,errorline) as
JOHN cod1234 java 100 <br>
JOHN cod1234 C++ 10000<br>
JOHN cod6790 hive 10<br>
JOHN cod6790 pig 978<br>
JOHN cod6790 spark 35<br>
i have tried below query but got error :
*insert into emp_test_detail select <br>
emp_tasting.code, <br>
emp_tasting.emp_tests.testtype, <br>
emp_tasting.emp_tests.errorline from emp_test <br>
lateral view explode(testing) mytest as emp_tasting <br>
lateral view explode(testing[0].tests) mytest as emp_tasting;* <br>
and here I don't know the exact length of testing array.so how to reference array fields?
Please help me on this ?
In your example query the error is likely related to using emp_tasting, the same column alias for both lateral view explode lines. They need to have different aliases.
To un-nest an array two levels deep, you need to explode the first array, then refer to the alias of that exploded array when exploding the nested array.
For example, you wanted name, code, testtype, errorline
name is available directly in the table
code is available from the first explode
testtype and errorline are available from the nested explode.
Note I am looking at your schema, not the data you've listed, it's easier for me to reason about
This query should do what you want
SELECT
name,
testingelement.code,
test.testtype,
test.errorline
FROM emp_test
LATERAL VIEW explode(testing) testingarray as testingelement
LATERAL VIEW explode(testingelement.tests) testsarray as test;
Table and column aliases
Note that explode has two aliases added after it, the first is for the table expression it generates, the second is for the column(s).
So in this example
LATERAL VIEW explode(testing) testingarray as testingelement
testingarray is the table alias
testingelement is the array column alias you need to reference to extract the fields within the struct.
Skipping the first explode
If you only wanted fields directly from the table and from the nested array then you can shortcut that query by doing a single LATERAL VIEW explode to
LATERAL VIEW explode(testing.tests) testsarray as test
The problem with that is it will also explode empty arrays, and you can't use * star expansion, you have to refer to field names explicitly. That's not a bad thing.
What is a bad thing is having to use array indexes in a query. As soon as you start writing field[0] then something smells funky. That would only ever get the first element of the array, and as you've said it relies on knowing the size of the array beforehand which would have very limited use cases.

Hive UDF to generate all possible ordered combinations from the list

I am trying to figure out in Hive how to generate a UDF that would take as input a list and output a list with 2 way ordered combination all elements in the list
Input:
list_variable_b
[5142430,5146974,5141766]
Output:
list_variable_b
[(5142430,5146974),(5146974,5141766),(5142430,5141766)]
So you're asking how to write an UDF that can take an array<bigint> and
turn it into an array<struct<int,int> or array<array<int>.
It sounds you want what's called n take k, which will produce (n!)/(n-k)!k! elements.
Now, hive has two kinds of UDFs, one that's the simple one, that can only process primitive (non-collection) types. But here you are processing an array so you'll need a Generic UDF. Generic UDF can do much more than simple UDFs, but they are also more difficult to write. A good guide on how to do it is here: http://www.baynote.com/2012/11/a-word-from-the-engineers/
Another way would be to use a double LATERAL VIEW with the caveat that all the elements in the array have to be unique for this to work.
If the table is
create table xx ( col array<int>);
such that
select * from xx;
OK
[5142430,5146974,5141766]
Using a double lateral view to do the cartesian product of the array on itself, then only get the pairs where one element is bigger then the other:
select a1,b1 from xx
lateral view explode(col) a as a1
lateral view explode(col) b as b1 where a1 < b1;
5142430 5146974
5141766 5142430
5141766 5146974

Using previous table in pig group syntax after filter

Suppose I have a table in pig with 3 columns, a , b, c. Now suppose I want to filter the table by b == 4 and then group it by a. I believe that would look something like this.
t1 = my_table; -- the table contains three columns a, b, c
t1_filtered = FILTER t1_filtered by (
b == 4
);
t1_grouped = GROUP t1_filtered by my_table.a;
My question is why can't it look like this:
t1 = my_table; -- the table contains three columns a, b, c
t1_filtered = FILTER t1_filtered by (
b == 4
);
t1_grouped = GROUP t1_filtered by t1_filtered.a;
Why do you have to reference the table before the filter? I'm trying to learn pig and i find myself making this mistake a lot. It seems to me that t1_filtered should equal a table that is just the filtered version of t1. Therefore a simple group should make sense, but i've been told you need to reference the table from before. Does anyone know whats going on behind the scenes and why this makes sense? Also, help naming this question is also appreciated.
The way you have De-referenced(.) is also not correct. This is how it should be.
A = LOAD '/filepath/to/tabledata' using PigStorage(',') as (a:int,b:int,c:int);
B = FILTER A BY a==1;
C = GROUP B BY a;
But your way of dereferencing(.) will also work in some cases. You can only use dot(.) when you are referencing a complex data type like a map,tuple or bag. If we use dot operator to access the normal fields it would expect a scalar output. If it has more than one output then you will get a error something like this.
java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (1,2,3), 2nd :(2,2,2)
Your way of using the dot operator would work only if the output of your group by has only one output if not you will end up with this error. Relation B is not a complex data type that is the reason we do not use any dereferencing operator in the group by clause.
Hope this answers your question.