Hive aggregation function that produces a map - sql

I have the following hive table
ID, class,value
1, A, 0.3
1, B, 0.4
1, C, 0.5
2, B, 0.1
2, C, 0.2
I want to get
ID, class:value
1, [A:0.3, B:0.4, C:0.5]
2, [B:0.1, C:0.2]
I know that there is a collect_set() UDAF that produces a list of class or list of value, is there anyway to get a list of key:value pairs?
NOTE:
I guess I can use two collect_set() one for class column and one for value column but I am not sure if the lists will be in the same order.

I've used the UnionUDAF from the Brickhouse library to do something similar. You create a map from each pair, and then union them all together during the aggregation.
Add JAR brickhouse.jar;
create temporary function BH_union as 'brickhouse.udf.collect.UnionUDAF';
SELECT S.ID, BH_union(S.v_map)
FROM (SELECT ID, map(class, value) as v_map from mytable) S
GROUP by S.ID

You can use custom Map/Reduce scripts and collect_list() (from Hive 0.13.0) to achieve the same.
Let me know if you need more help in this.

Related

How does BigQuery manage a struct field in a SELECT

The following queries a struct from a public data source:
SELECT year FROM `bigquery-public-data.words.eng_gb_1gram` LIMIT 1000
Its schema is:
And the resultset is:
It seems BigQuery automatically translates a struct to all its (leaf) fields when accessed, is that correct? Or how does BigQuery handle directly calling a struct in a select statement?
Two things are going on. You have an array of structs (aka "records").
Each element of the array appears on a separate line in the result set.
Each field in the struct is a separate column.
So, your results are not for a struct but for an array of structs.
You can see what happens for a single struct using:
select year[safe_ordinal(1)]
from . . .
You will get a single row for each row in the data, with the first element of the year array in the row. It will have separate columns, with the names of year.year, year.term_frequency and so on. If you wanted these as "regular" columns, you can use:
select year[ordinal(1)].*
from . . .
Then the columns are year, term_frequency, and so on.
As you might know - RECORD can be NULLABLE - in this case it is a STRUCT and RECORD can be REPEATED - in this case it is an array of record
You can use dot-start notion with the struct to select out all its fields as you do with tables' individual rows with SELECT * FROM tbl or its equivalent SELECT t.* FROM tbl t
So, for example below code
with tbl as (
select struct(1 as a, 2 as b, 3 as c) as col_struct,
[ struct(11 as x, 12 as y, 13 as z),
struct(21, 22, 23),
struct(31, 32, 33)
] as col_array
)
select col_struct.*
from tbl
produces
as if those are the rows of "mini" table called col_struct
Same dot-star notion - does not work for arrays - if you want to output separately elements of array - you need to first to unnest that array. like in below example
with tbl as (
select struct(1 as a, 2 as b, 3 as c) as col_struct,
[ struct(11 as x, 12 as y, 13 as z),
struct(21, 22, 23),
struct(31, 32, 33)
] as col_array
)
select rec
from tbl, unnest(col_array) rec
which outputs
And now, because each row is a struct - you can use dot-star notion
select rec.*
from tbl, unnest(col_array) rec
with output
And, finally - you can combine above as
select col_struct.*, rec.*
from tbl t, t.col_array rec
with output
Note: from tbl t, t.col_array rec is a shortcut for from tbl, unnest(col_array) rec
One more note - if you reference field name that is used in multiple places of your schema - the engine picks most outer matching one. And if by chance this matching one is within the ARRAY - you first need to unnest that array. And if this one is part of STRUCT - you need to make sure you fully qualify the path
For example - with above simplified data
select a from tbl // will not work
select col_struct.a from tbl // will work
select col_array.x from tbl // will not work
select x from tbl, unnest(col_array) // will work
There are many more can be said about subject based on what exactly your use case - but above is some hopefully helpful basics

Example of table function

Is the UNNEST an example of a table-function? It seems to produce a single named column if I'm understanding it correctly. Something like:
`vals`
[1,2,3]
unnest(vals) as v
`v`
1
2
3
with Table as (
select [1,2,3] vals
) select v from Table, UNNEST(vals) as v
Is this an example of a table-function? If not, what kind of function is it? Are there any other predefined table functions in BQ?
The UNNEST operator takes an ARRAY and returns a table, with one row for each element in the ARRAY. You can also use UNNEST outside of the FROM clause with the IN operator.
So, you might may call it table function if you wish :o)
You can read more about UNNEST here
It seems to produce a single named column if I'm understanding it correctly
Not exactly correct. See example below
with Table as (
select [struct(1 as a,2 as b),struct(3, 4), struct(5, 6)] vals
)
select v.* from Table, UNNEST(vals) as v
with output

Using SQL Query to return value from BigQuery User Defined Function

Can I use a query in Google BigQuery User Defined Function to return some value? I've been searching docs and stackoverflow for hours without any luck and I have a very specific use case where I need to return a single scalar value based on the values of multiple columns.
Following will be the use case for the query:
SELECT campaign,source,medium, get_channel(campaign,source,medium)
FROM table_name
the get_channel() UDF will use these parameters and a complex select statement to return a single scalar value for the row. I've prepared the query, I just need to find a way to use that query in the UDF, for which I, honestly am at loss and without a cause.
Is my use case correct? Is this even possible? Are there any alternatives to do this?
Looks like you want to use UDF to select scalar value off of some lookup table. if so, NO - you cannot reference a table in UDF - see more in Limits and Limitations
But if you just want to have some complex manipulation with arguments - sure - see dummy example below
#standardSQL
CREATE TEMPORARY FUNCTION get_channel(campaign INT64, source INT64, medium INT64) AS ((
SELECT campaign + source + medium as result_of_complex_select_statement
));
WITH `project.dataset.table_name` AS (
SELECT 1 AS campaign, 2 AS source, 3 AS medium UNION ALL
SELECT 4, 5, 6 UNION ALL
SELECT 7, 8, 9
)
SELECT
campaign,
source,
medium,
get_channel(campaign,source,medium) AS channel
FROM `project.dataset.table_name`
You should rather use JOIN to achieve your goal

How to use subquery result as the column name of another query

I want to use the result from subquery as the column name of another query since the data changes column all the time and the subquery will decide which column the current forcast data stored. My example:
select item,
item_type
...
forcast_0 * 0.9 as finalforcast
forcast_0 * 0.8 as newforcast
from sales_data.
but the forcast_0 column is the result (fore_column_name) of the subquery, the result may change to forcast_1 or forcast2
select
fore_column_name
from forecast_history
where ...
Also, the forcast column will be used multiple times in the first query. how could I implement this?
Use your sub query as an inline table. Something like....
select item,
item_type,
..
decode(fore_column_name, 'foo', 1, 2) * 0.9 as finalforcast,
decode(fore_column_name, 'foo', 1, 2) * 0.8 as newforcast
from sales_data,
(
select fore_column_name
from forecast_history
where ...
) inlineTable
I'm assuming here that the value from the sub-query will be the same for each row - so a quick cross-join will suffice. If the value will vary depending on the values in each row of the sales_data table, then some other type of join would be more appropriate.
Quick link to decode - in case you aren't familiar with it.

PostgreSQL: How to access column on anonymous record

I have a problem that I'm working on. Below is a simplified query to show the problem:
WITH the_table AS (
SELECT a, b
FROM (VALUES('data1', 2), ('data3', 4), ('data5', 6)) x (a, b)
), my_data AS (
SELECT 'data7' AS c, array_agg(ROW(a, b)) AS d
FROM the_table
)
SELECT c, d[array_upper(d, 1)]
FROM my_data
In the my data section, you'll notice that I'm creating an array from multiple rows, and the array is returned in one row with other data. This array needs to contain the information for both a and b, and keep two values linked together. What would seem to make sense would be to use an anonymous row or record (I want to avoid actually creating a composite type).
This all works well until I need to start pulling data back out. In the above instance, I need to access the last entry in the array, which is done easily by using array_upper, but then I need to access the value in what used to be the b column, which I cannot figure out how to do.
Essentially, right now the above query is returning:
"data7";"(data5,6)"
And I need to return
"data7";6
How can I do this?
NOTE: While in the above example I'm using text and integers as the types for my data, they are not the actual final types, but are rather used to simplify the example.
NOTE: This is using PostgreSQL 9.2
EDIT: For clarification, Something like SELECT 'data7', 6 is not what I'm after. Imagine that the_table is actually pulling from database tables and not the WITH statement the I put in for convenience, and I don't readily know what data is in the table.
In other words, I want to be able to do something like this:
SELECT c, (d[array_upper(d, 1)]).b
FROM my_data
And get this back:
"data7";6
Essentially, once I've put something into an anonymous record by using the row() function, how do I get it back out? How do I split up the 'data5' part and the 6 part so that they don't both return in one column?
For another example:
SELECT ROW('data5', 6)
makes 'data5' and 6 return in one column. How do I take that one column and break it back into the original two?
I hope that clarifies
If you can install the hstore extension:
with the_table as (
select a, b
from (values('data1', 2), ('data3', 4), ('data5', 6)) x (a, b)
), my_data as (
select 'data7' as c, array_agg(row(a, b)) as d
from the_table
)
select c, (avals(hstore(d[array_upper(d, 1)])))[2]
from my_data
;
c | avals
-------+-------
data7 | 6
This is just a very quick throw together around a similarish problem - not an answer to your question. This appears to be one direction towards identifying columns.
with x as (select 1 a, 2 b union all values (1,2),(1,2),(1,2))
select a from x;