How to extract key and value of greatest value in Hive map?

How to extract key and value of greatest value in Hive map? - hive

I have in Hive a field which contains a map that looks like this:
{"258":0.10075276284486512,"259":0.00093852142318649,"262":0.015979321337627,"264":0.0020453444772401,"265":0.024689771044731,"268":0.018837925051338,"274":0.011282124863882}
I would like to extract the key [and value if possible] of this map of the greatest value for each row. In this case, the ideal function would look like this:
select max_val(col)
from table
Output:
max_val
"258"
"165"
"204"

explode the map column and then use a ranking function like rank to order the values as required and get first such row. (This assumes there is a way to identify a row with any other column than the map, id in the query shown below.)
select id,k,v
from (select id,k,v,rank() over(partition by id order by v desc) as rnum
from tbl
lateral view explode(mapCol) t as k,v
) t
where rnum=1

Related

Transform rows to columns SQL Query

If I have a query like this
SELECT * FROM table1
I get a result something like this:
How can I write a query from the same table that returns me something like this:
value_name has to turn into columns and the value column has to turn into its values.
Also notice that the ids are repeated and its description is always the same one.
I'm working with PostgresQL

If you know the values in advance, you can use conditional aggregation:
select id, description,
max(value) filter (where value_name = 'FE') as fe,
max(value) filter (where value_name = 'H2O') as h2o,
max(value) filter (where value_name = 'N') as n
from t
group by id, description;
If you don't know the names, then you cannot accomplish this with a single SQL query. You need to use dynamic SQL or use an alternate data representation such as JSON.

Copy column values using a partition by statement in BigQuery

In BigQuery, I am trying to copy column values into other rows using a PARTITION BY statement anchored to a particular ID number.
Here is an example:
Right now, I am trying to use:
MIN(col_a) OVER (PARTITION BY CAST(id AS STRING), date ORDER BY date) AS col_b
It doesn't seem like the aggregate function is working properly. As in, the "col_b" still has null values when I try this method. Am I misunderstanding how aggregate functions work?

You can use this:
MIN(col_a) OVER (PARTITION BY id) AS col_b
If you have one value per id, this will return that value.
Note that converting id to a string is unnecessary. Also, you don't need a cumulative minimum, hence no ORDER BY.

Another option using coalesce
select *, coalesce(col_a, (select min(col_a) from my_table b where a.id=b.id)) col_b
from my_table a;
DEMO

Select rows by index in Amazon Athena

This is a very simple question but I can't seem to find documentation on it. How one would query rows by index (ie select the 10th through 20th row in a table)?
I know there's a row_numbers function but it doesn't seem to do what I want.

Do not specify any partition so your row number will be an integer between 1 and your number of record.
SELECT row_num FROM (
SELECT row_number() over () as row_num
FROM your_table
)
WHERE row_num between 100000 and 100010

I seem to have found a roundabout and clunky way of doing this in Athena, so any better answers are welcome. This approach requires you have some numeric column in your table already, in this case named some_numeric_column:
SELECT some_numeric_column, row_num FROM (
SELECT some_numeric_column,
row_number() over (order by some_numeric_column) as row_num
FROM your_table
)
WHERE row_num between 100000 and 100010
To explain, you first select some numeric column in your data, then create a column (called row_num) of row numbers which is based on the order of your selected numeric column. Then you wrap that all in a select call because Athena doesn't support creating and then conditioning on the row_num column within a single call. If you don't wrap it in a second SELECT call Athena will spit out some errors about not finding a column named row_num.

How to use DISTINCT used while selecting all columns including sequence number column?

My query is to avoid duplicate in a particular column while selecting all columns. But DISTINCT is not working since seq.number column is also being selected.
Any idea to make the query work
In the below example query seq_num is unique key.
Edit: including sample data in picture
select DISTINCT(name), seq_num from table_1;![enter image description here](https://i.stack.imgur.com/Y3NYn.jpg)

For two columns this query will be enough:
SELECT name, min(seq_num)
FROM table
GROUP BY name
For more column, use row_number analytic functon
SELECT name, col1, col2, .... col500, seq_num
FROM (
SELECT t.*, row_number() over (partition by name order by seq_num ) As rn
FROM table t
)
WHERE rn = 1
The above queries pick only one row with a given name and the smallest seq_num value for each name.

You cannot do what you want. Please read more about DISTINCT clause and query result set. You will understand that distinct is not suitable for your issue. If you provide some sample data for what you have and what should query show, when possible we will help you.

Return only the newest rows from a BigQuery table with a duplicate items

I have a table with many duplicate items – Many rows with the same id, perhaps with the only difference being a requested_at column.
I'd like to do a select * from the table, but only return one row with the same id – the most recently requested.
I've looked into group by id but then I need to do an aggregate for each column. This is easy with requested_at – max(requested_at) as requested_at – but the others are tough.
How do I make sure I get the value for title, etc that corresponds to that most recently updated row?

I suggest a similar form that avoids a sort in the window function:
SELECT *
FROM (
SELECT
*,
MAX(<timestamp_column>)
OVER (PARTITION BY <id_column>)
AS max_timestamp,
FROM <table>
)
WHERE <timestamp_column> = max_timestamp

Try something like this:
SELECT *
FROM (
SELECT
*,
ROW_NUMBER()
OVER (
PARTITION BY <id_column>
ORDER BY <timestamp column> DESC)
row_number,
FROM <table>
)
WHERE row_number = 1
Note it will add a row_number column, which you might not want. To fix this, you can select individual columns by name in the outer select statement.
In your case, it sounds like the requested_at column is the one you want to use in the ORDER BY.
And, you will also want to use allow_large_results, set a destination table, and specify no flattening of results (if you have a schema with repeated fields).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to extract key and value of greatest value in Hive map? - hive

Related

Transform rows to columns SQL Query

Copy column values using a partition by statement in BigQuery

Select rows by index in Amazon Athena

How to use DISTINCT used while selecting all columns including sequence number column?

Return only the newest rows from a BigQuery table with a duplicate items

Categories

Resources