How can I can concatenate multiple array columns in Hive? - hive

In a Hive table, let's assume I have data looking as follows:
+----+-----------+-----------+
| id | some_ids1 | some_ids2 |
+----+-----------+-----------+
| 1 | [2, 3] | [4, 5, 6] |
| 2 | [7] | [8, 9] |
+----+-----------+-----------+
and I would like to obtain something like:
+----+------------------+
| id | concatenated_ids |
+----+------------------+
| 1 | [2, 3, 4, 5, 6] |
| 2 | [7, 8, 9] |
+----+------------------+
In my actual example, the inner types are not primitives, so casting to string doesn't work. How are array columns concatenated in Hive?

Use a generic UDF pass these two columns and concatenate it and return as an array.
public class ArraySum extends UDF {
public List<Double> evaluate(List<Double> list1, List<Double> list2) {
return list1.addAll(list2)
}
}
Hope this helps!

Related

Pandas, new column containing value if key equals another column

Apologies if the title didn't make sense, I found it difficult to explain in one line.
I have a data frame, two of the columns contain data like this:
Column 1
| Selection_id |
| -------- |
| 46660181 |
| 40115397 |
| 267698 |
| 34774 |
| 449342 |
Column 2
| Bsps |
| -------- |
| 46660181: 2.1, 40115397: 1.75, 267698: 3.15, 34774: 2.64, 449342: 3.9 |
What I need to do is create a new column containing the value from the bsps as long as the selection id matches the bsp key. So 46660181 would have 2.1 in the column next to it.
I hope that makes sense!
I don't have a great knowledge on Python or Pandas as I've not done it for long but I'll do my best to follow along!
Any help with this would be appreciated, thank you.
Assuming some things here... There are faster ways than df.apply if you want to restructure your initial data, but it works as long as your data is on the smaller end and doesn't need to scale.
import pandas
df = pd.DataFrame(dict(cola=[2,1,3], colb=[[{1:4},{2:5},{3:6}], [{1:4},{2:5},{3:6}],[{1:4},{2:5},{3:6}],],))
| | cola | colb |
|---:|-------:|:-------------------------|
| 0 | 2 | [{1: 4}, {2: 5}, {3: 6}] |
| 1 | 1 | [{1: 4}, {2: 5}, {3: 6}] |
| 2 | 3 | [{1: 4}, {2: 5}, {3: 6}] |
solution:
df['val'] = df.apply(lambda row: [x.get(row.cola) for x in row.colb if x.get(row.cola) is not None][0], axis=1)
| | cola | colb | val |
|---:|-------:|:-------------------------|------:|
| 0 | 2 | [{1: 4}, {2: 5}, {3: 6}] | 5 |
| 1 | 1 | [{1: 4}, {2: 5}, {3: 6}] | 4 |
| 2 | 3 | [{1: 4}, {2: 5}, {3: 6}] | 6 |

Flatten Nested Array and Aggregate in Snowflake

My table column has nested arrays in a Snowflake database. I want to perform some aggregations using SQL (Snowflake SQL).
My table name is: DATA
The PROJ column is of VARIANT data type. The nested arrays will not always be 3, and I demonstrated that in the DATA table.
| ID | PROJ | LOCATION |
|----|-------------------------------|----------|
| 1 |[[0, 4], [1, 30], [10, 20]] | S |
| 2 |[[0, 2], [1, 20]] | S |
| 3 |[[0, 8], [1, 10], [10, 100]] | S |
Desired Output:
| Index | LOCATION | Min | Max | Mean|
|-------|----------|------|-----|-----|
| 0 | S | 2 | 8 | 4.66|
| 1 | S | 10 | 30 | 20 |
| 10 | S | 20 | 100| 60 |
First the nested array should be flattened, then Index is the first element of subarray and Value is the second element(array is 0-based):
CREATE OR REPLACE TABLE DATA
AS
SELECT 1 AS ID, [[0, 4], [1, 30], [10, 20]] AS PROJ UNION
SELECT 2 AS ID, [[0, 2], [1, 20]] AS PROJ UNION
SELECT 3 AS ID, [[0, 8], [1, 10], [10, 100]] AS PROJ;
Query:
SELECT s.VALUE[0]::INT AS Index,
MIN(s.VALUE[1]::INT) AS MinValue,
MAX(s.VALUE[1]::INT) AS MaxValue,
AVG(s.VALUE[1]::INT) AS MeanValue
FROM DATA
,LATERAL FLATTEN(input=> PROJ) s
GROUP BY s.VALUE[0]::INT
ORDER BY Index;
Output:

How to get sum of transposed array in PostgresSQL

I've desinged a table below constructures.
Column | Type | Collation | Nullable | Default
-------------------------+--------------------------+-----------+----------+---------
counts | integer[] | | |
All of value in field of counts has 9 elements.
All of elements is not null.
I would like to get the sum of transposed array like the below python code with using only SQL query.
import numpy as np
counts = np.array([
[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]
])
counts = counts.transpose()
sum = list(map(lambda x: sum(x), counts))
print(sum) # [12, 15, 18, 21]
So given example data:
counts |
-----------------------|
{0,15,8,6,10,12,4,0,5} |
{0,4,6,14,7,9,8,0,9} |
{0,6,7,4,11,6,10,0,10} |
Because the record has over thousand, it takes much time 500ms to get response to calculate the sum of transpose in frontend side.
I would like a result:
counts |
---------------------------|
{0,25,21,24,28,27,22,0,24} |
or
value0 | value1 | value2 | value3 | value4 | value5 | value6 | value7 | value8 |
--------|--------|--------|--------|--------|--------|--------|--------|--------|
0 | 25 | 21 | 24 | 28 | 27 | 22 | 0 | 24 |
In my opinion, this question could be solved using SQL functions but I don't know how to use it.
Just SUM() the array elements:
SELECT SUM(i[1]) -- first
, SUM(i[2]) -- second
, SUM(i[3]) -- third
-- etc.
, SUM(i[9])
FROM (VALUES
('{0,15,8,6,10,12,4,0,5}'::int[]),
('{0,4,6,14,7,9,8,0,9}'),
('{0,6,7,4,11,6,10,0,10}')) s(i);

In Hive, how to combine multiple tables to produce single row containing array of objects?

I have two tables as follows:
users table
==========================
| user_id name age |
|=========================
| 1 pete 20 |
| 2 sam 21 |
| 3 nash 22 |
==========================
hobbies table
======================================
| user_id hobby time_spent |
|=====================================
| 1 football 2 |
| 1 running 1 |
| 1 basketball 3 |
======================================
First question: I would like to make a single Hive query that can return rows in this format:
{ "user_id":1, "name":"pete", "hobbies":[ {hobby: "football", "time_spent": 2}, {"hobby": "running", "time_spent": 1}, {"hobby": "basketball", "time_spent": 3} ] }
Second question: If the hobbies table were to be as follows:
========================================
| user_id hobby scores |
|=======================================
| 1 football 2,3,1 |
| 1 running 1,1,2,5 |
| 1 basketball 3,6,7 |
========================================
Would it be possible to get the row output where scores is a list in the output as shown below:
{ "user_id":1, "name":"pete", "hobbies":[ {hobby: "football", "scores": [2, 3, 1]}, {"hobby": "running", "scores": [1, 1, 2, 5]}, {"hobby": "basketball", "scores": [3, 6, 7]} ] }
I was able to find the answer to my first question
select u.user_id, u.name,
collect_list(
str_to_map(
concat_ws(",", array(
concat("hobby:", h.hobby),
concat("time_spent:", h.time_spent)
))
)
) as hobbies
from users as u
join hobbies as h on u.user_id=h.user_id
group by u.user_id, u.name;

pandas:how to get each customer probability with predict_proba

I am using xgboost with objective='binary:logistic' to calculate each customer probability if he/she will make the spend.
Using predic_proba in sklearn will print two probability for both 0 and 1,like:
[[0.56651809 0.43348191]
[0.15598162 0.84401838]
[0.86852502 0.13147498]]
how to insert each customer ID by pandas to get something like:
+----+------------+------------+
| ID | prob_0 | prob_1 |
+----+------------+------------+
| 1 | 0.56651809 | 0.43348191 |
| 2 | 0.15598162 | 0.84401838 |
| 3 | 0.86852502 | 0.13147498 |
+----+------------+------------+
You can use pandas DataFrame() in order to make your form.
list_data = [[0.56651809, 0.43348191],[0.15598162, 0.84401838],[0.86852502, 0.13147498]]
columns = ['prob_0', 'prob_1']
index = [1, 2, 3]
pd.DataFrame(data = list_data, columns = columns, index= index)