How to combine multiple maps in Hive?

How to combine multiple maps in Hive? - hive

Is there a Hive UDF that creates a map with unique values?
For ex:
col_1 | col_2
-------------
a | x
a | y
b | y
b | y
c | z
c | NULL
d | NULL
This should return a map as follows
{ a : [x,y], b : [y], c:[z] }
I'm looking for something similar to presto's multimap_aggfunction

Use collect_set to remove duplicate col_2 per col_1 and then use map on this output.
select map(col_1,uniq_col_2)
from (select col_1,collect_set(col_2) as uniq_col2
from tbl
where col_2 is not null
group by col_1
) t

Related

SQL/HIVE - How do I query from a horizontal output to vertical output?

I have a query that returns two rows with the information needed
SELECT src_file_dt, a, b ,c FROM my_table WHERE src_file_dt IN ('1531675040', '1531675169');
it will return:
src_file_dt | a | b | c
1531675040 | 2 | 6 | 9
1531675169 | 8 | 2 | 0
Now, I need the data in the following layout, how do I get it like this output:
fields | prev (1531675040) | curr (1531675169)
a | 2 | 8
b | 6 | 2
c | 9 | 0

There should be easier way to achieve that.
I've build results using explode function and selecting separately data for prev and next columns:
select t1.keys, t1.vals as prev, t2.vals as next from (
SELECT explode(map('a', a, 'b', b, 'c',c)) as (keys, vals)
FROM my_table
WHERE src_file_dt = '1531675040'
) t1,
(
SELECT explode(map('a', a, 'b', b, 'c',c)) as (keys, vals)
FROM my_table
WHERE src_file_dt = '1531675169'
) t2
where t1.keys = t2.keys
;

Find and update non duplicated record based on one of the column

I want to find all the non duplicated records and update one of the column.
Ex.
Col_1 | Col_2 | Col_3 | Col_4 | Col_5
A | AA | BB | 1 |
A | AB | BC | 2 |
A | AC | BD | 3 |
B | BB | CC | 1 |
B | BB | CC | 2 |
C | CC | DD | 1 |
My query has to group by Col_1, and I want to find out not unique record based on Col_2 and Col3 and then update the Col_5.
Basically output should be as below,
Col_1 | Col_2 | Col_3 | Col_4 | Col_5
A | AA | BB | 1 | 1
A | AB | BC | 2 | 1
A | AC | BD | 3 | 1
B | BB | CC | 1 | 0
B | BB | CC | 2 | 0
C | CC | DD | 1 | 0
Does anyone have an idea how can I achieve this? This is a large database, so performance is also a key factor.
Thanks heaps,

There are plenty ways to do it. This solution comes from postgres for which I have access to, but I bet it will be working also on tsql as should have common syntax.
;WITH
cte_1 AS (
SELECT col_1 FROM some_table GROUP BY col_1 HAVING count(*) > 1
),
cte_2 AS (
SELECT col_1 FROM some_table GROUP BY col_1, col_2, col_3 HAVING count(*) > 1
),
cte_3 AS (
SELECT cte_1.col_1 FROM cte_1
LEFT JOIN cte_2 ON cte_1.col_1 = cte_2.col_1
WHERE cte_2.col_1 IS NULL
)
UPDATE some_table SET col_5 = 1
FROM cte_3 WHERE cte_3.col_1 = some_table.col_1;
So, what happens above?
First we build three CTE semi-tables which allow us to split logic into smaller parts:
cte_1 which extracts rows which can have multiple col2 and col_3 rows
cte_2 which selects those which have non-unique col_2 and col_3
cte_3 which returns those col_1 which have unique col_2 and col_3 just by LEFT JOIN
Using the last cte_3 structure we are able to update some_table correctly
I assume that your table is called some_table here. If you're worring about a performance you should provide some primary key here and also it would be good to have indexes on col_2 and col_3 (standalone but it may help if those would be composite on (col_1, col_2) and so on).
Also you may want to move it from CTE to use temporary tables (which could be also indexed to gain efficiency.
Please also note that this query works fine with your example but without real data it may be just guessing. I mean that what will happen when you would have for col_1=A some unique and non-uniqe col_2 in the same time?
But I believe it's good point to start.

;WITH
cte_1 AS (
SELECT col_1, count(*) as items FROM some_table GROUP BY col_1 HAVING count(*) > 1
),
cte_2 AS (
SELECT col_1, count(*) as items FROM some_table GROUP BY col_1, col_2, col_3 HAVING count(*) > 1
),
cte_3 AS (
SELECT cte_1.col_1 FROM cte_1
LEFT JOIN cte_2 ON cte_1.col_1 = cte_2.col_1
WHERE cte_2.col_1 IS NULL OR cte_1.items > cte_2.items
GROUP BY cte_1.col_1
)
UPDATE some_table SET col_5 = 1
FROM cte_3 WHERE cte_3.col_1 = some_table.col_1;

transform columns to rows

I have a table table1 like below
+----+------+------+------+------+------+
| id | loc | val1 | val2 | val3 | val4 |
+----+------+------+------+------+------+
| 1 | loc1 | 10 | 190 | null | 20 |
| 2 | loc2 | 20 | null | 10 | 10 |
+----+------+------+------+------+------+
need to combine the val1 to val4 into a new column val with a row for each so that the output is like below.
NOTE: - I data I have has val1 to val30 -> ie. 30 columns per row that need to be converted into rows.
+----+------+--------+
| id | loc | val |
+----+------+--------+
| 1 | loc1 | 10 |
| 1 | loc1 | 190 |
| 1 | loc1 | null |
| 1 | loc1 | 20 |
| 2 | loc2 | 20 |
| 2 | loc2 | null |
| 2 | loc2 | 10 |
| 2 | loc2 | 10 |
+----+------+--------+

You can use lateral join for transform columns to rows :
SELECT a.id,a.loc,t.vals
FROM table1 a,
unnest(ARRAY[a.val1,a.val2,a.val3,a.val4]) t(vals);
If you want to this with a dynamic added columns:
CREATE OR REPLACE FUNCTION columns_to_rows(
out id integer,
out loc text,
out vals integer
)
RETURNS SETOF record AS
$body$
DECLARE
columns_to_rows text;
BEGIN
SELECT string_agg('a.'||attname, ',') into columns_to_rows
FROM pg_attribute
WHERE attrelid = 'your_table'::regclass AND --table name
attnum > 0 and --get just the visible columns
attname <> all (array [ 'id', 'loc' ]) AND --exclude some columns
NOT attisdropped ; --column is not dropped
RETURN QUERY
EXECUTE format('SELECT a.id,a.loc,t.vals
FROM your_table a,
unnest(ARRAY[%s]) t(vals)',columns_to_rows);
end;
$body$
LANGUAGE 'plpgsql'
Look at this link for more detail: Columns to rows

You could use a cross join with generate_series for this:
select
id,
loc,
case x.i
when 1 then val1
when 2 then val2
. . .
end as val
from t
cross join generate_series(1, 4) x (i)
It uses the table only once and can be easily extended to accommodate more columns.
Demo
Note: In the accepted answer, first approach reads the table many times (as many times as column to be unpivoted) and second approach is wrong as there is no UNPIVOT in postgresql.

I'm sure there's a classier approach than this.
SELECT * FROM (
select id, loc, val1 as val from #t a
UNION ALL
select id, loc, val2 as val from #t a
UNION ALL
select id, loc, val3 as val from #t a
UNION ALL
select id, loc, val4 as val from #t a
) x
order by ID
Here's my attempt with unpivot but cant get the nulls, perhaps perform a join for the nulls? Anyway i'll still try
SELECT *
FROM (
SELECT * FROM #t
) main
UNPIVOT (
new_val
FOR val IN (val1, val2, val3, val4)
) unpiv

It will not work in postgress as needed by user. Saw when it was mentioned in comments.
I am finding a way to handle "NULL"
select p.id,p.loc,CASE WHEN p.val=0 THEN NULL ELSE p.val END AS val
from
(
SELECT id,loc,ISNULL(val1,0) AS val1,ISNULL(val2,0) AS val2,ISNULL(val3,0) AS val3,ISNULL(val4,0) AS val4
FROM Table1
)T
unpivot
(
val
for locval in(val1,val2,val3,val4)
)p
Test
EDIT:
Best Solution from my Side:
select a.id,a.loc,ex.val
from (select 'val1' as [over] union all select 'val2' union all select 'val3'
union all select 'val1' ) pmu
cross join (select id,loc from Table1) as a
left join
Table1 pt
unpivot
(
[val]
for [over] in (val1, val2, val3, val4)
) ex
on pmu.[over] = ex.[over] and
a.id = ex.id
Test

SQL query showing just few DISTINCT records

I have the following records:
Col_1 | Col_2 | Col_3
------+-------+------
A | A | XYZ01
A | A | XYZ02
A | A | XYZ03
A | B | XYZ04
B | B | XYZ05
B | B | XYZ06
B | B | XYZ07
B | B | XYZ08
I need a query which will return maximum of 2 records where Col_1 and Col_2 are distinct (regardless of Col_3) (that should be like 2 records sample of each distinct col_1,col_2 combination).
So this query should return:
Col_1 | Col_2 | Col_3
------+-------+------
A | A | XYZ01
A | A | XYZ02
A | B | XYZ04
B | B | XYZ05
B | B | XYZ06

SELECT *
FROM (
SELECT col_1
,col_2
,col_3
,row_number() OVER (
PARTITION BY col_1
,col_2 ORDER BY col_1
) AS foo
FROM TABLENAME
) bar
WHERE foo < 3
Top command will not work because you want to 'group by' multiple columns. What will help is partitioning the data and assigning a row number to the partition data. By partitioning on col_1 and col_2 we can create 3 different groupings.
1.All rows with 'a' in col_1
2.All rows with 'b' in col_2
3 All rows with 'a' and 'b' in col_1,col_2
we will order by col_1 (I picked this because your result set was ordered by a). Then for each row in that grouping we will count the rows and display the row number.
We will use this information as a derived table, and select * from this derived table where the rownumber is less than 3. This will get us the first two elements in each grouping

as far as Oracle The use of Rank would work.
Select * From (SELECT Col_1 ,
Col_2 ,
Col_3 ,
RANK() OVER (PARTITION BY Col_1, Col_2 ORDER BY Col_1 ) part
FROM someTable ) st where st.part < 2;
Since I was reminded that you can't use the alias in the original where clause, I made a change that will still work, though may not be the most elegant.

Postgres perform query to select array index

How Can I perform a query that returns a row if I have the wanted values at same index on different columns ? For example, here is some code:
select id_reg, a1, a2 from lb_reg_teste2;
id_reg | a1 | a2
--------+------------------+-------------
1 | {10,10,20,20,10} | {3,2,4,3,6}
(1 row)
The query would be someting like:
select id_reg from lb_reg_teste2 where idx(a1, '20') = idx(a2, '3');
# Should return id_reg = 1
I Found this script , but it only returns the first occurrence of a value in an array. For this case, I need all occurrences.
CREATE OR REPLACE FUNCTION idx(anyarray, anyelement)
RETURNS int AS
$$
SELECT i FROM (
SELECT generate_series(array_lower($1,1),array_upper($1,1))
) g(i)
WHERE $1[i] = $2
LIMIT 1;
$$ LANGUAGE sql IMMUTABLE;

You can extract the values from the arrays along with their indices, and then filter out the results.
If the arrays have the same number of elements, consider this query:
SELECT id_reg,
generate_subscripts(a1,1) as idx1,
unnest(a1) as val1,
generate_subscripts(a2,1) as idx2,
unnest(a2) as val2
FROM lb_reg_teste2
With the sample values of the question, this would generate this:
id_reg | idx1 | val1 | idx2 | val2
--------+------+------+------+------
1 | 1 | 10 | 1 | 3
1 | 2 | 10 | 2 | 2
1 | 3 | 20 | 3 | 4
1 | 4 | 20 | 4 | 3
1 | 5 | 10 | 5 | 6
Then use it as a subquery and add a WHERE clause to filter out as necessary.
For the example with 20 and 3 as the values to find at the same index:
SELECT DISTINCT id_reg FROM
( SELECT id_reg,
generate_subscripts(a1,1) as idx1,
unnest(a1) as val1,
generate_subscripts(a2,1) as idx2,
unnest(a2) as val2
FROM lb_reg_teste2 ) s
WHERE idx1=idx2 AND val1=20 AND val2=3;
If the number of elements of a1 and a2 differ, the subquery above will generate a cartesian product (NxM rows where N and M are the array sizes), so this will be less efficient but still produce the correct result, as far as I understood what you expect.
In this case, a variant would be to generate two distinct subqueries with the (values,indices) of each array and join them by the equality of the indices.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to combine multiple maps in Hive? - hive

Is there a Hive UDF that creates a map with unique values? For ex: col_1 | col_2 ------------- a | x a | y b | y b | y c | z c | NULL d | NULL This should return a map as follows { a : [x,y], b : [y], c:[z] } I'm looking for something similar to presto's multimap_aggfunction

Use collect_set to remove duplicate col_2 per col_1 and then use map on this output. select map(col_1,uniq_col_2) from (select col_1,collect_set(col_2) as uniq_col2 from tbl where col_2 is not null group by col_1 ) t

Related

SQL/HIVE - How do I query from a horizontal output to vertical output?

Find and update non duplicated record based on one of the column

transform columns to rows

SQL query showing just few DISTINCT records

Postgres perform query to select array index

Categories

Resources