How to get first n elements in an array in Hive

How to get first n elements in an array in Hive - hive

I use split function to create an array in Hive, how can I get the first n elements from the array, and I want to go through the sub-array
code example
select col1 from table
where split(col2, ',')[0:5]
'[0:5]'looks likes python style, but it doesn't work here.

This is a much simpler way of doing it. There is a UDF here called TruncateArrayUDF.javathat can do what you are asking. Just clone the repo from the main page and build the jar with Maven.
Example Data:
| col1 |
----------------------
1,2,3,4,5,6,7
11,12,13,14,15,16,17
Query:
add jar /complete/path/to/jar/brickhouse-0.7.0-SNAPSHOT.jar;
create temporary function trunc as 'brickhouse.udf.collect.TruncateArrayUDF';
select pos
,newcol
from (
select trunc(split(col1, '\\,'), 5) as p
from table
) x
lateral view posexplode(p) explodetable as pos, newcol
Output:
pos | newcol |
-------------------
0 1
1 2
2 3
3 4
4 5
0 11
1 12
2 13
3 14
4 15

This is a tricky one.
First grab the brickhouse jar from here
Then add it to Hive : add jar /path/to/jars/brickhouse-0.7.0-SNAPSHOT.jar;
Now create the two functions we will be usings :
CREATE TEMPORARY FUNCTION array_index AS 'brickhouse.udf.collect.ArrayIndexUDF';
CREATE TEMPORARY FUNCTION numeric_range AS 'brickhouse.udf.collect.NumericRange';
The query will be :
select a,
n as array_index,
array_index(split(a,','),n) as value_from_Array
from ( select "abc#1,def#2,hij#3" a from dual union all
select "abc#1,def#2,hij#3,zzz#4" a from dual) t1
lateral view numeric_range( length(a)-length(regexp_replace(a,',',''))+1 ) n1 as n
Explained :
select "abc#1,def#2,hij#3" a from dual union all
select "abc#1,def#2,hij#3,zzz#4" a from dual
Is just selecting some test data, in your case replace this with your table name.
lateral view numeric_range( length(a)-length(regexp_replace(a,',',''))+1 ) n1 as n
numeric_range is a UDTF that returns a table for a given range, in this case, i asked for a range between 0 (default) and the number of elements in string (calculated as the number of commas + 1)
This way, each row will be multiplied by the number of elements in the given column.
array_index(split(a,','),n)
This is exactly like using split(a,',')[n] but hive doesn't support it.
So we get the n-th element for each duplicated row of the initial string resulting in :
abc#1,def#2,hij#3,zzz#4 0 abc#1
abc#1,def#2,hij#3,zzz#4 1 def#2
abc#1,def#2,hij#3,zzz#4 2 hij#3
abc#1,def#2,hij#3,zzz#4 3 zzz#4
abc#1,def#2,hij#3 0 abc#1
abc#1,def#2,hij#3 1 def#2
abc#1,def#2,hij#3 2 hij#3
If you really want a specific number of elements (say 5) then just use :
lateral view numeric_range(5 ) n1 as n

Related

Group data series into variable width windows based on first event

I have computational task which can be reduced to the follow problem:
I have a large set of pairs of integers (key, val) which I want to group into windows. The first window starts with the first pair p ordered by key attribute and spans all the pairs where p[i].key belongs to [p[0].key; p[0].key + N), with some arbitrary integer N, positive and common to all windows.
The next window starts with the first pair ordered by key not included in the previous windows and again spans all the pairs from its key to key + N, and so on for the following windows.
The last step is to sum second attribute for each window and display it together with the first key of the window.
For example, given list of records with values:
key
val
1
3
2
7
5
1
6
4
7
1
10
3
13
5
and N=3, the windows would be:
{(1,3),(2,7)},
{(5,1),(6,4),(7,1)},
{(10,3)}
{(13,5)}
The final result:
key
sum_of_values
1
10
5
6
10
3
13
5
This is easy to program with a standard programming language but I have no clue how to solve this with SQL.

Note: If clickhouse doesn't support the RECURSIVE keyword, just remove that keyword from the expression.
Clickhouse seems to use non-standard syntax for the WITH clause. The below uses standard SQL. Adjust as needed.
Sorry. clickhouse may not support this approach. If not, we would need to find another method of walking through the data.
Standard SQL:
There are a few ways. Here's one approach. First assign row numbers to allow recursively stepping through the rows. We could use LEAD as well.
Assign a group (key value) to each row based on the current key and the last group/key value and whether they are within some distance (N = 3, in this case).
The last step is to just SUM these values per group start_key and to use the start_key value as the starting key in each group.
WITH RECURSIVE nrows (xkey, val, n) AS (
SELECT xkey, val, ROW_NUMBER() OVER (ORDER BY xkey) FROM test
)
, cte (xkey, val, n, start_key) AS (
SELECT xkey, val, n, xkey FROM nrows WHERE n = 1
UNION ALL
SELECT t1.xkey, t1.val, t1.n
, CASE WHEN t1.xkey <= t2.start_key + (3-1) THEN t2.start_key ELSE t1.xkey END
FROM nrows AS t1
JOIN cte AS t2
ON t2.n = t1.n-1
)
SELECT start_key
, SUM(val) AS sum_values
FROM cte
GROUP BY start_key
ORDER BY start_key
;
Result:
+-----------+------------+
| start_key | sum_values |
+-----------+------------+
| 1 | 10 |
| 5 | 6 |
| 10 | 3 |
| 13 | 5 |
+-----------+------------+

matching array in Postgres with string manipulation

I was working with the "<#" operator and two arrays of strings.
anyarray <# anyarray → boolean
Every string is formed in this way: ${name}_${number}, and I would like to check if the name part is included and the number is equal or lower than the one in the other array.
['elementOne_10'] & [['elementOne_7' , 'elementTwo20']] → true
['elementOne_10'] & [['elementOne_17', 'elementTwo20']] → false
what would be an efficient way to do this?

Assuming your sample data elementTwo20 in fact follows your described schema and should be elementTwo_20:
step-by-step demo:db<>fiddle
SELECT
id
FROM (
SELECT
*,
split_part(u, '_', 1) as name, -- 3
split_part(u, '_', 2)::int as num,
split_part(compare, '_', 1) as comp_name,
split_part(compare, '_', 2)::int as comp_num
FROM
t,
unnest(data) u, -- 1
(SELECT unnest('{elementOne_10}'::text[]) as compare) s -- 2
)s
GROUP BY id -- 4
HAVING
ARRAY_AGG(name) #> ARRAY_AGG(comp_name) -- 5
AND MAX(comp_num) BETWEEN MIN(num) AND MAX(num)
unnest() your array elements into one element per record
JOIN and unnest() your comparision data
split the element strings into their name and num parts
unnest() creates several records per original array, they can be grouped by an identifier (best is an id column)
Filter with your criteria in the HAVING clause: Compare the name parts for example with array operators, for BETWEEN comparing you can use MIN and MAX on the num part.
Note:
As #a_horse_with_no_name correctly mentioned: If possible think about your database design and normalize it:
Don't store arrays -> You don't need to unnest them on every operation
Relevant data should be kept separated, not concatenated as a string -> You don't need to split them on every operation
id | name | num
---------------------
1 | elementOne | 7
1 | elementTwo | 20
2 | elementOne | 17
2 | elementTwo | 20
This is exactly the result of the inner subquery. You have to create this every time you need these data. It's better to store the data like this.

SQL Update table with all corresponding data from another table concatenated?

Running PostgreSQL 12.4. I am trying to accomplish this but the syntax given there doesn't seem to be working on psql, and I could not find another approach.
I have the following data:
Table 1
ID Trait
1 X
1 Y
1 Z
2 A
2 B
Table 2
ID Traits, Listed
1
2
3
4
I would like to create the following result:
Table 2
ID Traits, Listed
1 X + Y + Z
2 A + B
Concatenating with + would be ideal, as some traits have inherent commas.
Thank you for any help!

Try something like:
update table2
SET traits = agg.t
FROM
(select id,string_agg(trait, ',') t from table1 group by id) agg
where
table2.id = agg.id;
dbfiddle
Concatenating with + would be ideal, as some traits have inherent commas.
You can use whatever delimiter you like (just change the second argument to string_agg).

redshift regex get multiple matches and expand rows

I'm working on the URL extraction on AWS Redshift. The URL column looks like this:
url item origin
http://B123//ajdsb apple US
http://BYHG//B123 banana UK
http://B325//BF89//BY85 candy CA
The result I want to get is to get the series that starts with B and also expand rows if there are multiple series in a URL.
extracted item origin
B123 apple US
BYHG banana UK
B123 banana UK
B325 candy CA
BF89 candy CA
BY85 candy CA
My current code is:
select REGEXP_SUBSTR(url, '(B[0-9A-Z]{3})') as extracted, item, origin
from data
The regex part works well but I have problems with extracting multiple values and expand them to new rows. I tried to use REGEXP_MATCHES(url, '(B[0-9A-Z]{3})', 'g') but function regexp_matches does not exist on Redshift...

The solution I use is fairly ugly but achieves the desired results. It involves using REGEXP_COUNT to determine the maximum number of matches in a row then joining the resulting table of numbers to a query using REGEXP_SUBSTR.
-- Get a table with the count of matches
-- e.g. if one row has 5 matches this query will return 0, 1, 2, 3, 4, 5
WITH n_table AS (
SELECT
DISTINCT REGEXP_COUNT(url, '(B[0-9A-Z]{3})') AS n
FROM data
)
-- Join the previous table to the data table and use n in the REGEXP_SUBSTR call to get the nth match
SELECT
REGEXP_SUBSTR(url, '(B[0-9A-Z]{3})', 1, n) AS extracted,
item,
origin
FROM data,
n_table
-- Only keep non-null matches
WHERE n > 0
AND REGEXP_COUNT(url, '(B[0-9A-Z]{3})') >= N

IronFarm's answer inspired me, though I wanted to find a solution that didn't require a cross join. Here's what I came up with:
with
-- raw data
src as (
select
1 as id,
'abc def ghi' as stuff
union all
select
2 as id,
'qwe rty' as stuff
),
-- for each id, get a series of indexes for
-- each match in the string
match_idxs as (
select
id,
generate_series(1, regexp_count(stuff, '[a-z]{3}')) as idx
from
src
)
select
src.id,
match_idxs.idx,
regexp_substr(src.stuff, '[a-z]{3}', 1, match_idxs.idx) as stuff_match
from
src
join match_idxs using (id)
order by
id, idx
;
This yields:
id | idx | stuff_match
----+-----+-------------
1 | 1 | abc
1 | 2 | def
1 | 3 | ghi
2 | 1 | qwe
2 | 2 | rty
(5 rows)

List category/subcategory tree and display its sub-categories in the same row

I have a hierarchical table of Regions and sub-regions, and I need to list a tree of regions and sub-regions (which is easy), but also, I need a column that displays, for each region, all the ids of it's sub regions.
For example:
id name superiorId
-------------------------------
1 RJ NULL
2 Tijuca 1
3 Leblon 1
4 Gavea 2
5 Humaita 2
6 Barra 4
I need the result to be something like:
id name superiorId sub-regions
-----------------------------------------
1 RJ NULL 2,3,4,5,6
2 Tijuca 1 4,5,6
3 Leblon 1 null
4 Gavea 2 4
5 Humaita 2 null
6 Barra 4 null
I have done that by creating a function that retrieves a STUFF() of a region row,
but when I'm selecting all regions from a country, for example, the query becomes really, really slow, since I execute the function to get the region sons for each region.
Does anybody know how to get that in an optimized way?
The function that "retrieves all the ids as a row" is:
I meant that the function returns all the sub-region's ids as a string, separated by a comma.
The function is:
CREATE FUNCTION getSubRegions (#RegionId int)
RETURNS TABLE
AS
RETURN(
select stuff((SELECT CAST( wine_reg.wine_reg_id as varchar)+','
from (select wine_reg_id
, wine_reg_name
, wine_region_superior
from wine_region as t1
where wine_region_superior = #RegionId
or exists
( select *
from wine_region as t2
where wine_reg_id = t1.wine_region_superior
and (
wine_region_superior = #RegionId
)
) ) wine_reg
ORDER BY wine_reg.wine_reg_name ASC for XML path('')),1,0,'')as Sons)
GO

When we used to make these concatenated lists in the database we took a similar approach to what you are doing at first
then when we looked for speed
we made them into CLR functions
http://msdn.microsoft.com/en-US/library/a8s4s5dz(v=VS.90).aspx
and now our database is only responsible for storing and retrieving data
this sort of thing will be in our data layer in the application

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to get first n elements in an array in Hive - hive

I use split function to create an array in Hive, how can I get the first n elements from the array, and I want to go through the sub-array code example select col1 from table where split(col2, ',')[0:5] '[0:5]'looks likes python style, but it doesn't work here.

Related

Group data series into variable width windows based on first event

matching array in Postgres with string manipulation

SQL Update table with all corresponding data from another table concatenated?

redshift regex get multiple matches and expand rows

List category/subcategory tree and display its sub-categories in the same row

Categories

Resources