Extract last N elements of an array in SQL (hive) - sql

I have a column with arrays and I want to extract the X last elements in an array.
Example trying to extract the last two elements:
Column A
['a', 'b', 'c']
['d', 'e']
['f', 'g', 'h', 'i']
Expected output:
Column A
['b', 'c']
['d', 'e']
['h', 'i']
Best case scenario would be to do it without using a UDF

One method using reverse, explode, filtering and re-assembling array again:
with your_table as (
select stack (4,
0, array(), --empty array to check it works if no elements or less than n
1, array('a', 'b', 'c'),
2, array('d', 'e'),
3, array('f', 'g', 'h', 'i')
) as (id, col_A)
)
select s.id, collect_list(s.value) as col_A
from
(select s.id, a.value, a.pos
from your_table s
lateral view outer posexplode(split(reverse(concat_ws(',',s.col_A)),',')) a as pos, value
where a.pos between 0 and 1 --last two (use n-1 instead of 1 if you want last n)
distribute by s.id sort by a.pos desc --keep original order
)s
group by s.id
Result:
s.id col_a
0 []
1 ["b","c"]
2 ["d","e"]
3 ["h","i"]
More elegant way using brickhouse numeric_range UDF in this answer

Related

How to get all overlapping (ordered) 3-tuples from an array in BigQuery

Given a table like the following
elems
['a', 'b', 'c', 'd', 'e']
['v', 'w', 'x', 'y']
I'd like to transform it into something like this:
tuple
['a', 'b', 'c']
['b', 'c', 'd']
['c', 'd', 'e']
['v', 'w', 'x']
['w', 'x', 'y']
I.e., I'd like to get all overlapping 3-tuples.
My current attempt looks as follows:
WITH foo AS (
SELECT ['a', 'b', 'c', 'd', 'e'] AS elems UNION ALL
SELECT ['v', 'w', 'x', 'y']),
single AS (
SELECT * FROM
foo,
UNNEST(elems) elem
),
tuples AS (
SELECT ARRAY_AGG(elem) OVER (ROWS BETWEEN 2 PRECEDING AND 0 FOLLOWING) AS tuple
FROM single
)
SELECT * FROM tuples
WHERE ARRAY_LENGTH(tuple) >= 3
But the problem is, it returns some unwanted rows too, i.e., the ones that are "between" the original rows from the foo table.
tuple
['a', 'b', 'c']
['b', 'c', 'd']
['c', 'd', 'e']
['d', 'e', 'v'] <--- unwanted
['e', 'v', 'w'] <--- unwanted
['v', 'w', 'x']
['w', 'x', 'y']
Also, is it guaranteed, that the order of rows in single is correct, or does it only work in my minimal example by chance, because of the low cardinality? (I guess there may be a simple solution without this step in between.)
Consider below approach
select [elems[offset(index - 1)], elems[offset(index)], elems[offset(index + 1)]] as tuple
from your_table, unnest([array_length(elems)]) len,
unnest(generate_array(1, len - 2)) index
if applied to sample data in your question - output is
You might consider below query.
Also, is it guaranteed, that the order of rows in single is correct, or does it only work in my minimal example by chance, because of the low cardinality?
afaik, it's not quaranteeded without explicit using WITH OFFSET in the query.
WITH foo AS (
SELECT ['a', 'b', 'c', 'd', 'e'] AS elems UNION ALL
SELECT ['v', 'w', 'x', 'y']),
single AS (
SELECT * FROM
foo,
UNNEST(elems) elem WITH OFFSET
),
tuples AS (
SELECT ARRAY_AGG(elem) OVER (PARTITION BY FORMAT('%t', elems) ORDER BY offset ROWS BETWEEN 2 PRECEDING AND 0 FOLLOWING) AS tuple
FROM single
)
SELECT * FROM tuples
WHERE ARRAY_LENGTH(tuple) >= 3;
Just to give you another idea
create temp function slice(arr ARRAY<string>, pos float64, len float64)
returns array<string> language js as
r"return arr.slice(pos, pos + len);";
select slice(elems, index, 3) as tuple
from foo, unnest([array_length(elems)]) len,
unnest(generate_array(0, len - 3)) index
leaving it up to you to refactor above query to the point when it will look something like
select tuple
from foo, unnest(slices(elems, 3)) as tuple

How to compute cosine similarity between two texts in presto?

Hello everyone: I wanted to use COSINE_SIMILARITY in Presto SQL to compute the similarity between two texts. Unfortunately, COSINE_SIMILARITY does not take the texts as the inputs; it takes maps instead. I am not sure how to convert the texts into those maps in presto. I want the following, if we have a table like this:
id
text1
text2
1
a b b
b c
Then we can compute the cosine similarity as:
COSINE_SIMILARITY(
MAP(ARRAY['a', 'b', 'c'], ARRAY[1, 2, 0]),
MAP(ARRAY['a', 'b', 'c'], ARRAY[0, 1, 1])
)
i.e., two texts combined has three words: 'a', 'b', and 'c'; text1 has 1 count of 'a', 2 counts of 'b', and 0 count of 'c', which goes as the first MAP; similarly, text2 has 0 count of 'a', 1 count of 'b', and 1 count of 'c', which goes as the second MAP.
The final table should look like this:
id
text1
text2
all_unique_words
map1
map2
similarity
1
a b b
b c
[a b c]
[1, 2, 0]
[0, 1, 1]
0.63
How can we convert two texts into two such maps in presto? Thanks in advance!
Use split to transform string into array and then depended on Presto version either use unnest+histogram trick or array_frequency:
-- sample data
with dataset(id, text1, text2) as (values (1, 'a b b', 'b c'))
-- query
select id, COSINE_SIMILARITY(histogram(t1), histogram(t2))
from dataset,
unnest (split(text1, ' '), split(text2, ' ')) as t(t1, t2)
group by id;
Output:
id
_col1
1
0.6324555320336759

Get first N elements from an array in BigQuery table

I have an array column and I would like to get first N elements of it (keeping an array data type). Is there a some nice way how to do it? Ideally without unnesting, ranking and array_agg back to array.
I could also do this (for getting first 2 elements):
WITH data AS
(
SELECT 1001 as id, ['a', 'b', 'c'] as array_1
UNION ALL
SELECT 1002 as id, ['d', 'e', 'f', 'g'] as array_1
UNION ALL
SELECT 1003 as id, ['h', 'i'] as array_1
)
select *,
[array_1[SAFE_OFFSET(0)], array_1[SAFE_OFFSET(1)]] as my_result
from data
But obviously this is not a nice solution as it would fail in case when some array would have only 1 element.
Here's a general solution with a UDF that you can call for any array type:
CREATE TEMP FUNCTION TopN(arr ANY TYPE, n INT64) AS (
ARRAY(SELECT x FROM UNNEST(arr) AS x WITH OFFSET off WHERE off < n ORDER BY off)
);
WITH data AS
(
SELECT 1001 as id, ['a', 'b', 'c'] as array_1
UNION ALL
SELECT 1002 as id, ['d', 'e', 'f', 'g'] as array_1
UNION ALL
SELECT 1003 as id, ['h', 'i'] as array_1
)
select *, TopN(array_1, 2) AS my_result
from data
It uses unnest and the array function, which it sounds like you didn't want to use, but it has the advantage of being general enough that you can pass any array to it.
Another option for BigQuery Standard SQL (with JS UDF)
#standardSQL
CREATE TEMP FUNCTION FirstN(arr ARRAY<STRING>, N FLOAT64)
RETURNS ARRAY<STRING> LANGUAGE js AS """
return arr.slice(0, N);
""";
SELECT *,
FirstN(array_1, 3) AS my_result
FROM data

Conditionally include in fields for presto query

I have a presto query which works as expected:
SELECT json_format(cast(MAP(ARRAY['random_name'],
ARRAY[
MAP(
ARRAY['a', 'b', 'c'],
ARRAY[a, b, c]
)]
) as JSON)) as metadata from a_table_with_a_b_c; // a,b,c are all ints
Now I only want to include a,b,c when they are larger than 0, how do I change the query? I can add 'CASE WHEN' but it seems I will have 'a:null' instead of not having it.
You can try this:
with rows as (
select row_number() over() as row_id, ARRAY['a', 'b', 'c'] as keys, ARRAY[a, b, c] as vals
from a_table_with_a_b_c
)
select
json_format(cast(MAP(ARRAY['random_name'],
ARRAY[
MAP(
array_agg(key), array_agg(value)
)]
) as JSON)) as metadata
from rows
cross join unnest (keys, vals) as t (key, value)
where value is not null and value > 0
group by row_id;

Is there any way to order the result set by what you want in SQL Server?

Is there any way to select from SQL Server by 'Queue, serie ...'.
For example I want to get some rows by using identifier.
I want to get rows ordered by like C, D, A, F
SELECT *
FROM BRANCH
WHERE IDENTIFIER IN ('C', 'D', 'A', 'F')
And this query turns rows order by random.
Maybe ordered as
'F', 'D', 'A', 'C'
'A', 'B', 'C', 'D'
How can I get the result set ordered as 'C', 'D', 'A', 'F'? I need this using for for xml path usage.
SELECT b.*
FROM dbo.BRANCH b
JOIN (
VALUES
(1, 'C'),
(2, 'D'),
(3, 'A'),
(4, 'F')
) c(ID, IDENTIFIER) ON c.IDENTIFIER = b.IDENTIFIER
ORDER BY c.ID