Filtering arrays in hive - hive

I have a hive table in that I am having a column called paid_value in an array format for each record.
Now I want to filter the array such that the value must be between 1000 and 10000 for each record.
I don't know how to do it.
I know array_contains(Array<T>, value) function but this doesn't solve my problem since it accepts only one value as a checking criteria but I want like 'between 1000 and 10000'.

You can use LATERAL VIEW EXPLODE to explode the array and then do the filter subsequently. But if your array size is huge, your process will be slow.
Other option definitely needs a UDF to do the filter.
The other workaround I can think of is doing with a brickhouse UDF is:
-- this will give you an array of numbers between start(st) and end(ed)
select collect_set(pe.i+1) as range_array
from
(SELECT 1000 as st, 1100 as ed) t
LATERAL VIEW posexplode(split(space(ed-st),' ')) pe AS i,x;
Then I use the brickhouse udf bhouse_intersect_array
select count(1)
from range_array cross join <source_tablename>
where size(bhouse_intersect_array(source_array, range_array)) > 0

Related

Aggregate single array of distinct elements from array column, excluding NULL

I'm trying to roll up the distinct non-null values of timestamps stored in a PostgreSQL 9.6 database column.
So given a table containing the following:
date_array
------------------------
{2019-10-21 00:00:00.0}
{2019-08-06 00:00:00.0,2019-08-05 00:00:00.0}
{2019-08-05 00:00:00.0}
(null)
{2019-08-01 00:00:00.0,2019-08-06 00:00:00.0,null}
The desired result would be:
{2019-10-21 00:00:00.0, 2019-08-06 00:00:00.0, 2019-08-05 00:00:00.0, 2019-08-01 00:00:00.0}
The arrays can be different sizes so most solutions I've tried end up running into a Code 0:
SQL State: 2202E
ERROR: cannot accumulate arrays of different dimensionality.
Some other caveats:
The arrays can be null, the arrays can contain a null. They happen to be timestamps of just dates (eg without time or timezone). But in trying to simplify the problem, I've had no luck in changing the sample data to strings (e.g {foo, bar, (null)}, {foo,baz}) - just to focus on the problem and eliminate any issues I miss/don't understand about timestamps w/o timezone.
This following SQL is the closest I've come (it resolves all but the different dimensionality issues):
SELECT
ARRAY_REMOVE ( ARRAY ( SELECT DISTINCT UNNEST ( ARRAY_AGG ( CASE WHEN ARRAY_NDIMS(example.date_array) > 0 AND example.date_array IS NOT NULL THEN example.date_array ELSE '{null}' END ) ) ), NULL) as actualDates
FROM example;
I created the following DB fiddle with sample data that illustrates the problem if the above is lacking: https://www.db-fiddle.com/f/8m469XTDmnt4iRkc5Si1eS/0
Additionally, I've perused stackoverflow on the issue (as well as PostgreSQL documentation) and there are similar questions with answers, but I've found none that are articulating the same problem I'm having.
Use unnest() in FROM clause (in a lateral join):
select array_agg(distinct elem order by elem desc) as result
from example
cross join unnest(date_array) as elem
where elem is not null
Test it in DB Fiddle.
A general note. An alternative solution using an array constructor is more efficient, especially in cases as simple as described. Personally, I prefer to use aggregate functions because this query structure is more general and flexible, easy to extend to handle more complex problems (e.g. having to aggregate more than one column, grouping by another column, etc). In these non-trivial cases, the performance differences tend to decrease, but the code using aggregates remains cleaner and more readable. It's an extremely important factor when you have to maintain really large and complex projects.
See also In Postgres select, return a column subquery as an array?
Plain array_agg() does this with arrays:
Concatenates all the input arrays into an array of one higher
dimension. (The inputs must all have the same dimensionality, and
cannot be empty or null.)
Not what you need. See:
Is there something like a zip() function in PostgreSQL that combines two arrays?
You need something like this: unnest(), process and sort elements an feed the resulting set to an ARRAY constructor:
SELECT ARRAY(
SELECT DISTINCT elem::date
FROM (SELECT unnest(date_array) FROM example) AS e(elem)
WHERE elem IS NOT NULL
ORDER BY elem DESC
);
db<>fiddle here
To be clear: we could use array_agg() (taking non-array input, different from your incorrect use) instead of the final ARRAY constructor. But the latter is faster (and simpler, too, IMO).
They happen to be timestamps of just dates (eg without time or timezone)
So cast to date and trim the noise.
Should be the fastest way:
A correlated subquery is a bit faster than a LATERAL one (and does the simple job).
An ARRAY constructor is a bit faster than the aggregate function array_agg() (and does the simple job).
Most importantly, sorting and applying DISTINCT in a subquery is typically faster than inline ORDER BY and DISTINCT in an aggregate function (and does the simple job).
See:
Unnest arrays of different dimensions
How to select 1d array from 2d array?
Why is array_agg() slower than the non-aggregate ARRAY() constructor?
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
Performance comparison:
db<>fiddle here

function to sum all first value of Results SQL

I have a table with "Number", "Name" and "Result" Column. Result is a 2D text Array and I need to create a Column with the name "Average" that sum all first values of Result Array and divide by 2, can somebody help me Pls, I must use the create function for this. Its look like this:
Table1
Number
Name
Result
Average
01
Kevin
{{2.0,10},{3.0,50}}
2.5
02
Max
{{1.0,10},{4.0,30},{5.0,20}}
5.0
Average = ((2.0+3.0)/2) = 2.5
= ((1.0+4.0+5.0)/2) = 5.0
First of all: You should always avoid storing arrays in the table (or generate them in a subquery if not extremely necessary). Normalize it, it makes life much easier in nearly every single use case.
Second: You should avoid more-dimensional arrays. The are very hard to handle. See Unnest array by one level
However, in your special case you could do something like this:
demo:db<>fiddle
SELECT
number,
name,
SUM(value) FILTER (WHERE idx % 2 = 1) / 2 -- 2
FROM mytable,
unnest(avg_result) WITH ORDINALITY as elements(value, idx) -- 1
GROUP BY number, name
unnest() expands the array elements into one element per record. But this is not an one-level expand: It expand ALL elements in depth. To keep track of your elements, you could add an index using WITH ORDINALITY.
Because you have nested two-elemented arrays, the unnested data can be used as follows: You want to sum all first of two elements, which is every second (the odd ones) element. Using the FILTER clause in the aggregation helps you to aggregate only exact these elements.
However: If that's was a result of a subquery, you should think about doing the operation BEFORE array aggregation (if this is really necessary). This makes things easier.
Assumptions:
number column is Primary key.
result column is text or varchar type
Here are the steps for your requirements:
Add the column in your table using following query (you can skip this step if column is already added)
alter table table1 add column average decimal;
Update the calculated value by using below query:
update table1 t1
set average = t2.value_
from
(
select
number,
sum(t::decimal)/2 as value_
from table1
cross join lateral unnest((result::text[][])[1:999][1]) as t
group by 1
) t2
where t1.number=t2.number
Explanation: Here unnest((result::text[][])[1:999][1]) will return the first value of each child array (considering you can have up to 999 child arrays in your 2D array. You can increase or decrease it as per your requirement)
DEMO
Now you can create your function as per your requirement with above query.

Unnesting structs in BigQuery

What is the correct way to flatten a struct of two arrays in BigQuery? I have a dataset like the one pictured here (the struct.destination and struct.visitors arrays are ordered - i.e. the visitor counts correspond specifically to the destinations in the same row):
I want to reorganize the data so that I have a total visitor count for each unique combination of origins and destinations. Ideally, the end result will look like this:
I tried using UNNEST twice in a row - once on struct.destination and then on struct.visitors, but this produces the wrong result (each destination gets mapped to every value in the array of visitor counts when it should only get mapped to the value in the same row):
SELECT
origin,
unnested_destination,
unnested_visitors
FROM
dataset.table,
UNNEST(struct.destination) AS unnested_destination,
UNNEST(struct.visitors) AS unnested_visitors
You have one struct that is repeated. So, I think you want:
SELECT origin,
s.destination,
s.visitors
FROM dataset.table t CROSS JOIN
UNNEST(t.struct) s;
EDIT:
I see, you have a struct of two arrays. You can do:
SELECT origin, d.destination, v.visitors
FROM dataset.table t CROSS JOIN
UNNEST(struct.destination) s WITH OFFSET nd LEFT JOIN
UNNEST(struct.visitors) v WITH OFFSET nv
ON nd = nv
Difficult to test by not having the underlying data to test on, so I created my own query with your dataset. As far as I can tell destination|visitors is not in an ARRAY-format, but rather in a STRUCT-format, so you do not need UNNEST it. Also view this thread please :)
SELECT
origin,
COUNT(struct.destination),
COUNT(struct.visitors)
FROM dataset.table
GROUP BY 1

Average interval between timestamps in an array

In a PostgreSQL 9.x database, I have a column which is an array of type timestamp. Each array has between 1..n timestamps.
I'm trying to extract the average interval between all elements in each array.
I understand using a window function on the source table might be the ideal way to tackle this but in this case I am trying to do it as an operation on the array.
I've looked at several other questions that are trying to calculate the moving average of another column etc or the avg (median date of a list of timestamps).
For example the average interval I'm looking for on an array with 3 elements like this:
'{"2012-10-09 17:04:05.710887"
,"2013-10-18 22:30:08.973749"
,"2014-10-22 22:18:18.885973"}'::timestamp[]
Would be:
-368d
Wondering if I need to unpack the array through a function?
One way of many possible: unnest, join, avg in a lateral subquery:
SELECT *
FROM tbl t
LEFT JOIN LATERAL (
SELECT avg(a2.ts - a1.ts) AS avg_intv
FROM unnest(t.arr) WITH ORDINALITY a1(ts, ord)
JOIN unnest(t.arr) WITH ORDINALITY a2(ts, ord) ON (a2.ord = a1.ord + 1)
) avg ON true;
db<>fiddle here
The [INNER] JOIN in the subquery produces exactly the set of combinations relevant for intervals between elements.
I get 371 days 14:37:06.587543, not '-368d', btw.
Related, with more explanation:
PostgreSQL unnest() with element number
You can also only unnest once and use the window functions lead() or lag(), but you were trying to avoid window functions. And you need to make sure of the original order of elements in any case ...
(There is no array function you could use directly to get what you need - in case you were hoping for that.)
Alternative with CTE
Might be appealing to still unnest only once (even while avoiding window functions):
SELECT *
FROM tbl t
LEFT JOIN LATERAL (
WITH a AS (SELECT * FROM unnest(t.arr) WITH ORDINALITY a1(ts, ord))
SELECT avg(a2.ts - a1.ts) AS avg_intv
FROM a a1
JOIN a a2 ON (a2.ord = a1.ord +1)
) avg ON true;
But I expect the added CTE overhead to cost more than unnesting twice. Mostly just demonstrating a WITH clause in a subquery.

Hive UDF to generate all possible ordered combinations from the list

I am trying to figure out in Hive how to generate a UDF that would take as input a list and output a list with 2 way ordered combination all elements in the list
Input:
list_variable_b
[5142430,5146974,5141766]
Output:
list_variable_b
[(5142430,5146974),(5146974,5141766),(5142430,5141766)]
So you're asking how to write an UDF that can take an array<bigint> and
turn it into an array<struct<int,int> or array<array<int>.
It sounds you want what's called n take k, which will produce (n!)/(n-k)!k! elements.
Now, hive has two kinds of UDFs, one that's the simple one, that can only process primitive (non-collection) types. But here you are processing an array so you'll need a Generic UDF. Generic UDF can do much more than simple UDFs, but they are also more difficult to write. A good guide on how to do it is here: http://www.baynote.com/2012/11/a-word-from-the-engineers/
Another way would be to use a double LATERAL VIEW with the caveat that all the elements in the array have to be unique for this to work.
If the table is
create table xx ( col array<int>);
such that
select * from xx;
OK
[5142430,5146974,5141766]
Using a double lateral view to do the cartesian product of the array on itself, then only get the pairs where one element is bigger then the other:
select a1,b1 from xx
lateral view explode(col) a as a1
lateral view explode(col) b as b1 where a1 < b1;
5142430 5146974
5141766 5142430
5141766 5146974