SQL / HiveQL to assign values to buckets based on a table

SQL / HiveQL to assign values to buckets based on a table - hive

I have a table "bucket" containing minimum int values for buckets, like this
min_value bucket_id
--------- ---------
0 1
12345 2
67890 3
i.e. any value >= 0 and < 12345 belongs in bucket 1, ..., any value >= 67890 belongs in bucket 3.
and a table of int values "value" like this:
id value
-- -----
11 10
22 20000
33 80000
I would like to figure out which bucket each value belongs to. So
select id, bucket_id
from (some join, or whatever, of bucket and value)
gives me
id bucket_id
-- ---------
11 1
22 2
33 3
I am trying to implement this in HiveQL. Any ideas?

I assumed that the condition for the bucket with largest min_value is min_value <= value (since there is no bucket with larger min_value) and I also assumed integer type for column value of table value and column min_value of table bucket (that's important because the query uses comparison which works differently in case of string type so you need to do typecasting for string).
The following query works for non-negative value of table value, in case of negative values involved, you have to replace
max(if(a.value >= b.min_value, b.min_value, 0))
with
max(if(a.value >= b.min_value, b.min_value, <minimum possible value that "value" field may have>)):
select
c.id,
if(d.bucket_id is null, 'not in bucket', d.bucket_id)
from
(
select
a.id,
max(if(a.value >= b.min_value, b.min_value, 0)) as bucket_min_value
from
value a
left join
bucket b
group by a.id
)
c
left join
bucket d
on c.bucket_min_value = d.min_value
;

You can use window functions to define ranges for the bucket ids and then join the bucket table. Check this out.
> select * from bucket;
+-------------------+-------------------+--+
| bucket.min_value | bucket.bucket_id |
+-------------------+-------------------+--+
| 0 | 1 |
| 12345 | 2 |
| 67890 | 3 |
+-------------------+-------------------+--+
> select * from buckvalue;
+---------------+------------------+--+
| buckvalue.id | buckvalue.value |
+---------------+------------------+--+
| 11 | 10 |
| 22 | 20000 |
| 33 | 80000 |
+---------------+------------------+--+
> select bucket_id, min_value, lead(min_value) over(order by bucket_id) as max1 from bucket;
INFO : OK
+------------+------------+--------+--+
| bucket_id | min_value | max1 |
+------------+------------+--------+--+
| 1 | 0 | 12345 |
| 2 | 12345 | 67890 |
| 3 | 67890 | NULL |
+------------+------------+--------+--+
> select t1.id, t1.value, t2.bucket_id from buckvalue t1 left outer join ( select bucket_id, min_value, lead(min_value) over(order by bucket_id) as max1 from bucket ) t2
where t1.value >= t2.min_value and t1.value < coalesce(t2.max1,99999);
+--------+-----------+---------------+--+
| t1.id | t1.value | t2.bucket_id |
+--------+-----------+---------------+--+
| 11 | 10 | 1 |
| 22 | 20000 | 2 |
| 33 | 80000 | 3 |
+--------+-----------+---------------+--+

I found a really simple query to do this. It works by finding all the bucket numbers for which the value is greater than the bucket's minimum value, and taking the maximum bucket_id.
create temporary table bucket as select * from (select 0 min_value, 1 bucket_id union select 12345, 2 union select 67890, 3) a;
create temporary table value as select * from (select 11 id, 10 value union select 22, 20000 union select 33, 80000) a;
select value.id, max(bucket.bucket_id) bucket_id
from value
join bucket
where value.value > bucket.min_value
group by value.id;

Related

Possible to use a column name in a UDF in SQL?

I have a query in which a series of steps is repeated constantly over different columns, for example:
SELECT DISTINCT
MAX (
CASE
WHEN table_2."GRP1_MINIMUM_DATE" <= cohort."ANCHOR_DATE" THEN 1
ELSE 0
END)
OVER (PARTITION BY cohort."USER_ID")
AS "GRP1_MINIMUM_DATE",
MAX (
CASE
WHEN table_2."GRP2_MINIMUM_DATE" <= cohort."ANCHOR_DATE" THEN 1
ELSE 0
END)
OVER (PARTITION BY cohort."USER_ID")
AS "GRP2_MINIMUM_DATE"
FROM INPUT_COHORT cohort
LEFT JOIN INVOLVE_EVER table_2 ON cohort."USER_ID" = table_2."USER_ID"
I was considering writing a function to accomplish this as doing so would save on space in my query. I have been reading a bit about UDF in SQL but don't yet understand if it is possible to pass a column name in as a parameter (i.e. simply switch out "GRP1_MINIMUM_DATE" for "GRP2_MINIMUM_DATE" etc.). What I would like is a query which looks like this
SELECT DISTINCT
FUNCTION(table_2."GRP1_MINIMUM_DATE") AS "GRP1_MINIMUM_DATE",
FUNCTION(table_2."GRP2_MINIMUM_DATE") AS "GRP2_MINIMUM_DATE",
FUNCTION(table_2."GRP3_MINIMUM_DATE") AS "GRP3_MINIMUM_DATE",
FUNCTION(table_2."GRP4_MINIMUM_DATE") AS "GRP4_MINIMUM_DATE"
FROM INPUT_COHORT cohort
LEFT JOIN INVOLVE_EVER table_2 ON cohort."USER_ID" = table_2."USER_ID"
Can anyone tell me if this is possible/point me to some resource that might help me out here?
Thanks!

There is no such direct as #Tejash already stated, but the thing looks like your database model is not ideal - it would be better to have a table that has USER_ID and GRP_ID as keys and then MINIMUM_DATE as seperate field.
Without changing the table structure, you can use UNPIVOT query to mimic this design:
WITH INVOLVE_EVER(USER_ID, GRP1_MINIMUM_DATE, GRP2_MINIMUM_DATE, GRP3_MINIMUM_DATE, GRP4_MINIMUM_DATE)
AS (SELECT 1, SYSDATE, SYSDATE, SYSDATE, SYSDATE FROM dual UNION ALL
SELECT 2, SYSDATE-1, SYSDATE-2, SYSDATE-3, SYSDATE-4 FROM dual)
SELECT *
FROM INVOLVE_EVER
unpivot ( minimum_date FOR grp_id IN ( GRP1_MINIMUM_DATE AS 1, GRP2_MINIMUM_DATE AS 2, GRP3_MINIMUM_DATE AS 3, GRP4_MINIMUM_DATE AS 4))
Result:
| USER_ID | GRP_ID | MINIMUM_DATE |
|---------|--------|--------------|
| 1 | 1 | 09/09/19 |
| 1 | 2 | 09/09/19 |
| 1 | 3 | 09/09/19 |
| 1 | 4 | 09/09/19 |
| 2 | 1 | 09/08/19 |
| 2 | 2 | 09/07/19 |
| 2 | 3 | 09/06/19 |
| 2 | 4 | 09/05/19 |
With this you can write your query without further code duplication and if you need use PIVOT-syntax to get one line per USER_ID.
The final query could then look like this:
WITH INVOLVE_EVER(USER_ID, GRP1_MINIMUM_DATE, GRP2_MINIMUM_DATE, GRP3_MINIMUM_DATE, GRP4_MINIMUM_DATE)
AS (SELECT 1, SYSDATE, SYSDATE, SYSDATE, SYSDATE FROM dual UNION ALL
SELECT 2, SYSDATE-1, SYSDATE-2, SYSDATE-3, SYSDATE-4 FROM dual)
, INPUT_COHORT(USER_ID, ANCHOR_DATE)
AS (SELECT 1, SYSDATE-1 FROM dual UNION ALL
SELECT 2, SYSDATE-2 FROM dual UNION ALL
SELECT 3, SYSDATE-3 FROM dual)
-- Above is sampledata query starts from here:
, unpiv AS (SELECT *
FROM INVOLVE_EVER
unpivot ( minimum_date FOR grp_id IN ( GRP1_MINIMUM_DATE AS 1, GRP2_MINIMUM_DATE AS 2, GRP3_MINIMUM_DATE AS 3, GRP4_MINIMUM_DATE AS 4)))
SELECT qcsj_c000000001000000 user_id, GRP1_MINIMUM_DATE, GRP2_MINIMUM_DATE, GRP3_MINIMUM_DATE, GRP4_MINIMUM_DATE
FROM INPUT_COHORT cohort
LEFT JOIN unpiv table_2
ON cohort.USER_ID = table_2.USER_ID
pivot (MAX(CASE WHEN minimum_date <= cohort."ANCHOR_DATE" THEN 1 ELSE 0 END) AS MINIMUM_DATE
FOR grp_id IN (1 AS GRP1,2 AS GRP2,3 AS GRP3,4 AS GRP4))
Result:
| USER_ID | GRP1_MINIMUM_DATE | GRP2_MINIMUM_DATE | GRP3_MINIMUM_DATE | GRP4_MINIMUM_DATE |
|---------|-------------------|-------------------|-------------------|-------------------|
| 3 | | | | |
| 1 | 0 | 0 | 0 | 0 |
| 2 | 0 | 1 | 1 | 1 |
This way you only have to write your calculation logic once (see line starting with pivot).

Oracle Sql: Obtain a Sum of a Group, if Subgroup condition met

I have a dataset upon which I am trying to obain a summed value for each group, if a subgroup within each group meets a certain condition. I am not sure if this is possible, or if I am approaching this problem incorrectly.
My data is structured as following:
+----+-------------+---------+-------+
| ID | Transaction | Product | Value |
+----+-------------+---------+-------+
| 1 | A | 0 | 10 |
| 1 | A | 1 | 15 |
| 1 | A | 2 | 20 |
| 1 | B | 1 | 5 |
| 1 | B | 2 | 10 |
+----+-------------+---------+-------+
In this example I want to obtain the sum of values by the ID column, if a transaction does not contain any products labeled 0. In the above described scenario, all values related to Transaction A would be excluded because Product 0 was purchased. With the outcome being:
+----+-------------+
| ID | Sum of Value|
+----+-------------+
| 1 | 15 |
+----+-------------+
This process would repeat for multiple IDs with each ID only containing the sum of values if the transaction does not contain product 0.

Hmmm . . . one method is to use not exists for the filtering:
select id, sum(value)
from t
where not exists (select 1
from t t2
where t2.id = t.id and t2.transaction = t.transaction and
t2.product = 0
)
group by id;

Do not need to use correlated subquery with not exists.
Just use group by.
with s (id, transaction, product, value) as (
select 1, 'A', 0, 10 from dual union all
select 1, 'A', 1, 15 from dual union all
select 1, 'A', 2, 20 from dual union all
select 1, 'B', 1, 5 from dual union all
select 1, 'B', 2, 10 from dual)
select id, sum(sum_value) as sum_value
from
(select id, transaction,
sum(value) as sum_value
from s
group by id, transaction
having count(decode(product, 0, 1)) = 0
)
group by id;
ID SUM_VALUE
---------- ----------
1 15

Find unique dataset with max. value from 3 columns

Imaging following table
ID:PrimaryKey (Sequence generated Number)
ColA:ForeignKey(Number)
ColB:ForeignKey(Number)
ColC:ForeignKey(Number)
State:Enumeration(Number) 10,20,30,... 90
ValidFrom:TimeStamp(6)
LastUpdate:(6)
I know created a query to fetch any combination in the highest states (70 and above) The combination ColA,ColB and ColC should be unqiue. If there is a validfrom available the highest would win. If there are 2 in state 90 the newest would win:
So for some table like this
|------|------|------|-------|-------------|------------|
| ColA | ColB | ColC | State |ValidFrom |LastUpdate |
|------|------|------|-------|-------------|------------|
| 1 | 1 | 1 | 10 | null | 10.10.2018 | //Excluded
|------|------|------|-------|-------------|------------|
| 1 | 1 | 1 | 70 | null | 09.10.2018 | // lower State
|------|------|------|-------|-------------|------------|
| 1 | 1 | 1 | 90 | null | 05.05.2018 | // older LastUpdate
|------|------|------|-------|-------------|------------|
| 1 | 1 | 1 | 90 | null | 12.07.2018 | //Should Win
|------|------|------|-------|-------------|------------|
| 1 | 2 | 1 | 90 | 18.10.2018 | 12.07.2018 | //Should Win
|------|------|------|-------|-------------|------------|
| 1 | 2 | 1 | 90 | null | 18.11.2018 | //loose against ValidFrom
|------|------|------|-------|-------------|------------|
| 3 | 2 | 1 | 90 | 02.12.2018 | 04.08.2018 | //lower ValidFrom
|------|------|------|-------|-------------|------------|
| 3 | 2 | 1 | 70 | 19.10.2018 | 17.11.2018 | //lower state
|------|------|------|-------|-------------|------------|
| 3 | 2 | 1 | 90 | 18.10.2018 | 14.08.2018 | //Should win
|------|------|------|-------|-------------|------------|
So as you can see the combination of ColA,ColB and ColC should be unqiue at the end.
So I started writing a script gives me all the data with the highest states per combination:
SELECT MAINSELECT.*
FROM
FOO MAINSELECT
WHERE
MAINSELECT.STATE >= 70
AND NOT EXISTS
( SELECT SUBSELECT.ID
FROM
FOO SUBSELECT
WHERE SUBSELECT.ID <> MAINSELECT.ID
AND SUBSELECT.COLA = MAINSELECT.COLA
AND SUBSELECT.COLB = MAINSELECT.COLB
AND SUBSELECT.COLC = MAINSELECT.COLC
AND SUBSELECT.STATE > MAINSELECT.STATE);
This now gives me all in the highest state. As I do not want to use an OR statement I tried to solve the problem to query either NULL as Validfrom or the MAX in 2 different queries (and use union). So I tried to extend this base SELECT like this to get all with a ValidFrom != null && Max(ValidFrom):
SELECT MAINSELECT.*
FROM
FOO MAINSELECT
WHERE
MAINSELECT.STATE >= 70
MAINSELECT.VALIDFROM IS NOT NULL
AND NOT EXISTS
( SELECT SUBSELECT.ID
FROM
FOO SUBSELECT
WHERE SUBSELECT.ID <> MAINSELECT.ID
AND SUBSELECT.COLA = MAINSELECT.COLA
AND SUBSELECT.COLB = MAINSELECT.COLB
AND SUBSELECT.COLC = MAINSELECT.COLC
AND SUBSELECT.STATE > MAINSELECT.STATE)
AND NOT EXISTS
( SELECT SUBSELECT.ID
FROM
FOO SUBSELECT
WHERE SUBSELECT.ID <> MAINSELECT.ID -- Should not be the same
AND SUBSELECT.COLA = MAINSELECT.COLA -- Same combination!
AND SUBSELECT.COLB = MAINSELECT.COLB
AND SUBSELECT.COLC = MAINSELECT.COLC
AND SUBSELECT.STATE = MAINSELECT.STATE --Filter on same state!
AND SUBSELECT.VALIDFROM > MAINSELECT.VALIDFROM);
But this doesn't seem to work because now nothing ist printed.
I am expecting just row: 5 and 9! [Starting at 1 ;-)]
And I currently get row: 5, 7 and 9!
So the combination [3,2,1] is duplicate.
I do not get why the 2nd NOT EXISTS does not work. It's like there are 0F*** given!

Use row_number():
dbfiddle demo
select *
from (
select row_number() over (
partition by cola, colb, colc
order by state desc, validfrom desc nulls last, lastupdate desc) rn,
foo.*
from foo)
where rn = 1
7 wins against 9 because 2018-12-02 is newer than 2018-10-18.
Explanation:
partition by cola, colb, colc causes that for each combination of these columns numbering is done separately,
next are criteria of ordering, so higher state wins, then newer, not nullable validfrom wins, and at the end newer lastupdate wins.
For each combinantion of a, b, c we get separate set of numbered rows. Outer query filters only rows numbered as 1.

I found the answer. Instead of using NOT EXISTS I am trying to use the max, rpad and coalesce to create a string which I compare:
SELECT
MAINSELECT.*
FROM
FOO MAINSELECT
WHERE (1 = 1)
AND MAINSELECT.STATE >= 70
AND coalesce(to_char(MAINSELECT.state), rpad('0', 3, '0') ) || coalesce(to_char(MAINSELECT.validfrom,'YYMMDDhh24missFF'), rpad('0', 18, '0') ) || coalesce(to_char(MAINSELECT.lastupdate,'YYMMDDhh24missFF'), rpad('0', 18, '0') )
= (select max(coalesce(to_char(SUBSELECT.state), rpad('0', 3, '0') ) || coalesce(to_char(SUBSELECT.validfrom,'YYMMDDhh24missFF'), rpad('0', 18, '0') )|| coalesce(to_char(SUBSELECT.lastupdate,'YYMMDDhh24missFF'), rpad('0', 18, '0')))
FROM
FOO SUBSELECT
WHERE (1 = 1)
AND SUBSELECT.STATE >= 70
AND SUBSELECT.COLA = MAINSELECT.COLA
AND SUBSELECT.COLB = MAINSELECT.COLB
AND SUBSELECT.COLC = MAINSELECT.COLC
);
This creates a simple string with the values from the columns STATE,VALIDFROM and LASTUPDATE and is then trying to find the max of these! stating with the State which has the highest number and comes in the front!

How can combine this two tables and create a new table?

How to combine two tables and create a new table
first table :
ExitDate | fullname | outputnumber
------------------------------------------------
2012/01/01 a 10
2012/01/06 b 2
2012/01/08 c 3
2012/01/12 d 4
second table
inputnumber | date
-------------------------------
100 2012/01/05
150 2012/01/07
200 2012/01/10
the answer table
ExitDate | fullname | outputnumber | inputnumber | date
-------------------------------------------------------------------------------
2012/01/01 a 10 - -
- - - 100 2012/01/05
2012/01/06 b 2 - -
- - - 150 2012/01/07
2012/01/08 c 3 - -
- - - 200 2012/01/10
2012/01/12 d 4 - -
note : the date and location is important and i using sql server

If I correctly understand, you need union all. Something like this :
select * from (
select ExitDate, fullname, outputnumber, NUll as inputnumber, NUll as [date] from first_table
union all
select NUll as ExitDate, NUll as fullname, NUll as outputnumber, inputnumber, [date] from second_table
) t
order by coalesce(ExitDate, [date])
Then entire result is sorted by combined dates from ExitDate and date columns
rextester demo

I think you can have a better table like this:
select *
from (
select fullname, 0 as io, outputnumber as number, ExitDate as date
from table1
union all
select '-', 1, inputnumber, date) t
order by date, io;
fullname | io | number | date
---------+----+--------+-------------
a | 0 | 10 | 2012/01/01
- | 1 | 100 | 2012/01/05
b | 0 | 2 | 2012/01/06
- | 1 | 150 | 2012/01/07
c | 0 | 3 | 2012/01/08
- | 1 | 200 | 2012/01/10
d | 0 | 4 | 2012/01/12

You can get the exact output you want using full outer join:
select t1.*, t2.*
from t1 full outer join
t2
on 1 = 0
order by coalesce(t1.exitdate, t2.date);

SQL - min() gets the lowest value, max() the highest, what if I want the 2nd (or 5th or nth) lowest value?

The problem I'm trying to solve is that I have a table like this:
a and b refer to point on a different table. distance is the distance between the points.
| id | a_id | b_id | distance | delete |
| 1 | 1 | 1 | 1 | 0 |
| 2 | 1 | 2 | 0.2345 | 0 |
| 3 | 1 | 3 | 100 | 0 |
| 4 | 2 | 1 | 1343.2 | 0 |
| 5 | 2 | 2 | 0.45 | 0 |
| 6 | 2 | 3 | 110 | 0 |
....
The important column I'm looking is a_id. If I wanted to keep the closet b for each a, I could do something like this:
update mytable set delete = 1 from (select a_id, min(distance) as dist from table group by a_id) as x where a_gid = a_gid and distance > dist;
delete from mytable where delete = 1;
Which would give me a result table like this:
| id | a_id | b_id | distance | delete |
| 1 | 1 | 1 | 1 | 0 |
| 5 | 2 | 2 | 0.45 | 0 |
....
i.e. I need one row for each value of a_id, and that row should have the lowest value of distance for each a_id.
However I want to keep the 10 closest points for each a_gid. I could do this with a plpgsql function but I'm curious if there is a more SQL-y way.
min() and max() return the smallest and largest, if there was an aggregate function like nth(), which'd return the nth largest/smallest value then I could do this in similar manner to the above.
I'm using PostgeSQL.

Try this:
SELECT *
FROM (
SELECT a_id, (
SELECT b_id
FROM mytable mib
WHERE mib.a_id = ma.a_id
ORDER BY
dist DESC
LIMIT 1 OFFSET s
) AS b_id
FROM (
SELECT DISTINCT a_id
FROM mytable mia
) ma, generate_series (1, 10) s
) ab
WHERE b_id IS NOT NULL
Checked on PostgreSQL 8.3

I love postgres, so it took it as a challenge the second I saw this question.
So, for the table:
Table "pg_temp_29.foo"
Column | Type | Modifiers
--------+---------+-----------
value | integer |
With the values:
SELECT value FROM foo ORDER BY value;
value
-------
0
1
2
3
4
5
6
7
8
9
14
20
32
(13 rows)
You can do a:
SELECT value FROM foo ORDER BY value DESC LIMIT 1 OFFSET X
Where X = 0 for the highest value, 1 for the second highest, 2... And so forth.
This can be further embedded in a subquery to retrieve the value needed. So, to use the dataset provided in the original question we can get the a_ids with the top ten lowest distances by doing:
SELECT a_id, distance FROM mytable
WHERE id IN
(SELECT id FROM mytable WHERE t1.a_id = t2.a_id
ORDER BY distance LIMIT 10);
ORDER BY a_id, distance;
a_id | distance
------+----------
1 | 0.2345
1 | 1
1 | 100
2 | 0.45
2 | 110
2 | 1342.2

Does PostgreSQL have the analytic function rank()? If so try:
select a_id, b_id, distance
from
( select a_id, b_id, distance, rank() over (partition by a_id order by distance) rnk
from mytable
) where rnk <= 10;

This SQL should find you the Nth lowest salary should work in SQL Server, MySQL, DB2, Oracle, Teradata, and almost any other RDBMS: (note: low performance because of subquery)
SELECT * /*This is the outer query part */
FROM mytable tbl1
WHERE (N-1) = ( /* Subquery starts here */
SELECT COUNT(DISTINCT(tbl2.distance))
FROM mytable tbl2
WHERE tbl2.distance < tbl1.distance)
The most important thing to understand in the query above is that the subquery is evaluated each and every time a row is processed by the outer query. In other words, the inner query can not be processed independently of the outer query since the inner query uses the tbl1 value as well.
In order to find the Nth lowest value, we just find the value that has exactly N-1 values lower than itself.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL / HiveQL to assign values to buckets based on a table - hive

Related

Possible to use a column name in a UDF in SQL?

Oracle Sql: Obtain a Sum of a Group, if Subgroup condition met

Find unique dataset with max. value from 3 columns

How can combine this two tables and create a new table?

SQL - min() gets the lowest value, max() the highest, what if I want the 2nd (or 5th or nth) lowest value?

Categories

Resources