In hive, how to do a calculation among 2 rows? - hive

I have this table.
+------------------------------------------------------------+
| ks | time | val1 | val2 |
+-------------+---------------+---------------+--------------+
| A | 1 | 1 | 1 |
| B | 1 | 3 | 5 |
| A | 2 | 6 | 7 |
| B | 2 | 10 | 12 |
| A | 4 | 6 | 7 |
| B | 4 | 20 | 26 |
+------------------------------------------------------------+
What I want to get is for each row,
ks | time | val1 | val1 of next ts of same ks |
To be clear, result of above example should be,
+------------------------------------------------------------+
| ks | time | val1 | next.val1 |
+-------------+---------------+---------------+--------------+
| A | 1 | 1 | 6 |
| B | 1 | 3 | 10 |
| A | 2 | 6 | 6 |
| B | 2 | 10 | 20 |
| A | 4 | 6 | null |
| B | 4 | 20 | null |
+------------------------------------------------------------+
(I need the same next for value2 as well)
I tried a lot to come up with a hive query for this, but still no luck. I was able to write a query for this in sql as mentioned here (Quassnoi's answer), but couldn't create the equivalent in hive because hive doesn't support subqueries in select.
Can someone please help me achieve this?
Thanks in advance.
EDIT:
Query I tried was,
SELECT ks, time, val1, next[0] as next.val1 from
(SELECT ks, time, val1
COALESCE(
(
SELECT Val1, time
FROM myTable mi
WHERE mi.val1 > m.val1 AND mi.ks = m.ks
ORDER BY time
LIMIT 1
), CAST(0 AS BIGINT)) AS next
FROM myTable m
ORDER BY time) t2;

Your query seems quite similar to the "year ago" reporting that is ubiquitous in financial reporting. I think a LEFT OUTER JOIN is what you are looking for.
We join table myTable to itself, naming the two instances of the same table m and n. For every entry in the first table m we will attempt to find a matching record in n with the same ks value but an incremented value of time. If this record does not exist, all column values for n will be NULL.
SELECT
m.ks,
m.time,
m.val1,
n.val1 as next_val1,
m.val2,
n.val2 as next_val2
FROM
myTable m
LEFT OUTER JOIN
myTable n
ON (
m.ks = n.ks
AND
m.time + 1 = n.time
);
Returns the following.
ks time val1 next_val1 val2 next_val2
A 1 1 6 1 7
A 2 6 6 7 7
A 3 6 NULL 7 NULL
B 1 3 10 5 12
B 2 10 20 12 26
B 3 20 NULL 26 NULL
Hope that helps.

I find that using Hive custom map/reduce functionality works great to solve queries similar to this. It gives you the opportunity to consider a set of input and "reduce" to one (or more) results.
This answer discusses the solution.
The key is that you use CLUSTER BY to send all results with similar key value to the same reducer, hence same reduce script, collect accordingly, and then output the reduced results when the key changes, and start collecting for the new key.

Related

How to sum rows in groups of 3?

I have a table that looks like this:
id | amount
1 | 8
2 | 3
3 | 9
3 | 2
4 | 5
5 | 3
5 | 1
5 | 7
6 | 3
7 | 3
8 | 5
I need a query that returns the summed amount of rows grouped by every 3 consequent IDs. The result should be:
ids (not a necessary column, just to explain better) | amount
1,2,3 | 22
4,5,6 | 19
7,8 | 8
In my table, you can assume IDs are always consequent. So there can't be a 10 without existing a 9 too. But the same ID can also show up multiple times with different amounts (just like in my example above).
Assuming ID is a numeric data type.
Demo
SELECT max(id) maxID, SUM(Amount) as Amount
FROM TBLNAME
GROUP BY Ceiling(id/3.0)
ORDER BY maxID
Giving us:
+-------+--------+
| maxid | amount |
+-------+--------+
| 3 | 22 |
| 6 | 19 |
| 8 | 8 |
+-------+--------+
Doc Link: Ceiling
MaxID is included just so the order by makes sense and validation of totals can occur.
I used 3.0 instead of 3 and implicit casting to a decimal data type (a hack I know but it works) otherwise integer math takes place and the rounding when the division occurs provides a incorrect result.
Without the .0 on the 3.0 divisor we'd get:
+-------+--------+
| maxid | amount |
+-------+--------+
| 2 | 11 |
| 5 | 27 |
| 8 | 11 |
+-------+--------+
Ceiling() is used over floor() since floor() would not allow aggregation of 1-3 in the same set.

How to return the same period last year data with SQL?

I am trying to create a view in postgreSQL with the requirements as below:
The table needs to show the same period last year data for every records.
Sample data:
date_sk | location_sk | division_sk | employee_type_sk | value
20180202 | 6 | 8 | 4 | 1
20180202 | 7 | 2 | 4 | 2
20190202 | 6 | 8 | 4 | 1
20190202 | 7 | 2 | 4 | 1
20200202 | 6 | 8 | 4 | 1
20200202 | 7 | 2 | 4 | 3
In the table, date_sk, location_sk, division_sk and employee_type_sk are super keys which form an unique record in the table.
You can check the required output as below:
date_sk | location_sk | division_sk | employee_type_sk | value | value_last_year
20180202 | 6 | 8 | 4 | 1 | NULL
20180203 | 7 | 2 | 4 | 2 | NULL
20190202 | 6 | 8 | 4 | 1 | 1
20190203 | 7 | 3 | 4 | 1 | NULL
20200202 | 6 | 8 | 4 | 1 | 1
20200203 | 7 | 3 | 4 | 3 | 1
The records start on 20180202, therefore, the data for the same period last year is unavailable. At the 4th record, there is a difference in division_sk comparing with the same period last year - hence, the head_count_last_year is NULL.
My current solution is to create a view from the sample data with an addition column as same_date_last_year then LEFT JOIN the same table. The SQL queries are below:
CREATE VIEW test_view AS
SELECT *,
CONCAT(LEFT(date_sk, 4) - 1, RIGHT(date_sk, 4)) AS same_date_last_year
FROM test_table
SELECT
test_view.date_sk,
test_view.location_sk,
test_view.division_sk,
test_view.employee_type_sk,
test_view.value,
test_table.value AS value_last_year
FROM test_view
LEFT JOIN test_table ON (test_view.same_date_last_year = test_table.date_sk)
We have a lot of data in the table. My solution above is unacceptable in terms of performance.
Is there a different query which yields the same result and might improve the performance ?
You could simply use a correlated subquery here which is likely best for performance:
select *,
(
select value from t t2
where t2.date_sk=t.date_sk - interval '1' year and
t2.location_sk=t.location_sk and
t2.division_sk=t.division_sk and
t2.employee_type_sk=t.employee_type_sk
) as value_last_year
from t
WITH CTE(DATE_SK,LOCATION_SK,DIVISION_SK,EMPLOYEE_TYPE_SK,VALUE)AS
(
SELECT CAST('20180202' AS DATE),6,8,4,1 UNION ALL
SELECT CAST('20180203'AS DATE),7,2,4,2 UNION ALL
SELECT CAST('20190202'AS DATE),6,8,4,1 UNION ALL
SELECT CAST('20190203'AS DATE),7,2,4,1 UNION ALL
SELECT CAST('20200202'AS DATE),6,8,4,1 UNION ALL
SELECT CAST('20200203'AS DATE),7,2,4,3
)
SELECT C.DATE_SK,C.LOCATION_SK,C.DIVISION_SK,C.EMPLOYEE_TYPE_SK,C.VALUE,
LAG(C.VALUE)OVER(PARTITION BY C.LOCATION_SK,C.DIVISION_SK,C.EMPLOYEE_TYPE_SK ORDER BY C.DATE_SK ASC)LAGG
FROM CTE AS C
ORDER BY C.DATE_SK ASC;
Could you please try if the above is suitable for you. I assume,DATE_SK is a date column or can be CAST to a date

SQL - How to transform a table with a range of values into another with all the numbers in that range?

I have a Table (A) with some intervals from start_val to end_val with an attribute for that range of values.
I want a Table (B) in which each row is a number in the interval of start_val to end_val with the attribute of that range.
I need to do that using SQL.
Exemple
Table A:
+---------+--------+----------+
|start_val| end_val| attribute|
+---------+--------+----------+
| 10 | 12 | 1 |
| 20 | 23 | 2 |
+---------+--------+----------+
Table B (Expected result):
+---------+----------+
|start_val| attribute|
|end_val | |
| interv | |
+---------+----------+
| 10 | 1 |
| 11 | 1 |
| 12 | 1 |
| 20 | 2 |
| 21 | 2 |
| 22 | 2 |
| 23 | 2 |
+---------+----------+
Here is a way to do this
select m.start_val + n -1 as start_val_computed
,m.attribute
from t m
join lateral generate_series(1,(m.end_val-m.start_val)+1) n
on 1=1
+--------------------+-----------+
| start_val_computed | attribute |
+--------------------+-----------+
| 10 | 1 |
| 11 | 1 |
| 12 | 1 |
| 20 | 2 |
| 21 | 2 |
| 22 | 2 |
| 23 | 2 |
+--------------------+-----------+
working example
https://dbfiddle.uk/?rdbms=postgres_12&fiddle=ce9e13765b5a4c3616d95ec659c1dfc9
You may use a calendar table approach:
SELECT
t1.val,
t2.attribute
FROM generate_series(10, 23) AS t1(val)
INNER JOIN TableA t2
ON t1.val BETWEEN t2.start_val AND t2.end_val
ORDER BY
t2.attribute,
t1.val;
Note: You may expand the bounds in the above call to generate_series to cover whatever range you think your data would need.
This is a variant of George's solution, but it is a bit simpler:
select n, m.attribute
from t m cross join lateral
generate_series(m.start_val, m.end_val) n;
The changes are:
CROSS JOIN instead of JOIN. So, no need for an ON clause.
No arithmetic in the GENERATE_SERIES().
No arithmetic in the SELECT.
You can just call the result of GENERATE_SERIES() whatever name you want in the result set.
Postgres actually allows you to put GENERATE_SERIES() in the SELECT:
select generate_series(m.start_val, m.end_val) as n, m.attribute
from t m;
However, I am not a fan of putting row generating functions anywhere other than the FROM clause. I just find it confusing to figure out what the query is doing.

Select max value from column for every value in other two columns

I'm working on a webapp that tracks tvshows, and I need to get all episodes id's that are season finales, which means, the highest episode number from all seasons, for all tvshows.
This is a simplified version of my "episodes" table.
id tvshow_id season epnum
---|-----------|--------|-------
1 | 1 | 1 | 1
2 | 1 | 1 | 2
3 | 1 | 1 | 3
4 | 1 | 2 | 1
5 | 1 | 2 | 2
6 | 2 | 1 | 1
7 | 2 | 1 | 2
8 | 2 | 1 | 3
9 | 2 | 1 | 4
10 | 2 | 2 | 1
11 | 2 | 2 | 2
The expect output:
id
---|
3 |
5 |
9 |
11 |
I've managed to get this working for the latest season but I can't make it work for all seasons.
I've also tried to take some ideas from this but I can't seem to find a way to add the tvshow_id in there.
I'm using Postgres v10
SELECT Id from
(Select *, Row_number() over (partition by tvshow_id,season order by epnum desc) as ranking from tbl)c
Where ranking=1
You can use the below SQL to get your result, using GROUP BY with sub-subquery as:
select id from tab_x
where (tvshow_id,season,epnum) in (
select tvshow_id,season,max(epnum)
from tab_x
group by tvshow_id,season)
Below is the simple query to get desired result. Below query is also good in performance with help of using distinct on() clause
select
distinct on (tvshow_id,season)
id
from your_table
order by tvshow_id,season ,epnum desc

How to match variable data in SQL Server

I need to map a many-to-many relationship between two flat tables. Table A contains a list of possible configurations (where each column is a question and the cell value is the answer). NULL values denote that the answer does not matter. Table B contains actual configurations with the same columns.
Ultimately, I need the final results to show which configurations are mapped between table B and A:
Example
ActualId | ConfigId
---------+---------
5 | 1
6 | 1
8 | 2
. | .
. | .
. | .
N | M
To give a simple example of the tables and data I'm working with, the first table would look like such:
Table A
--------
ConfigId | Size | Color | Cylinders | ... | ColumnN
---------+------+-------+-----------+-----+--------
1 | 3 | | 4 | ... | 5
2 | 4 | 5 | 5 | ... | 5
3 | | 5 | | ... | 5
And Table B would look like this:
Table B
-------
ActualId | Size | Color | Cylinders | ... | ColumnN
---------+------+-------+-----------+-----+--------
1 | 3 | 1 | 4 | ... | 5
2 | 3 | 8 | 4 | ... | 5
3 | 4 | 5 | 5 | ... | 5
4 | 7 | 5 | 6 | ... | 5
Since the NULL values denote that any value can work, the expected result would be:
Expected
---------
ActualId | ConfigId
---------+---------
1 | 1
2 | 1
3 | 2
3 | 3
4 | 3
I'm trying to figure out the best way to go about matching the actual data which has over a hundred columns. I know trying to check each and every column for NULL values is absolutely wrong and will not perform well. I'm really fascinated with this problem and would love some help to find the best way to tackle this.
So, this joins table a on size, color and cylinders.
The size match will be A against B:
If A.SIZE is null, the compare will B.SIZE=B.SIZE which will always return true.
If A.SIZE is not null, the compare will be A.SIZE=B.SIZE which will only be true if they match.
The matching on color and cylinders are similar.
SELECT * FROM TABLEA A
INNER JOIN TABLEB B ON ISNULL(A.SIZE, B.SIZE)=B.SIZE
AND ISNULL(A.COLOR, B.COLOR)=B.COLOR
AND ISNULL(A.CYLINDERS, B.CYLINDERS)=B.CYLINDERS