hive - split a row into multiple rows between the range of values - sql

I have a table below and would like to split the rows by the range from start to end columns.
i.e id and value should repeat for each value between start & end(both inclusive)
--------------------------------------
id | value | start | end
--------------------------------------
1 | 5 | 1 | 4
2 | 8 | 5 | 9
--------------------------------------
Desired output
--------------------------------------
id | value | current
--------------------------------------
1 | 5 | 1
1 | 5 | 2
1 | 5 | 3
1 | 5 | 4
2 | 8 | 5
2 | 8 | 6
2 | 8 | 7
2 | 8 | 8
2 | 8 | 9
--------------------------------------
I can write my own UDF in java/python to get this result but would like to check if I can implement in Hive SQL using any existing hive UDFs
Thanks in advance.

This can be accomplished with a recursive common table expression, which Hive doesn't support.
One option is to create a table of numbers and use it to generate rows between start and end.
create table numbers
location 'hdfs_location' as
select row_number() over(order by somecolumn) as num
from some_table --this can be any table with the desired number of rows
;
--Join it with the existing table
select t.id,t.value,n.num as current
from tbl t
join numbers n on n.num>=t.start and n.num<=t.end

You can do using posexplode() UDF.
WITH
data AS (
SELECT 1 AS id, 5 AS value, 1 AS start, 4 AS `end`
UNION ALL
SELECT 2 AS id, 8 AS value, 5 AS start, 9 AS `end`
)
SELECT distinct id, value, (zr.start+rge.diff) as `current`
FROM data zr LATERAL VIEW posexplode(split(space(zr.`end`-zr.start),' ')) rge as diff, x
Here is its Output:
+-----+--------+----------+--+
| id | value | current |
+-----+--------+----------+--+
| 1 | 5 | 1 |
| 1 | 5 | 2 |
| 1 | 5 | 3 |
| 1 | 5 | 4 |
| 2 | 8 | 5 |
| 2 | 8 | 6 |
| 2 | 8 | 7 |
| 2 | 8 | 8 |
| 2 | 8 | 9 |
+-----+--------+----------+--+

Related

SQL How to summarize integer/numeric values on different rows

I am trying to merge integer and numeric values from different SQL rows within the same table into one row so that they are summarized.
| ID | Count | Total Payment
1 | 1 | 5 | 10.99
2 | 1 | 3 | 4.86
3 | 2 | 8 | 19.88
4 | 2 | 2 | 15.99
5 | 2 | 5 | 8.45
6 | 3 | 4 | 12.98
7 | 3 | 10 | 40.42
As such I want to summarize the above rows into the below rows.
| ID | Count | Total Payment
1 | 1 | 8 | 15.85
2 | 2 | 15 | 44.32
3 | 3 | 14 | 53.40
How do I do this?
Thank you HonyBadger and Mathieu Guindon.
The correct code was:
SELECT [id], SUM([count]), SUM([total_payment])
FROM [table_name]
GROUP BY [id]
ORDER BY [count], [total_payment];

Oracle SQL unpivot and keep rows with null values [duplicate]

This question already has an answer here:
oracle - querying NULL values in unpivot query
(1 answer)
Closed 2 years ago.
I'm currently doing an unpivot for a Oracle Data Source (v.12.2) like this:
SELECT *
FROM some_table
UNPIVOT (
(X,Y,Val)
FOR SITE
IN (
(SITE1_X, SITE1_Y, SITE1_VAL) AS '1',
(SITE2_X, SITE2_Y, SITE2_VAL) AS '2',
(SITE3_X, SITE3_Y, SITE3_VAL) AS '3'
))
This works totally fine so far. There is only one exception - I have another column, let's say extend_info, ... if this column has the value y, there will be only one row of this column and all the site columns will be null. Nevertheless I would like to keep this row and not drop it.
I'm not really sure how to do this or what would be a nice way to do this. Any recommendations?
Example:
Original Table:
ID | SITE1_X | SITE1_Y |SITE1_VAL | SITE2_X | SITE2_Y | SITE2_VAL | ... | extend_info
-------
1 | 0 | 0 | 5 | 1 | 1 | 10 | ... | n
2 | 0 | 0 | 3 | null | null | null | ... | n
3 | null | null | null | null | null | null | ... | y
current output:
ID | SITE | X | Y | VAL | extend_info
-------
1 | 1 | 0 | 0 | 5 | n
2 | 1 | 1 | 1 | 10 | n
3 | 2 | 0 | 0 | 3 | n
desired output:
ID | SITE | X | Y | VAL | extend_info
-------
1 | 1 | 0 | 0 | 5 | n
2 | 1 | 1 | 1 | 10 | n
3 | 2 | 0 | 0 | 3 | n
4 | | | | | y
I don't really care what is in SITE|X|Y|VAL in that case, can be 0 for everything or null.
Bonus question:
If extend_info is y I would like to join another table with this ID. The other table looks like this:
ID | F_ID | X | Y | VAL
-----
1 | 4 | 1 | 1 | 8
2 | 4 | 2 | 2 | 9
and in that case my final output table should look like:
ID | SITE | X | Y | VAL | X_OTHER_TABLE | Y_OTHER_TABLE
-------
1 | 1 | 0 | 0 | 5 |
2 | 1 | 1 | 1 | 10 |
3 | 2 | 0 | 0 | 3 |
4 | | | | 8 | 1 | 1
5 | | | | 9 | 2 | 2
I know... the database structure is super ugly but that is what a vendor provides us and we are trying to create a View to make it easier to perform some data analysis tasks on it.
It doesn't have to look 1:1 like my final example - but maybe my itention gets clear = I want to have one single table/view with all the information in a single format.
Thanks for any help!
I would recommend a lateral join:
SELECT s.id, u.*
FROM some_table s CROSS JOIN LATERAL
(SELECT s.SITE1_X as SITE_X, s.SITE1_Y as SITE_Y, s.SITE1_VAL as SITE_VAL FROM DUAL UNION ALL
SELECT s.SITE2_X, s.SITE2_Y, s.SITE2_VAL FROM DUAL UNION ALL
SELECT s.SITE3_X, s.SITE3_Y, s.SITE3_VAL FROM DUAL
) u;
You can just join additional tables to this as you like.

Rolling Average in sqlite

I want to calculate a rolling average in a table and keep track of the starting time of each calculated window frame.
My problem is, that I expect result count reduced compared of the rows in the table. But my query retuns the exact same row number. I think I understand why it does not work, but I don't know the remedy.
Let's say I have a table with example data that looks like this:
+------+-------+
| Tick | Value |
+------+-------+
| 1 | 1 |
| 2 | 3 |_
| 3 | 5 |
| 4 | 7 |_
| 5 | 9 |
| 6 | 11 |_
| 7 | 13 |
| 8 | 15 |_
| 9 | 17 |
| 10 | 19 |_
+------+-------+
I want to calculate the average of every nth item, for example of two rows (see marks above) so that I get an result of:
+--------------+--------------+
| OccurredTick | ValueAverage |
+--------------+--------------+
| 1 | 2 |
| 3 | 6 |
| 5 | 10 |
| 7 | 14 |
| 9 | 18 |
+--------------+--------------+
I tried that with
SELECT
FIRST_VALUE(Tick) OVER (
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
) OccurredTick,
AVG(Value) OVER (
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
) ValueAverage
FROM TableName;
What I get in return is:
+--------------+--------------+
| OccurredTick | ValueAverage |
+--------------+--------------+
| 1 | 2 |
| 2 | 4 |
| 3 | 6 |
| 4 | 8 |
| 5 | 10 |
| 6 | 12 |
| 7 | 14 |
| 8 | 16 |
| 9 | 18 |
| 10 | 19 |
+--------------+--------------+
You could use aggregation. If tick is always increasing with no gaps:
select min(tick), avg(value) avg_value
from mytable
group by cast((tick - 1) / 2 as integer)
You can change 2 to whatever group size suits to best.
If tick are not sequentially increasing, we can generate a sequence with row_number()
select min(tick), avg(value) avg_value
from (
select t.*, row_number() over(order by tick) rn
from mytable t
) t
group by cast((rn - 1) / 2 as integer)

Add ID to duplicates in two columns

I have a table representing trade exchanges between cities and I'd like to add an id that would indicate groups of same origin/destination and destination/origin alike.
For example:
| origin | destination
|--------|------------
| 8 | 2
| 2 | 8
| 8 | 2
| 8 | 5
| 8 | 5
| 9 | 1
| 1 | 9
would become:
| id | origin | destination
|----|--------|------------
| 0 | 8 | 2
| 0 | 2 | 8
| 0 | 8 | 2
| 1 | 8 | 5
| 1 | 8 | 5
| 2 | 9 | 1
| 2 | 1 | 9
I can have same origin/destination but I can also have origin/destination = destination/origin and I want all of those groups identified.
One way: with the window function dense_rank() and GREATEST / LEAST:
SELECT dense_rank() OVER (ORDER BY GREATEST(origin, destination)
, LEAST (origin, destination)) - 1 AS id
, origin, destination
FROM trade;
db<>fiddle here
- 1 to start with 0 like your example.

SQL - Select distinct on two column

I have this table 'words' with more information:
+---------+------------+-----------
| ID |ID_CATEGORY | ID_THEME |
+---------+------------+-----------
| 1 | 1 | 1
| 2 | 1 | 1
| 3 | 1 | 1
| 4 | 1 | 2
| 5 | 1 | 2
| 6 | 1 | 2
| 7 | 2 | 3
| 8 | 2 | 3
| 9 | 2 | 3
| 10 | 2 | 4
| 11 | 2 | 4
| 12 | 3 | 5
| 13 | 3 | 5
| 14 | 3 | 6
| 15 | 3 | 6
| 16 | 3 | 6
And this query that gives to me 3 random ids from different categories, but not from different themes too:
SELECT Id
FROM words
GROUP BY Id_Category, Id_Theme
ORDER BY RAND()
LIMIT 3
What I want as result is:
+---------+------------+-----------
| ID |ID_CATEGORY | ID_THEME |
+---------+------------+-----------
| 2 | 1 | 1
| 7 | 2 | 3
| 14 | 3 | 6
That is, repeat no category or theme.
When you use GROUP BY you cannot include in the select list a column which is not being ordered. So, in your query it's impossible to inlcude Id in the select list.
So you need to do something a bit more complex:
SELECT Id_Category, Id_Theme,
(SELECT Id FROM Words W
WHERE W.Id_Category = G.Id_Category AND W.Id_Theme = G.Id_Theme
ORDER BY RAND() LIMIT 1
) Id
FROM Words G
GROUP BY Id_Category, Id_Theme
ORDER BY RAND()
LIMIT 3
NOTE: the query groups by the required columns, and the subselect is used to take a random Id from all the possible Ids in the group. Then main query is filtered to take three random rows.