Reference the output of a calculated column in Hive SQL - sql

I have a self-referencing/recursive calculation in Excel that needs to be moved to Hive SQL. Basically the column needs to SUM the two values only if the total of the concrete column plus the result from the previous calculation is greater than 0.
The data is as follows, A is the value and B is the expected output:
| A | B |
|-----|-----|
| -1 | 0 |
| 2 | 2 |
| -2 | 0 |
| 2 | 2 |
| 2 | 4 |
| -1 | 3 |
| 2 | 5 |
In Excel it would be written in column B as:
=MAX(0,B1+A2)
The problem in SQL is you need to have the output of the current calculation. I think I've got it sorted in SQL as the following:
DECLARE #Numbers TABLE(A INT, Rn INT)
INSERT INTO #Numbers VALUES (-1,1),(2,2),(-2,3),(2,4),(2,5),(-1,6),(2,7);
WITH lagged AS
(
SELECT A, 0 AS B, Rn
FROM #Numbers
WHERE Rn = 1
UNION ALL
SELECT i.A,
CASE WHEN ((i.A + l.B) >= 0) THEN (i.A + l.B)
ELSE l.B
END,
i.Rn
FROM #Numbers i INNER JOIN lagged l
ON i.Rn = l.Rn + 1
)
SELECT *
FROM lagged;
But this being Hive, it doesn't support CTEs so I need to dumb the SQL down a touch. Is that possible using LAG/LEAD? My brain is hurting having got this far!

I initially thought that it would help to first compute the Sum of all elements until each rank and then fix the values somehow using negative elements.
However, one big negative that would zero the B column will carry forward in the sum and will make all following elements negative.
It's as Gordon commented - 0 is max in the calculation =MAX(0,B1+A2) depends on the previous location where it happened and it seems to be impossible to compute them in advance analytically.

Related

Split a quantity into multiple rows with limit on quantity per row

I have a table of ids and quantities that looks like this:
dbo.Quantity
id | qty
-------
1 | 3
2 | 6
I would like to split the quantity column into multiple lines and number them, but with a set limit (which can be arbitrary) on the maximum quantity allowed for each row.
So for the value of 2, expected output should be:
dbo.DesiredResult
id | qty | bucket
---------------
1 | 2 | 1
1 | 1 | 2
2 | 1 | 2
2 | 2 | 3
2 | 2 | 4
2 | 1 | 5
In other words,
Running SELECT id, SUM(qty) as qty FROM dbo.DesiredResult should return the original table (dbo.Quantity).
Running
SELECT id, SUM(qty) as qty FROM dbo.DesiredResult GROUP BY bucket
should give you this table.
id | qty | bucket
------------------
1 | 2 | 1
1 | 2 | 2
2 | 2 | 3
2 | 2 | 4
2 | 1 | 5
I feel I can do this with cursors imperitavely, looping over each row, keeping a counter that increments and resets as the "max" for each is filled. But this is very "anti-SQL" I feel there is a better way around this.
One approach is recursive CTE which emulates cursor sequentially going through rows.
Another approach that comes to mind is to represent your data as intervals and intersections of intervals.
Represent this:
id | qty
-------
1 | 3
2 | 6
as intervals [0;3), [3;9) with ids being their labels
0123456789
|--|-----|
1 2 - id
It is easy to generate this set of intervals using running total SUM() OVER().
Represent your buckets also as intervals [0;2), [2;4), [4;6), etc. with their own labels
0123456789
|-|-|-|-|-|
1 2 3 4 5 - bucket
It is easy to generate this set of intervals using a table of numbers.
Intersect these two sets of intervals preserving information about their labels.
Working with sets should be possible in a set-based SQL query, rather than a sequential cursor or recursion.
It is bit too much for me to write down the actual query right now. But, it is quite possible that ideas similar to those discussed in Packing Intervals by Itzik Ben-Gan may be useful here.
Actually, once you have your quantities represented as intervals you can generate required number of rows/buckets on the fly from the table of numbers using CROSS APPLY.
Imagine we transformed your Quantity table into Intervals:
Start | End | ID
0 | 3 | 1
3 | 9 | 2
And we also have a table of numbers - a table Numbers with column Number with values from 0 to, say, 100K.
For each Start and End of the interval we can calculate the corresponding bucket number by dividing the value by the bucket size and rounding down or up.
Something along these lines:
SELECT
Intervals.ID
,A.qty
,A.Bucket
FROM
Intervals
CROSS APPLY
(
SELECT
Numbers.Number + 1 AS Bucket
,#BucketSize AS qty
-- it is equal to #BucketSize if the bucket is completely within the Start and End boundaries
-- it should be adjusted for the first and last buckets of the interval
FROM Numbers
WHERE
Numbers.Number >= Start / #BucketSize
AND Numbers.Number < End / #BucketSize + 1
) AS A
;
You'll need to check and adjust formulas for errors +-1.
And write some CASE WHEN logic for calculating the correct qty for the buckets that happen to be on the lower and upper boundary of the interval.
Use a recursive CTE:
with cte as (
select id, 1 as n, qty
from t
union all
select id, n + 1, qty
from cte
where n + 1 < qty
)
select id, n
from cte;
Here is a db<>fiddle.

How to integrate over segments using SQL

I have a table with columns t_b; t_e; x were [t_b, t_e) denotes a period during which x resources where used. I want to compute a table were for each hour h I have amount of resources that where used during [h, h+1) period.
So far my only idea was to generate multiple rows from each input row for each hour (I use an extension of SQL with UDFs) and then simply group by by hour, but I'm afraid this may be too slow considering large amount of data at hand.
Say for example I have a table with two rows:
+-----+-----+---+
| t_b | t_e | x |
+-----+-----+---+
| 1 | 3.5 | a |
| 0.5 | 4 | b |
+-----+-----+---+
Then resulting table should be:
+---+-------------+
| h | x |
+---+-------------+
| 0 | 0*a + 0.5*b |
| 1 | 1*a + 1*b |
| 2 | 1*a + 1*b |
| 3 | 0.5*a + 1*b |
+---+-------------+
You can have a trigger on insert into the stats table that also adds to the aggregate table (the per-hour sums).
If you also need to convert the existing data, you need to run over every row of your current table, split it into amounts/hours and add to the aggregate table.
This is an sql-server example for all number columns
with h as (
-- your hours tally here
select top(24) row_number() over(order by (select null)) eoh from sys.all_objects
), myTable as (
select 1 t_b, 3.5 t_e, 20 v union all
select 0.5, 4, 40
)
select eoh-1 h_starth
, sum(v * (case when t_e < eoh then t_e else eoh end - case when t_b > eoh-1 then t_b else eoh-1 end)) usage
from h
left join myTable t on t_e > eoh - 1 and eoh > t_b -- [..) intresection with [..)
group by eoh;
Fiddle

Calculate overall percentage of Access Query

I have an MS Access Query which returns the following sample data:
+-----+------+------+
| Ref | ANS1 | ANS2 |
+-----+------+------+
| 123 | A | A |
| 234 | B | B |
| 345 | C | C |
| 456 | D | E |
| 567 | F | G |
| 678 | H | I |
+-----+------+------+
Is it possible to have Access return the overall percentage where ANS1 = ANS2?
So my new query would return:
50
I know how to get a count of the records returned by the original query, but not how to calculate the percentage.
Since you're looking for a percentage of some condition being met across the entire dataset, the task can be reduced to having a function return either 1 (when the condition is validated), or 0 (when the condition is not validated), and then calculating an average across all records.
This could be achieved in a number of ways, one example might be to use a basic iif statement:
select avg(iif(t.ans1=t.ans2,1,0)) from YourTable t
Or, using the knowledge that a boolean value in MS Access is represented using -1 (True) or 0 (False), the expression can be reduced to:
select -avg(t.ans1=t.ans2) from YourTable t
In each of the above, change YourTable to the name of your table.
If you know how to get a count, then apply that same knowledge twice:
SELECT Count([ANS1]) As MatchCount FROM [Data]
WHERE [ANS1] = [ANS2]
divided by the total count
SELECT Count([ANS1]) As AllCount FROM [Data]
To combine both of these in a basic SQL query, one needs a "dummy" query since Access doesn't allow selection of only raw data:
SELECT TOP 1
((SELECT Count([ANS1]) As MatchCount FROM [Data] WHERE [ANS1] = [ANS2])
/
(SELECT Count([ANS1]) As AllCount FROM [Data]))
AS MatchPercent
FROM [Data]
This of course assumes that there is at least one row... so it doesn't divide by zero.

repeating / duplicating query entries based on a table value

Related to / copied from this PostgreSQL topic: so-link
Let's say I have a table with two rows
id | value |
----+-------+
1 | 2 |
2 | 3 |
I want to write a query that will duplicate (repeat) each row based on
the value. I want this result (5 rows total):
id | value |
----+-------+
1 | 2 |
1 | 2 |
2 | 3 |
2 | 3 |
2 | 3 |
How is this possible in SQL Anywhere (Sybase SQL)?
The easiest way to do this is to have a numbers table . . . one that generates integers. Perhaps you have one handy. There are other ways. For instance, using a recursive CTE:
with numbers as (
select 1 as n
union all
select n + 1
from numbers
where n < 100
)
select t.*
from yourtable t join
numbers n
on n.n <= value;
Not all versions of Sybase necessarily support recursive CTEs There are other ways to generate such a table or you might already have one handy.

Finding the row with most common attribute using SQL

I have the following table in my database:
user_id | p1 | p2 | p3
1 | x | y | z
2 | x | x | x
3 | y | y | z
I need to find the row(s) that contains the most common value between that same row.
i.e., the first row has no common value, the second contain three common values and the third one contains two common values.
Then, the output in this case should be
user_id | p1 | p2 | p3
2 | x | x | x
Any ideas?
(It would be nice if the solution did not require a vendor-specific feature, but anything will help).
For a non vendor specific solution You could do
SELECT *
FROM YourTable
ORDER BY
CASE WHEN p1=p2 THEN 1 ELSE 0 END +
CASE WHEN p1=p3 THEN 1 ELSE 0 END +
CASE WHEN p2=p3 THEN 1 ELSE 0 END DESC
And then LIMIT, TOP, ROW_NUMBER or whatever dependant upon RDBMS to just get the top row.
But if you have a specific RDBMS in mind there may be other ways that are more maintainable for larger number of columns (e.g. for SQL Server 2008)
SELECT TOP 1 *
FROM YourTable
ORDER BY
(SELECT COUNT (DISTINCT p) FROM (VALUES(p1),(p2),(p3)) T(p))
Also how do you want ties handled?