Hive QL Difference between two closest elements in a column - sql

Let's say I have a very simple table like this:
ID: Integer
A 4
A 9
A 2
B 4
B 7
B 3
And I want to groupBy(ID). What would be an appropriate query that tells me the minimum difference - like this
ID: MIN_DIF:
A 2
B 1
Simplicity of the query right now is more important than efficiency, but both the most basic and the most efficient query would be appreciated.
Sidenote: Finding the average distance would be a bonus, but I need min first

You can use lag() or lead():
select id, min(int - prev_int)
from (select t.*, lag(int) over (partition by id order by int) as prev_int
from t
) t
group by id
where prev_int is not null;
An alternative method avoids window functions but would probably have much worse performance is:
select t.id, min(t2.integer - t.integer)
from t join
t t2
on t.id = t2.id
where t2.integer > t.integer
group by t.id;

Related

Compare every field in table to every other field in same table

Imagine a table with only one column.
+------+
| v |
+------+
|0.1234|
|0.8923|
|0.5221|
+------+
I want to do the following for row K:
Take row K=1 value: 0.1234
Count how many values in the rest of the table are less than or equal to value in row 1.
Iterate through all rows
Output should be:
+------+-------+
| v |output |
+------+-------+
|0.1234| 0 |
|0.8923| 2 |
|0.5221| 1 |
+------+-------+
Quick Update I was using this approach to compute a statistic at every value of v in the above table. The cross join approach was way too slow for the size of data I was dealing with. So, instead I computed my stat for a grid of v values and then matched them to the vs in the original data. v_table is the data table from before and stat_comp is the statistics table.
AS SELECT t1.*
,CASE WHEN v<=1.000000 THEN pr_1
WHEN v<=2.000000 AND v>1.000000 THEN pr_2
FROM v_table AS t1
LEFT OUTER JOIN stat_comp AS t2
Windows functions were added to ANSI/ISO SQL in 1999 and to to Hive in version 0.11, which was released on 15 May, 2013.
What you are looking for is a variation on rank with ties high which in ANSI/ISO SQL:2011 would look like this-
rank () over (order by v with ties high) - 1
Hive currently does not support with ties ... but the logic can be implemented using count(*) over (...)
select v
,count(*) over (order by v) - 1 as rank_with_ties_high_implicit
from mytable
;
or
select v
,count(*) over
(
order by v
range between unbounded preceding and current row
) - 1 as rank_with_ties_high_explicit
from mytable
;
Generate sample data
select 0.1234 as v into #t
union all
select 0.8923
union all
select 0.5221
This is the query
;with ct as (
select ROW_NUMBER() over (order by v) rn
, v
from #t ot
)
select distinct v, a.cnt
from ct ot
outer apply (select count(*) cnt from ct where ct.rn <> ot.rn and v <= ot.v) a
After seeing your edits, it really does look look like you could use a Cartesian product, i.e. CROSS JOIN here. I called your table foo, and crossed joined it to itself as bar:
SELECT foo.v, COUNT(foo.v) - 1 AS output
FROM foo
CROSS JOIN foo bar
WHERE foo.v >= bar.v
GROUP BY foo.v;
Here's a fiddle.
This query cross joins the column such that every permutation of the column's elements is returned (you can see this yourself by removing the SUM and GROUP BY clauses, and adding bar.v to the SELECT). It then adds one count when foo.v >= bar.v, yielding the final result.
You can take the full Cartesian product of the table with itself and sum a case statement:
select a.x
, sum(case when b.x < a.x then 1 else 0 end) as count_less_than_x
from (select distinct x from T) a
, T b
group by a.x
This will give you one row per unique value in the table with the count of non-unique rows whose value is less than this value.
Notice that there is neither a join nor a where clause. In this case, we actually want that. For each row of a we get a full copy aliased as b. We can then check each one to see whether or not it's less than a.x. If it is, we add 1 to the count. If not, we just add 0.

How to write a LEFT JOIN in BigQuery's Standard SQL?

We have a query that works in BigQuery's Legacy SQL. How do we write it in Standard SQL so it works?
SELECT Hour, Average, L.Key AS Key FROM
(SELECT 1 AS Key, *
FROM test.table_L AS L)
LEFT JOIN
(SELECT 1 AS Key, Avg(Total) AS Average
FROM test.table_R) AS R
ON L.Key = R.Key ORDER BY Hour ASC
Currently the error it gives is:
Equality is not defined for arguments of type ARRAY<INT64> at [4:74]
BigQuery has two modes for queries: Legacy SQL and Standard SQL. We have looked at the BigQuery Standard SQL documentation and also see just one SO answer on Standard SQL joins in BigQuery - but so far, it is unclear to us what the key change needed might be.
Table_L looks like this:
Row Hour
1 A
2 B
3 C
Table_R looks like this:
Row Value
1 10
2 20
3 30
Results Desired:
Row Hour Average(OfR) Key
1 A 20 1
2 B 20 1
3 C 20 1
How do we rewrite this BigQuery Legacy SQL query to work in Standard SQL?
Based on your recent update in question and comments - try below
WITH Table_L AS (
SELECT 1 AS Row, 'A' AS Hour UNION ALL
SELECT 2 AS Row, 'B' AS Hour UNION ALL
SELECT 3 AS Row, 'C' AS Hour
),
Table_R AS (
SELECT 1 AS Row, 10 AS Value UNION ALL
SELECT 2 AS Row, 20 AS Value UNION ALL
SELECT 3 AS Row, 30 AS Value
)
SELECT
Row,
Hour,
(SELECT AVG(Value) FROM Table_R) AS AverageOfR,
1 AS Key
FROM Table_L
Above is for testing
the query you should run in "production" is
SELECT
Row,
Hour,
(SELECT AVG(Value) FROM Table_R) AS AverageOfR,
1 AS Key
FROM Table_L
In case, if for some reason you are bound to JOIN, use below CROSS JOIN version
SELECT
Row,
Hour,
AverageOfR,
1 AS Key
FROM Table_L
CROSS JOIN ((SELECT AVG(Value) AS AverageOfR FROM Table_R))
or below LEFT JOIN version with Key field involved (in case if Key really important for your logic - which somehow I feel is true)
SELECT
Row,
Hour,
AverageOfR,
L.Key AS Key
FROM (SELECT 1 AS Key, Row, Hour FROM Table_L) AS L
LEFT JOIN ((SELECT 1 AS Key, AVG(Value) AS AverageOfR FROM Table_R)) AS R
ON L.Key = R.Key
Your error message suggests that key is not a column in table_L. If no, then don't include it in the query.
It looks like you simply want the average of the total from table_R. You can approach this as:
SELECT l.*, r.average
FROM test.table_L as l CROSS JOIN
(SELECT Avg(Total) as average
FROM test.table_R
) R
ORDER BY l.hour ASC;

Joining next Sequential Row

I am planing an SQL Statement right now and would need someone to look over my thougts.
This is my Table:
id stat period
--- ------- --------
1 10 1/1/2008
2 25 2/1/2008
3 5 3/1/2008
4 15 4/1/2008
5 30 5/1/2008
6 9 6/1/2008
7 22 7/1/2008
8 29 8/1/2008
Create Table
CREATE TABLE tbstats
(
id INT IDENTITY(1, 1) PRIMARY KEY,
stat INT NOT NULL,
period DATETIME NOT NULL
)
go
INSERT INTO tbstats
(stat,period)
SELECT 10,CONVERT(DATETIME, '20080101')
UNION ALL
SELECT 25,CONVERT(DATETIME, '20080102')
UNION ALL
SELECT 5,CONVERT(DATETIME, '20080103')
UNION ALL
SELECT 15,CONVERT(DATETIME, '20080104')
UNION ALL
SELECT 30,CONVERT(DATETIME, '20080105')
UNION ALL
SELECT 9,CONVERT(DATETIME, '20080106')
UNION ALL
SELECT 22,CONVERT(DATETIME, '20080107')
UNION ALL
SELECT 29,CONVERT(DATETIME, '20080108')
go
I want to calculate the difference between each statistic and the next, and then calculate the mean value of the 'gaps.'
Thougts:
I need to join each record with it's subsequent row. I can do that using the ever flexible joining syntax, thanks to the fact that I know the id field is an integer sequence with no gaps.
By aliasing the table I could incorporate it into the SQL query twice, then join them together in a staggered fashion by adding 1 to the id of the first aliased table. The first record in the table has an id of 1. 1 + 1 = 2 so it should join on the row with id of 2 in the second aliased table. And so on.
Now I would simply subtract one from the other.
Then I would use the ABS function to ensure that I always get positive integers as a result of the subtraction regardless of which side of the expression is the higher figure.
Is there an easier way to achieve what I want?
The lead analytic function should do the trick:
SELECT period, stat, stat - LEAD(stat) OVER (ORDER BY period) AS gap
FROM tbstats
The average value of the gaps can be done by calculating the difference between the first value and the last value and dividing by one less than the number of elements:
select sum(case when seqnum = num then stat else - stat end) / (max(num) - 1);
from (select period, row_number() over (order by period) as seqnum,
count(*) over () as num
from tbstats
) t
where seqnum = num or seqnum = 1;
Of course, you can also do the calculation using lead(), but this will also work in SQL Server 2005 and 2008.
By using Join also you achieve this
SELECT t1.period,
t1.stat,
t1.stat - t2.stat gap
FROM #tbstats t1
LEFT JOIN #tbstats t2
ON t1.id + 1 = t2.id
To calculate the difference between each statistic and the next, LEAD() and LAG() may be the simplest option. You provide an ORDER BY, and LEAD(something) returns the next something and LAG(something) returns the previous something in the given order.
select
x.id thisStatId,
LAG(x.id) OVER (ORDER BY x.id) lastStatId,
x.stat thisStatValue,
LAG(x.stat) OVER (ORDER BY x.id) lastStatValue,
x.stat - LAG(x.stat) OVER (ORDER BY x.id) diff
from tbStats x

Summing a column up to a certain row (using GROUP BY and OVER)?

I have a table that lists the duration of different activities. It looks like
id duration
1 15
2 30
3 30
4 45
5 30
...etc
I want to sum these activities like
for (lastActivity=1 to 5)
SELECT id, SUM(duration) FROM durations
WHERE id<=lastActivity
to produce an output like
id endtime
1 15
2 45
3 75
4 120
5 150
where each row sums the duration of the activities up to its position in the list.
It seems an easy task (and possibly is), but I can't figure out how the sql should look like to produce such an output. I have tried using GROUP BY together with the OVER clause but perhaps there's a simpler way of doing this.
SELECT t.id,
t.duration,
rt.runningTotal
FROM mytable t
CROSS apply (SELECT Sum(duration) AS runningTotal
FROM emp
WHERE id <= t.id) AS rt
ORDER BY t.id
The APPLY operator allows you to invoke a table-valued function for each row returned by an outer table expression of a query. The table-valued function acts as the right input and the outer table expression acts as the left input. The right input is evaluated for each row from the left input and the rows produced are combined for the final output. The list of columns produced by the APPLY operator is the set of columns in the left input followed by the list of columns returned by the right input.
Note : To use APPLY, the database compatibility level must be at least 90. This was introduced in sql server 2005.
you can use running total
check this post
Running total in sqlserver stackoverflow
This will degrade depending on how large your actual table is, but this should do the trick:
Some interesting reading around this can be found here
SELECT 1 as id, 15 as num into #test
UNION ALL SELECT 2, 30
UNION ALL SELECT 3, 30
UNION ALL SELECT 4, 45
UNION ALL SELECT 5, 30
select
t1.id
,MAX(t1.num) as id_num
,SUM(t2.num) as running_total
from #test t1
LEFT OUTER JOIN #test t2 on t2.id <= t1.id
GROUP BY
t1.id
Try this :
select d2.ID,sum(d1.duration)
from durations d1,durations d2
where d1.id<=d2.id
group by d2.id

PostgreSQL if query?

Is there a way to select records based using an if statement?
My table looks like this:
id | num | dis
1 | 4 | 0.5234333
2 | 4 | 8.2234
3 | 8 | 2.3325
4 | 8 | 1.4553
5 | 4 | 3.43324
And I want to select the num and dis where dis is the lowest number... So, a query that will produce the following results:
id | num | dis
1 | 4 | 0.5234333
4 | 8 | 1.4553
If you want all the rows with the minimum value within the group:
SELECT id, num, dis
FROM table1 T1
WHERE dis = (SELECT MIN(dis) FROM table1 T2 WHERE T1.num = T2.num)
Or you could use a join to get the same result:
SELECT T1.id, T1.num, T1.dis
FROM table1 T1
JOIN (
SELECT num, MIN(dis) AS dis
FROM table1
GROUP BY num
) T2
ON T1.num = T2.num AND T1.dis = T2.dis
If you only want a single row from each group, even if there are ties then you can use this:
SELECT id, dis, num FROM (
SELECT id, dis, num, ROW_NUMBER() OVER (PARTITION BY num ORDER BY dis) rn
FROM table1
) T1
WHERE rn = 1
Unfortunately this won't be very efficient. If you need something more efficient then please see Quassnoi's page on selecting rows with a groupwise maximum for PostgreSQL. Here he suggests several ways to perform this query and explains the performance of each. The summary from the article is as follows:
Unlike MySQL, PostgreSQL implements
several clean and documented ways to
select the records holding group-wise
maximums, including window functions
and DISTINCT ON.
However to the lack of the loose index
scan support by the PostgreSQL’s
optimizer and the less efficient usage
of indexes in PostgreSQL, the queries
using these function take too long.
To work around these problems and
improve the queries against the low
cardinality grouping conditions, a
certain solution described in the
article should be used.
This solution uses recursive CTE’s to
emulate loose index scan and is very
efficient if the grouping columns have
low cardinality.
Use this:
SELECT DISTINCT ON (num) id, num, dis
FROM tbl
ORDER BY num, dis
Or if you intend to use other RDBMS in future, use this:
select * from tbl a where dis =
(select min(dis) from tbl b where b.num = a.num)
If you need to have IF logic you can use PL/pgSQL.
http://www.postgresql.org/docs/8.4/interactive/plpgsql-control-structures.html
But try to solve your issue with SQL first if possible, it will be faster and use PL/pgSQL when SQL can't solve your problem.