Calculate difference of counter data in Hive - hive

I have counter data stored in Hive table. Counter increments in time and sometimes is reset to zero.
I want to calculate difference between consecutive rows, but in case of a counter reset the difference is negative. An example data and expected output is here:
data: 1, 3, 6, 7, 1, 4
difference: 2, 3, 1, -6, 3, NA
expected: 2, 3, 1, 1, 3, NA
Usually such an operation is done by calculating a lag and subtracting it from the data. In case of negative difference, we should put just the value from lag, here is an example of function, which does this in R/dplyr:
diff_counter <-function(x){
# count difference between measurements
lag <- lag(x)
dx <- x - lag
reset_idx <- dx < 0 & !is.na(dx)
dx[reset_idx] = lag[reset_idx]
return(dx)
}
Can I do something similar in Hive?
Regards
Paweł

Assuming that t is your datetime column and the counter gets incremented in that order, you may use a CASE block with the LEAD function like this.
SELECT x
,CASE
WHEN (
LEAD(x) OVER (
ORDER BY t
) - x
) > 0
THEN LEAD(x) OVER (
ORDER BY t
) - x
ELSE LEAD(x) OVER (
ORDER BY t
)
END AS diff
FROM yourtable;
| X | DIFF |
|---|--------|
| 1 | 2 |
| 3 | 3 |
| 6 | 1 |
| 7 | 1 |
| 1 | 3 |
| 4 | (null) |

Related

SQL query: Fill missing values with 0

I have a table with gaps in data at certain times (see there is no data between 37 & 46). I need to fill in those gaps with 0 for better display on the frontend.
date | mydata
--------+-----------------
911130 | 10
911131 | 11
911132 | 9
911133 | 6
911134 | 5
911135 | 5
911136 | 10
911137 | 8
911146 | 4
911147 | 5
911148 | 9
911149 | 14
911150 | 8
The times are sequential integers (UNIX timestamps initially). I have aggregated my data into 5 minute time buckets.
The frontend query will pass in a start & end time and aggregate the data into larger buckets. For example:
SELECT
(five_min_time / 6) AS date,
SUM(mydata) AS mydata
FROM mydata_table_five_min
WHERE
five_min_time BETWEEN (1640000000 / 300) AND (1640086400 / 300)
GROUP BY date
ORDER BY date ASC;
I would like to be able to get a result:
date | mydata
--------+-----------------
911130 | 10
911131 | 11
911132 | 9
911133 | 6
911134 | 5
911135 | 5
911136 | 10
911137 | 8
911138 | 0
911139 | 0
911140 | 0
911141 | 0
911142 | 0
911143 | 0
911144 | 0
911145 | 0
911146 | 4
911147 | 5
911148 | 9
911149 | 14
911150 | 8
As a note, this query is being run in AWS Redshift.
Not sure if a recursive CTE works in redshift.
But something like this works in postgresql.
with recursive rcte as (
select
min(half_hour) as n,
max(half_hour) as n_max
from cte_data
union all
select n+1, n_max
from rcte
where n < n_max
)
, cte_data as (
select
(five_min_time / 6) as half_hour,
sum(mydata) as mydata
from mydata_table_five_min
where five_min_time between (date_part('epoch','2021-12-20 12:00'::date)::int / 300)
and (date_part('epoch','2021-12-21 12:00'::date)::int / 300)
group by half_hour
)
select n as date
--, to_timestamp(n*6*300) as dt
, coalesce(t.mydata, 0) as mydata
from rcte c
left join cte_data t
on t.half_hour = c.n
order by date;

Calculate moving average using SQL window functions with leading null's where not enough data is avaliable

I want to calculate a moving average using SQL window functions. The following example of a 2 "day" moving average basically works fine, but It also calculates an average if only one data point is available. I rather want the average to be null as long as not enough data is available
create table average(
nr int,
value float
);
insert into average values (1, 2), (2, 4), (3, 6), (3, 8), (4, 10);
SELECT
nr,
value,
AVG(value) OVER (ORDER BY nr ROWS BETWEEN 1 PRECEDING AND 0 FOLLOWING)::FLOAT AS "Moving-Average-2"
FROM average;
result:
1 2 2
2 4 3
3 6 5
3 8 7
4 10 9
expected result:
1 2 null
2 4 3
3 6 5
3 8 7
4 10 9
EDIT 1:
Of course the average can be anything not only 2.
You could use another window function (COUNT()) to make sure that at least two records are available in the window before doing the computation, like:
SELECT
nr,
value,
CASE WHEN COUNT(*) OVER(ORDER BY nr ROWS BETWEEN 1 PRECEDING AND 0 FOLLOWING) > 1
THEN AVG(value) OVER (ORDER BY nr ROWS BETWEEN 1 PRECEDING AND 0 FOLLOWING)::FLOAT
ELSE NULL
END AS "Moving-Average-2"
FROM average;
Demo on DB Fiddle:
| nr | value | Moving-Average-2 |
| --- | ----- | ---------------- |
| 1 | 2 | |
| 2 | 4 | 3 |
| 3 | 6 | 5 |
| 3 | 8 | 7 |
| 4 | 10 | 9 |
Since you happen to form the average only between 1 preceding row and the current one, just using lag() might be simplest:
select nr, value
,(value + lag(value, 1, NULL) OVER (ORDER BY nr)) / 2 AS "Moving-Average-2"
from average;
lag() has an overloaded variant that allows to provide a default value (as 3rd parameter) in case there is no row. Provide NULL and you are there. Or, since NULL is the default default anyway, just:
... ,(value + lag(value) OVER (ORDER BY nr)) / 2 AS "Moving-Average-2"
While the underlying table column is of type float, you need no cast to float in this case.
This is assuming the column value is defined NOT NULL (like indicated by your sample data). Else you also get NULL where the previous row has value IS NULL and the current row has a value, while avg() returns the value in this case! (Or this may be what you want anyway, given your question.)
This may be a handy place to use a window specification:
select a.*,
(case when row_number() over w > 1
then avg(value) over w
end) as running_average
from average a
window w as (order by nr rows between 1 preceding and current row);
i think null is not come in agv first row otherwise below will work by using
BETWEEN 1 PRECEDING AND CURRENT ROW
select nr, value,
avg(value) OVER (ORDER BY nr ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) AS "Moving-Average-2"
from average;
BETWEEN 1 PRECEDING AND CURRENT ROW
but you can haddle it by useing case when
select nr, value,
case when nr=1 then null else
avg(value) OVER (ORDER BY nr ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) end AS "Moving-Average-2"
from average;
nr value Moving-Average-2
1 2
2 4 3
3 6 5
3 8 7
4 10 9
online demo link

Oracle SQL - Efficiently calculate number of concurrent phone calls

I know that this question is essentially a duplicate of an older question I asked but quite a few things changed since I asked that question so I thought I'd ask a new question about it.
I have a table that holds phone call records which has the following fields:
END: Holds the timestamp of when a call ended - Data Type: DATE
LINE: Holds the phone line that was used for a call - Data Type: NUMBER
CALLDURATION: Holds the duration of a call in seconds - Data Type: NUMBER
The table has entries like this:
END LINE CALLDURATION
---------------------- ------------------- -----------------------
25/01/2012 14:05:10 6 65
25/01/2012 14:08:51 7 1142
25/01/2012 14:20:36 5 860
I need to create a query that returns the number of concurrent phone calls based on the data from that table. The query should calculate that number in different intervals. What I mean by that is that the results of the query should only contain a new entry whenever a call was started or ended. As long as the number of concurrent phone calls stays the same there should not be any additional entry in the output.
To make this more clear, here is an example of everything the query should return based on the example entries from the previous table:
TIMESTAMP LINE CALLDURATION STATUS CURRENTLYUSEDLINES
---------------------- ----- ------------- ------- -------------------
25/01/2012 13:49:49 7 1142 1 1
25/01/2012 14:04:05 6 65 1 2
25/01/2012 14:05:10 6 65 -1 1
25/01/2012 14:06:16 5 860 1 2
25/01/2012 14:08:51 7 1142 -1 1
25/01/2012 14:20:36 5 860 -1 0
I got the following example query from a colleague but unfortunately I do not fully understand it and it also does not work exactly as it should because for calls with a duration of 0 seconds it would sometimes have "-1" in the CURRENTLYUSEDLINES-column:
SELECT COALESCE (SUM (STATUS) OVER (ORDER BY END ROWS BETWEEN UNBOUNDED PRECEDING AND 0 PRECEDING), 0) CURRENTLYUSEDLINES
FROM (SELECT END - CALLDURATION / 86400 AS TIMESTAMP,
LINE,
CALLDURATION,
1 AS STATUS
FROM t_calls
UNION ALL
SELECT END,
LINE,
CALLDURATION,
-1 AS STATUS
FROM t_calls) t
ORDER BY 1;
Now I am supposed to make that query work like in the example but I'm not sure how to do that.
Could someone help me out with this or at least explain this query so I can try fixing it myself?
I think this will solve your problem:
SELECT TIMESTAMP,
SUM(SUM(STATUS)) OVER (ORDER BY TIMESTAMP) as CURRENTLYUSEDLINES
FROM ((SELECT END - CALLDURATION / (24*60*60) AS TIMESTAMP,
COUNT(*) AS STATUS
FROM t_calls
GROUP BY END - CALLDURATION / (24*60*60)
) UNION ALL
(SELECT END, - COUNT(*) AS STATUS
FROM t_calls
GROUP BY END
)
) t
GROUP BY TIMESTAMP
ORDER BY 1;
This is a slight simplification of your query. But by doing all the aggregations, you should be getting 0s, but not negative values.
You are getting negative values because the "ends" of the calls are being processed before the begins. This does all the work "at the same time", because there is only one row per timestamp.
You can use an UNPIVOT (using a similar technique to my answer here):
SQL Fiddle
Oracle 11g R2 Schema Setup:
CREATE TABLE table_name ( END, LINE, CALLDURATION ) AS
SELECT CAST( TIMESTAMP '2012-01-25 14:05:10' AS DATE ), 6, 65 FROM DUAL UNION ALL
SELECT CAST( TIMESTAMP '2012-01-25 14:08:51' AS DATE ), 7, 1142 FROM DUAL UNION ALL
SELECT CAST( TIMESTAMP '2012-01-25 14:20:36' AS DATE ), 5, 860 FROM DUAL;
Query 1:
SELECT p.*,
SUM( status ) OVER ( ORDER BY dt, status DESC ) AS currentlyusedlines
FROM (
SELECT end - callduration / 86400 As dt,
t.*
FROM table_name t
)
UNPIVOT( dt FOR status IN ( dt As 1, end AS -1 ) ) p
Results:
| LINE | CALLDURATION | STATUS | DT | CURRENTLYUSEDLINES |
|------|--------------|--------|----------------------|--------------------|
| 7 | 1142 | 1 | 2012-01-25T13:49:49Z | 1 |
| 6 | 65 | 1 | 2012-01-25T14:04:05Z | 2 |
| 6 | 65 | -1 | 2012-01-25T14:05:10Z | 1 |
| 5 | 860 | 1 | 2012-01-25T14:06:16Z | 2 |
| 7 | 1142 | -1 | 2012-01-25T14:08:51Z | 1 |
| 5 | 860 | -1 | 2012-01-25T14:20:36Z | 0 |

Disassemble string, group, and reconstruct in Oracle SQL

So here is what a sample of my data look like:
ID | Amount
1111-1 | 5
1111-1 | -5
1111-2 | 5
1111-2 | -5
12R-1 | 8
12R-1 | -8
12R-3 | 8
12R-3 | -8
54A73-1| 2
54A73-1| -2
54A73-2| 2
54A73-2| -1
What I want to do is group by the string in the ID column before the dash, and find the group of IDs that have a sum of zero. The kicker is that after I find which group of IDs sum to zero, I want to add back the dash and number following the dash.
Here is what I hope the solution to look like:
ID | Amount
1111-1 | 5
1111-1 | -5
1111-2 | 5
1111-2 | -5
12R-1 | 8
12R-1 | -8
12R-3 | 8
12R-3 | -8
Notice how the IDs starting with 54A73 are not there anymore, its because the sum of their Amounts is not equal to zero.
Any help solving this questions would be much appreciated!
Here's one option joining the table back to itself after grouping by the beginning part of the id field using left and locate:
MySQL Version
select id, amount
from yourtable t
join (
select left(id, locate('-', id)-1) shortid
from yourtable
group by left(id, locate('-', id)-1)
having sum(amount) = 0
) t2 on left(t.id, locate('-', t.id)-1) = t2.shortid
SQL Fiddle Demo
Oracle Version
select id, amount
from yourtable t
join (
select substr(id, 0, instr(id,'-')-1) shortid
from yourtable
group by substr(id, 0, instr(id,'-')-1)
having sum(amount) = 0
) t2 on substr(t.id, 0, instr(t.id,'-')-1) = t2.shortid
More Fiddle

Joining series of dates and counting continous days

Let's say I have a table as below
date add_days
2015-01-01 5
2015-01-04 2
2015-01-11 7
2015-01-20 10
2015-01-30 1
what I want to do is to check the days_balance, i.e. if date is greater or smaller than previous date + N days (add_days) and take the cumulated sum of days count if they are a continuous series.
So the algorithm should work like
for i in 2:N_rows {
days_balance[i] := date[i-1] + add_days[i-1] - date[i]
if days_balance[i] >= 0 then
date[i] := date[i] + days_balance[i]
}
The expected result should be as follows
date days_balance
2015-01-01 0
2015-01-04 2
2015-01-11 -3
2015-01-20 -2
2015-01-30 0
Is it possible in pure SQL? I imagine it should be with some conditional joins, but cannot see how it could be implemented.
I'm posting another answer since it may be nice to compare them since they use different methods (this one just does a n^2 style join, other one used a recursive CTE). This one takes advantage of the fact that you don't have to calculate the days_balance for each previous row before calculating it for a particular row, you just need to sum things from previous days....
drop table junk
create table junk(date DATETIME, add_days int)
insert into junk values
('2015-01-01',5 ),
('2015-01-04',2 ),
('2015-01-11',7 ),
('2015-01-20',10 ),
('2015-01-30',1 )
;WITH cte as
(
select ROW_NUMBER() OVER (ORDER BY date) i, date, add_days, ISNULL(DATEDIFF(DAY, LAG(date) OVER (ORDER BY date), date), 0) days_since_prev
FROM Junk
)
, combinedWithAllPreviousDaysCte as
(
select i [curr_i], date [curr_date], add_days [curr_add_days], days_since_prev [curr_days_since_prev], 0 [prev_add_days], 0 [prev_days_since_prev] from cte where i = 1 --get first row explicitly since it has no preceding rows
UNION ALL
select curr.i [curr_i], curr.date [curr_date], curr.add_days [curr_add_days], curr.days_since_prev [curr_days_since_prev], prev.add_days [prev_add_days], prev.days_since_prev [prev_days_since_prev]
from cte curr
join cte prev on curr.i > prev.i --join to all previous days
)
select curr_i, curr_date, SUM(prev_add_days) - curr_days_since_prev - SUM(prev_days_since_prev) [days_balance]
from combinedWithAllPreviousDaysCte
group by curr_i, curr_date, curr_days_since_prev
order by curr_i
outputs:
+--------+-------------------------+--------------+
| curr_i | curr_date | days_balance |
+--------+-------------------------+--------------+
| 1 | 2015-01-01 00:00:00.000 | 0 |
| 2 | 2015-01-04 00:00:00.000 | 2 |
| 3 | 2015-01-11 00:00:00.000 | -3 |
| 4 | 2015-01-20 00:00:00.000 | -5 |
| 5 | 2015-01-30 00:00:00.000 | -5 |
+--------+-------------------------+--------------+
Well, I think I have it with a recursive CTE (sorry, I only have Microsoft SQL Server available to me at the moment, so it may not comply with PostgreSQL).
Also I think the expected results you had were off (see comment above). If not, this can probably be modified to conform to your math.
drop table junk
create table junk(date DATETIME, add_days int)
insert into junk values
('2015-01-01',5 ),
('2015-01-04',2 ),
('2015-01-11',7 ),
('2015-01-20',10 ),
('2015-01-30',1 )
;WITH cte as
(
select ROW_NUMBER() OVER (ORDER BY date) i, date, add_days, ISNULL(DATEDIFF(DAY, LAG(date) OVER (ORDER BY date), date), 0) days_since_prev
FROM Junk
)
,recursiveCte (i, date, add_days, days_since_prev, days_balance, math) as
(
select top 1
i,
date,
add_days,
days_since_prev,
0 [days_balance],
CAST('no math for initial one, just has zero balance' as varchar(max)) [math]
from cte where i = 1
UNION ALL --recursive step now
select
curr.i,
curr.date,
curr.add_days,
curr.days_since_prev,
prev.days_balance - curr.days_since_prev + prev.add_days [days_balance],
CAST(prev.days_balance as varchar(max)) + ' - ' + CAST(curr.days_since_prev as varchar(max)) + ' + ' + CAST(prev.add_days as varchar(max)) [math]
from cte curr
JOIN recursiveCte prev ON curr.i = prev.i + 1
)
select i, DATEPART(day,date) [day], add_days, days_since_prev, days_balance, math
from recursiveCTE
order by date
And the results are like so:
+---+-----+----------+-----------------+--------------+------------------------------------------------+
| i | day | add_days | days_since_prev | days_balance | math |
+---+-----+----------+-----------------+--------------+------------------------------------------------+
| 1 | 1 | 5 | 0 | 0 | no math for initial one, just has zero balance |
| 2 | 4 | 2 | 3 | 2 | 0 - 3 + 5 |
| 3 | 11 | 7 | 7 | -3 | 2 - 7 + 2 |
| 4 | 20 | 10 | 9 | -5 | -3 - 9 + 7 |
| 5 | 30 | 1 | 10 | -5 | -5 - 10 + 10 |
+---+-----+----------+-----------------+--------------+------------------------------------------------+
I don’t quite get how your algorithm returns your expected results? But let me share a technique I came up with that might help.
This will only work if the end result of your data is to be exported to Excel, and even then it won’t work in all scenarios depending on what format you export your dataset in, but here it is....
If you’ll familiar with Excel Formulas, what I discovered is that if you write an Excel formula in your SQL as another field, it will execute that formula for you as soon as you export to excel (best method that works for me is just coping and pasting it into Excel, so that it doesn’t format it as text)
So for your example, here’s what you could do (noting again I don’t understand your algorithm, so this is probably wrong, but it’s just to give you the concept)
SELECT
date
, add_days
, '=INDEX($1:$65536,ROW()-1,COLUMN()-2)'
||'+INDEX($1:$65536,ROW()-1,COLUMN()-1)'
||'-INDEX($1:$65536,ROW(),COLUMN()-2)'
AS "days_balance[i]"
,'=IF(INDEX($1:$65536,ROW(),COLUMN()-1)>=0'
||',INDEX($1:$65536,ROW(),COLUMN()-3)'
||'+INDEX($1:$65536,ROW(),COLUMN()-1))'
AS "date[i]"
FROM
myTable
ORDER BY /*Ensure to order by whatever you need for your formula to work*/
The key part to making this work is using the INDEX formula function to select a cell based on the position of the current cell. So ROW()-1 tells it get me the result of the previous record, and COLUMN()-2 means take the value from two columns to the left of the current. Because you can't use cell references like A2+B2-A3 because the row numbers won't change on export, and it assumes the position of the columns.
I used SQL string concatenation with || just so it's easier to read on screen.
I tried this one in excel; it didn’t match your expected results. But if this technique works for you then just correct the excel formula to suit.