Hiveql aggregation :difference between two values - hive

I'm using hive coupled with hadoop.I'm lookinfor a function (hiveql) which permit to have difference between the last/first value of the day.The data has recorded every 5 minute (Gauge or Counter which increments) for each resource name and i want to have an aggregation having one value per day per resource name (mac).
illustration

Use analytic functions first_value and last_value:
select colldate,
max(last_success - first_success) as success,
max(last_conmiss - first_conmiss) as conmiss,
resourcename
from
(
select
first_value(success) over(partition by resourcename, colldate order by colltime) first_success,
last_value(success) over(partition by resourcename, colldate order by colltime) last_success,
first_value(conmiss) over(partition by resourcename, colldate order by colltime) first_conmiss,
last_value(conmiss) over(partition by resourcename, colldate order by colltime) last_conmiss,
colldate, resourcename
from (select s.*, to_date(s.colltime) colldate from table s) s
)s
group by colldate, resourcename;
See docs here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics

Related

How to override past records in SQL?

I have a SQL table where one field can have the same value in multiple records. For example:
I would like to select the EndTime and UserId of each UserID once. If there are multiple records with the same UserID  I want to select the most recent record and override the past ones.
I'm working with Sqlite in python.
You can use a window function such as ROW_NUMBER()
SELECT EndTime, userID
FROM
(
SELECT ROW_NUMBER() OVER (PARTITION BY userID ORDER BY StartTime DESC) AS rn,
t.*
FROM t
) AS tt
WHERE rn = 1
where grouping by userID is provided by PARTITION BY userID, and picking the recent values is provided through sorting descendingly by StartTime
Demo
All you need is aggregation:
SELECT UserID, MAX(StartTime) StartTime, EndTime
FROM tablename
GROUP BY UserID;
because SQLite will return for each UserID the row with the max StartTime.
See the demo.

Rank() based on column entries while the data is ordered by date

I'm trying to use dense_rank() function over the pagename column after the data is ordered by time_id.
Expected output in rank column, rn, is: [1,2,2,3,4].
Currently I wrote it as:
with tbl2 as
(select UID, pagename, date_id, time_id, source--, dense_rank() over(partition by UID order by pagename) as rn
from tbl1
order by time_id)
select *, dense_rank() over(partition by UID order by time_id, pagename) as rn
from tbl2
Any help would be appreciated
Edit 1: What I am trying to achieve here is to give ranks, as per the user on-screen action flow, to the pages that are visited. Suppose if the same page 'A' is visited back after visiting a different page 'B' then the ranks for these page visits A, B, A will be 1,2,3 (note that the same page A has different ranks 1 & 3)
step-by-step demo:db<>fiddle
SELECT
*,
SUM(is_diff) OVER (ORDER BY date_id, time_id, page)
FROM (
SELECT
*,
CASE WHEN page = lag(page) over (order by date_id, time_id) THEN 0 ELSE 1 END as is_diff
FROM mytable
)s
This looks exactly like a problem I asked some years ago: Window functions: PARTITION BY one column after ORDER BY another
You want to execute a window function on columns (uuid, page) but want to keep the current order which is given by unrelated columns (date_id, time_id).
The problem is, that PARTITION BY orders the records before the ORDER BY clause. So, it defines the primary order and this is not expected.
Once I found a solution for that. I adapted it to your used case. Please read the explanation over there: https://stackoverflow.com/a/52439794/3984221
Interesting part: Your special rank() case is not explicitly required in the query, because my solution creates that out-of-the-box ("by accident" so-to-speak ;) ).
Hmmm . . . If you want the pages ordered by their earliest time, then use two levels of window functions:
select t.*,
dense_rank() over (partition by uid order by min_rn, pagename) as ranking
from (select t.*,
min(rn) over (partition by uid, pagename) as min_rn
from t
) t
Note: This uses rn as a convenient shortcut because the date/time is split into two columns. You can also combine them:
select t.*,
dense_rank() over (partition by uid order by min_dt, pagename) as ranking
from (select t.*,
min(date_id || time_id) over (partition by uid, pagename) as min_dt
from t
) t;
Note: This solution is different from S_man's. On your sample data, they do the same thing. However, if the user returns to a page, then his gives page a new ranking. This gives the page the same ranking as the first time it appears. It is not clear what you really want.
You can use DENSE_RANK() like this for your requirment,
SELECT
u_id,
page_name,
date_id,
time_id,
source,
DENSE_RANK()
OVER (
PARTITION BY page_name
ORDER BY u_id DESC
) rn
FROM ( SELECT * FROM tbl1 ORDER BY time_id ) AS result;

Generate custom group ranking in sql

As posted, I am trying to generate group ranking based on Is_True_Mod column. Here Until next 1 comes, I want 1 group to be there. Please find expected output in SQL. Here in expected output, rows grouped based on Is_True_Mode column. Regular ranking showing for reference ( order by ranking should be their )
You can identify the groups using a cumulative sum. Then you can you row_number() to enumerate the rows:
select t.*,
row_number() over (partition by grp order by regularranking) as expected_output
from (select t.*,
sum(is_true_mode) over (order by regularranking) as grp
from t
) t;

SQL Finding five largest numbers instead of one Max in a table

I have a table and I need to run a query that contains some aggregation Functions like Maximum , Average , Standard Deviation , ...
but instead of one Maximum I should return 5 largest number.
the simplified query is something like this:
SELECT OSI_KEY , MAX(VALUE) , AVG(VALUE) , STDDEV(VALUE), variance(VALUE)
FROM DATA_VALUES_5MIN_6_2013
GROUP BY OSI_KEY
ORDER BY OSI_KEY
and I need some Magical ;) Query like this:
SELECT OSI_KEY , MAX1(VALUE) ,MAX2(VALUE) ,MAX3(VALUE) ,MAX4(VALUE) , MAX5(VALUE) ,
AVG(VALUE) , STDDEV(VALUE), variance(VALUE)
FROM DATA_VALUES_5MIN_6_2013
GROUP BY OSI_KEY
ORDER BY OSI_KEY
I appreciate your considerations.
Oracle has an NTH_VALUE() function. Unfortunately, it is only an analytic function and not a window function. This leads to the strange construct of SELECT DISTINCT with a bunch of analytic functions:
SELECT DISTINCT OSI_KEY,
MAX(VALUE) OVER (PARTITION BY OSI_KEY),
NTH_VALUE(VALUE, 2) OVER (PARTITION BY OSI_KEY ORDER BY VALUE DESC) as MAX_2,
NTH_VALUE(VALUE, 3) OVER (PARTITION BY OSI_KEY ORDER BY VALUE DESC) as MAX_3,
NTH_VALUE(VALUE, 4) OVER (PARTITION BY OSI_KEY ORDER BY VALUE DESC) as MAX_4,
NTH_VALUE(VALUE, 5) OVER (PARTITION BY OSI_KEY ORDER BY VALUE DESC) as MAX_5,
AVG(VALUE) OVER (PARTITION BY OSI_KEY),
STDDEV(VALUE) OVER (PARTITION BY OSI_KEY),
variance(VALUE) OVER (PARTITION BY OSI_KEY)
FROM DATA_VALUES_5MIN_6_2013
ORDER BY OSI_KEY;
You can also do this using conditional aggregation, with a row_number() or dense_rank() in a subquery.
SELECT OSI_KEY, MaxValue FROM (
SELECT OSI_KEY, MAX(value) AS MaxValue FROM table GROUP BY OSI_KEY
)
ORDER BY MaxValue DESC
FETCH FIRST 5 ROWS ONLY;

Limit result set in sql window function

Assume I would like to rewrite the following aggregate query
select id, max(hittime)
from status
group by id
using an aggregate windowing function like
select id, max(hittime) over(partition by id order by hittime desc) from status
How can I specify, that I am only interested in the first result within the partition?
EDIT: I was thinking that there might be a solution with [ RANGE | ROWS ] BETWEEN frame_start AND frame_end. What to get not only max(hittime) but also the second, third ...
I think what you need is a ranking function, either ROW_NUMBER or DENSE_RANK depending on how you want to handle ties.
select id, hittime
from (
select id, hittime,
dense_rank() over(partition by id order by hittime desc) as ranking
from status
) as x
where ranking = 1; --to get max hittime
--where ranking <=2; --max and second largest
Use distinct statement.
select DISTINCT id, max(hittime) over(partition by id order by hittime desc) from status