Hive transformation

Hive transformation - hive

I am trying to make a simple hive transformation.
Can some one provide me a way to do this? I have tried collect_set and currently looking at klout's open source UDF.

I think this gives you what you want. I wasn't able to run it and debug it though. Good luck!
select start_point.unit
, start_time as start
, start_time + min(stop_time - start_time) as stop
from
(select * from
(select date_time as start_time
, unit
, last_value(unit) over (order by date_time row desc between current row and 1 following) as previous_unit
from table
) previous
where unit <> previous_unit
) start_points
left outer join
(select * from
(select date_time as stop_time
, unit
, last_value(unit) over (order by date_time row between current row and 1 following) as next_unit
from table
) next
where unit <> next_unit
) stop_points
on start_points.unit = stop_points.unit
where stop_time > start_time
group by start_point.unit, start_time
;

What about using the min and max functions? I think the following will get you what you need:
SELECT
Unit,
MIN(datetime) as start,
MAX(datetime) as stop
from table_name
group by Unit
;

I found it. Thanks for the pointer to use window functions
select *
from
(select *,
case when lag(unit,1) over (partition by id order by effective_time_ut desc) is NULL THEN 1
when unit<>lag(unit,1) over (partition by id order by effective_time_ut desc) then 1
when lead(unit,1) over (partition by id order by effective_time_ut desc) is NULL then 1
else 0 end as different_loc
from units_we_care) a
where different_loc=1

create table temptable as select unit, start_date, end_time, row_number () over() as row_num from (select unit, min(date_time) start_date, max(date_time) as end_time from table group by unit) a;
select a.unit, a.start_date as start_date, nvl(b.start_date, a.end_time) end_time from temptable a left outer join temptable b on (a.row_num+1) = b.row_num;

Related

Getting category based on production shift

I have this query
with cte as(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY seq ORDER BY date_time) rn1,
ROW_NUMBER() OVER (PARTITION BY seq, output > 0
ORDER BY date_time) rn2
FROM myTable
;
select
seq,
date_time::date,
MIN(date_time) AS MinDatetime,
MAX(date_time) AS MaxDatetime,
SUM(output) AS sum_output
FROM cte cte
GROUP by
seq,
date_time::date ,
cntpr > 0,
rn1 - rn2
ORDER BY
seq,
MIN(date_time);
here's the result:
what I would like to do is to join my result to this master table
enter image description here
and the expected result will be MinDatetime and MaxDatetime among my master table's start and end shift to show the shift information, like this:
enter image description here
Any help would be very appreciated.. thank you!

This is the solution I came up with:
select seq, shift, start_shift, end_shift, MinDateTime, MaxDateTime
from
(
select
seq,
MIN(date_time) AS MinDatetime,
MAX(date_time) AS MaxDatetime,
SUM(output) AS sum_output
FROM cte cte
GROUP by
seq
ORDER BY
seq,
MIN(date_time::date)) t
join mstr
on
CASE
WHEN start_shift < end_shift THEN (MinDateTime::time between start_shift and end_shift) OR (MaxDateTime::time between start_shift and end_shift)
ELSE (MinDateTime::time >= start_shift) OR
(MaxDateTime::time >= start_shift) OR
(MinDateTime::time <= end_shift) OR
(MaxDateTime::time <= end_shift)
END
ORDER BY seq;
Fiddle: https://www.db-fiddle.com/f/4jyoMCicNSZpjMt4jFYoz5/4208
Explanation: I get the groups, join them with master table on interval matching.

SQL Server LEAD function

-- FIRST LOGIN DATE
WITH CTE_FIRST_LOGIN AS
(
SELECT
PLAYER_ID, EVENT_DATE,
ROW_NUMBER() OVER (PARTITION BY PLAYER_ID ORDER BY EVENT_DATE ASC) AS RN
FROM
ACTIVITY
),
-- CONSECUTIVE LOGINS
CTE_CONSEC_PLAYERS AS
(
SELECT
PLAYER_ID,
LEAD(EVENT_DATE,1) OVER (PARTITION BY EVENT_DATE ORDER BY EVENT_DATE) NEXT_DATE
FROM
ACTIVITY A
JOIN
CTE_FIRST_LOGIN C ON A.PLAYER_ID = C.PLAYER_ID
WHERE
NEXT_DATE = DATEADD(DAY, 1, A.EVENT_DATE) AND C.RN = 1
GROUP BY
A.PLAYER_ID
)
-- FRACTION
SELECT
NULLIF(ROUND(1.00 * COUNT(CTE_CONSEC.PLAYER_ID) / COUNT(DISTINCT PLAYER_ID), 2), 0) AS FRACTION
FROM
ACTIVITY
JOIN
CTE_CONSEC_PLAYERS CTE_CONSEC ON CTE_CONSEC.PLAYER_ID = ACTIVITY.PLAYER_ID
I am getting the following error when I run this query.
[42S22] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Invalid column name 'NEXT_DATE'. (207) (SQLExecDirectW)
This is a leetcode medium question 550. Game Play Analysis IV. I wanted to know why it can't identify the column NEXT_DATE here and what am I missing? Thanks!

The problem is in this CTE:
-- CONSECUTIVE LOGINS prep
CTE_CONSEC_PLAYERS AS (
SELECT
PLAYER_ID,
LEAD(EVENT_DATE,1) OVER (PARTITION BY EVENT_DATE ORDER BY EVENT_DATE) NEXT_DATE
FROM ACTIVITY A
JOIN CTE_FIRST_LOGIN C ON A.PLAYER_ID = C.PLAYER_ID
WHERE NEXT_DATE = DATEADD(DAY, 1, A.EVENT_DATE) AND C.RN = 1
GROUP BY A.PLAYER_ID
)
Note that you are creating NEXT_DATE as a column alias in this CTE but also referring to it in the WHERE clause. This is invalid because by SQL clause-ordering rules the NEXT_DATE column alias does not exist until you get to the ORDER BY clause which is the last evaluated clause in a SQL query or subquery. You don't have an ORDER BY clause in this subquery, so technically the NEXT_DATE column alias only exists to [sub]queries that both come after and reference your CTE_CONSEC_PLAYERS CTE.
To fix this you'd probably want two CTEs like this (untested):
-- CONSECUTIVE LOGINS
CTE_CONSEC_PLAYERS_pre AS (
SELECT
PLAYER_ID,
RN,
EVENT_DATE,
LEAD(EVENT_DATE,1) OVER (PARTITION BY EVENT_DATE ORDER BY EVENT_DATE) NEXT_DATE
FROM ACTIVITY A
JOIN CTE_FIRST_LOGIN C ON A.PLAYER_ID = C.PLAYER_ID
)
-- CONSECUTIVE LOGINS
CTE_CONSEC_PLAYERS AS (
SELECT
PLAYER_ID,
MAX(NEXT_DATE) AS NEXT_DATE,
FROM CTE_CONSEC_PLAYERS_pre
WHERE NEXT_DATE = DATEADD(DAY, 1, EVENT_DATE) AND RN = 1
GROUP BY PLAYER_ID
)

You gave every table an alias (for example JOIN CTE_FIRST_LOGIN C has the alias C), and every column access is via the alias. You need to add the correct alias from the correct table to NEXT_DATE.

Your primary issue is that NEXT_DATE is a window function, and therefore cannot be referred to in the WHERE because of SQL's order of operations.
But it seems this query is over-complicated.
The problem to be solved appears to be: how many players logged in the day after they first logged in, as a percentage of all players.
This can be done in a single pass (no joins), by using multiple window functions together:
WITH CTE_FIRST_LOGIN AS (
SELECT
PLAYER_ID,
EVENT_DATE,
ROW_NUMBER() OVER (PARTITION BY PLAYER_ID ORDER BY EVENT_DATE) AS RN,
-- if EVENT_DATE is a datetime and can have multiple per day then group by CAST(EVENT_DATE AS date) first
LEAD(EVENT_DATE, 1) OVER (PARTITION BY EVENT_DATE ORDER BY EVENT_DATE) AS NextDate
FROM ACTIVITY
),
BY_PLAYERS AS (
SELECT
c.PLAYER_ID,
SUM(CASE WHEN c.RN = 1 AND c.NextDate = DATEADD(DAY, 1, c.EVENT_DATE)
THEN 1 END) AS IsConsecutive
FROM CTE_FIRST_LOGIN AS c
GROUP BY c.PLAYER_ID
)
SELECT ROUND(
1.00 *
COUNT(c.IsConsecutive) /
NULLIF(COUNT(*), 0)
,2) AS FRACTION
FROM BY_PLAYERS AS c;
You could theoretically merge BY_PLAYERS into the outer query and use COUNT(DISTINCT but splitting them feels cleaner

How to get the validity date range of a price from individual daily prices in SQL

I have some prices for the month of January.
Date,Price
1,100
2,100
3,115
4,120
5,120
6,100
7,100
8,120
9,120
10,120
Now, the o/p I need is a non-overlapping date range for each price.
price,from,To
100,1,2
115,3,3
120,4,5
100,6,7
120,8,10
I need to do this using SQL only.
For now, if I simply group by and take min and max dates, I get the below, which is an overlapping range:
price,from,to
100,1,7
115,3,3
120,4,10

This is a gaps-and-islands problem. The simplest solution is the difference of row numbers:
select price, min(date), max(date)
from (select t.*,
row_number() over (order by date) as seqnum,
row_number() over (partition by price, order by date) as seqnum2
from t
) t
group by price, (seqnum - seqnum2)
order by min(date);
Why this works is a little hard to explain. But if you look at the results of the subquery, you will see how the adjacent rows are identified by the difference in the two values.

SELECT Lag.price,Lag.[date] AS [From], MIN(Lead.[date]-Lag.[date])+Lag.[date] AS [to]
FROM
(
SELECT [date],[Price]
FROM
(
SELECT [date],[Price],LAG(Price) OVER (ORDER BY DATE,Price) AS LagID FROM #table1 A
)B
WHERE CASE WHEN Price <> ISNULL(LagID,1) THEN 1 ELSE 0 END = 1
)Lag
JOIN
(
SELECT [date],[Price]
FROM
(
SELECT [date],Price,LEAD(Price) OVER (ORDER BY DATE,Price) AS LeadID FROM [#table1] A
)B
WHERE CASE WHEN Price <> ISNULL(LeadID,1) THEN 1 ELSE 0 END = 1
)Lead
ON Lag.[Price] = Lead.[Price]
WHERE Lead.[date]-Lag.[date] >= 0
GROUP BY Lag.[date],Lag.[price]
ORDER BY Lag.[date]

Another method using ROWS UNBOUNDED PRECEDING
SELECT price, MIN([date]) AS [from], [end_date] AS [To]
FROM
(
SELECT *, MIN([abc]) OVER (ORDER BY DATE DESC ROWS UNBOUNDED PRECEDING ) end_date
FROM
(
SELECT *, CASE WHEN price = next_price THEN NULL ELSE DATE END AS abc
FROM
(
SELECT a.* , b.[date] AS next_date, b.price AS next_price
FROM #table1 a
LEFT JOIN #table1 b
ON a.[date] = b.[date]-1
)AA
)BB
)CC
GROUP BY price, end_date

How to calculate total hours from multiple in time and out time from below?

first punch as in time,
second punch as out time
if possible avoid duplicate punch on same time within a minute
I need to get all in time ,outtime in a row with total hours
like below any format.
I tried below query but can't get my expected output
WITH Level1
AS (
SELECT A.emp_reader_id,
DT
,A.EventCatId
,A.Belongs_to
,ROW_NUMBER() OVER ( PARTITION BY A.Belongs_to,A.emp_reader_id ORDER BY DT ) AS RowNum
FROM dbo.trnevents A
)
,
LEVEL2
AS (-- find the last and next event type for each row
SELECT A.emp_reader_id,A.DT , A.EventCatId ,COALESCE(LastVal.EventCatId, 10) AS LastEvent,
COALESCE(NextVal.EventCatId, 10) AS NextEvent ,A.Belongs_to
FROM Level1 A
LEFT JOIN Level1 LastVal
ON A.emp_reader_id = LastVal.emp_reader_id and A.Belongs_to=LastVal.Belongs_to
AND A.RowNum - 1 = LastVal.RowNum
LEFT JOIN Level1 NextVal
ON A.emp_reader_id = NextVal.emp_reader_id and A.Belongs_to=NextVal.Belongs_to
AND A.RowNum + 1 = NextVal.RowNum
)
select * from level2 where emp_reader_id=92 order by dt desc
Expected output:

Try this below script. I considered all DT with Sam Minutes as single entry for the calculation.
WITH CTE AS
(
SELECT MAX(emp_reader_id) emp_reader_id,
CAST(DT AS DATE) Date_for_Group,
LEFT(CAST(DT AS VARCHAR),16) Time_For_Group,
ROW_NUMBER() OVER(PARTITION BY CAST(DT AS DATE) ORDER BY LEFT(CAST(DT AS VARCHAR),16)) RN,
CASE
WHEN ROW_NUMBER() OVER(PARTITION BY CAST(DT AS DATE) ORDER BY LEFT(CAST(DT AS VARCHAR),16))%2 = 0 THEN 'OUT'
ELSE 'IN'
END In_Out
FROM your_table
GROUP BY CAST(DT AS DATE),LEFT(CAST(DT AS VARCHAR),16)
)
SELECT A.emp_reader_id,A.Date_for_Group,
SUM(DATEDIFF(Minute,CAST(A.Time_For_Group AS DATETIME),CAST(B.Time_For_Group AS DATETIME)))/60 Hr,
SUM(DATEDIFF(Minute,CAST(A.Time_For_Group AS DATETIME),CAST(B.Time_For_Group AS DATETIME)))%60 Min
FROM CTE A
INNER JOIN CTE B
ON A.emp_reader_id = B.emp_reader_id
AND A.RN = B.RN -1
AND A.Date_for_Group = B.Date_for_Group
WHERE A.In_Out = 'IN'
GROUP BY A.emp_reader_id,A.Date_for_Group

first assign rownumber to datetime column then start the same result set with rownumber+1
Then Inner join them on rownumbers. After that select min an max from timein and out columns and group by on date to get total workhours of that day. hope it helps.
select empid
,date
,min(timein) as timein,max (timeout) timeout,convert(nvarchar(20),datediff(hh,min (timein),max(timeout))%24)
+':'+
convert(nvarchar(20),datediff(mi,min (timein),max(timeout))%60) as totalhrs
from(
Select a.empid,cast(a.dt as date) date,b.dt as timein,a.dt as timeout from(
SELECT DT
,[empid]
, id
,row_number() over(order by dt) as inn
FROM [test1].[dbo].[Table_2]
)a
inner join(
SELECT distinct DT
,[empid]
, id
,rank() over(order by dt)+1 as out
FROM [test1].[dbo].[Table_2])b
on FORMAT(a.dt,'hh:mm') <> FORMAT(b.dt,'hh:mm')
and cast(a.dt as date)=cast(b.dt as date)
and a.inn=b.out)b
group by b.empid,b.date

How to generate session_id by sql?

My tracking system do not generate sessions IDS.
I have user_id & event_date_time.
I need a new session_id for each user's session that starts 30 minutes or more after last event_date_time of each user.
My final goal is to calculate median session time.
I tried to generate session_id=1 and session_id=2 once event_date_time-next_event_time>30 and guid=guid, but i'm stuck from here
select a.*,
case when (a.next_event_date-a.event_date)*24*60<30 and userID=next_userID
then 1
when (a.next_event_date-a.event_date)*24*60>=30 and userID=next_userID then
2
end session_id
from
(select f.userID,
lead(f.userID) over (partition by f.guid order by f.event_date)
next_guid,
f.event_date,
lead(f.event_date) over (partition by f.guid order by f.event_date)
next_event_date
from event_table f
)a
where next_event_date is not null

If I understood correctly you could generate ID's this way:
select id, guid, event_date,
sum(chg) over (partition by guid order by event_date) session_id
from (
select id, guid, event_date,
case when lag(guid) over (partition by guid order by event_date) = guid
and 24 * 60 * (event_date -lag(event_date)
over (partition by guid order by event_date) ) < 30
then 0 else 1
end chg
from event_table ) a
dbfiddle demo
Compare neighbouring rows, if there are different guids or time difference is greater than 30 minutes then assign 1. Then sum these values analytically.

I think you're on the right track using lead or lag. My recommendation would be to break this into steps and create a temp table to work against:
With the first query, assign every record its own unique ID, either a sequence number or GUID. You could also capture some of the lagged data in this step.
With a second query, find the overlaps (< 30 minutes) and make the overlapping records all the same -- either the same as the earliest or latest in that grouping, doesn't matter as long as it's consistent.
Something like this:
create table events_temp as (
select f.*,
row_number() over (partition by f.userID order by f.event_date) as user_row,
lag(f.userID) over (partition by f.userID order by f.event_date) as prev_userID,
lag(f.event_date) over (partition by f.userID order by f.event_date) as prev_event_date
from event_table f
order by f.userId, f.event_date
)
select a.*,
case when prev_userID = userID
and 24 * 60 * (event_date - prev_event_date) < 30
then lag(user_row) over (partition by userID order by user_row)
else user_row
end as session_id
from events_temp

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Hive transformation - hive

I am trying to make a simple hive transformation. Can some one provide me a way to do this? I have tried collect_set and currently looking at klout's open source UDF.

What about using the min and max functions? I think the following will get you what you need: SELECT Unit, MIN(datetime) as start, MAX(datetime) as stop from table_name group by Unit ;

Related

Getting category based on production shift

SQL Server LEAD function

How to get the validity date range of a price from individual daily prices in SQL

How to calculate total hours from multiple in time and out time from below?

How to generate session_id by sql?

Categories

Resources