Airflow: how can i automate such that a query runs for every date specified rather than hard coding? - hive

I am new to airflow so apoliges if this has been asked somewhere.
I have a query i run in hive that is partitioned on year month so e.g. 202001.
how can i run a query which specifies a variable for different values within the query in airflow? eg. taking this example
from airflow import DAG
from airflow.operators.mysql_operator import MySqlOperator
default_arg = {'owner': 'airflow', 'start_date': '2020-02-28'}
dag = DAG('simple-mysql-dag',
default_args=default_arg,
schedule_interval='00 11 2 * *')
mysql_task = MySqlOperator(dag=dag,
mysql_conn_id='mysql_default',
task_id='mysql_task'
sql='<path>/sample_sql.sql',
params={'test_user_id': -99})
where my sample_sql.hql looks like:
ALTER TABLE sample_df DROP IF EXISTS
PARTITION (
cpd_ym = ${ym}
) PURGE;
INSERT INTO sample_df
PARTITION (
cpd_ym = ${ym}
)
SELECT
*
from sourcedf
;
ANALYZE TABLE sample_df
PARTITION (
cpd_ym = ${ym}
)
COMPUTE STATISTICS;
ANALYZE TABLE sample_df
PARTITION (
cpd_ym = ${ym}
)
COMPUTE STATISTICS FOR COLUMNS;
i want to run the above for different values of ym using airflow e.g. between 202001 and 202110 how can i do this?

I'm a bit confused because you are asking about Hive yet you show example of MySqlOperator. In any case assuming the the sql/hql parameter is templated you can use execution_date directly in your query. Thus you can extract the year & month to be used for the partition value.
Example:
mysql_task = MySqlOperator(
dag=dag,
task_id='mysql_task',
sql="""SELECT {{ execution_date.strftime('%y%m') }}""",
)
So in your sample_sql.hql it will be:
ALTER TABLE sample_df DROP IF EXISTS
PARTITION (
cpd_ym = {{ execution_date.strftime('%y%m') }}
) PURGE;
You mentioned that you are new to Airflow so make sure you are aware what execution_date is and how it's being calculated (if you are not check this answer). You can do string manipulations to other macros as well. Choose the macro that is suitable to your needs (execution_date / prev_execution_date / next_execution_date / etc...).

Related

Query takes too long to run, how to optimize it?

The query structure: Helper-select in "with" clause - selects most recent entry using 'top 1 transaction_date'. Then does many joins. It takes too much time to run - what am I doing wrong?
CREATE VIEW [IRWSMCMaterialization].[FactInventoryItemOnHandDailyView] AS
WITH TempTBLFactIvnItmDaily AS (
SELECT TOP 20
ITEM_NUMBER AS [InventoryItemNumber]
,CAST(FORMAT(TRANSACTION_DATE, 'yyyyMMdd') AS INT) AS [DateKey]
,BRANCH_PLANT_FHK AS [BranchPlantKey]
,BRANCH_PLANT_CODE AS [BranchPlantCode]
,CAST(QUANTITY_ON_HAND AS BIGINT) AS [QuantityOnHand]
,TRANSACTION_DATE AS [Date]
,WAREHOUSE_LOCATION_FHK AS [WarehouseLocationKey]
,WAREHOUSE_LOCATION_CODE AS [WarehouseLocationCode]
,WAREHOUSE_LOT_NUMBER_CODE AS [WarehouseLotNumber]
,WAREHOUSE_LOT_NUMBER_FHK AS [WarehouseLotNumberKey]
,UNIT_OF_MEASURE AS [UnitOfMeasureName]
,UNIT_OF_MEASURE_PHK AS [UnitOfMeasureKey]
FROM dbo.RS_INV_ITEM_ON_HAND
-- below is where clause, choose only most recent entry
WHERE TRANSACTION_DATE = (SELECT TOP 1 TRANSACTION_DATE FROM dbo.RS_INV_ITEM_ON_HAND ORDER BY TRANSACTION_DATE DESC)
)
SELECT [InventoryItemNumber],
[DateKey],
[Date],
[BranchPlantCode] AS [BP],
[WarehouseLocationCode] AS [Location],
[QuantityOnHand],
[UnitOfMeasureName] AS [UoM],
CASE [WarehouseLotNumber]
WHEN 'Not Assigned' THEN NULL
ELSE [WarehouseLotNumber]
END
AS [Lot]
FROM TempTBLFactIvnItmDaily iioh
JOIN DWH.DimBranchPlant bp ON iioh.BranchPlantKey = bp.BRANCH_PLANT_PHK
JOIN DWH.DimWarehouseLocation wloc ON iioh.WarehouseLocationKey = wloc.WAREHOUSE_LOCATION_PHK
JOIN DWH.DimWarehouseLotNumber wlot ON iioh.WarehouseLotNumberKey = wlot.WarehouseLotNumber_PHK
JOIN DWH.DimUnitOfMeasure uom ON CAST(iioh.UnitOfMeasureKey AS VARCHAR(100)) = uom.UNIT_OF_MEASURE_PHK
where bp.BRANCH_PLANT_CODE = '96100'
AND iioh.QuantityOnHand > 0
AND (wloc.WAREHOUSE_LOCATION_CODE like '6000W01%' OR wloc.WAREHOUSE_LOCATION_CODE like 'BL%')
GO
There are a lot of things that does not seems good. First of all, your base query must be a lot simpler. Something like this:
SELECT iioh.ITEM_NUMBER AS [InventoryItemNumber],
CAST(FORMAT(iioh.TRANSACTION_DATE, 'yyyyMMdd') AS INT) AS [DateKey],
iioh.TRANSACTION_DATE AS [Date],
iioh.BRANCH_PLANT_CODE AS [BP],
iioh.WAREHOUSE_LOCATION_CODE AS [Location],
CAST(iioh.QUANTITY_ON_HAND AS BIGINT) AS [QuantityOnHand],
iioh.UNIT_OF_MEASURE AS [UoM],
NULLIF(iioh.WAREHOUSE_LOT_NUMBER_CODE, 'Not Assigned') AS [Lot]
FROM dbo.RS_INV_ITEM_ON_HAND iioh
JOIN DWH.DimBranchPlant bp
ON iioh.BranchPlantKey = bp.BRANCH_PLANT_PHK
JOIN DWH.DimWarehouseLocation wloc
ON iioh.WarehouseLocationKey = wloc.WAREHOUSE_LOCATION_PHK
JOIN DWH.DimUnitOfMeasure uom
ON CAST(iioh.UnitOfMeasureKey AS VARCHAR(100)) = uom.UNIT_OF_MEASURE_PHK
where bp.BRANCH_PLANT_CODE = '96100'
AND iioh.QuantityOnHand > 0
AND (wloc.WAREHOUSE_LOCATION_CODE like '6000W01%' OR wloc.WAREHOUSE_LOCATION_CODE like 'BL%')
AND iioh.TRANSACTION_DATE = #TRANSACTION_DATE
For example, you are joining the DWH.DimWarehouseLotNumber but you are not extracting columns - do you really need it? Also, there are other columns which are not returned by the view - why to query them?
From, there you are first filtering by date and then y other fields, so your first TOP 20 records may be filtered by the next conditions - is this a behavior you want?
Also, do you really want this cast?
ON CAST(iioh.UnitOfMeasureKey AS VARCHAR(100)) = uom.UNIT_OF_MEASURE_PHK
It's better to use CONVERT, not FORMAT in performance aspect. Also, why not saving/materializing the TRANSACTION_DATE as INT (for example using a persisted computed column or just on CRUD) instead of calculating this value on each read?
Filtering by location code using LIKE clause can heart the performance, too. Why not adding a new column WareHouseLocationCodeType and set a same value for all locations satisfying this condition:
(wloc.WAREHOUSE_LOCATION_CODE like '6000W01%' OR wloc.WAREHOUSE_LOCATION_CODE like 'BL%')
Then you can filter by this column in the view since this is very important for you. Also, you can create filter index on this column to increase the performance, more.
Also, you may want to create a inline-function instead a view and pass the date as parameter:
CREATE OR ALTER FUNCTION [IRWSMCMaterialization].[FactInventoryItemOnHandDailyView]
(
#TRANSACTION_DATE datetime
)
RETURNS TABLE
AS
RETURN
(
SELECT iioh.ITEM_NUMBER AS [InventoryItemNumber],
CAST(FORMAT(iioh.TRANSACTION_DATE, 'yyyyMMdd') AS INT) AS [DateKey],
iioh.TRANSACTION_DATE AS [Date],
iioh.BRANCH_PLANT_CODE AS [BP],
iioh.WAREHOUSE_LOCATION_CODE AS [Location],
CAST(iioh.QUANTITY_ON_HAND AS BIGINT) AS [QuantityOnHand],
iioh.UNIT_OF_MEASURE AS [UoM],
NULLIF(iioh.WAREHOUSE_LOT_NUMBER_CODE, 'Not Assigned') AS [Lot]
,iioh.TRANSACTION_DATE
FROM dbo.RS_INV_ITEM_ON_HAND iioh
JOIN DWH.DimBranchPlant bp
ON iioh.BranchPlantKey = bp.BRANCH_PLANT_PHK
JOIN DWH.DimWarehouseLocation wloc
ON iioh.WarehouseLocationKey = wloc.WAREHOUSE_LOCATION_PHK
JOIN DWH.DimUnitOfMeasure uom
ON CAST(iioh.UnitOfMeasureKey AS VARCHAR(100)) = uom.UNIT_OF_MEASURE_PHK
where bp.BRANCH_PLANT_CODE = '96100'
AND iioh.QuantityOnHand > 0
AND (wloc.WAREHOUSE_LOCATION_CODE like '6000W01%' OR wloc.WAREHOUSE_LOCATION_CODE like 'BL%')
AND iioh.TRANSACTION_DATE = #TRANSACTION_DATE
)
Then call it like this:
SELECT TOP 20 *
FROM [IRWSMCMaterialization].[FactInventoryItemOnHandDailyView] ('2020-12-04')
ORDER BY #TRANSACTION_DATE DESC
The query optimization is science today. If you want to find bottlenecks in your query you can follow some of these steps:
As the first step, enable statistics with these commands:
SET STATISTICS TIME ON;
SET STATISTICS IO ON;
Once you execute these commands in some query windows in the same window execute your query. When your query is executed switch to the Messages tab and you will see a lot of useful information like TIME execution, parse and compile-time and maybe the most interesting I/O reads.
As the second step, try to understand which table has a lot of reads, for example if you are expecting 10 rows from the query, but in some tables you have 10k or 100k logical reads something is wrong. That means for the 10 rows query execution from one table reads 10k pages. Obviously you are missing some index on this table, try to find which index you need.
If you are having some static values in where clause like the following one, then think about Filtered Index:
bp.BRANCH_PLANT_CODE = '96100' AND iioh.QuantityOnHand > 0
Not always, but in some cases conversion can break your indexes if you are casting them or using some other function in where clause like the following one, even you have an index on this column query optimizer will not use it in query execution:
CAST(iioh.UnitOfMeasureKey AS VARCHAR(100))
The last one, if you have OR logical operator in your query try to execute one by one part of your OR logical operator separately see to performance. This logical operator can really kill your query, and this is one example:
AND (wloc.WAREHOUSE_LOCATION_CODE like '6000W01%' OR wloc.WAREHOUSE_LOCATION_CODE like 'BL%')
Once, you determine here that you don't have any issues you can go more further.

Filter by clustering fields using a sub-select query

With Google Bigquery, I am querying a clustered table by applying a filter on the clustering field projectId, like so:
WITH userProjects AS (
SELECT
projectsArray
FROM
projectsPerUser
WHERE
userId = "eben#somewhere.com"
)
SELECT
userProperty
FROM
`mydata.mydataset.mytable`
WHERE
--projectId IN UNNEST((SELECT projectsArray FROM userProjects))
projectId IN ("mydata", "anotherproject")
AND _PARTITIONTIME >= "2019-03-20"
Clustering is applied correctly in the code snippet above, but when I use the commented-out line --projectId IN UNNEST((SELECT projectsArray FROM userProjects)), clustering doesn't apply.
I've tried wrapping it in a UDF like this as well, which also doesn't work:
CREATE TEMP FUNCTION storedValue(item ARRAY<STRING>) AS (
item
);
...
WHERE projectId IN UNNEST(storedValue((SELECT projectsListArray FROM projectsList)))
As I understand from this, the execution path for sub-select queries is different to merely filtering on a scalar or array directly.
I expect a solution to exist where I can programmatically supply an array to filter on that will still allow me the cost benefit a clustered table provides.
In summary:
WHERE projectId IN ("mydata", "anotherproject") [OK]
WHERE projectId IN UNNEST((SELECT projectsArray FROM userProjects)) [Not OK]
WHERE projectId IN UNNEST(storedValue((SELECT projectsListArray FROM projectsList))) [Not OK]
Any ideas?
My suggestion is to rewrite your query so that your nested SELECT is a temporary table (which you've already done) and then perform the filtering you require by using an INNER JOIN rather than a set membership test, so your query would become something like this:
WITH userProjects AS (
SELECT
projectsArray
FROM
projectsPerUser
WHERE
userId = "eben#somewhere.com"
)
SELECT
userProperty
FROM
`mydata.mydataset.mytable` as a
JOIN
userProjects as b
ON a.projectId = b.projectsArray
WHERE
AND _PARTITIONTIME >= "2019-03-20"
I believe this will result in a query which does not scan the full partition if that field is clustered.
FWIW, clustering works well for me with dynamic filters:
SELECT title, SUM(views) views
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE DATE(TIMESTAMP_TRUNC(datehour, DAY)) = '2019-01-01'
AND wiki='en'
AND title IN ('Dogfight_(disambiguation)','Dogfight','Dogfight_(film)')
GROUP BY 1
1.8 sec elapsed, 364 MB processed
if instead I do
AND title IN (
SELECT DISTINCT prev
FROM `fh-bigquery.wikipedia_vt.clickstream_materialized`
WHERE date='2019-01-01' AND prev LIKE 'Dogfight%'
ORDER BY 1 LIMIT 3)
2.9 sec elapsed, 513.8 MB processed
If I go to v2 (not clustered), instead of v3:
FROM `fh-bigquery.wikipedia_v2.pageviews_2019`
2.6 sec elapsed, 9.6 GB processed
I'm not sure what's happening in your tables - but it might be interesting to revisit.

Transforming SQL many to many JOIN with a constraint into DAX/PowerBI

I'm trying recreate a join between 2 tables, timeregistration and schedule, into one table which I then filter based the difference between the scheduled start time and the actual start time. Basically; I'm trying to determine whether a employee is late or early.
My query in SQL looks like this and works as desired, I just can't seem to figure out how to make this join work in DAX for Power BI, let alone the time constraint.
SELECT
tr.employee_employeeID AS EmployeeID,
tr.rawstarttime AS Actual_Start,
s.StartTime_Schedule AS Schedule_Start
FROM
timeregistration tr,
schedule s
WHERE
tr.Employee_EmployeeID = s.Employee_EmployeeID
AND
DATEDIFF(MINUTE, tr.RawStartTime, s.StartDateTime) < 60
AND
DATEDIFF(MINUTE, tr.RawEndTime, s.StartDateTime) > -60
And the tables and relation look something like this:
Table structure
I have tried the following so far:
GENERATE(timeregistration; schedule)
- which returned
The Column with the name of 'EmployeeID' already exists in the 'date_diff' Table.
NATURALINNERJOIN(timeregistration; schedule)
- which returned
No common join columns detected. The join function 'NATURALINNERJOIN' requires at-least one common join column.
CROSSJOIN(timeregistration; schedule)
- which returned
The Column with the name of 'EmployeeID' already exists in the 'date_diff' Table.
As of right now I wouldn't really know what to do with the JOIN, so any help would be appreciated.
With kind regards,
Martien
(edit: Fixed formatting mistakes)
Solved it with this DAX query
date_diff =
FILTER(
ADDCOLUMNS(
GENERATEALL(
schedule;
VAR schedule_id = schedule[EmployeeID]
RETURN
SELECTCOLUMNS(
CALCULATETABLE(
timeregistration;
timeregistration[EmployeeID] = schedule_id
);
"StartTime_timeregistration"; timeregistration[StartTime_actual]
)
);
"diff"; DATEDIFF([StartTime_schedule];[StartTime_timeregistration];SECOND)
);
3600 >= [diff] &&
-3600 <= [diff] &&
NOT(ISBLANK([StartTime_timeregistration]))
)

How can I alter this to run in postgresql?

What needs changing for this to run in postgreSQL?
I was given the piece of sql
UPDATE ACC
SET ACC.ACC_EC = SITESmin.ACC_EC,
ACC.ACC_NC = SITESmin.ACC_NC
FROM ACC
INNER JOIN LATERAL ( SELECT TOP 1
*
FROM SITES
ORDER BY ( acc_ec - site_etg ) * ( acc_ec - site_etg ) + (acc_ncb - site_ntg ) * ( acc_ncb - site_ntg )
) SITESmin;
It seems to be using SET but I do not know why, so if it's not needed drop it.
I am trying to get postgresql to work out distances. For every record in file one I have to compare to 3300 records in file 2 and select the nearest. Received wisdom suggests an array solution for the 3300 but I do not know how to do that. Perhaps it it a "sub query" in SQL.
If I am permitted to upload samples I will do so, though I have the feeling this is not allowed?
Here are the filed names
public.acc.Location_Easting_OSGR
public.acc.Location_Northing_OSGR
"public"."Sites"."SITE_ETG"
"public"."Sites"."SITE_NTG"
Try this:
WITH SITESmin as (
SELECT ACC_EC, ACC_NC
FROM SITES
ORDER BY ( acc_ec - site_etg ) * ( acc_ec - site_etg ) + (acc_ncb - site_ntg ) * ( acc_ncb - site_ntg )
LIMIT 1
)
UPDATE ACC
SET ACC.ACC_EC = SITESmin.ACC_EC,
ACC.ACC_NC = SITESmin.ACC_EC
FROM SITESmin;
If it does not work, please provide the schema and some data to make it easier to reproduce

Better way to use Apache Hive to sessionize my log data?

Is there a better way to use Apache Hive to sessionize my log data ? I'm not sure that I'm doing so, below, in the optimal way:
The log data is stored in sequence files; a single log entry is a JSON string; eg:
{"source": {"api_key": "app_key_1", "user_id": "user0"}, "events": [{"timestamp": 1330988326, "event_type": "high_score", "event_params": {"score": "1123", "level": "9"}}, {"timestamp": 1330987183, "event_type": "some_event_0", "event_params": {"some_param_00": "val", "some_param_01": 100}}, {"timestamp": 1330987775, "event_type": "some_event_1", "event_params": {"some_param_11": 100, "some_param_10": "val"}}]}
Formatted, this looks like:
{'source': {'api_key': 'app_key_1', 'user_id': 'user0'},
'events': [{'event_params': {'level': '9', 'score': '1123'},
'event_type': 'high_score',
'timestamp': 1330988326},
{'event_params': {'some_param_00': 'val', 'some_param_01': 100},
'event_type': 'some_event_0',
'timestamp': 1330987183},
{'event_params': {'some_param_10': 'val', 'some_param_11': 100},
'event_type': 'some_event_1',
'timestamp': 1330987775}]
}
'source' contains some info ( user_id and api_key ) about the source of the events contained in 'events'; 'events' contains a list of events generated by the source; each event has 'event_params', 'event_type', and 'timestamp' ( timestamp is a Unix timestamp in GMT ). Note that timestamps within a single log entry, and across log entries may be out of order.
Note that I'm constrained such that I cannot change the log format, cannot initially log the data into separate files that are partitioned ( though I could use Hive to do this after the data is logged ), etc.
In the end, I'd like a table of sessions, where a session is associated with an app ( api_k ) and user, and has a start time and session length ( or end time ); sessions are split where, for a given app and user, a gap of 30 or more minutes occurs between events.
My solution does the following ( Hive script and python transform script are below; doesn't seem like it would be useful to show the SerDe source, but let me know if it would be ):
[1] load the data into log_entry_tmp, in a denormalized format
[2] explode the data into log_entry, so that, eg, the above single entry would now have multiple entries:
{"source_api_key":"app_key_1","source_user_id":"user0","event_type":"high_score","event_params":{"score":"1123","level":"9"},"event_timestamp":1330988326}
{"source_api_key":"app_key_1","source_user_id":"user0","event_type":"some_event_0","event_params":{"some_param_00":"val","some_param_01":"100"},"event_timestamp":1330987183}
{"source_api_key":"app_key_1","source_user_id":"user0","event_type":"some_event_1","event_params":{"some_param_11":"100","some_param_10":"val"},"event_timestamp":1330987775}
[3] transform and write data into session_info_0, where each entry contains events' app_id, user_id, and timestamp
[4] tranform and write data into session_info_1, where entries are ordered by app_id, user_id, event_timestamp ; and each entry contains a session_id ; the python tranform script finds the splits, and groups the data into sessions
[5] transform and write final session data to session_info_2 ; the sessions' app + user, start time, and length in seconds
[Hive script]
drop table if exists app_info;
create external table app_info ( app_id int, app_name string, api_k string )
location '${WORK}/hive_tables/app_info';
add jar ../build/our-serdes.jar;
-- [1] load the data into log_entry_tmp, in a denormalized format
drop table if exists log_entry_tmp;
create external table log_entry_tmp
row format serde 'com.company.TestLogSerde'
location '${WORK}/hive_tables/test_logs';
drop table if exists log_entry;
create table log_entry (
entry struct<source_api_key:string,
source_user_id:string,
event_type:string,
event_params:map<string,string>,
event_timestamp:bigint>);
-- [2] explode the data into log_entry
insert overwrite table log_entry
select explode (trans0_list) t
from log_entry_tmp;
drop table if exists session_info_0;
create table session_info_0 (
app_id string,
user_id string,
event_timestamp bigint
);
-- [3] transform and write data into session_info_0, where each entry contains events' app_id, user_id, and timestamp
insert overwrite table session_info_0
select ai.app_id, le.entry.source_user_id, le.entry.event_timestamp
from log_entry le
join app_info ai on (le.entry.source_api_key = ai.api_k);
add file ./TestLogTrans.py;
drop table if exists session_info_1;
create table session_info_1 (
session_id string,
app_id string,
user_id string,
event_timestamp bigint,
session_start_datetime string,
session_start_timestamp bigint,
gap_secs int
);
-- [4] tranform and write data into session_info_1, where entries are ordered by app_id, user_id, event_timestamp ; and each entry contains a session_id ; the python tranform script finds the splits, and groups the data into sessions
insert overwrite table session_info_1
select
transform (t.app_id, t.user_id, t.event_timestamp)
using './TestLogTrans.py'
as (session_id, app_id, user_id, event_timestamp, session_start_datetime, session_start_timestamp, gap_secs)
from
(select app_id as app_id, user_id as user_id, event_timestamp as event_timestamp from session_info_0 order by app_id, user_id, event_timestamp ) t;
drop table if exists session_info_2;
create table session_info_2 (
session_id string,
app_id string,
user_id string,
session_start_datetime string,
session_start_timestamp bigint,
len_secs int
);
-- [5] transform and write final session data to session_info_2 ; the sessions' app + user, start time, and length in seconds
insert overwrite table session_info_2
select session_id, app_id, user_id, session_start_datetime, session_start_timestamp, sum(gap_secs)
from session_info_1
group by session_id, app_id, user_id, session_start_datetime, session_start_timestamp;
[TestLogTrans.py]
#!/usr/bin/python
import sys, time
def buildDateTime(ts):
return time.strftime('%Y-%m-%d %H:%M:%S', time.gmtime(ts))
curGroup = None
prevGroup = None
curSessionStartTimestamp = None
curSessionStartDatetime = None
prevTimestamp = None
for line in sys.stdin.readlines():
fields = line.split('\t')
if len(fields) != 3:
raise Exception('fields = %s', fields)
app_id = fields[0]
user_id = fields[1]
event_timestamp = int(fields[2].strip())
curGroup = '%s-%s' % (app_id, user_id)
curTimestamp = event_timestamp
if prevGroup == None:
prevGroup = curGroup
curSessionStartTimestamp = curTimestamp
curSessionStartDatetime = buildDateTime(curSessionStartTimestamp)
prevTimestamp = curTimestamp
isNewGroup = (curGroup != prevGroup)
gapSecs = 0 if isNewGroup else (curTimestamp - prevTimestamp)
isSessionSplit = (gapSecs >= 1800)
if isNewGroup or isSessionSplit:
curSessionStartTimestamp = curTimestamp
curSessionStartDatetime = buildDateTime(curSessionStartTimestamp)
session_id = '%s-%s-%d' % (app_id, user_id, curSessionStartTimestamp)
print '%s\t%s\t%s\t%d\t%s\t%d\t%d' % (session_id, app_id, user_id, curTimestamp, curSessionStartDatetime, curSessionStartTimestamp, gapSecs)
prevGroup = curGroup
prevTimestamp = curTimestamp
I think you could easily drop step 3, and put that query that you use there in as a subquery to your from clause in step 4. Physicalising that transform, doesn't appear to give you anything.
Otherwise I think for what you're trying to achieve here, this seems a reasonable approach.
Potentially step 2 you could achieve using a custom mapper, passing the output into step 4 as a custome reducer (with step 3 built in as a subquery). That will reduce you mapreduce jobs by 1, and therefore could give you a significant saving in time.