Not sure where to start... But basically I have a report table, an account table, and an account history table. The account history table will have zero or more records, where each record is the state of the account cancelled flag after it changed.
There is other stuff going on, but basically i am looking to return the account detail data, with the state of account cancelled bit on the start date and enddate as different columns.
What is the best way to do this?
I have the following working query below
(Idea) Should I do seperate joins on history table, 1 for each date?
I guess I could do it in three separate queries ( Get Begin Snapshot, End Snapshot, Normal Report query with a join to each snapshot)
something else?
Expected output:
AccountID, OtherData, StartDateCancelled, EndDateCancelled
Test Tables:
DECLARE #Report TABLE (ReportID INT, StartDate DATETIME, EndDate DATETIME)
DECLARE #ReportAccountDetail TABLE( ReportID INT, Accountid INT, Cancelled BIT )
DECLARE #AccountHistory TABLE( AccountID INT, ModifiedDate DATETIME, Cancelled BIT )
INSERT INTO #Report
SELECT 1,'1/1/2011', '2/1/2011'
--
INSERT INTO #ReportAccountDetail
SELECT 1 AS ReportID, 1 AS AccountID, 0 AS Cancelled
UNION
SELECT 1,2,0
UNION
SELECT 1,3,1
UNION
SELECT 1,4,1
--
INSERT INTO #AccountHistory
SELECT 2 AS CustomerID, '1/2/2010' AS ModifiedDate, 1 AS Cancelled
UNION--
SELECT 3, '2/1/2011', 1
UNION--
SELECT 4, '1/1/2010', 1
UNION
SELECT 4, '2/1/2010', 0
UNION
SELECT 4, '2/1/2011', 1
Current Query:
SELECT Accountid, OtherData,
MAX(CASE WHEN BeginRank = 1 THEN CASE WHEN BeginHistoryExists = 1 THEN HistoryCancelled ELSE DefaultCancel END ELSE NULL END ) AS StartDateCancelled,
MAX(CASE WHEN EndRank = 1 THEN CASE WHEN EndHistoryExists = 1 THEN HistoryCancelled ELSE DefaultCancel END ELSE NULL END ) AS EndDateCancelled
FROM
(
SELECT c.Accountid,
'OtherData' AS OtherData,
--lots of other data
ROW_NUMBER() OVER (PARTITION BY c.AccountID ORDER BY
CASE WHEN ch.ModifiedDate <= Report.StartDate THEN 1 ELSE 0 END DESC, ch.ModifiedDate desc) AS BeginRank,
CASE WHEN ch.ModifiedDate <= Report.StartDate THEN 1 ELSE 0 END AS BeginHistoryExists,
ROW_NUMBER() OVER ( PARTITION BY c.AccountID ORDER BY
CASE WHEN ch.ModifiedDate <= Report.EndDate THEN 1 ELSE 0 END DESC, ch.ModifiedDate desc) AS EndRank,
CASE WHEN ch.ModifiedDate <= Report.EndDate THEN 1 ELSE 0 END AS EndHistoryExists,
CAST( ch.Cancelled AS INT) AS HistoryCancelled,
0 AS DefaultCancel
FROM
#Report AS Report
INNER JOIN #ReportAccountDetail AS C ON Report.ReportID = C.ReportID
--Others joins related for data to return
LEFT JOIN #AccountHistory AS CH ON CH.AccountID = C.AccountID
WHERE Report.ReportID = 1
) AS x
GROUP BY AccountID, OtherData
Welcome input on writing stack overflow questions. Thanks!
ROW_NUMBER() often suprises me and out-performs my expectations. In this case, however, I'd be tempted to just use correlated sub-queries. At least, I'd test them against the alternatives.
Note: I would also use real tables, with real indexes, and a realistic volume of fake data. (If it's worth posting this question, I'm assuming that it's worth testing this realistically.)
SELECT
[Report].ReportID,
[Account].AccountID,
[Account].OtherData,
ISNULL((SELECT TOP 1 Cancelled FROM AccountHistory WHERE AccountID = [Account].AccountID AND ModifiedDate <= [Report].StartDate ORDER BY ModifiedDate DESC), 0) AS StartDateCancelled,
ISNULL((SELECT TOP 1 Cancelled FROM AccountHistory WHERE AccountID = [Account].AccountID AND ModifiedDate <= [Report].EndDate ORDER BY ModifiedDate DESC), 0) AS EndDateCancelled
FROM
Report AS [Report]
LEFT JOIN
ReportAccountDetail AS [Account]
ON [Account].ReportID = [Report].ReportID
ORDER BY
[Report].ReportID,
[Account].AccountID
Note: For whatever reason, I've found that TOP 1 and ORDER BY is faster than MAX().
In terms of your suggested answer, I'd modify it slightly to just use ISNULL instead of trying to make the Exists columns work.
I'd also join on the "other data" after all of the working out, rather than inside the inner-most query, so as to avoid having to group by all the "other data".
WITH
HistoricData AS
(
SELECT
Report.ReportID,
c.Accountid,
c.OtherData,
ROW_NUMBER() OVER (PARTITION BY c.ReportID, c.AccountID ORDER BY CASE WHEN ch.ModifiedDate <= Report.StartDate THEN 1 ELSE 0 END DESC, ch.ModifiedDate DESC) AS BeginRank,
ROW_NUMBER() OVER (PARTITION BY c.ReportID, c.AccountID ORDER BY ch.ModifiedDate DESC) AS EndRank,
CH.Cancelled
FROM
#Report AS Report
INNER JOIN
#ReportAccountDetail AS C
ON Report.ReportID = C.ReportID
LEFT JOIN
#AccountHistory AS CH
ON CH.AccountID = C.AccountID
AND CH.ModifiedDate <= Report.EndDate
)
,
FlattenedData AS
(
SELECT
ReportID,
Accountid,
OtherData,
ISNULL(MAX(CASE WHEN BeginRank = 1 THEN Cancelled END), 0) AS StartDateCancelled,
ISNULL(MAX(CASE WHEN EndRank = 1 THEN Cancelled END), 0) AS EndDateCancelled
FROM
[HistoricData]
GROUP BY
ReportID,
AccountID,
OtherData
)
SELECT
*
FROM
[FlattenedData]
LEFT JOIN
[OtherData]
ON Whatever = YouLike
WHERE
[FlattenedData].ReportID = 1
And a final possible version...
WITH
ReportStartHistory AS
(
SELECT
*
FROM
(
SELECT
[Report].ReportID,
ROW_NUMBER() OVER (PARTITION BY [Report].ReportID, [History].AccountID ORDER BY [History].ModifiedDate) AS SequenceID,
[History].*
FROM
Report AS [Report]
INNER JOIN
AccountHistory AS [History]
ON [History].ModifiedDate <= [Report].StartDate
)
AS [data]
WHERE
SequenceID = 1
)
,
ReportEndHistory AS
(
SELECT
*
FROM
(
SELECT
[Report].ReportID,
ROW_NUMBER() OVER (PARTITION BY [Report].ReportID, [History].AccountID ORDER BY [History].ModifiedDate) AS SequenceID,
[History].*
FROM
Report AS [Report]
INNER JOIN
AccountHistory AS [History]
ON [History].ModifiedDate <= [Report].EndDate
)
AS [data]
WHERE
SequenceID = 1
)
SELECT
[Report].ReportID,
[Account].*,
ISNULL([ReportStartHistory].Cancelled, 0) AS StartDateCancelled,
ISNULL([ReportEndHistory].Cancelled, 0) AS EndDateCancelled
FROM
Report AS [Report]
INNER JOIN
Account AS [Account]
LEFT JOIN
[ReportStartHistory]
ON [ReportStartHistory].ReportID = [Report].ReportID
AND [ReportStartHistory].AccountID = [Account].AccountID
LEFT JOIN
[ReportEndHistory]
ON [ReportEndHistory].ReportID = [Report].ReportID
AND [ReportEndHistory].AccountID = [Account].AccountID
Related
If I were to have a table such as the one below:
id_
last_updated_by
1
robot
1
human
1
robot
2
robot
3
robot
3
human
Using SQL, how could I group by the ID and create a new column to indicate whether a human has ever updated the record like this:
id_
last_updated_by
updated_by_human
1
robot
1
2
robot
0
3
robot
1
UPDATE
I'm currently doing the following, though I'm not sure how efficient this is. Selecting the latest record and then merging it with my calculated column via a sub-select.
SELECT MAIN.TRANSACTION_ID,
MAIN.CREATED_DATE
MAIN.CREATED_BY_USER_ID,
MAIN.OWNER_USER_ID,
STP.TOUCHED_BY_HUMAN
FROM (
SELECT TRANSACTION_ID,
CREATED_DATE
CREATED_BY_USER_ID_
OWNER_USER_ID_
FROM TABLE_NAME
WHERE CREATED_DATE >= CAST('{start_date} 00:00:00' AS TIMESTAMP)
AND CREATED_DATE <= CAST('{end_date} 23:59:59' AS TIMESTAMP)
QUALIFY row_number() OVER (partition by TRANSACTION_ID order by End_Dt desc) = 1
) MAIN
LEFT JOIN (
SELECT TRANSACTION_ID,
CASE
WHEN CREATED_BY_USER_ID IN ('ROBOT', 'MACHINE') OR
CREATED_BY_USER_ID LIKE 'N%' OR
CREATED_BY_USER_ID IS NULL
THEN 0
ELSE 1 END AS CREATED_BY_HUMAN,
CASE
WHEN OWNER_USER_ID IN ('ROBOT', 'MACHINE') OR
OWNER_USER_ID LIKE 'N%' OR
OWNER_USER_ID IS NULL
THEN 0
ELSE 1 END AS OWNED_BY_HUMAN,
CASE
WHEN CREATED_BY_HUMAN = 0 AND
OWNED_BY_HUMAN = 0
THEN 0
ELSE 1 END AS TOUCHED_BY_HUMAN_
FROM TABLE_NAME
WHERE CREATED_DATE >= CAST('{start_date} 00:00:00' AS TIMESTAMP)
AND CREATED_DATE <= CAST('{end_date} 23:59:59' AS TIMESTAMP)
QUALIFY row_number() OVER (partition by TRANSACTION_ID order by TOUCHED_BY_HUMAN_ desc) = 1
) STP
ON MAIN.TRANSACTION_ID = STP.TRANSACTION_ID
If I'm following your problem, then something like this should work.
SELECT
t.*
,CASE WHEN a.id IS NOT NULL THEN 1 ELSE 0 END AS updated_by_human
FROM table t
LEFT JOIN (SELECT DISTINCT id FROM table WHERE last_updated_by = 'human') a ON t.id = a.id
That takes care of the updated_by_human field, but if you also need to reduce the records in table (only keeping a subset) then you need more information to do that.
Exists clauses are usually not that performant but if your data isn't big this should work.
select id_,
IF (EXISTS (SELECT 1 FROM table_name t2 WHERE t2.last_updated_by = 'human' and t2.id_ = t1.id_), 1, 0) AS updated_by_human
from table_name t1;
here is another way
SELECT *
FROM table_name t1
GROUP BY ti.id_
HAVING COUNT(*) > 0
AND MAX(CASE t1.last_updated_by WHEN 'human' THEN 1 ELSE 0 END) = 1;
Since you didn't specified which column is used to determine this record is the newest record added by a given id, I assume that there will be a column to track the insert/modify timestamp (which is pretty standard table design), let's put it is last_updated_timestamp (if you don't have any, then I still insist you to have one as an auditing trail without timestamp does not make sense)
Given your table name is updating_trail
SELECT updating_trail.*, last_update_trail.modified_by_human
FROM updating_trail
INNER JOIN (
-- determine the id_, the lastest modified_timestamp, and a flag check to determine if there is any record with last_update_by is 'human' -> if yes then give 1
SELECT updating_trail.id_, MAX(last_update_timestamp) AS most_recent_update_ts, MAX(CASE WHEN updating_trail.last_updated_by = 'human' THEN 1 ELSE 0 END) AS modified_by_human
FROM updating_trail
GROUP BY updating_trail.id_
) last_update_trail
ON updating_trail.id_ = last_update_trail.id_ AND updating_trail.last_update_timestamp = last_update_trail.most_recent_update_ts;
Give
id_
last_updated_by
last_update_timestamp
modified_by_human
1
robot
2021-10-19T20:00:00.000Z
1
2
robot
2021-10-19T17:00:00.000Z
0
3
robot
2021-10-19T16:00:00.000Z
1
Check out this sample db fiddle I created for you
This is a 1:1 translation of your query to conditional aggregation:
SELECT TRANSACTION_ID,
CREATED_DATE,
CREATED_BY_USER_ID,
OWNER_USER_ID,
Max(CASE
WHEN CREATED_BY_USER_ID IN ('ROBOT', 'MACHINE') OR
CREATED_BY_USER_ID LIKE 'N%' OR
CREATED_BY_USER_ID IS NULL
THEN 0
ELSE 1
END) Over (PARTITION BY TRANSACTION_ID) AS CREATED_BY_HUMAN
FROM Table_Name
WHERE CREATED_DATE >= Cast('{start_date} 00:00:00' AS TIMESTAMP)
AND CREATED_DATE <= Cast('{end_date} 23:59:59' AS TIMESTAMP)
QUALIFY Row_Number() Over (PARTITION BY TRANSACTION_ID ORDER BY End_Dt DESC) = 1
I have a simple select query with some joins like:
SELECT
[c].[column1]
, [c].[column2]
FROM [Customer] AS [c]
INNER JOIN ...
So I do a left join with my principal table as:
LEFT JOIN [Communication] AS [com] ON [c].[CustomerGuid] = [com].[ComGuid]
this relatioship its 1 to *, one customer can have multiple communications
So in my select I want to get value 1 or 2 depending of condition:
Condition:
if ComTypeKey (from communication) table have a row with value 3 and have another row with vale 4 return 1 then 0
So I try something like:
SELECT
[c].[column1]
, [c].[column2]
, IIF([com].[ComTypeKey] = 3 AND [com].[ComTypeKey] = 4,1,0)
FROM [Customer] AS [c]
INNER JOIN ...
LEFT JOIN [Communication] AS [com] ON [c].[CustomerGuid] = [com].[ComGuid]
But it throws me two rows, beacause there are 2 rows on communication. My desire value is to get only one row with value 1 if my condition is true
If you have multiple rows you need GROUP BY, then count the relevant keys and subtract 1 to get (1, 0)
SELECT
[c].[column1]
, [c].[column2]
, COUNT(CASE WHEN [ComTypeKey] IN (3,4) THEN 1 END) - 1 as FLAG_CONDITION
FROM [Customer] AS [c]
INNER JOIN ...
LEFT JOIN [Communication] AS [com]
ON [c].[CustomerGuid] = [com].[ComGuid]
GROUP BY
[c].[column1]
, [c].[column2]
I'm not really sure I understand.
This will literally find if both values 3 and 4 exist for that CustomerGuid, and only select one of them in that case - not filtering out any record otherwise.
If this is not what you want, providing sample data with the expected result would remove the ambiguity.
SELECT Field1,
Field2,
...
FieldN
FROM (SELECT TMP.*,
CASE WHEN hasBothValues = 1 THEN
ROW_NUMBER() OVER ( PARTITION BY CustomerGuid ORDER BY 1 )
ELSE 1
END AS iterim_rn
FROM (SELECT TD.*,
MAX(CASE WHEN Value1 = '3' THEN 1 ELSE 0 END) OVER
( PARTITION BY CustomerGuid ) *
MAX(CASE WHEN Value1 = '4' THEN 1 ELSE 0 END) OVER
( PARTITION BY CustomerGuid ) AS hasBothValues
FROM TEST_DATA TD
) TMP
) TMP2
WHERE interim_rn = 1
I have written two queries however feel they are inefficient.
I have two queries, one which prepares the data (the data was originally from a old fox pro db and the dates etc where nvarchars, so I convert them to dates etc) the second which collates all of the data ready to be exported to a csv and eventually the csv is sent to a web service.
So the first query...
I have a table of people and a table of placements (placements being a job that they have had) the placements table will have lots of different rows for a single person and I need only the latest (based on start and end date), is the below the most efficient way of doing this?
PersonCode = unique id for the person, Code = unique id for the placement
SELECT * FROM Person c
LEFT JOIN
(
SELECT MAX(StartDate) AS StartDate, MAX(EndDate) AS EndDate, MAX(Code) AS Code, PersonCode
FROM PersonPlacement
GROUP BY PersonCode
) cp ON c.PersonCode = cp.PersonCode
LEFT JOIN PersonPlacement cp2 ON cp.Code = cp2.Code
So my second query is below...
The second query reads from the first query and needs to do the following:
Get only unique candidates based on last contact date (the original data had dupes)
Get the latest placement
Get Resume data
Only get people that are not currently in a job based on start and end date of placement
If they are in a job that is ending soon then show them
See query below...
SELECT *
FROM Pre_PersonView c
INNER JOIN (
SELECT PersonCode, Code, row_number() over(partition by PersonCode order by StartDate desc) as rn
FROM Pre_PersonView
) pj ON c.PersonCode = pj.PersonCode AND pj.rn = 1
LEFT JOIN Pre_PersonView cp ON pj.Code = cp.Code
INNER JOIN (
SELECT PersonCode, row_number() over(partition by PersonCode order by LastContactDate desc) as rn
FROM Person
) uc ON c.PersonCode = uc.PersonCode AND uc.rn = 1
LEFT JOIN [PersonResumeText] ct ON c.PersonId = ct.PersonId
WHERE c.PersonCode NOT IN
(
SELECT pcv.PersonCode
FROM Pre_PersonView pcv
WHERE pcv.Department IN ('x','y','z')
AND pcv.StartDate <= GETDATE()
AND (CASE WHEN pcv.EndDate = '1899-12-30' THEN GETDATE() + 1 ELSE pcv.EndDate END) > GETDATE()
)
AND DATEDIFF(DAY, ISNULL((CASE WHEN cp.StartDate = '0216-07-22' THEN '2016-07-22' ELSE cp.StartDate END), GETDATE() -365), ISNULL((CASE WHEN cp.EndDate = '1899-12-30' THEN NULL ELSE cp.EndDate END), GETDATE() + 1))
>
(CASE WHEN cp.Department IN ('x','y','z') THEN 365 ELSE 2 END)
Again my question here is this the most efficient way to be doing this?
I have a program with which my users can look up all the data traffic that happend the last 7 days. I use a stored procedure to get me that data - 250 records at a time (the user can page through that). The problem was, that the users get a lot of timeouts when they wanted to see that data.
Here is the stored procedure before I tried to optimize ist.
#MaxRecCount INT,
#PageOffset INT,
#IncludeData BIT
SELECT [Client], [Schema], [Version], [Records], [Fetched], [Receipted], [ProvidedAt], [FetchedAt], [ReceiptedAt],[PacketIds], [Record] FROM (
SELECT TOP(#MaxRecCount) MAX(bai_ExportPendingArchive.[UserName]) AS Client,
MAX(bai_ExportPendingArchive.Category) AS [Schema],
MAX(bai_ExportPendingArchive.ContractVersion) AS [Version],
COUNT(*) AS [Records],
SUM (CASE WHEN bai_ExportPendingAckArchive.ExportPendingId IS NULL THEN 0 ELSE 1 END) as [Fetched],
SUM (CASE WHEN bai_ExportPendingAckArchive.Receipted IS NULL THEN 0 ELSE 1 END) as [Receipted],
MAX(bai_ExportArchive.Inserted) AS [ProvidedAt],
MAX(CASE WHEN bai_ExportPendingAckArchive.ExportPendingId IS NULL THEN NULL ELSE bai_ExportPendingAckArchive.Inserted END) AS [FetchedAt],
MAX(CASE WHEN bai_ExportPendingAckArchive.Receipted IS NULL THEN NULL ELSE bai_ExportPendingAckArchive.Receipted END) AS [ReceiptedAt],
bai_ExportArchive.PacketIds AS [PacketIds],
NULL AS [Record],
ROW_NUMBER() Over (Order By MAX(bai_ExportArchive.Inserted) desc) as [RowNumber]
FROM bai_ExportArchive
INNER JOIN bai_ExportPendingArchive ON bai_ExportArchive.Id = bai_ExportPendingArchive.ExportId
LEFT OUTER JOIN bai_ExportPendingAckArchive ON bai_ExportPendingAckArchive.ExportPendingId = bai_ExportPendingArchive.Id
GROUP BY bai_ExportPendingArchive.[UserName], bai_ExportArchive.PacketIds, bai_ExportPendingArchive.Category
) AS InnerTable WHERE RowNumber > (#PageOffset * #MaxRecCount) and RowNumber <= (#PageOffset * #MaxRecCount + #MaxRecCount)
ORDER BY RowNumber
#MaxRecCount, #PageOffset and #IncludeData are parameter which came from my C#-method.
This version needed about 1:35min to get me the data I wanted. To make the stored procedure faster I insered a WHERE clause to filter for the Inserted col (also I made an Index on this column) and to use OFFSET FETCH:
The stored procedure after the optimization:
#MaxRecCount INT,
#PageOffset INT,
#IncludeData BIT
Declare #pageStart int
Declare #pageEnd int
SET #pageStart = #PageOffset * #MaxRecCount
SET #pageEnd = #pageStart + #MaxRecCount + 50
IF #IncludeData = 0
BEGIN
SELECT [Client], [Schema], [Version], [Records], [Fetched], [Receipted], [ProvidedAt], [FetchedAt], [ReceiptedAt],[PacketIds], [Record] FROM (
SELECT TOP(#MaxRecCount) bai_ExportPendingArchive.[UserName] AS Client,
bai_ExportPendingArchive.Category AS [Schema],
MAX(bai_ExportPendingArchive.ContractVersion) AS [Version],
COUNT(*) AS [Records],
SUM (CASE WHEN bai_ExportPendingAckArchive.ExportPendingId IS NULL THEN 0 ELSE 1 END) as [Fetched],
SUM (CASE WHEN bai_ExportPendingAckArchive.Receipted IS NULL THEN 0 ELSE 1 END) as [Receipted],
MAX(bai_ExportArchive.Inserted) AS [ProvidedAt],
MAX(CASE WHEN bai_ExportPendingAckArchive.ExportPendingId IS NULL THEN NULL ELSE bai_ExportPendingAckArchive.Inserted END) AS [FetchedAt],
MAX(CASE WHEN bai_ExportPendingAckArchive.Receipted IS NULL THEN NULL ELSE bai_ExportPendingAckArchive.Receipted END) AS [ReceiptedAt],
bai_ExportArchive.PacketIds AS [PacketIds],
NULL AS [Record],
ROW_NUMBER() Over (Order By MAX(bai_ExportArchive.Inserted) desc) as [RowNumber]
FROM bai_ExportArchive
INNER JOIN bai_ExportPendingArchive ON bai_ExportArchive.Id = bai_ExportPendingArchive.ExportId
LEFT OUTER JOIN bai_ExportPendingAckArchive ON bai_ExportPendingAckArchive.ExportPendingId = bai_ExportPendingArchive.Id
Where bai_ExportArchive.Inserted <= (Select bai_ExportArchive.Inserted from bai_ExportArchive Order by bai_ExportArchive.Inserted DESC Offset #pageStart ROWS FETCH NEXT 1 ROWS Only)
And bai_ExportArchive.Inserted > (Select bai_ExportArchive.Inserted from bai_ExportArchive Order by bai_ExportArchive.Inserted DESC Offset #pageEnd ROWS FETCH NEXT 1 ROWS Only)
GROUP BY bai_ExportPendingArchive.[UserName], bai_ExportArchive.PacketIds, bai_ExportPendingArchive.Category
) AS InnerTable
ORDER BY RowNumber
This version gives me the data in about 2s. The only problem is, I work on Microsoft SQL Server 2014 BUT my Users use SQL Server 2008+. The Problem now is, that the OFFSET FETCH dosn't work in Server 2008. And now I'm clueless how I can optimize my stored procedure that it is fast and work on SQl Server 2008.
I'm thankful for any help :)
Try this method to handle the pagination in SQL Server 2005/2008.
First use a CTE for your select query with a ROW_NUMBER() column to identify the record number/count. After that you can select a range of records from this CTE using your PAGE_NUMBER and PAGE_COUNT. Example is below
DECLARE #P_PAGE_NUM INT = 0
,#P_PAGE_SIZE INT = 20
;WITH CTE
AS
( /*SELECT ROW_NUMBER() OVER (ORDER BY COL_to_SORT DESC) AS [ROW_NO]
,...
WHERE ....
*/ -- You can replace your select query here, but column [ROW_NO] should be there in your select list.
--ie ROW_NUMBER() OVER (ORDER BY put_column-to-sort-here DESC) AS [ROW_NO]
)
SELECT *
--,( SELECT COUNT(*) FROM CTE) AS [TOTAL_ROW_COUNT]
FROM CTE
WHERE (
ISNULL(#P_PAGE_NUM,0) = 0 OR
[ROW_NO] BETWEEN ( #P_PAGE_NUM - 1) * #P_PAGE_SIZE + 1
AND #P_PAGE_NUM * #P_PAGE_SIZE
)
ORDER BY [ROW_NO]
I need to get the first and last record for a user if one of the key fields is different over time using a Hive table:
This is some sample data:
UserID EntryDate Activity
a3324 1/1/16 walk
a3324 1/2/16 walk
a3324 1/3/16 walk
a3324 1/4/16 run
a5613 1/1/16 walk
a5613 1/2/16 walk
a5613 1/3/16 walk
a5613 1/4/16 walk
And I'm looking for output preferably like this:
a3324 1/1/16 walk 1/4/16 run
Or at least like this:
a3324 walk run
I start writing code like this:
SELECT UserID, MINIMUM(EntryDate), MAXIMUM(EntryDate), Activity
FROM
SELECT UserID, DISTINCT Activity
GROUP BY UserID
HAVING Count(Activity) > 1
But I know that's not it.
I'd also like to be able to specify the cases where the original activity was Walk and the second activity was Run perhaps in the Where clause.
Can you help with an approach?
Thanks
You can use lag /lead to get a solution
SELECT * FROM (
select UserID ,EntryDate , Activityslec,
lead(Activityslec, 1) over (UserID ,EntryDate ) as nextActivityslec
from table) as A
where Activityslec <> nextActivityslec
SELECT
t.UserId
,MIN(CASE WHEN t.RowNumAsc = 1 THEN t.EntryDate END) as MinEntryDate
,MIN(CASE WHEN t.RowNumAsc = 1 THEN t.Activity END) as MinActivity
,MAX(CASE WHEN t.RowNumDesc = 1 THEN t.EntryDate END) as MaxEntryDate
,MAX(CASE WHEN t.RowNumDesc = 1 THEN t.Activity END) as MaxActivity
FROM
(
SELECT
UserId
,EntryDate
,Activity
,ROW_NUMBER() OVER (PARTITION BY UserId ORDER BY EntryDate) as RowNumAsc
,ROW_NUMBER() OVER (PARTITION BY UserId ORDER BY EntryDate DESC) as RowNumDesc
FROM
Table
) t
WHERE
t.RowNumAsc = 1
OR t.RowNumDesc = 1
GROUP BY
t.UserId
Looks like window functions are supported (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics) so using 2 row numbers 1 for EntryDate Ascending and another for Descending with Conditional Aggregation should get you to the answer.
And if you don't want to use Analytic Functions (window functions) you can use self left joins and conditional aggregation:
SELECT
t.UserId
,MIN(CASE WHEN mn.UserId IS NULL THEN t.EntryDate END) as MinEntryDate
,MIN(CASE WHEN mn.UserId IS NULL THEN t.Activity END) as MinActivity
,MAX(CASE WHEN mx.UserId IS NULL THEN t.EntryDate END) as MaxEntryDate
,MAX(CASE WHEN mx.UserId IS NULL THEN t.Activity END) as MaxActivity
FROM
Table t
LEFT JOIN Table mn
ON t.UserId = mn.UserId
AND t.EntryDate > mn.EntryDate
LEFT JOIN Table mx
ON t.UserId = mx.UserId
AND t.EntryDate < mx.EntryDate
WHERE
mn.UserId IS NULL
OR mx.UserId IS NULL
GROUP BY
t.UserId
Or a correlated Sub Query way:
SELECT
UserId
,MIN(EntryDate) as MinEntryDate
,(SELECT
Activity
FROM
Activity a
WHERE
u.UserId = a.UserId
AND a.EntryDate = MIN(u.EntryDate)
LIMIT 1
) as MinActivity
,MAX(EntryDate) as MaxEntryDate
,(SELECT
Activity
FROM
Activity a
WHERE
u.UserId = a.UserId
AND a.EntryDate = MAX(u.EntryDate)
LIMIT 1
) as MaxActivity
FROM
Activity u
GROUP BY
UserId