Getting around BigQuery subquery & apply limitations - google-bigquery

I have a SQL Server query that I'm trying to convert to run in BigQuery. There are three tables involved:
CalendarMonths
FirstDayOfMonth | FirstDayOfNextMonth
----------------------------+----------------------------
2017-02-01 00:00:00.000 UTC | 2017-03-01 00:00:00.000 UTC
2017-03-01 00:00:00.000 UTC | 2017-04-01 00:00:00.000 UTC
Clients
clientid | name | etc.
---------+----------------+------
1 | Bob's Shop |
2 | Anne's Cookies |
ClientLogs
id | clientid | timestamp | price_current | price_old | license_count_current | license_count_old |
----+----------+----------------+---------------+-----------+-----------------------+---------------
1 | 1 | 2017-02-01 UTC | 1200 | 0 | 10 | 0 |
2 | 1 | 2018-02-03 UTC | 2400 | 1200 | 20 | 10 |
3 | 2 | 2016-07-13 UTC | 1200 | 0 | 10 | 0 |
4 | 2 | 2018-03-30 UTC | 0 | 1200 | 0 | 10 |
The T-SQL query looks something like this:
SELECT
FirstDayOfMonth, FirstDayOfNextMonth,
(SELECT SUM(sizeatdatelog.price_current)
FROM clients c
CROSS APPLY (SELECT TOP 1 *
FROM clientlogs
WHERE clientid = c.clientid
AND [timestamp] < cm.FirstDayOfMonth
ORDER BY [timestamp] DESC) sizeatdatelog
WHERE sizeatdatelog.license_count_current > 0) as StartingRevenue,
(another subquery for starting client count) as StartingClientCount,
(another subquery for churned revenue) as ChurnedRevenue,
(there are about 6 other subqueries)
FROM
CalendarMonths cm
ORDER BY
cm.FirstDayOfMonth
And the final output looks like:
FirstDayOfMonth | FirstDayOfNextMonth | StartingRevenue | StartingClientCount | etc
-------------------------------------------------------------------------------------------------------
2017-02-01 00:00:00.000 UTC | 2017-03-01 00:00:00.000 UTC | 68382995.43 | 79430 |
2017-03-01 00:00:00.000 UTC | 2017-04-01 00:00:00.000 UTC | 69843625.12 | 80430 |
In BigQuery, I added a simple subquery in the select clause and it worked great:
SELECT FirstDayOfMonth, FirstDayOfNextMonth, (SELECT clientId FROM clientlogs LIMIT 1 ) as cl
FROM CalendarMonths cm
ORDER BY cm.FirstDayOfMonth
However, as soon as I add a where clause to the subquery, I get this error message:
Error: Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.
How should I proceed from this point? If I can't get the results I'm looking for in one query, maybe I should look into creating multiple scheduled jobs that create temporary tables and then a final scheduled job that joins it all together. Or maybe I could look at doing this in code via GCP or use the BigQuery API in app scripts. The data size isn't huge and the query isn't run often. I'm looking for maintainability more than efficiency, so ideally there is a way to get this data into one query.

Below is for BigQuery Standard SQL
#standardSQL
SELECT FirstDayOfMonth, FirstDayOfNextMonth,
SUM(price_current) StartingRevenue, COUNT(1) StartingClientCount
FROM (
SELECT FirstDayOfMonth, FirstDayOfNextMonth,
clientid, price_current
FROM (
SELECT FirstDayOfMonth, FirstDayOfNextMonth, clientid,
FIRST_VALUE(price_current) OVER(latest_values) price_current,
FIRST_VALUE(license_count_current) OVER(latest_values) license_count_current
FROM `project.dataset.CalendarMonths` cm
JOIN `project.dataset.ClientLogs` cl
ON `timestamp` < FirstDayOfMonth
WINDOW latest_values AS (PARTITION BY clientid ORDER BY `timestamp` DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
)
WHERE license_count_current > 0
GROUP BY FirstDayOfMonth, FirstDayOfNextMonth, clientid, price_current
)
GROUP BY FirstDayOfMonth, FirstDayOfNextMonth
ORDER BY FirstDayOfMonth
most likely above can be extended to the rest of your subqueries

Correlated subquery like
SELECT TOP 1 *
FROM clientlogs
WHERE clientid = c.clientid
AND [timestamp] < cm.FirstDayOfMonth
ORDER BY [timestamp] DESC)
in BigQuery usually needs to be rewritten through aggregation along the lines of
SELECT ARRAY_AGG(foo ORDER BY [timestamp] DESC LIMIT 1)[offset(0)]
FROM ... as foo
WHERE correlated condition
BigQuery more likely to work with simple correlated subqueries in the form of
SELECT
{optional aggregation}
FROM table
WHERE {correlated condition}

For the sake of the community I'm posting the query I ended up using. Huge thanks to Mikhail Berlyant for his help with this one.
I ended up breaking the query into CTEs so I could use correlated subqueries to get the specific data I needed.
WITH previousMonths AS (
SELECT *
FROM (
SELECT FirstDayOfMonth, FirstDayOfNextMonth, account_c,
FIRST_VALUE(acl.timestamp_c ) OVER (start_values) timestamp_c,
FIRST_VALUE(acl.acv_current_c ) OVER (start_values) acv_current_c,
FIRST_VALUE(acl.license_count_current_c) OVER(start_values) license_count_current_c,
FIRST_VALUE(acl.price_current_c) OVER (start_values) price_current_c
FROM warehouse.project.calendar_months cm
JOIN warehouse.project.account_change_logs acl ON timestamp_c < FirstDayOfMonth
WINDOW start_values AS (PARTITION BY account_c, FirstDayOfMonth ORDER BY timestamp_c DESC)
)
GROUP BY FirstDayOfMonth, FirstDayOfNextMonth, account_c,
timestamp_c, acv_current_c, license_count_current_c, price_current_c
),
currentMonth AS (
SELECT *
FROM (
SELECT FirstDayOfMonth, FirstDayOfNextMonth, account_c,
FIRST_VALUE(acl.timestamp_c ) OVER (change_values) timestamp_c,
FIRST_VALUE(acl.acv_current_c ) OVER (change_values) acv_current_c,
FIRST_VALUE(acl.license_count_current_c) OVER(change_values) license_count_current_c,
FIRST_VALUE(acl.acv_old_c) OVER(PARTITION BY account_c, FirstDayOfMonth ORDER BY timestamp_c) acv_old_at_start_of_month_c,
FIRST_VALUE(acl.license_count_old_c) OVER(PARTITION BY account_c, FirstDayOfMonth ORDER BY timestamp_c) license_count_old_at_start_of_month_c,
FIRST_VALUE(acl.price_current_c) OVER (change_values) price_current_c
FROM warehouse.project.calendar_months cm
JOIN warehouse.project.account_change_logs acl
ON timestamp_c >= FirstDayOfMonth AND timestamp_c < FirstDayOfNextMonth
WINDOW change_values AS (PARTITION BY account_c, FirstDayOfMonth ORDER BY timestamp_c DESC)
)
GROUP BY FirstDayOfMonth, FirstDayOfNextMonth, account_c,
timestamp_c, acv_current_c, acv_old_at_start_of_month_c, license_count_current_c,
license_count_old_at_start_of_month_c, price_current_c
)
SELECT FirstDayOfMonth, FirstDayOfNextMonth,
(SELECT COUNT(acv_current_c) FROM previousMonths pm WHERE pm.FirstDayOfMonth = cal.FirstDayOfMonth
AND license_count_current_c > 0) as StartingAccounts,
(SELECT COUNT(acv_current_c) FROM currentMonth cm WHERE cm.FirstDayOfMonth = cal.FirstDayOfMonth
AND license_count_old_at_start_of_month_c = 0 AND license_count_current_c > 0) as NewAccounts,
(SELECT COUNT(acv_current_c) FROM currentMonth cm WHERE cm.FirstDayOfMonth = cal.FirstDayOfMonth
AND license_count_current_c = 0) as ChurnAccounts,
(SELECT SUM(license_count_current_c) FROM previousMonths pm WHERE pm.FirstDayOfMonth = cal.FirstDayOfMonth
AND license_count_current_c > 0) as StartingUsers,
(SELECT SUM(license_count_current_c) FROM currentMonth cm WHERE cm.FirstDayOfMonth = cal.FirstDayOfMonth
AND license_count_old_at_start_of_month_c = 0 AND license_count_current_c > 0) as NewUsers,
(SELECT SUM(license_count_current_c - license_count_old_at_start_of_month_c) FROM currentMonth cm WHERE cm.FirstDayOfMonth = cal.FirstDayOfMonth
AND license_count_old_at_start_of_month_c < license_count_current_c
AND license_count_old_at_start_of_month_c <> 0) as ExpansionUsers,
(SELECT SUM(license_count_old_at_start_of_month_c - license_count_current_c) FROM currentMonth cm WHERE cm.FirstDayOfMonth = cal.FirstDayOfMonth
AND license_count_old_at_start_of_month_c > license_count_current_c
AND license_count_current_c <> 0) as ContractionUsers,
(SELECT SUM(license_count_old_at_start_of_month_c - license_count_current_c) FROM currentMonth cm WHERE cm.FirstDayOfMonth = cal.FirstDayOfMonth
AND license_count_old_at_start_of_month_c > license_count_current_c
AND license_count_current_c = 0) as ChurnUsers,
(SELECT SUM(acv_current_c) FROM previousMonths pm WHERE pm.FirstDayOfMonth = cal.FirstDayOfMonth
AND license_count_current_c > 0) as StartingARR
--etc, etc,
FROM warehouse.project.calendar_months cal
ORDER BY FirstDayOfMonth

Related

Obtain Name Column Based on Value

I have a table that calculates the number of associated records that fit a criteria for each parent record. See example below:
note - morning, afternoon and evening are only weekdays
| id | morning | afternoon | evening | weekend |
| -- | ------- | --------- | ------- | ------- |
| 1 | 0 | 2 | 3 | 1 |
| 2 | 2 | 9 | 4 | 6 |
What I am trying to achieve is to determine which columns have the lowest value and get their column name as such:
| id | time_of_day |
| -- | ----------- |
| 1 | morning |
| 2 | afternoon |
Here is my current SQL code to result in the first table:
SELECT
leads.id,
COALESCE(morning, 0) morning,
COALESCE(afternoon, 0) afternoon,
COALESCE(evening, 0) evening,
COALESCE(weekend, 0) weekend
FROM leads
LEFT OUTER JOIN (
SELECT DISTINCT ON (lead_id) lead_id, COUNT(*) AS morning
FROM lead_activities
WHERE lead_activities.modality = 'Call' AND lead_activities.bound_type = 'outbound' AND extract('dow' from created_at) IN (0,1,2,3,4,5) AND (extract('hour' from created_at) >= 0 AND extract('hour' from created_at) < 12)
GROUP BY lead_id
) morning ON morning.lead_id = leads.id
LEFT OUTER JOIN (
SELECT DISTINCT ON (lead_id) lead_id, COUNT(*) AS afternoon
FROM lead_activities
WHERE lead_activities.modality = 'Call' AND lead_activities.bound_type = 'outbound' AND extract('dow' from created_at) IN (0,1,2,3,4,5) AND (extract('hour' from created_at) >= 12 AND extract('hour' from created_at) < 17)
GROUP BY lead_id
) afternoon ON afternoon.lead_id = leads.id
LEFT OUTER JOIN (
SELECT DISTINCT ON (lead_id) lead_id, COUNT(*) AS evening
FROM lead_activities
WHERE lead_activities.modality = 'Call' AND lead_activities.bound_type = 'outbound' AND extract('dow' from created_at) IN (0,1,2,3,4,5) AND (extract('hour' from created_at) >= 17 AND extract('hour' from created_at) < 25)
GROUP BY lead_id
) evening ON evening.lead_id = leads.id
LEFT OUTER JOIN (
SELECT DISTINCT ON (lead_id) lead_id, COUNT(*) AS weekend
FROM lead_activities
WHERE lead_activities.modality = 'Call' AND lead_activities.bound_type = 'outbound' AND extract('dow' from created_at) IN (6,7)
GROUP BY lead_id
) weekend ON weekend.lead_id = leads.id
You can use CASE/WHEN/ELSE to check for the specific conditions and produce different values. For example:
with
q as (
-- your query here
)
select
id,
case
when morning <= least(afternoon, evening, weekend) then 'morning'
when afternoon <= least(morning, evening, weekend) then 'afternoon'
when evening <= least(morning, afternoon, weekend) then 'evening'
else 'weekend'
end as time_of_day
from q

How to get daily budget based on monthly budget and workings days

Have have 2 tables.
One table with month budget, and one table with workings days.
What I want, is find out daily budget based on the monthly budget and working days.
Example:
August have a budget on 1000 and have 21 workings day.
September have a budget on 2000 and 23 workings days
I want to figure out what the total budget betweens two dates.
Ex: between 2020-08-02 and 2020-09-15
But must be sure that, days in august takes budget from august, days from september takes budget from september etc.
tbBudget:
Date | Amount
2020-08-01 | 1000
2020-09-01 | 2000
2020-10-01 | 3000
tbWorkingDays
Date | WorkingDay
2020-08-01 | 0
2020-08-02 | 0
2020-08-03 | 1
2020-08-04 | 1
2020-08-05 | 1
2020-08-06 | 1
2020-08-07 | 1
2020-08-08 | 1
...
2020-09-01 | 1
2020-09-02 | 1
2020-09-03 | 0
2020-09-04 | 1
...
2020-10-01 | 1
2020-10-02 | 0
2020-10-03 | 1
2020-10-04 | 1
I have no idea how to solve this issue. Can you help me?
My result should be like:
Date | WorkingDay | BudgetAmount
2020-08-02 | 0 | 0.0
2020-08-03 | 1 | 47.6
2020-08-04 | 1 | 47.6
2020-08-05 | 1 | 47.6
..
2020-09-13 | 1 | 86.9
2020-09-14 | 1 | 86.9
2020-09-15 | 1 | 86.9
Using CTE and group by:
with CTE1 AS(
SELECT FORMAT(A.DATE, 'MMyyyy') DATE, B.AMOUNT, SUM(CASE WHEN [WorkingDay] = 1 THEN 1 ELSE 0 END) AS TOTAL_WORKING_DAYS
FROM tbWorkingDays A INNER JOIN tbBudget B
ON (FORMAT(A.DATE, 'MMyyyy') = FORMAT(B.DATE, 'MMyyyy')) GROUP BY FORMAT(A.[DATE], 'MMyyyy'), B.AMOUNT
)
SELECT A.DATE,
A.WORKINGDAY,
CASE WHEN A.WORKINGDAY = 1 THEN B.AMOUNT/B.TOTAL_WORKING_DAYS
ELSE 0 END AS BudgetAmount
FROM CTE1 B
INNER JOIN
tbWorkingDays A
ON (FORMAT(A.DATE, 'MMyyyy') = B.DATE);
Assuming that the budgets are by month:
select wd.*,
(case when workingday = 0 then 0
else wd.budget * 1.0 / sum(wd.workingday) over (partition by wd.date)
end) as daily_amount
from tbWorkingDays wd join
tblBudget b
on wd.date >= b.date and wd.date < dateadd(month, 1, wd.date);
If the budget dates are not per month, then use apply instead:
select wd.*,
(case when workingday = 0 then 0
else wd.budget * 1.0 / sum(wd.workingday) over (partition by wd.date)
end) as daily_amount
from tbWorkingDays wd cross apply
(select top (1) b.*
from tblBudget b
where wd.date >= b.date
order by b.date desc
) b
Use sum as an analytical function to get the number of workingdays pr month, then divide out
Here is a functioning solution
with tally as
(
SELECT
row_number() over (order by (select null))-1 n
from (values (null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null)) a(a)
cross join (values (null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null)) b(b)
cross join (values (null),(null),(null),(null),(null),(null),(null),(null),(null),(null),(null)) c(c)
)
, tbWorkingDays as
(
select
cast(dateadd(day,n,'2020-01-01') as date) [Date],
iif(DATEPART(WEEKDAY,cast(dateadd(day,n,'2020-01-01') as date)) in (1,7),0,1) WorkingDay
from tally
where n<365
)
, tbBudget AS
(
select * from
(values
(cast('2020-08-01' as date), cast(1000 as decimal(19,2)))
,(cast('2020-09-01' as date), cast(2000as decimal(19,2)))
,(cast('2020-10-01' as date), cast(3000as decimal(19,2)))
) a([Date],[Amount])
)
select
a.[Date]
,a.WorkingDay*
(b.Amount/
sum(a.WorkingDay) over (partition by year(a.Date)*100+month(a.Date)))
from tbWorkingDays a
inner join tbBudget b
on a.Date between b.Date and dateadd(day,-1,dateadd(month,1,b.date))
The work is done here:
select
a.[Date]
,a.WorkingDay*
(b.Amount/
sum(a.WorkingDay) over (partition by year(a.Date)*100+month(a.Date)))
from tbWorkingDays a
inner join tbBudget b
on a.Date between b.Date and dateadd(day,-1,dateadd(month,1,b.date))
The expression
sum(a.WorkingDay) over (partition by year(a.Date)*100+month(a.Date))
Sums the number of workingdays for the current month. I then join against the budget and take the sum for the month and divide by the expression above.
To make sure there only is budget on workingdays, I simply multiply by "workingday", since 0 is a non workingday, the sum will be 0 for all non workingdays.

SQL Select Statement for Time and attendance for a month

Anyone can help with this one please? Our attendance system generates the following data:
Empid Department Timestamp Read_ID
3221 IT 2017-01-29 11:12:00.000 1
5565 IT 2017-01-29 12:28:06.000 1
5565 IT 2017-01-29 12:28:07.000 1
3221 IT 2017-01-29 13:12:00.000 2
5565 IT 2017-01-29 13:28:06.000 2
3221 IT 2017-01-30 07:42:15.000 1
3221 IT 2017-01-30 16:16:15.000 2
3221 IT 2017-01-31 09:05:00.000 1
3221 IT 2017-01-31 11:05:00.000 2
3221 IT 2017-01-31 13:20:00.000 1
3221 IT 2017-01-31 16:10:00.000 2
Where Read_ID value are :
1 = Entry
2 = Exit
I'm looking for SQL query to run on MS SQL 2014 that summarize attendance time for each employee on monthly basis, for instance
Empid Department Year Month TotalHours
3221 IT 2017 1 15:24
5565 IT 2017 1 01:00
This query should give you the result you need. It works by selecting each entries, and joining it with the next exit of the same employee (entries without further exits are ignored) : this gives us the duration of this employee shift. Then results are aggregated and shift durations are sumed in each group.
SELECT
t1.empid,
t1.department,
YEAR(t1.timestamp) Year,
MONTH(t1.timestamp) Month,
CONVERT(
varchar(12),
DATEADD(minute, SUM(DATEDIFF(minute, t1.timestamp, t2.timestamp)), 0),
114
) TotalHours
FROM
mytable t1
INNER JOIN mytable t2
ON t1.empid = t2.empid
AND t2.read_id = 2
AND t2.timestamp = (
SELECT MIN(timestamp)
FROM mytable
WHERE
read_id = 2
AND empid = t2.empid
AND timestamp > t1.timestamp
)
WHERE
t1.read_id = 1
GROUP BY t1.empid, t1.department, YEAR(t1.timestamp), MONTH(t1.timestamp)
ORDER BY 1, 2, 3, 4
Returns :
empid | department | Year | Month | TotalHours
----: | :--------- | ---: | ----: | :-----------
3221 | IT | 2017 | 1 | 15:24:00:000
5565 | IT | 2017 | 1 | 02:00:00:000
DB Fiddle demo on SQL Server 2014
There is an edge case, however, where an employee enters twice and then exists (this happens in your data, where employee 5565 enters at 29/01/2017 12:28:06 and at 29/01/2017 12:28:07, and then exits at 29/01/2017 13:28:06. The above query will take in account the two overlaping entries and map them to the same exit, resulting in this hour of work being counted twice.
While this matches your expected results, is this what you really want ? Here is an alternative query that , if several consecutive of the same employee entries happen, only takes in account the latest one :
SELECT
t1.empid,
t1.department,
YEAR(t1.timestamp) Year,
MONTH(t1.timestamp) Month,
CONVERT(
varchar(12),
DATEADD(minute, SUM(DATEDIFF(minute, t1.timestamp, t2.timestamp)), 0),
114
) TotalHours
FROM
mytable t1
INNER JOIN mytable t2
ON t1.empid = t2.empid
AND t2.read_id = 2
AND t2.timestamp = (
SELECT MIN(timestamp)
FROM mytable
WHERE
read_id = 2
AND empid = t2.empid
AND timestamp > t1.timestamp
)
WHERE
t1.read_id = 1
AND NOT EXISTS (
SELECT 1
FROM mytable
WHERE
read_id = 1
AND empid = t1.empid
AND timestamp > t1.timestamp
AND timestamp < t2.timestamp
)
GROUP BY t1.empid, t1.department, YEAR(t1.timestamp), MONTH(t1.timestamp)
ORDER BY 1, 2, 3, 4
Returns :
empid | department | Year | Month | TotalHours
----: | :--------- | ---: | ----: | :-----------
3221 | IT | 2017 | 1 | 15:24:00:000
5565 | IT | 2017 | 1 | 01:00:00:000
DB fiddle
Try this. I was not sure what time format would satisfy your system, so I put both:
SELECT * INTO #Tbl3 FROM (VALUES
(3221,'IT','2017-01-29 11:12:00.000',1),
(5565,'IT','2017-01-29 12:28:06.000',1),
(5565,'IT','2017-01-29 12:28:07.000',1),
(3221,'IT','2017-01-29 13:12:00.000',2),
(5565,'IT','2017-01-29 13:28:06.000',2),
(3221,'IT','2017-01-30 07:42:15.000',1),
(3221,'IT','2017-01-30 16:16:15.000',2),
(3221,'IT','2017-01-31 09:05:00.000',1),
(3221,'IT','2017-01-31 11:05:00.000',2),
(3221,'IT','2017-01-31 13:20:00.000',1),
(3221,'IT','2017-01-31 16:10:00.000',2))
x (Empid,Department,Timestamp,Read_ID)
;With cte as (
SELECT t1.Empid, t1.Department
, [Year] = Year(t1.Timestamp)
, [Month] = Month(t1.Timestamp)
, Seconds = SUM(DATEDIFF(second, t1.Timestamp, t2.Timestamp))
FROM #Tbl3 as t1
OUTER APPLY (
SELECT Timestamp = MIN(t.Timestamp)
FROM #Tbl3 as t
WHERE t.Department = t1.Department and t.Empid = t1.Empid
and t.Timestamp > t1.Timestamp and t.Read_ID = 2
) as t2
WHERE t1.Read_ID = 1
GROUP BY t1.Empid, t1.Department, Year(t1.Timestamp), Month(t1.Timestamp))
SELECT *, TotalHours = Seconds / 3600., TotalTime =
RIGHT('0'+CAST(Seconds / 3600 as VARCHAR),2) + ':' +
RIGHT('0'+CAST((Seconds % 3600) / 60 as VARCHAR),2) + ':' +
RIGHT('0'+CAST(Seconds % 60 as VARCHAR),2)
FROM cte;

Selecting a single row in the same table/view if a query returns no results

I have the following view in my SQL database, which selects data from a Transaction table and a Customer table:
+-------+-----------+---------------------+--------+
| RowNo | Name | Date | Amount |
+-------+-----------+---------------------+--------+
| 1 | Customer1 | 2018-11-10 01:00:00 | 55.49 |
| 2 | Customer2 | 2018-11-10 02:00:00 | 58.15 |
| 3 | Customer3 | 2018-11-10 03:00:00 | 79.15 |
| 4 | Customer1 | 2018-11-11 04:00:00 | 41.89 |
| 5 | Customer2 | 2018-11-11 05:00:00 | 5.15 |
| 6 | Customer3 | 2018-11-11 06:00:00 | 35.17 |
| 7 | Customer1 | 2018-11-12 07:00:00 | 43.78 |
| 8 | Customer1 | 2018-11-12 08:00:00 | 93.78 |
| 9 | Customer2 | 2018-11-12 09:00:00 | 80.74 |
+-------+-----------+---------------------+--------+
I need an SQL query that will return all a customer's transactions for a given day (easy enough), but then if a customer had no transactions on the given day, the query must return the customer's most recent transaction.
Edit:
The view is as follows:
Create view vwReport as
Select c.Name, t.Date, t.Amount
from Transaction t
inner join Customer c on c.Id = t.CustomerId
And then to get the data I just do a select from the view:
Select * from
vwReport r
where r.Date between '2018-11-10 00:00:00' and '2018-11-11 00:00:00'
So, to clarify, I need one query that returns all the customer transactions for a day, and included in that results set is the last transaction of any customers who don't have a transaction on that day. So, in the table above, running the query for 2018-11-12, should return row 7, 8 and 9, as well as row 6 for Customer3 that did not have a transaction on the 12th.
Take your existing query and UNION ALL it with a "most recent transaction query" for everyone who doesn't have a transaction in that range.
with found as
(
select c.Id, c.Name, t.Date, t.Amount
from Transaction t
inner join Customer c on c.Id = t.CustomerId
where Date between '2018-11-10 00:00:00' and '2018-11-11 00:00:00'
)
with unfound as
(
select c.Id, c.Name, t.Date, t.Amount, RANK() OVER (PARTITION BY Name ORDER BY CAST(Date AS DATE) DESC) AS row
from Transaction t
inner join Customer c on c.Id = t.CustomerId
WHERE Date < '2018-11-10 00:00:00'
)
select Name, Date, Amount
from found
union all
select Name, Date, Amount
from unfound
where Id not in ( select Id from found ) and row = 1
You're interested in selecting multiple rows with ties, you could use the RANK() function to find all rows ranked by date descending:
SELECT * FROM (
SELECT *, RANK() OVER (PARTITION BY Name ORDER BY CAST(Date AS DATE) DESC) AS rn
FROM txntbl
WHERE CAST(Date AS DATE) <= '2018-11-12'
) AS x
WHERE rn = 1
Demo on DB Fiddle
You can use a correlated subquery:
select t.*
from transactions t
where t.date = (select max(t2.date)
from transactions t2
where t2.name = t.name and
t2.date <= #date
);
Note: This only returns customers who had a transaction on or before the date in question.
With the limited information available from the question, the following presents a solution using a join as opposed to a correlated subquery:
select t1.*
from
vwReport t1 inner join
(
select t2.name, max(t2.date) as mdate
from vwReport t2
group by t2.name
) t3
on t1.name = t3.name and t1.date = t3.mdate
where
t1.date <= #date
Use UNION for the last date transactions only if there are no transactions for the given dates (BETWEEN '2018-11-10 00:00:00' AND '2018-11-11 00:00:00'):
SELECT * FROM vwReport r
WHERE (r.Date BETWEEN '2018-11-10 00:00:00' AND '2018-11-11 00:00:00')
AND (r.Name = #name)
UNION
SELECT * FROM vwReport r
WHERE (r.Date = (SELECT MAX(r.Date) FROM vwReport r WHERE r.Name = #name))
AND (r.Name = #name)
AND ((SELECT COUNT(*) FROM vwReport r
WHERE (r.Date BETWEEN '2018-11-10 00:00:00' AND '2018-11-11 00:00:00')
AND (r.Name = #name)) = 0)

Delete rows in single table in SQL Server where timestamp column differs

I have a table of employee timeclock punches that looks something like this:
| EmployeeID | PunchDate | PunchTime | PunchType | Sequence |
|------------|------------|-----------|-----------|----------|
| 5386 | 12/27/2016 | 03:57:42 | On Duty | 552 |
| 5386 | 12/27/2016 | 09:30:00 | Off Duty | 563 |
| 5386 | 12/27/2016 | 010:02:00 | On Duty | 564 |
| 5386 | 12/27/2016 | 12:10:00 | Off Duty | 570 |
| 5386 | 12/27/2016 | 12:22:00 | On Duty | 571 |
| 5386 | 12/27/2016 | 05:13:32 | Off Duty | 578 |
What I need to do is delete any rows where the difference in minutes between an Off Duty punch and the following On Duty punch is less than, say, 25 minutes. In the example above, I would want to remove Sequence 570 and 571.
I'm already creating this table by pulling all Off Duty punches from another table and using this query to pull all On Duty punches that follow an Off Duty punch:
SELECT * FROM [dbo].[Punches]
INSERT INTO [dbo].[UpdatePunches (EmployeeID,PunchDate,PunchTime,PunchType,Sequence)
SELECT * FROM [dbo].[Punches]
WHERE Sequence IN (
SELECT Sequence + 1
FROM [dbo].[Punches]
WHERE PunchType LIKE 'Off Duty%') AND
PunchType LIKE 'On Duty%'
I have been trying to fit some sort of DATEDIFF query both in this code and as a separate step to weed these out, but have not had any luck. I can't use specific Sequence numbers because those are going to change for every punch.
I'm using SQL Server 2008.
Any suggestions would be much appreciated.
You can assign rownumbers per employee based on punchdate and punchtime and join each row with the next based on ascending order of date and time.
Thereafter, get the rownumbers of those rows where the difference is less than 25 minutes and finally delete those rows.
with rownums as
(select t.*,row_number() over(partition by employeeid
order by cast(punchdate +' '+punchtime as datetime) ) as rn
from t)
,rownums_to_delete as
(
select r1.rn,r1.employeeid
from rownums r1
join rownums r2 on r1.employeeid=r2.employeeid and r1.rn=r2.rn+1
where dateadd(minute,25,cast(r2.punchdate +' '+r2.punchtime as datetime)) > cast(r1.punchdate +' '+r1.punchtime as datetime)
and r1.punchtype <> r2.punchtype
union all
select r2.rn, r2.employeeid
from rownums r1
join rownums r2 on r1.employeeid=r2.employeeid and r1.rn=r2.rn+1
where dateadd(minute,25,cast(r2.punchdate +' '+r2.punchtime as datetime)) > cast(r1.punchdate +' '+r1.punchtime as datetime)
and r1.punchtype <> r2.punchtype
)
delete r
from rownums_to_delete rd
join rownums r on rd.employeeid=r.employeeid and r.rn=rd.rn
Sample Demo
If date and time columns are not varchar but actual date and time datatype, use punchdate+punchtime in the query.
Edit: An easier version of the query would be
with todelete as (
select t1.employeeid,cast(t2.punchdate+' '+t2.punchtime as datetime) as punchtime,
t2.punchtype,t2.sequence,
cast(t1.punchdate+' '+t1.punchtime as datetime) next_punchtime,
t1.punchtype as next_punchtype,t1.sequence as next_sequence
from t t1
join t t2 on t1.employeeid=t2.employeeid
and cast(t2.punchdate+' '+t2.punchtime as datetime) between dateadd(minute,-25,cast(t1.punchdate+' '+t1.punchtime as datetime)) and cast(t1.punchdate+' '+t1.punchtime as datetime)
where t2.punchtype <> t1.punchtype
)
delete t
from t
join todelete td on t.employeeid = td.employeeid
and cast(t.punchdate+' '+t.punchtime as datetime) in (td.punchtime,td.next_punchtime)
;
SQL Server has a nice ability called updatable CTEs. Using lead() and lag(), you can do exactly what you want. The following assumes that the date is actually stored as a datetime -- this is just for the convenience of adding the date and time together (you can also explicitly use conversion):
with todelete as (
select tcp.*,
(punchdate + punchtime) as punchdatetime.
lead(punchtype) over (partition by employeeid order by punchdate, punchtime) as next_punchtype,
lag(punchtype) over (partition by employeeid order by punchdate, punchtime) as prev_punchtype,
lead(punchdate + punchtime) over (partition by employeeid order by punchdate, punchtime) as next_punchdatetime,
lag(punchdate + punchtime) over (partition by employeeid order by punchdate, punchtime) as prev_punchdatetime
from timeclockpunches tcp
)
delete from todelete
where (punchtype = 'Off Duty' and
next_punchtype = 'On Duty' and
punchdatetime > dateadd(minute, -25, next_punchdatetime)
) or
(punchtype = 'On Duty' and
prev_punchtype = 'Off Duty' and
prev_punchdatetime > dateadd(minute, -25, punchdatetime)
);
EDIT:
In SQL Server 2008, you can do use the same idea, just not as efficiently:
delete t
from t outer apply
(select top 1 tprev.*
from t tprev
where tprev.employeeid = t.employeeid and
(tprev.punchdate < t.punchdate or
(tprev.punchdate = t.punchdate and tprev.punchtime < t.punchtime)
)
order by tprev.punchdate desc, tprev.punchtime desc
) tprev outer apply
(select top 1 tnext.*
from t tnext
where tnext.employeeid = t.employeeid and
(t.punchdate < tnext.punchdate or
(t.punchdate = tnext.punchdate and t.punchtime < tnext.punchtime)
)
order by tnext.punchdate desc, tnext.punchtime desc
) tnext
where (t.punchtype = 'Off Duty' and
tnext.punchtype = 'On Duty' and
t.punchdatetime > dateadd(minute, -25, tnext.punchdatetime)
) or
(t.punchtype = 'On Duty' and
tprev.punchtype = 'Off Duty' and
tprev.punchdatetime > dateadd(minute, -25, t.punchdatetime)
);
You could create a DateTime from the Date and Time fields in a CTE and then lookup the next On Duty Time after the Off Duty Time like below:
;
WITH OnDutyDateTime AS
(
SELECT
EmployeeID,
Sequence,
DutyDateTime = DATEADD(ms, DATEDIFF(ms, '00:00:00', PunchTime), CONVERT(DATETIME, PunchDate))
FROM
#TempEmployeeData
where PunchType = 'On Duty'
),
OffDutyDateTime As
(
SELECT
EmployeeID,
Sequence,
DutyDateTime = DATEADD(ms, DATEDIFF(ms, '00:00:00', PunchTime), CONVERT(DATETIME, PunchDate))
FROM
#TempEmployeeData
where PunchType = 'Off Duty'
)
SELECT
OffDutyDateTime = DutyDateTime,
OnDutyDateTime = (SELECT TOP 1 DutyDateTime FROM OnDutyDateTime WHERE EmployeeID = A.EmployeeID AND Sequence > A.Sequence ORDER BY Sequence ASC ),
DiffInMinutes = DATEDIFF(minute,DutyDateTime,(SELECT TOP 1 DutyDateTime FROM OnDutyDateTime WHERE EmployeeID = A.EmployeeID AND Sequence > A.Sequence ORDER BY Sequence ASC ))
FROM
OffDutyDateTime A
OffDutyDateTime OnDutyDateTime DiffInMinutes
----------------------- ----------------------- -------------
2016-12-27 09:30:00.000 2016-12-27 10:02:00.000 32
2016-12-27 12:10:00.000 2016-12-27 12:22:00.000 12
2016-12-28 05:13:32.000 NULL NULL
(3 row(s) affected)
Maybe something like this would be easy to slap in there.. This simply uses a subquery to find the next 'on duty' punch and compare it in the main query to the 'off duty' punch.
Delete
FROM [dbo].[Punches] p
where p.PunchTime >=
dateadd(minute, -25, isnull (
(select top 1 p2.PunchTime from [dbo].[Punches] p2 where
p2.EmployeeID=p.EmployeeID and p2.PunchType='On Duty'
and p1.Sequence < p2.Sequence and p2.PunchDate=p.PunchDate
order by p2.Sequence asc)
),'2500-01-01')
and p.PunchType='Off Duty'