Improving performance of outer apply - sql

Let me briefly describe what I'm attempting in case someone has a much more elegant way of solving the same problem. I'm trying to write a stored procedure that looks at sales orders in a database, find when the same item is ordered by the same customer multiple times, and predict the next date of an order using an average of the previous intervals between orders for the same item. The query below is going to form the basis for the temp table to work against with probably cursors and running averages.
So far the query I have looks like this
SELECT sl.custaccount ,
sl.itemid ,
sl.shippingdaterequested ,
nextdate.shippingdaterequested AS nextshippingdaterequested
FROM salesline AS sl
OUTER APPLY ( SELECT TOP 1
sl2.custaccount ,
sl2.itemid ,
sl2.shippingdaterequested
FROM salesline AS sl2
WHERE sl2.shippingdaterequested > sl.shippingdaterequested
AND sl2.custaccount = sl.custaccount
AND sl2.itemid = sl.itemid
GROUP BY sl2.custaccount ,
sl2.itemid ,
sl2.shippingdaterequested
ORDER BY sl2.shippingdaterequested
) AS nextdate
GROUP BY sl.custaccount ,
sl.itemid ,
sl.shippingdaterequested ,
nextdate.shippingdaterequested
This query gives me a row for every sales line with a column representing the next time that item was ordered by that customer. If that column is NULL, I know the record I'm on is the last time.
The basic problem is that this query is way too slow, it runs fine if I go against a single customer at a time, returning results in a second, but running against ~100,000 customers would take around 27 hours.
I know the basic problem is that I'm outer applying, so it's probably doing row by agonizing row processing, but I'm not sure of another way to get to hear that would work out faster. Any thoughts?

I think you are making it more complex than it needs to be.
Just take the min and max and divide by the count
SELECT sl.custaccount ,
sl.itemid ,
MAX(sl.shippingdaterequested) AS lastShip ,
DATEDIFF(dd, MIN(sl.shippingdaterequested),
MAX(sl.shippingdaterequested)) / COUNT(*) AS interval ,
DATEADD(dd,
DATEDIFF(dd, MIN(sl.shippingdaterequested),
MAX(sl.shippingdaterequested)) / COUNT(*),
MAX(sl.shippingdaterequested)) AS nextShip
FROM salesline AS sl
GROUP BY sl.custaccount ,
sl.itemid
HAVING COUNT(*) > 1

Related

Access: Having trouble with getting average movies per day

I have a database project at my school and I am almost finished. The only thing that I need is average movies per day. I have a watchhistory where you can find the users who have watch a movie. The instrucition is that you filter the people out of the watchhistory who have an average of 2 movies per day.
I wrote the following SQL statement. But every time I get errors. Can someone help me?
SQL:
SELECT
customer_mail_address,
COUNT(movie_id) AS AantalBekeken,
COUNT(movie_id) / SUM(GETDATE() -
(SELECT subscription_start FROM Customer)) AS AveragePerDay
FROM
Watchhistory
GROUP BY
customer_mail_address
The error:
Msg 130, Level 15, State 1, Line 1
Cannot perform an aggregate function on an expression containing an aggregate or a subquery.
I tried something different and this query sums the total movie's per day. Now I need the average of everything and that SQL only shows the cusotmers who are have more than 2 movies per day average.
SELECT
Count(movie_id) as AantalPerDag,
Customer_mail_address,
Cast(watchhistory.watch_date as Date) as Date
FROM
Watchhistory
GROUP BY
customer_mail_address, Cast(watch_date as Date)
The big problem that I see is that you're trying to use a subquery as if it's a single value. A subquery could potentially return many values, and unless you have only one customer in your system it will do exactly that. You should be JOINing to the Customer table instead. Hopefully the JOIN only returns one customer per row in WatchHistory. If that's not the case then you'll have more work to do there.
SELECT
customer_mail_address,
COUNT(movie_id) AS AantalBekeken,
CAST(COUNT(movie_id) AS DECIMAL(10, 4)) / DATEDIFF(dy, C.subscription_start, GETDATE()) AS AveragePerDay
FROM
WatchHistory WH
INNER JOIN Customer C ON C.customer_id = WH.customer_id -- I'm guessing at the join criteria here since no table structures were provided
GROUP BY
C.customer_mail_address,
C.subscription_start
HAVING
COUNT(movie_id) / DATEDIFF(dy, C.subscription_start, GETDATE()) <> 2
I'm guessing that the criteria isn't exactly 2 movies per day, but either less than 2 or more than 2. You'll need to adjust based on that. Also, you'll need to adjust the precision for the average based on what you want.
What the error message is telling you is that you can't use SUM together with COUNT.
try putting SUM(GETDATE()-(SELECT subscription_start FROM Customer)) as your second aggregate variable, and
try using HAVING & FILTER at the end of your query to select only the users that have count/sum = 2
maybe this is what you need?
lets join the two tables Watchhistory and Customers
select customer_mail_address,
COUNT(movie_id) AS AantalBekeken,
COUNT(movie_id) / datediff(Day, GETDATE(),Customer.subscription_start) AS AveragePerDay
from Watchhistory inner join Customer
on Watchhistory.customer_mail_address = Customer.customer_mail_address
GROUP BY
customer_mail_address
having AveragePerDay = 2
change the last line of code according to what you need (I did not understand if you want it in or out)
I got it guys. Finally :)
SELECT customer_mail_address, SUM(AveragePerDay) / COUNT(customer_mail_address) AS gemiddelde
FROM (SELECT DISTINCT customer_mail_address, COUNT(CAST(watch_date AS date)) AS AveragePerDay
FROM dbo.Watchhistory
GROUP BY customer_mail_address, CAST(watch_date AS date)) AS d
GROUP BY customer_mail_address
HAVING (SUM(AveragePerDay) / COUNT(customer_mail_address) >= 2

SQL subquery based on row values with unrelated table

I need to get a count of records in an unrelated table, based on the row values in a query with some moderately complex joins. All data is on one server in a single SQL 2012 database, on several different tables.
I am recreating ticket movement history for a single ticket at a time, from audit records and need to calculate business days for the spans in rows created by the joins. Tickets are moved around between areas (ASSIGNMENT), and there are guidelines on how long it should be at any one area. The ticket may go to the same area multiple times with each time restarting the time count.
I need to consider company holidays in the business day calculations. After looking at several solutions for business day calculations on SE I decided to go with a company calendar table (dbo.UPMCCALENDARM1) and count the dates between spans. Seemed like a great idea...
I can't figure out how to use the row values as parameters for the date count query.
The query below has working solutions with a Variable and with a Cross Join, but it only works with hard coded dates, if I try to use the field values it does not work, because they are not part of the sub query and can not be bound.
-- between DV_im_Audit_ASSIGNMENT.Time and Detail.RESOLVED_TIME
In theory I could probably get there using this full query in the sub query to get the date count, but this is as short as I can make it and still get clean data. It is a pretty heavy lift for an on demand report, that would be my last option. So I want to reach out to UPMCCALENDARM1 as each occurrence of DV_im_Audit_ASSIGNMENT.Time and Detail.RESOLVED_TIME are listed.
Can it be done? If so how?
declare #NonBus integer
set #NonBus = '0'
set #NonBus = (select Count(UPMCCALENDARM1.DATE) as NonBus
from dbo.UPMCCALENDARM1
where UPMC_BUSINESS_DAY = 'f'
and UPMCCALENDARM1.DATE
between '2015-08-01' and '2015-08-31'
-- between DV_im_Audit_ASSIGNMENT.Time and Detail.RESOLVED_TIME
)
select DV_im_Audit_ASSIGNMENT.Incident_ID
, DV_im_Audit_ASSIGNMENT.Old_ASSIGNMENT
, DV_im_Audit_ASSIGNMENT.New_ASSIGNMENT
, DV_im_Audit_ASSIGNMENT.Time as Assign_Time
, B.Time as Reassign_Time
, Detail.OPEN_TIME
, Cal.NonBus
, NonBus
, Detail.RESOLVED_TIME
, A.rownumA
, B.rownumB
from dbo.DV_im_Audit_ASSIGNMENT
--Get RownumA as a select join so I can work with it here, else get an invalid column name 'rownumA' error
left join(select Incident_ID
, Old_ASSIGNMENT
, New_ASSIGNMENT
, [Time]
, rownumA = ROW_NUMBER() OVER (ORDER BY DV_im_Audit_ASSIGNMENT.Incident_ID, DV_im_Audit_ASSIGNMENT.Time)
from dbo.DV_im_Audit_ASSIGNMENT
where Incident_ID = ?
) as A
on DV_im_Audit_ASSIGNMENT.Incident_ID = A.Incident_ID
and DV_im_Audit_ASSIGNMENT.New_ASSIGNMENT = A.New_ASSIGNMENT
and DV_im_Audit_ASSIGNMENT.Time = A.Time
--Get time assigned to next group, is problomatic when assigned to the same group multiple times.
left join(select Incident_ID
, Old_ASSIGNMENT
, New_ASSIGNMENT
, [Time]
, rownumB = ROW_NUMBER() OVER (ORDER BY DV_im_Audit_ASSIGNMENT.Incident_ID, DV_im_Audit_ASSIGNMENT.Time)
from dbo.DV_im_Audit_ASSIGNMENT
where Incident_ID = ?
) as B
on DV_im_Audit_ASSIGNMENT.Incident_ID = B.Incident_ID
and DV_im_Audit_ASSIGNMENT.New_ASSIGNMENT = B.Old_ASSIGNMENT
and DV_im_Audit_ASSIGNMENT.Time < B.Time
and rownumA = (B.rownumB - 1)
--Get current ticket info
left join (select Incident_ID
, OPEN_TIME
, RESOLVED_TIME
from dbo.DV_im_PROBSUMMARYM1_Detail
where Incident_ID = ?
) as Detail
on DV_im_Audit_ASSIGNMENT.Incident_ID = Detail.Incident_ID
--Count non-bussiness days. This section is in testing and does not use dataview as a source.
-- this gets the date count for one group of dates, need a different count for each row based on assign time.
cross join (Select Count(UPMCCALENDARM1.DATE) as NonBus
from dbo.UPMCCALENDARM1
where UPMC_BUSINESS_DAY = 'f'
and UPMCCALENDARM1.DATE
between '2015-08-01' and '2015-08-30'
-- between DV_im_Audit_ASSIGNMENT.Time and Detail.RESOLVED_TIME
) as Cal
--Get data for one ticket
where DV_im_Audit_ASSIGNMENT.Incident_ID = ?
ORDER BY DV_im_Audit_ASSIGNMENT.Incident_ID, DV_im_Audit_ASSIGNMENT.Time
Results
FYI - I am running this SQL through BIRT 4.2, I believe there are few SQL items that will not pass through BIRT
Following the suggestion by #Dominique I created a custom scalar function (using the wizard in SSMS), I used default values for the dates as I had started by playing with stored procedure and that made it easier to test. This problem requires a function as it will return a value per row, where a stored procedure will not.
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
-- =============================================
-- Author: James Jenkins
-- Create date: September 2015
-- Description: Counts Business Days for UPMC during a span of dates
-- =============================================
CREATE FUNCTION dbo.UPMCBusinessDayCount
(
-- Add the parameters for the function here
#StartDate date = '2015-08-01',
#EndDate date = '2015-08-31'
)
RETURNS int
AS
BEGIN
-- Declare the return variable here
DECLARE #BusDay int
-- Add the T-SQL statements to compute the return value here
SELECT #BusDay = Count(UPMCCALENDARM1.DATE)
from dbo.UPMCCALENDARM1
where UPMC_BUSINESS_DAY = 't'
and UPMCCALENDARM1.DATE between #StartDate and #EndDate
-- Return the result of the function
RETURN #BusDay
END
GO
After the function is created in the database I added these two lines to my select statement, and it works perfectly.
--Custom function counts business days on UPMCCALENDARM1
, dbo.UPMCBusinessDayCount(DV_im_Audit_ASSIGNMENT.Time, Detail.RESOLVED_TIME) as BusDay
I can use this function for any span that has date data in this (or any query on the database). I will probably be removing the default dates as well as adding a third parameter to count non-business days (UPMC_BUSINESS_DAY = 'f'). But as it is the problem is solved.

SQL: Average value per day

I have a database called ‘tweets’. The database 'tweets' includes (amongst others) the rows 'tweet_id', 'created at' (dd/mm/yyyy hh/mm/ss), ‘classified’ and 'processed text'. Within the ‘processed text’ row there are certain strings such as {TICKER|IBM}', to which I will refer as ticker-strings.
My target is to get the average value of ‘classified’ per ticker-string per day. The row ‘classified’ includes the numerical values -1, 0 and 1.
At this moment, I have a working SQL query for the average value of ‘classified’ for one ticker-string per day. See the script below.
SELECT Date( `created_at` ) , AVG( `classified` ) AS Classified
FROM `tweets`
WHERE `processed_text` LIKE '%{TICKER|IBM}%'
GROUP BY Date( `created_at` )
There are however two problems with this script:
It does not include days on which there were zero ‘processed_text’s like {TICKER|IBM}. I would however like it to spit out the value zero in this case.
I have 100+ different ticker-strings and would thus like to have a script which can process multiple strings at the same time. I can also do them manually, one by one, but this would cost me a terrible lot of time.
When I had a similar question for counting the ‘tweet_id’s per ticker-string, somebody else suggested using the following:
SELECT d.date, coalesce(IBM, 0) as IBM, coalesce(GOOG, 0) as GOOG,
coalesce(BAC, 0) AS BAC
FROM dates d LEFT JOIN
(SELECT DATE(created_at) AS date,
COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|IBM}%' then tweet_id
END) as IBM,
COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|GOOG}%' then tweet_id
END) as GOOG,
COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|BAC}%' then tweet_id
END) as BAC
FROM tweets
GROUP BY date
) t
ON d.date = t.date;
This script worked perfectly for counting the tweet_ids per ticker-string. As I however stated, I am not looking to find the average classified scores per ticker-string. My question is therefore: Could someone show me how to adjust this script in such a way that I can calculate the average classified scores per ticker-string per day?
SELECT d.date, t.ticker, COALESCE(COUNT(DISTINCT tweet_id), 0) AS tweets
FROM dates d
LEFT JOIN
(SELECT DATE(created_at) AS date,
SUBSTR(processed_text,
LOCATE('{TICKER|', processed_text) + 8,
LOCATE('}', processed_text, LOCATE('{TICKER|', processed_text))
- LOCATE('{TICKER|', processed_text) - 8)) t
ON d.date = t.date
GROUP BY d.date, t.ticker
This will put each ticker on its own row, not a column. If you want them moved to columns, you have to pivot the result. How you do this depends on the DBMS. Some have built-in features for creating pivot tables. Others (e.g. MySQL) do not and you have to write tricky code to do it; if you know all the possible values ahead of time, it's not too hard, but if they can change you have to write dynamic SQL in a stored procedure.
See MySQL pivot table for how to do it in MySQL.

SQL to calculate value of Shares at a particular time

I'm looking for a way that I can calculate what the value of shares are at a given time.
In the example I need to calculate and report on the redemptions of shares in a given month.
There are 3 tables that I need to look at:
Redemptions table that has the Date of the redemption, the number of shares that were redeemed and the type of share.
The share type table which has the share type and links the 1st and 3rd tables.
The Share price table which has the share type, valuation date, value.
So what I need to do is report on and have calculated based on the number of share redemptions the value of those shares broken down by month.
Does that make sense?
Thanks in advance for your help!
Apologies, I think I should elaborate a little further as there might have been some misunderstandings. This isn't to calculate daily changing stocks and shares, it's more for fund management. What this means is that the share price only changes on a monthly basis and it's also normally a month behind.
The effect of this is that the what the query needs to do, is look at the date of the redemption, work out the date ie month and year. Then look at the share price table and if there's a share price for the given date (this will need to be calculated as it will be a single day ie the price was x on day y) then multiple they number of units by this value. However, if there isn't a share price for the given date then use the last price for that particular share type.
Hopefully this might be a little more clear but if there's any other information I can provide to make this easier then please let me know and I'll supply you with the information.
Regards,
Phil
This should do the trick (note: updated to group by ShareType):
SELECT
ST.ShareType,
RedemptionMonth = DateAdd(month, DateDiff(month, 0, R.RedemptionDate), 0),
TotalShareValueRedeemed = Sum(P.SharePrice * R.SharesRedeemed)
FROM
dbo.Redemption R
INNER JOIN dbo.ShareType ST
ON R.ShareTypeID = ST.ShareTypeID
CROSS APPLY (
SELECT TOP 1 P.*
FROM dbo.SharePrice P
WHERE
R.ShareTypeID = P.ShareTypeID
AND R.RedemptionDate >= P.SharePriceDate
ORDER BY P.SharePriceDate DESC
) P
GROUP BY
ShareType,
DateAdd(month, DateDiff(month, 0, R.RedemptionDate), 0)
ORDER BY
ShareType,
RedemptionMonth
;
See it working in a Sql Fiddle.
This can easily be parameterized by simply adding a WHERE clause with conditions on the Redemption table. If you need to show a 0 for share types in months where they had no Redemptions, please let me know and I'll improve my answer--it would help if you would fill out your use case scenario a little bit, and describe exactly what you want to input and what you want to see as output.
Also please note: I'm assuming here that there will always be a price for a share redemption--if a redemption exists that is before any share price for it, that redemption will be excluded.
If you have the valuations for every day, then the calculation is a simple join followed by an aggregation. The resulting query is something like:
select year(redemptiondate), month(redemptiondate),
sum(r.NumShares*sp.Price) as TotalPrice
from Redemptions r left outer join
ShareType st
on r.sharetype = st.sharetype left outer join
SharePrice sp
on st.sharename = sp.sharename and r.redemptiondate = sp.pricedate
group by year(redemptiondate), month(redemptiondate)
order by 1, 2;
If I understand your question, you need a query like
select shares.id, shares.name, sum (redemption.quant * shareprices.price)
from shares
inner join redemption on shares.id = redemption.share
inner join shareprices on shares.id = shareprices.share
where redemption.curdate between :p1 and :p2
order by shares.id
group by shares.id, shares.name
:p1 and :p2 are date parameters
If you just need it for one date range:
SELECT s.ShareType, SUM(ISNULL(sp.SharePrice, 0) * ISNULL(r.NumRedemptions, 0)) [RedemptionPrice]
FROM dbo.Shares s
LEFT JOIN dbo.Redemptions r
ON r.ShareType = s.ShareType
OUTER APPLY (
SELECT TOP 1 SharePrice
FROM dbo.SharePrice p
WHERE p.ShareType = s.ShareType
AND p.ValuationDate <= r.RedemptionDate
ORDER BY p.ValuationDate DESC) sp
WHERE r.RedemptionDate BETWEEN #Date1 AND #Date2
GROUP BY s.ShareType
Where #Date1 and #Date2 are your dates
The ISNULL checks are just there so it actually gives you a value if something is null (it'll be 0). It's completely optional in this case, just a personal preference.
The OUTER APPLY acts like a LEFT JOIN that will filter down the results from SharePrice to make sure you get the most recent ValuationDate from table based on the RedemptionDate, even if it wasn't from the same date range as that date. It could probably be achieved another way, but I feel like this is easily readable.
If you don't feel comfortable with the OUTER APPLY, you could use a subquery in the SELECT part (i.e., ISNULL(r.NumRedemptions, 0) * (/* subquery from dbo.SharePrice here */)

Calculate Average after populating a temp table

I have been tasked with figuring out the average length of time that our customers stick with us. (Specifically from the date they become a customer, to when they placed their last order.)
I am not 100% sure that I am doing this properly, but my thought was to gather the date we enter the customer into the database, and then head over to the order table and grab their most recent order date, dump them into a temp table, and then figure out the length of time between those two dates, and then tally an average based on that number.
( I have to do some other wibbly wobbly time stuff as well, but this is the one thats kicking my butt)
The end goal with this is to be able to say "On Average our customers stick with us for 4 years, and 3 months." (Or whatever the data shows it to be.)
SELECT * INTO #AvgTable
FROM(
SELECT DISTINCT (c.CustNumber) AS [CustomerNumber]
, COALESCE(convert( VARCHAR(10),c.OrgEnrollDate,101),'') AS [StartDate]
, COALESCE(CONVERT(VARCHAR(10),MAX(co.OrderDate),101),'')AS [EndDate]
,DATEDIFF(DD,c.OrgEnrollDate, co.OrderDate) as [LengthOfTime]
FROM dbo.Customer c
JOIN dbo.CustomerOrder co ON c.ID = co.CustomerID
WHERE c.Archived = 0
AND co.Archived =0
AND c.OrgEnrollDate IS NOT NULL
AND co.OrderDate IS NOT NULL
GROUP BY c.CustNumber
, co.OrderDate 2
)
--This is where I start falling apart
Select AVG[LengthofTime]
From #AvgTable
If understand you correctly, then just try
SELECT AVG(DATEDIFF(dd, StartDate, EndDate)) AvgTime
FROM #AvgTable
My guess is that since you are storing the data in a temp table, that the integer result of the datediff is being implicitly converted back to a datetime (which you cannot do an average on).
Don't store the average in your temp table (don't even have a temp table, but that is whole different conversation). Just do the differencing in your select.