Calculate Average after populating a temp table - sql

I have been tasked with figuring out the average length of time that our customers stick with us. (Specifically from the date they become a customer, to when they placed their last order.)
I am not 100% sure that I am doing this properly, but my thought was to gather the date we enter the customer into the database, and then head over to the order table and grab their most recent order date, dump them into a temp table, and then figure out the length of time between those two dates, and then tally an average based on that number.
( I have to do some other wibbly wobbly time stuff as well, but this is the one thats kicking my butt)
The end goal with this is to be able to say "On Average our customers stick with us for 4 years, and 3 months." (Or whatever the data shows it to be.)
SELECT * INTO #AvgTable
FROM(
SELECT DISTINCT (c.CustNumber) AS [CustomerNumber]
, COALESCE(convert( VARCHAR(10),c.OrgEnrollDate,101),'') AS [StartDate]
, COALESCE(CONVERT(VARCHAR(10),MAX(co.OrderDate),101),'')AS [EndDate]
,DATEDIFF(DD,c.OrgEnrollDate, co.OrderDate) as [LengthOfTime]
FROM dbo.Customer c
JOIN dbo.CustomerOrder co ON c.ID = co.CustomerID
WHERE c.Archived = 0
AND co.Archived =0
AND c.OrgEnrollDate IS NOT NULL
AND co.OrderDate IS NOT NULL
GROUP BY c.CustNumber
, co.OrderDate 2
)
--This is where I start falling apart
Select AVG[LengthofTime]
From #AvgTable

If understand you correctly, then just try
SELECT AVG(DATEDIFF(dd, StartDate, EndDate)) AvgTime
FROM #AvgTable

My guess is that since you are storing the data in a temp table, that the integer result of the datediff is being implicitly converted back to a datetime (which you cannot do an average on).
Don't store the average in your temp table (don't even have a temp table, but that is whole different conversation). Just do the differencing in your select.

Related

How can I get the first instance of an event per day with multiple columns including a datetime and return those columns plus the full datetime value?

I need to generate a SQL script that will pull out Distinct entries using a number of columns, one of which is a datetime column. I am only interested in the first occurrence of the day per event and the query needs to span multiple days. The query will be run against a very large database and can potentially be returning hundreds of thousands of results if not millions. Therefore I need this script to be as efficient as possible as well. This will eventually be a script running in SSRS to pull access transactions.
I've tried using GROUP BY, DISTINCT, subqueries, FIRST, and such without success. All the examples I can find online don't have JOIN statements or calculated columns such as only gathering the date from a datetime field.
I've simplified the below script some to only pull one day and one door, but the prod will be multiple days and doors. This code returns the data I need, I don't care about the COUNT, but I also need to get the (DateAdd(minute,-(ServerLocaleOffset),ServerUTC)) field in my result set as well somehow. The problem is since it goes down to the second it makes all records DISTINCT.
DECLARE #Begin datetime2 = '4/10/2019',
#End datetime2 = '4/11/2019',
#Door varchar(max) = 'Front Entrance'
SELECT
CONVERT(VARCHAR(10), (DateAdd(minute,-(ServerLocaleOffset),ServerUTC)),101) AS 'Date'
,AJ.PrimaryObjectIdentity
,AJ.SecondaryObjectIdentity
,AJ.MessageType
,AJ.PrimaryObjectName
,AJ.SecondaryObjectName
,AP.Text13
,COUNT(*) AS 'Count'
FROM Access.JournalLogView AJ
LEFT OUTER JOIN Access.Personnel as AP on AP.GUID = AJ.PrimaryObjectIdentity
WHERE (MessageType like 'CardAdmitted' OR MessageType like 'CardRejected')
AND (DateAdd(minute,-(ServerLocaleOffset),ServerUTC)) BETWEEN #Begin AND #End
AND (SecondaryObjectName IN (#Door))
GROUP BY CONVERT(VARCHAR(10), (DateAdd(minute,-(ServerLocaleOffset),ServerUTC)),101)
,PrimaryObjectIdentity
,SecondaryObjectIdentity
,MessageType
,PrimaryObjectName
,SecondaryObjectName
,Text13
ORDER BY AJ.PrimaryObjectName
I want to get the columns called out in the SELECT statement plus the datetime which includes the second. Again I also want the most efficient way of pulling this data as well. Thank you very much.
Assuming PrimaryObjectIdentity is the primary key to find the personnel in JournalLogview and ServerLocaleOffset as the datetime column in that table,I have written down this:
DECLARE #Begin datetime2 = '4/10/2019',
#End datetime2 = '4/11/2019',
#Door varchar(max) = 'Front Entrance'
WITH cte
AS(
SELECT
ROW_NUMBER() OVER
(PARTITION BY PrimaryObjectIdentity,CONVERT(VARCHAR(10), (DateAdd(minute,-(ServerLocaleOffset),ServerUTC)),101) ORDER BY ServerLocaleOffset) AS row_num,
--whatever the columns you want here
*
FROM
Access.JournalLogView)
SELECT
DateAdd(minute,-(ServerLocaleOffset),ServerUTC)) AS 'DateTime'
,AJ.PrimaryObjectIdentity
,AJ.SecondaryObjectIdentity
,AJ.MessageType
,AJ.PrimaryObjectName
,AJ.SecondaryObjectName
,AP.Text13
--I guess count(*) won't be of use a we are selecting only the first row
,COUNT(*) AS 'Count'
FROM cte AJ
LEFT OUTER JOIN
Access.Personnel as AP
on
AP.GUID = AJ.PrimaryObjectIdentity
WHERE
AJ.row_num = 1
AND (MessageType like 'CardAdmitted' OR MessageType like 'CardRejected')
AND (DateAdd(minute,-(ServerLocaleOffset),ServerUTC)) BETWEEN #Begin AND #End
AND (SecondaryObjectName IN (#Door))
GROUP BY (DateAdd(minute,-(ServerLocaleOffset),ServerUTC))
,PrimaryObjectIdentity
,SecondaryObjectIdentity
,MessageType
,PrimaryObjectName
,SecondaryObjectName
,Text13
ORDER BY AJ.PrimaryObjectName
In this query, I have used PARTITION to partition the whole table by each user, date and then assign row_number() to each row starting from the first entry of each user in that particular date. So, any row with row_num() = 1 will give you the first entry of that user in that date (which is the same condition I have used in the where clause). Hope this helps :)

SQL: Average value per day

I have a database called ‘tweets’. The database 'tweets' includes (amongst others) the rows 'tweet_id', 'created at' (dd/mm/yyyy hh/mm/ss), ‘classified’ and 'processed text'. Within the ‘processed text’ row there are certain strings such as {TICKER|IBM}', to which I will refer as ticker-strings.
My target is to get the average value of ‘classified’ per ticker-string per day. The row ‘classified’ includes the numerical values -1, 0 and 1.
At this moment, I have a working SQL query for the average value of ‘classified’ for one ticker-string per day. See the script below.
SELECT Date( `created_at` ) , AVG( `classified` ) AS Classified
FROM `tweets`
WHERE `processed_text` LIKE '%{TICKER|IBM}%'
GROUP BY Date( `created_at` )
There are however two problems with this script:
It does not include days on which there were zero ‘processed_text’s like {TICKER|IBM}. I would however like it to spit out the value zero in this case.
I have 100+ different ticker-strings and would thus like to have a script which can process multiple strings at the same time. I can also do them manually, one by one, but this would cost me a terrible lot of time.
When I had a similar question for counting the ‘tweet_id’s per ticker-string, somebody else suggested using the following:
SELECT d.date, coalesce(IBM, 0) as IBM, coalesce(GOOG, 0) as GOOG,
coalesce(BAC, 0) AS BAC
FROM dates d LEFT JOIN
(SELECT DATE(created_at) AS date,
COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|IBM}%' then tweet_id
END) as IBM,
COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|GOOG}%' then tweet_id
END) as GOOG,
COUNT(DISTINCT CASE WHEN processed_text LIKE '%{TICKER|BAC}%' then tweet_id
END) as BAC
FROM tweets
GROUP BY date
) t
ON d.date = t.date;
This script worked perfectly for counting the tweet_ids per ticker-string. As I however stated, I am not looking to find the average classified scores per ticker-string. My question is therefore: Could someone show me how to adjust this script in such a way that I can calculate the average classified scores per ticker-string per day?
SELECT d.date, t.ticker, COALESCE(COUNT(DISTINCT tweet_id), 0) AS tweets
FROM dates d
LEFT JOIN
(SELECT DATE(created_at) AS date,
SUBSTR(processed_text,
LOCATE('{TICKER|', processed_text) + 8,
LOCATE('}', processed_text, LOCATE('{TICKER|', processed_text))
- LOCATE('{TICKER|', processed_text) - 8)) t
ON d.date = t.date
GROUP BY d.date, t.ticker
This will put each ticker on its own row, not a column. If you want them moved to columns, you have to pivot the result. How you do this depends on the DBMS. Some have built-in features for creating pivot tables. Others (e.g. MySQL) do not and you have to write tricky code to do it; if you know all the possible values ahead of time, it's not too hard, but if they can change you have to write dynamic SQL in a stored procedure.
See MySQL pivot table for how to do it in MySQL.

Improving performance of outer apply

Let me briefly describe what I'm attempting in case someone has a much more elegant way of solving the same problem. I'm trying to write a stored procedure that looks at sales orders in a database, find when the same item is ordered by the same customer multiple times, and predict the next date of an order using an average of the previous intervals between orders for the same item. The query below is going to form the basis for the temp table to work against with probably cursors and running averages.
So far the query I have looks like this
SELECT sl.custaccount ,
sl.itemid ,
sl.shippingdaterequested ,
nextdate.shippingdaterequested AS nextshippingdaterequested
FROM salesline AS sl
OUTER APPLY ( SELECT TOP 1
sl2.custaccount ,
sl2.itemid ,
sl2.shippingdaterequested
FROM salesline AS sl2
WHERE sl2.shippingdaterequested > sl.shippingdaterequested
AND sl2.custaccount = sl.custaccount
AND sl2.itemid = sl.itemid
GROUP BY sl2.custaccount ,
sl2.itemid ,
sl2.shippingdaterequested
ORDER BY sl2.shippingdaterequested
) AS nextdate
GROUP BY sl.custaccount ,
sl.itemid ,
sl.shippingdaterequested ,
nextdate.shippingdaterequested
This query gives me a row for every sales line with a column representing the next time that item was ordered by that customer. If that column is NULL, I know the record I'm on is the last time.
The basic problem is that this query is way too slow, it runs fine if I go against a single customer at a time, returning results in a second, but running against ~100,000 customers would take around 27 hours.
I know the basic problem is that I'm outer applying, so it's probably doing row by agonizing row processing, but I'm not sure of another way to get to hear that would work out faster. Any thoughts?
I think you are making it more complex than it needs to be.
Just take the min and max and divide by the count
SELECT sl.custaccount ,
sl.itemid ,
MAX(sl.shippingdaterequested) AS lastShip ,
DATEDIFF(dd, MIN(sl.shippingdaterequested),
MAX(sl.shippingdaterequested)) / COUNT(*) AS interval ,
DATEADD(dd,
DATEDIFF(dd, MIN(sl.shippingdaterequested),
MAX(sl.shippingdaterequested)) / COUNT(*),
MAX(sl.shippingdaterequested)) AS nextShip
FROM salesline AS sl
GROUP BY sl.custaccount ,
sl.itemid
HAVING COUNT(*) > 1

Find closest date in SQL Server

I have a table dbo.X with DateTime column Y which may have hundreds of records.
My Stored Procedure has parameter #CurrentDate, I want to find out the date in the column Y in above table dbo.X which is less than and closest to #CurrentDate.
How to find it?
The where clause will match all rows with date less than #CurrentDate and, since they are ordered descendantly, the TOP 1 will be the closest date to the current date.
SELECT TOP 1 *
FROM x
WHERE x.date < #CurrentDate
ORDER BY x.date DESC
Use DateDiff and order your result by how many days or seconds are between that date and what the Input was
Something like this
select top 1 rowId, dateCol, datediff(second, #CurrentDate, dateCol) as SecondsBetweenDates
from myTable
where dateCol < #currentDate
order by datediff(second, #CurrentDate, dateCol)
I have a better solution for this problem i think.
I will show a few images to support and explain the final solution.
Background
In my solution I have a table of FX Rates. These represent market rates for different currencies. However, our service provider has had a problem with the rate feed and as such some rates have zero values. I want to fill the missing data with rates for that same currency that as closest in time to the missing rate. Basically I want to get the RateId for the nearest non zero rate which I will then substitute. (This is not shown here in my example.)
1) So to start off lets identify the missing rates information:
Query showing my missing rates i.e. have a rate value of zero
2) Next lets identify rates that are not missing.
Query showing rates that are not missing
3) This query is where the magic happens. I have made an assumption here which can be removed but was added to improve the efficiency/performance of the query. The assumption on line 26 is that I expect to find a substitute transaction on the same day as that of the missing / zero transaction.
The magic happens is line 23: The Row_Number function adds an auto number starting at 1 for the shortest time difference between the missing and non missing transaction. The next closest transaction has a rownum of 2 etc.
Please note that in line 25 I must join the currencies so that I do not mismatch the currency types. That is I don't want to substitute a AUD currency with CHF values. I want the closest matching currencies.
Combining the two data sets with a row_number to identify nearest transaction
4) Finally, lets get data where the RowNum is 1
The final query
The query full query is as follows;
; with cte_zero_rates as
(
Select *
from fxrates
where (spot_exp = 0 or spot_exp = 0)
),
cte_non_zero_rates as
(
Select *
from fxrates
where (spot_exp > 0 and spot_exp > 0)
)
,cte_Nearest_Transaction as
(
select z.FXRatesID as Zero_FXRatesID
,z.importDate as Zero_importDate
,z.currency as Zero_Currency
,nz.currency as NonZero_Currency
,nz.FXRatesID as NonZero_FXRatesID
,nz.spot_imp
,nz.importDate as NonZero_importDate
,DATEDIFF(ss, z.importDate, nz.importDate) as TimeDifferece
,ROW_NUMBER() Over(partition by z.FXRatesID order by abs(DATEDIFF(ss, z.importDate, nz.importDate)) asc) as RowNum
from cte_zero_rates z
left join cte_non_zero_rates nz on nz.currency = z.currency
and cast(nz.importDate as date) = cast(z.importDate as date)
--order by z.currency desc, z.importDate desc
)
select n.Zero_FXRatesID
,n.Zero_Currency
,n.Zero_importDate
,n.NonZero_importDate
,DATEDIFF(s, n.NonZero_importDate,n.Zero_importDate) as Delay_In_Seconds
,n.NonZero_Currency
,n.NonZero_FXRatesID
from cte_Nearest_Transaction n
where n.RowNum = 1
and n.NonZero_FXRatesID is not null
order by n.Zero_Currency, n.NonZero_importDate

Window moving average in sql server

I am trying to create a function that computes a windowed moving average in SQLServer 2008. I am quite new to SQL so I am having a fair bit of difficulty. The data that I am trying to perform the moving average on needs to be grouped by day (it is all timestamped data) and then a variable moving average window needs to be applied to it.
I already have a function that groups the data by day (and #id) which is shown at the bottom. I have a few questions:
Would it be better to call the grouping function inside the moving average function or should I do it all at once?
Is it possible to get the moving average for the dates input into the function, but go back n days to begin the moving average so that the first n days of the returned data will not have 0 for their average? (ie. if they want a 7 day moving average from 01-08-2011 to 02-08-2011 that I start the moving average calculation on 01-01-2011 so that the first day they defined has a value?)
I am in the process of looking into how to do the moving average, and know that a moving window seems to be the best option (currentSum = prevSum + todayCount - nthDayAgoCount) / nDays but I am still working on figuring out the SQL implementation of this.
I have a grouping function that looks like this (some variables removed for visibility purposes):
SELECT
'ALL' as GeogType,
CAST(v.AdmissionOn as date) as dtAdmission,
CASE WHEN #id IS NULL THEN 99 ELSE v.ID END,
COUNT(*) as nVisits
FROM dbo.Table1 v INNER JOIN dbo.Table2 t ON v.FSLDU = t.FSLDU5
WHERE v.AdmissionOn >= '01-01-2010' AND v.AdmissionOn < DATEADD(day,1,'02-01-2010')
AND v.ID = Coalesce(#id,ID)
GROUP BY
CAST(v.AdmissionOn as date),
CASE WHEN #id IS NULL THEN 99 ELSE v.ID END
ORDER BY 2,3,4
Which returns a table like so:
ALL 2010-01-01 1 103
ALL 2010-01-02 1 114
ALL 2010-01-03 1 86
ALL 2010-01-04 1 88
ALL 2010-01-05 1 84
ALL 2010-01-06 1 87
ALL 2010-01-07 1 82
EDIT: To answer the first question I asked:
I ended up creating a function which declared a temporary table and inserted the results from the count function into it, then used the example from user662852 to compute the moving average.
Take the hardcoded date range out of your query. Write the output (like your sample at the end) to a temp table (I called it #visits below).
Try this self join to the temp table:
Select list.dtadmission
, AVG(data.nvisits) as Avg
, SUM(data.nvisits) as sum
, COUNT(data.nvisits) as RollingDayCount
, MIN(data.dtadmission) as Verifymindate
, MAX(data.dtadmission) as Verifymaxdate
from #visits as list
inner join #visits as data
on list.dtadmission between data.dtadmission and DATEADD(DD,6,data.dtadmission) group by list.dtadmission
EDIT: I didn't have enough room in Comments to say this in response to your question:
My join is "kinda cartesian" because it uses a between in the join constraint. Each record in list is going up against every other record, and then I want the ones where the date I report is between a lower bound of (-7) days and today. Every data date is available to list date, this is the key to your question. I could have written the join condition as
list.dtadmission between DATEADD(DD,-6,data.dtadmission) and data.dtadmission
But what really happened was I tested it as
list.dtadmission between DATEADD(DD,6,data.dtadmission) and data.dtadmission
Which returns no records because the syntax is "Between LOW and HIGH". I facepalmed on 0 records and swapped the arguments, that's all.
Try the following, see what I mean: This is the cartesian join for just one listdate:
SELECT
list.[dtAdmission] as listdate
,data.[dtAdmission] as datadate
,data.nVisits as datadata
,DATEADD(dd,6,list.dtadmission) as listplus6
,DATEADD(dd,6,data.dtAdmission ) as datapplus6
from [sandbox].[dbo].[admAvg] as list inner join [sandbox].[dbo].[admAvg] as data
on
1=1
where list.dtAdmission = '5-Jan-2011'
Compare this to the actual join condition
SELECT
list.[dtAdmission] as listdate
,data.[dtAdmission] as datadate
,data.nVisits as datadata
,DATEADD(dd,6,list.dtadmission) as listplus6
,DATEADD(dd,6,data.dtAdmission ) as datapplus6
from [sandbox].[dbo].[admAvg] as list inner join [sandbox].[dbo].[admAvg] as data
on
list.dtadmission between data.dtadmission and DATEADD(DD,6,data.dtadmission)
where list.dtAdmission = '5-Jan-2011'
See how list date is between datadate and dataplus6 in all the records?