SQL query compare value with average of similiar records

SQL query compare value with average of similiar records - sql

The table has 3 columns : Category, Value(int), Date
What I want the SQL query to do is check for each record belonging to a specific category, if the value lies within a specific tolerance range (say t) of the average of value over last 100 records which have the same weekday (monday, tuesday, etc) and same category as that of the concerned record.
I was able to implement this partially, as I know the Category before hand, but the weekday depends on the record which is queried. Also, currently I am just checking if the value is greater than the average, instead of which I need to check if it lies within a certain tolerance.
SELECT Value, Date,
CASE WHEN
value > (SELECT AVG(value) FROM Table WHERE Category = 'CategoryX' and Date BETWEEN current_date - 700 and current_date - 1) THEN 1
ELSE 0
END AS check_avg
FROM Table
WHERE Category = 'CategoryX'
Sample :
Category
Value
Date
CategoryX
5000
2022-06-29
CategoryX
4500
2022-06-27
CategoryX
1000
2022-06-22
CategoryY
4500
2022-06-15
CategoryX
2000
2022-06-15
CategoryX
3000
2022-06-08
Expected Result :
Value in Record with today's date : 5000.
Average of values in records with same weekday and same category : 1000 + 2000 + 3000 / 3 = 2000.
If tolerance is 50%, then allowed value should be between 1000 - 3000.
So result should be 0

Validate that in both queries you are evaluating the same category and same weekday. Then sort the values that will be used to compute the average by date, and getting only the inmediate previous 100 records. Finally, check the difference between current value and average is below the tolerance interval epsilon.
SELECT Value, Date,
CASE WHEN
ABS(value - (SELECT AVG(Value) FROM (SELECT TOP 100 Value FROM Table WHERE Category = t.Category and DATEPART(WEEKDAY, Date)=DATEPART(WEEKDAY, t.Date) AND Date <= t.Date ORDER BY Date DESC ))) < epsilon THEN 1
ELSE 0
END AS check_avg
FROM Table t
WHERE Category = 'CategoryX'

Related

Calculating Datediff of two days based on when the sum of a column hits a number cap

Tried to see if this was asked anywhere else but doesn't seem like it. Trying to create a sql query to give me the date difference in days between '2022-10-01' and the date when our impression sum hits our cap of 5.
For context, we may see duplicate dates because someone revisit our website that day so we'll get a different session number to pair with that count. Here's an example table of one individual and how many impressions logged.
My goal is to get the number of days it takes to hit an impression cap of 5. So for this individual, they would hit the cap on '2022-10-07' and the days between '2022-10-01' and '2022-10-07' is 6. I am also calculating the difference before/after '2023-01-01' since I need this count for Q4 of '22 and Q1 of '23 but will not include in the example table. I have other individuals to include but for the purpose of asking here, I kept it to one.
Current Query:
select
click_date,
case
when date(click_date) < date('2023-01-01') and sum(impression_cnt = 5) then datediff('day', '2022-10-01', click_date)
when date(click_date) >= date('2023-01-01') and sum(impression_cnt = 5) then datediff('day', '2023-01-01', click_date)
else 0
end days_to_capped
from table
group by customer, click_date, impression_cnt
customer
click date
impression_cnt
123456
2022-10-05
2
123456
2022-10-05
1
123456
2022-10-06
1
123456
2022-10-07
1
123456
2022-10-11
1
123456
2022-10-11
3
Result Table
customer
days_to_cap
123456
6
I'm currently only getting 0 days and then 81 days once it hits 2022-12-21 (last date) for this individual so i know I need to fix my query. Any help would be appreciated!
Edited: This is in snowflake!

So, the issue with your query is that the sum is being calculated at the level that you are grouping by, which is every field, so it will always just be the value of the impressions field every time.
What you need to do is a running sum, which is a SUM() OVER (PARTITION BY...) statement. And then qualify the results of that:
First, just to get the data that you have:
with x as (
select *
from values
(123456,'2022-10-05'::date,2),
(123456,'2022-10-05'::date,1),
(123456,'2022-10-06'::date,1),
(123456,'2022-10-07'::date,1),
(123456,'2022-10-11'::date,1),
(123456,'2022-10-11'::date,3) x (customer,click_date,impression_cnt)
)
Then, I query the CTE to do the running sum with a QUALIFY statement to choose the record that actually has the value I'm looking for
select
customer,
case
when click_date < '2023-01-01'::date and sum(impression_cnt) OVER (partition by customer order by click_date) = 5 then datediff('day', '2022-10-01', click_date)
when click_date >= '2023-01-01'::date and sum(impression_cnt) OVER (partition by customer order by click_date) = 5 then datediff('day', '2023-01-01', click_date)
else 0
end days_to_capped
from x
qualify days_to_capped > 0;
The qualify filters your results to just the record that you cared about.

LAG and LEAD based on parameter

I have table - Invoices, with such structure:
Id
InvoiceNo
Date
1
10
11-12-21
2
20
12-12-21
3
30
13-12-21
4
40
NULL
5
50
14-12-21
6
60
NULL
7
70
NULL
8
80
15-12-21
What I need to do - I need to find InvoiceNo's, the date field of the next or previous line of which contains null.
So, based on provided data - I should receive:
InvoiceNo
30
50
80
But how to do this? One option that I found - LAG() and LEAD() functions, and with these functions I can receive numbers and dates, but cannot use parameters - so cannot provide "Date is not null" check.

You can use lag and lead to find the previous and next rows, and then wrap the query with another query that returns only the rows where one of them was null. Note that lag of the first row and lead of the last raw will return null by default, so you need to explicitly state a non-null default, such as getdate():
SELECT InvoiceNo
FROM (SELECT InvoiceNo,
LAG(Date, 1, GETDATE()) OVER (ORDER BY InvoiceNo) AS lag_date,
LEAD(Date, 1, GETDATE()) OVER (ORDER BY InvoiceNo) AS lead_date
FROM invoices) t
WHERE lag_date IS NULL OR lead_date IS NULL

Retrieve data 60 days prior to their retest date

I have a requirement where I need to retrieve Row(s) 60 days prior to their "Retest Date" which is a column present in the table. I have also attached the screenshot and the field "Retest Date" is highlighted.
reagentlotid
reagentlotdesc
u_retest
RL-0000004
NULL
2021-09-30 17:00:00.00
RL-0000005
NULL
2021-09-29 04:21:00.00
RL-0000006
NULL
2021-09-29 04:22:00.00
RL-0000007
Y-T4
2021-08-28 05:56:00.00
RL-0000008
NULL
2021-09-30 05:56:00.00
RL-0000009
NULL
2021-09-28 04:23:00.00
This is what I was trying to do in SQL Server:
select r.reagentlotid, r.reagentlotdesc, r.u_retestdt
from reagentlot r
where u_retestdt = DATEADD(DD,60,GETDATE());
But, it didn't work. The above query returning 0 rows.
Could please someone help me with this query?

Use a range, if you want all data from the day 60 days hence:
select r.reagentlotid, r.reagentlotdesc, r.u_retestdt
from reagentlot r
where
u_retestdt >= CAST(DATEADD(DD,60,GETDATE())
AS DATE) AND
u_retestdt < CAST(DATEADD(DD,61,GETDATE()) AS DATE)
Dates are like numbers; the time is like a decimal part. 12:00:00 is half way through a day so it's like x.5 - SQLServer even lets you manipulate datetime types by adding fractions of days etc (adding 0.5 is adding 12h)
If you had a column of numbers like 1.1, 1.5. 2.4 and you want all the one-point-somethings you can't get any of them by saying score = 1; you say score >= 1 and score < 2
Generally, you should try to avoid manipulating table data in a query's WHERE clause because it usually makes indexes unusable: if you want "all numbers between 1 and 2", use a range; don't chop the decimal off the table data in order to compare it to 1. Same with dates; don't chop the time off - use a range:
--yes
WHERE score >= 1 and score < 2
--no
WHERE CAST(score as INTEGER) = 1
--yes
WHERE birthdatetime >= '1970-01-01' and birthdatetime < '1970-01-02'
--no
WHERE CAST(birthdatetime as DATE) = '1970-01-01'
Note that I am using a CAST to cut the time off in my recommendation to you, but that's to establish a pair of constants of "midnight on the day 60 days in the future" and "midnight on 61 days in the future" that will be used in the range check.
Follow the rule of thumb of "avoid calling functions on columns in a where clause" and generally, you'll be fine :)

Try something like this. -60 days may be the current or previous year. HTH
;with doy1 as (
select DATENAME(dayofyear, dateadd(day,-60,GetDate())) as doy
)
, doy2 as (
select case when doy > 0 then doy
when doy < 0 then 365 - doy end as doy
, case when doy > 0 then year(getdate())
when doy < 0 then year(getdate())-1 end as yr
from doy1
)
select r.reagentlotid
, r.reagentlotdesc
, cast(r.u_retestdt as date) as u_retestdt
from reagentlot r
inner join doy2 d on DATENAME(dayofyear, r.u_retestdt) = d.doy
where DATENAME(dayofyear, r.u_retestdt) = doy
and year(r.u_retestdt) = d.yr

Selecting records with total greater then specified value over specified time interval

This is for Check Cashing business.
I have a table of checks cashed:
CustomerID, CustomerName, DateTimeCashed, CheckAmount, CheckFee, CheckPaypot
00100 John Doe 01/01/2017 12:40:30 1000 20 980
00200 John Smith 01/02/2017 13:24:45 2000 40 1960
..................
There are thousands of records like this.
I need to build a query which would return all records where total CheckPaypot for each Customer cashed in any 24 hour period exceeds 10000.
I know how to do this if a 24-hour interval is defined as a day from 12:00 AM to 11:59 PM.
Select * from (
Select CustomerID, CustomerName, DateTimeCashed, CheckAmount, CheckFee, CheckPaypot,
(Select sum(ch.CheckPaypot) from Checks ch
where
ch.CustomerID = c.CustomerID and CONVERT(date, cn.DateTimeCashed) = CONVERT(date, c.DateTimeCashed)
) as Total from Checks c) x
where x.Total > 10000
But the requirement is that the time interval is floating meaning that beginning and ending can be anything
as long as the length of the time interval is 24 hours. So of the Customer cashed 3 checks: 1 check in the afternoon
and 2 checks before noon of the next day and total of these checks is over $10000, they all must be included in the result.
Thank you,
lenkost.

SELECT
CustomerID,
SUM(CheckPaypot)
FROM
tb_previsao
WHERE
DateTimeCashed > DateTimeCashed - INTERVAL '1' DAY
GROUP BY
CustomerID
HAVING
SUM(CheckPaypot) > 10000;

Unfortunately, you'll have to use a correlated subquery:
SELECT
FROM (
SELECT outer_ch.*,
(SELECT SUM(checkpayout)
FROM checks inner_ch
WHERE DATEDIFF(HOUR, inner_ch.datetimecashed, outer_ch.datetimecashed)
BETWEEN 0 AND 23
AND inner_ch.customerid = outer_ch.customerid) AS running_sum_checkpayout
FROM checks outer_ch
)
WHERE running_sum_checkpayout > 10000
I say "unfortunately" because correlated subqueries are necessarily inefficient as they execute a separate subquery for each row in the result set. If this doesn't perform well enough, try to avoid doing a full table scan for each of these subqueries, e.g. by putting an index on customerid.

MDX last order date and last order value

I've googled but I cannot get the point
I've a fact table like this one
fact_order
id, id_date, amount id_supplier
1 1 100 4
2 3 200 4
where id_date is the primary key for a dimension that have
id date month
1 01/01/2011 january
2 02/01/2011 january
3
I would like to write a calculated member that give me the last date and the last amount for the same supplier.

Last date and last amount -- it's a maximum values for this supplier?
If "yes", so you can create two metrics with aggregation "max" for fields id_date and amount.
And convert max id_date to appropriate view in the following way:
CREATE MEMBER CURRENTCUBE.[Measures].[Max Date]
AS
IIF([Measures].[Max Date Key] is NULL,NULL,
STRTOMEMBER("[Date].[Calendar].[Date].&["+STR([Measures].[Max Date Key])+"]").name),
VISIBLE = 1 ;
It will works, If maximum dates in your dictionary have maximum IDs. In my opinion You should use date_id not 1,2,3..., but 20110101, 20110102, etc.
If you don't want to obtain max values - please provide more details and small example.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas