sql server lag function where condition - sql

In the database, I have +1000,+2000,+3000..... increasing values according to the previous value. These sometimes do not increase, but decrease, I wrote a listing query to find this out.
select NUMBER-lag(NUMBER) over (ORDER BY DATE_TIME) AS 'DIFF'
from exampleTable with(nolock)
WHERE CONDITION1='abcdef' AND DATE_TIME >='20220801'
This works and I export to excel and filter and find the ones less than 0, but should I add them directly to the where part in sql?
I tried HAVING because it is a non-normal field, and it didn't work either.
AND (NUMBER-lag(NUMBER) over (ORDER BY DATE_TIME))<0
ORDER BY DATE_TIME ASC

So basically it is like this,
;WITH CTE AS (
select NUMBER-lag(NUMBER) over (ORDER BY DATE_TIME) AS'DIFF' from exampleTable with(nolock)
WHERE CONDITION1='abcdef' AND DATE_TIME >='20220801'
)
SELECT * FROM CTE WHERE DIFF <0

Related

BigQuery - Extract last entry of each group

I have one table where multiple records inserted for each group of product. Now, I want to extract (SELECT) only the last entries. For more, see the screenshot. The yellow highlighted records should be return with select query.
The HAVING MAX and HAVING MIN clause for the ANY_VALUE function is now in preview
HAVING MAX and HAVING MIN were just introduced for some aggregate functions - https://cloud.google.com/bigquery/docs/release-notes#February_06_2023
with them query can be very simple - consider below approach
select any_value(t having max datetime).*
from your_table t
group by t.id, t.product
if applied to sample data in your question - output is
You might consider below as well
SELECT *
FROM sample_table
QUALIFY DateTime = MAX(DateTime) OVER (PARTITION BY ID, Product);
If you're more familiar with an aggregate function than a window function, below might be an another option.
SELECT ARRAY_AGG(t ORDER BY DateTime DESC LIMIT 1)[SAFE_OFFSET(0)].*
FROM sample_table t
GROUP BY t.ID, t.Product
Query results
You can use window function to do partition based on key and selecting required based on defining order by field.
For Example:
select * from (
select *,
rank() over (partition by product, order by DateTime Desc) as rank
from `project.dataset.table`)
where rank = 1
You can use this query to select last record of each group:
Select Top(1) * from Tablename group by ID order by DateTime Desc

Why can't I use my column alias in WHERE clause?

I want to compare a value of my current row with the value of the row before. I came up with this, but it won't work. It can't find PREV_NUMBER_OF_PEOPLE so my WHERE clause is invalid. I'm not allowed to use WITH. Does anyone have an idea?
SELECT
ID
,NUMBER_OF_PEOPLE
,LAG(NUMBER_OF_PEOPLE) OVER (ORDER BY DATE) AS PREV_NUMBER_OF_PEOPLE
,DATE
FROM (
SELECT * FROM DATAFRAME
WHERE DATE>=CURRENT_DATE-90
ORDER BY DATE DESC
) AS InnerQuery
WHERE NUMBER_OF_PEOPLE <> PREV_NUMBER_OF_PEOPLE
You have several issues with your query:
The filtering conditions should be in the outer query.
The new column definition should be in the inner query.
The order by should be in the outer query.
With these changes, it should work fine:
SELECT ID, NUMBER_OF_PEOPLE, PREV_NUMBER_OF_PEOPLE, DATE
FROM (SELECT D.*,
LAG(NUMBER_OF_PEOPLE) OVER (ORDER BY DATE) AS PREV_NUMBER_OF_PEOPLE
FROM DATAFRAME D
) AS InnerQuery
WHERE NUMBER_OF_PEOPLE <> PREV_NUMBER_OF_PEOPLE AND
DATE >= CURRENT_DATE - 90
ORDER BY DATE DESC;
You need the filtering after the LAG() so you can include the earliest day in the date range. If you filter in the inner query, the LAG() will return NULL in that case.
You need to define the alias in the subquery so you can refer to it in the WHERE. Aliases defined in a SELECT cannot be used in the corresponding WHERE. This is a SQL rule, not due to the database you are using.
You could use common table expression (CTE's) to split the query processing.
Something like this:
WITH cte1 AS
(
SELECT * -- field list is advised...
FROM DATAFRAME
WHERE DATE >= CURRENT_DATE-90
),
cte2 AS
(
SELECT ID
,NUMBER_OF_PEOPLE
,LAG(NUMBER_OF_PEOPLE) OVER (ORDER BY DATE) AS PREV_NUMBER_OF_PEOPLE
,DATE
FROM cte1
)
SELECT ID
,NUMBER_OF_PEOPLE
,PREV_NUMBER_OF_PEOPLE
,DATE
FROM cte2
WHERE NUMBER_OF_PEOPLE <> PREV_NUMBER_OF_PEOPLE
ORDER BY DATE DESC;
Logical query processing is the conceptual interpretation of the query that defines the correct result, and unlike the keyed-in order of the query clauses, it starts by evaluating the FROM clause. Understanding logical query processing is crucial for correct understanding of T-SQL.
The main statement used to retrieve data in T-SQL is the SELECT statement. Following are the main query clauses specified in the order that you are supposed to type them (known as “keyed-in order”):
SELECT
FROM
WHERE
GROUP BY
HAVING
ORDER BY
But as mentioned, the logical query processing order, which is the conceptual interpretation order, is different. It starts with the FROM clause. Here is the logical query processing order of the six main query clauses:
FROM
WHERE
GROUP BY
HAVING
SELECT
ORDER BY
You can use a CTE :
WITH CTE1 AS (
SELECT * FROM DATAFRAME
WHERE DATE>=CURRENT_DATE-90
),
CTE2 AS (
SELECT
ID
,NUMBER_OF_PEOPLE
,LAG(NUMBER_OF_PEOPLE) OVER (ORDER BY DATE) AS PREV_NUMBER_OF_PEOPLE
,DATE
FROM CT2
)
SELECT * FROM CT2
WHERE NUMBER_OF_PEOPLE <> PREV_NUMBER_OF_PEOPLE
Just move the lag() into the derived table.
SELECT *
FROM (
SELECT id,
number_of_people,
lag(number_of_people) over (order by date) as prev_number_of_people,
date
FROM dataframe
WHERE date >= current_date - 90
) AS InnerQuery
WHERE number_of_people <> prev_number_of_people
ORDER BY date DESC

Trying to get the greatest value from a customer on a given day

What I need to do: if a customer makes more than one transaction in a day, I need to display the greatest value (and ignore any other values).
The query is pretty big, but the code I inserted below is the focus of the issue. I’m not getting the results I need. The subselect ideally should be reducing the number of rows the query generates since I don’t need all the transactions, just the greatest one, however my code isn’t cutting it. I’m getting the exact same number of rows with or without the subselect.
Note: I don’t actually have a t. in the actual query, there’s just a dozen or so other fields being pulled in. I added the t.* just to simplify the code example.*
SELECT
t.*,
(SELECT TOP (1)
t1.CustomerGUID
t1.Value
t1.Date
FROM #temp t1
WHERE t1.CustomerGUID = t.CustomerGUID
AND t1.Date = t.Date
ORDER BY t1.Value DESC) AS “Value”
FROM #temp t
Is there an obvious flaw in my code or is there a better way to achieve the result of getting the greatest value transaction per day per customer?
Thanks
you may want to do as follows:
SELECT
t1.CustomerGUID,
t1.Date,
MAX(t1.Value) AS Value
FROM #temp t1
GROUP BY
t1.CustomerGUID,
t1.Date
You can use row_number() as shown below.
SELECT
*
FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY CustomerGUID ORDER BY Date Desc) AS SrNo FROM <YourTable>
)
<YourTable>
WHERE
SrNo = 1
Sample data will be more helpful.
Try this window function:
MAX(value) OVER(PARTITION BY date,customer ORDER BY value DESC)
Its faster and more efficient.
Probably many other ways to do it, but this one is simple and works
select t.*
from (
select
convert(varchar(8), r.date,112) one_day
,max(r.Value) max_sale
from #temp r
group by convert(varchar(8), r.date,112)
) e
inner join #temp t on t.value = e.max_sale and convert(varchar(8), t.date,112) = e.one_day
if you have 2 people who spend the exact same amount that's also max, you'll get 2 records for that day.
the convert(varchar(8), r.date,112) will perform as desired on date, datetime and datetime2 data types. If you're date is a varchar,char,nchar or nvarchar you'll want to examine the data to find out if you left(t.date,10) or left(t.date,8) it.
If i've understood your requirement correctly you have stated"greatest value transaction per day per customer". That suggests to me you don't want 1 row per customer in the output but a row per day per customer.
To achieve this you can group on the day like this
Select t.customerid, datepart(day,t.date) as Daydate,
max(t.value) as value from #temp t group by
t.customerid, datepart(day,t.date);

Group by every N records in T-SQL

I have some performance test results on the database, and what I want to do is to group every 1000 records (previously sorted in ascending order by date) and then aggregate results with AVG.
I'm actually looking for a standard SQL solution, however any T-SQL specific results are also appreciated.
The query looks like this:
SELECT TestId,Throughput FROM dbo.Results ORDER BY id
WITH T AS (
SELECT RANK() OVER (ORDER BY ID) Rank,
P.Field1, P.Field2, P.Value1, ...
FROM P
)
SELECT (Rank - 1) / 1000 GroupID, AVG(...)
FROM T
GROUP BY ((Rank - 1) / 1000)
;
Something like that should get you started. If you can provide your actual schema I can update as appropriate.
Give the answer to Yuck. I only post as an answer so I could include a code block. I did a count test to see if it was grouping by 1000 and the first set was 999. This produced set sizes of 1,000. Great query Yuck.
WITH T AS (
SELECT RANK() OVER (ORDER BY sID) Rank, sID
FROM docSVsys
)
SELECT (Rank-1) / 1000 GroupID, count(sID)
FROM T
GROUP BY ((Rank-1) / 1000)
order by GroupID
I +1'd #Yuck, because I think that is a good answer. But it's worth mentioning NTILE().
Reason being, if you have 10,010 records (for example), then you'll have 11 groupings -- the first 10 with 1000 in them, and the last with just 10.
If you're comparing averages between each group of 1000, then you should either discard the last group as it's not a representative group, or...you could make all the groups the same size.
NTILE() would make all groups the same size; the only caveat is that you'd need to know how many groups you wanted.
So if your table had 25,250 records, you'd use NTILE(25), and your groupings would be approximately 1000 in size -- they'd actually be 1010 in size; the benefit being, they'd all be the same size, which might make them more relevant to each other in terms of whatever comparison analysis you're doing.
You could get your group-size simply by
DECLARE #ntile int
SET #ntile = (SELECT count(1) from myTable) / 1000
And then modifying #Yuck's approach with the NTILE() substitution:
;WITH myCTE AS (
SELECT NTILE(#ntile) OVER (ORDER BY id) myGroup,
col1, col2, ...
FROM dbo.myTable
)
SELECT myGroup, col1, col2...
FROM myCTE
GROUP BY (myGroup), col1, col2...
;
Answer above does not actually assign a unique group id to each 1000 records. Adding Floor() is needed. The following will return all records from your table, with a unique GroupID for each 1000 rows:
WITH T AS (
SELECT RANK() OVER (ORDER BY your_field) Rank,
your_field
FROM your_table
WHERE your_field = 'your_criteria'
)
SELECT Floor((Rank-1) / 1000) GroupID, your_field
FROM T
And for my needs, I wanted my GroupID to be a random set of characters, so I changed the Floor(...) GroupID to:
TO_HEX(SHA256(CONCAT(CAST(Floor((Rank-1) / 10) AS STRING),'seed1'))) GroupID
without the seed value, you and I would get the exact same output because we're just doing a SHA256 on the number 1, 2, 3 etc. But adding the seed makes the output unique, but still repeatable.
This is BigQuery syntax. T-SQL might be slightly different.
Lastly, if you want to leave off the last chunk that is not a full 1000, you can find it by doing:
WITH T AS (
SELECT RANK() OVER (ORDER BY your_field) Rank,
your_field
FROM your_table
WHERE your_field = 'your_criteria'
)
SELECT Floor((Rank-1) / 1000) GroupID, your_field
, COUNT(*) OVER(PARTITION BY TO_HEX(SHA256(CONCAT(CAST(Floor((Rank-1) / 1000) AS STRING),'seed1')))) AS CountInGroup
FROM T
ORDER BY CountInGroup
You can also use Row_Number() instead of rank. No Floor required.
declare #groupsize int = 50
;with ct1 as ( select YourColumn, RowID = Row_Number() over(order by YourColumn)
from YourTable
)
select YourColumn, RowID, GroupID = (RowID-1)/#GroupSize + 1
from ct1
I read more about NTILE after reading #user15481328 answer
(resource: https://www.sqlservertutorial.net/sql-server-window-functions/sql-server-ntile-function/ )
and this solution allowed me to find the max date within each of the 25 groups of my data set:
with cte as (
select date,
NTILE(25) OVER ( order by date ) bucket_num
from mybigdataset
)
select max(date), bucket_num
from cte
group by bucket_num
order by bucket_num

Remove ORDER BY clause from PARTITION BY clause?

Is there a way I can reduce the impact of the 'ORDER BY lro_pid' clause in the OVER portion of the inner query below?
SELECT *
FROM (SELECT a.*,
Row_Number() over (PARTITION BY search_point_type
ORDER BY lro_pid) spt_rank
FROM lro_search_point a
ORDER BY spt_rank)
WHERE spt_rank = 1;
I don't care to order this result within the partition since I want to order it by a different variable entirely. lro_pid is an indexed column, but this still seems like a waste of resources as it currently stands. (Perhaps there is a way to limit the ordering to a range of a single row?? Hopefully no time/energy would be spent on sorting within the partition at all)
A couple of things to try:
Can you e.g. ORDER BY 'constant' in the OVER clause?
If ordering by a constant is not permitted, how about ORDER BY (lro_pid * 0)?
I'm not an Oracle expert (MSSQL is more my thing) - hence questions to answer your question!
Using a constant in the analytic ORDER BY as #Will A suggested appears to be the fastest method.
The optimizer still performs a sort, but it's faster than sorting a column.
Also, you probably want to remove the second ORDER BY, or at least move it to the outer query.
Below is my test case:
--Create table, index, and dummy data.
create table lro_search_point(search_point_type number, lro_pid number, column1 number
,column2 number, column3 number);
create index lro_search_point_idx on lro_search_point(lro_pid);
insert /*+ append */ into lro_search_point
select mod(level, 10), level, level, level, level from dual connect by level <= 100000;
commit;
--Original version. Averages 0.53 seconds.
SELECT * FROM
(
SELECT a.*, Row_Number() over (PARTITION BY search_point_type ORDER BY lro_pid) spt_rank
FROM lro_search_point a
ORDER BY spt_rank
)
WHERE spt_rank=1;
--Sort by constant. Averages 0.33 seconds.
--This query and the one above have the same explain plan, basically it's
--SELECT/VIEW/SORT ORDER BY/WINDOW SORT PUSHED RANK/TABLE ACCESS FULL.
SELECT * FROM
(
SELECT a.*, Row_Number() over (PARTITION BY search_point_type ORDER BY -1) spt_rank
FROM lro_search_point a
ORDER BY spt_rank
)
WHERE spt_rank=1;
--Remove the ORDER BY (or at least move it to the outer query). Averages 0.27 seconds.
SELECT * FROM
(
SELECT a.*, Row_Number() over (PARTITION BY search_point_type ORDER BY -1) spt_rank
FROM lro_search_point a
)
WHERE spt_rank=1;
--Replace analytic with aggregate functions, averages 0.28 seconds.
--This idea is the whole reason I did this, but turns out it's no faster. *sigh*
--Plan is SELECT/SORT GROUP BY/TABLE ACCESS FULL.
--Note I'm using KEEP instead of just regular MIN.
--I assume that you want the values from the same row.
SELECT a.search_point_type
,min(lro_pid) keep (dense_rank first order by -1)
,min(column1) keep (dense_rank first order by -1)
,min(column2) keep (dense_rank first order by -1)
,min(column3) keep (dense_rank first order by -1)
FROM lro_search_point a
group by a.search_point_type;
To obmit the clause ORDER BY you could use ORDER BY rownum.