Set-based alternative to loop in SQL Server

Set-based alternative to loop in SQL Server - sql

I know that there are several posts about how BAD it is to try to loop in SQL Server in a stored procedure. But I haven't quite found what I am trying to do. We are using data connectivity that can be linked internally directly into excel.
I have seen some posts where a few people have said they could convert most loops to a standard query. But for the life of me I am having trouble with this one.
I need all custIDs who have orders right before an event of type 38,40. But only get them if there is no other order between the event and the order in the first query.
So there are 3 parts. I first query for all orders (orders table) based on a time frame into a temporary table.
Select into temp1 odate, custId from orders where odate>'5/1/12'
Then I could use the temp table to inner join on the secondary table to get a customer event (LogEvent table) that may have occurred some time in the past prior to the current order.
Select into temp2 eventdate, temp1.custID from LogEvent inner join temp1 on
temp1.custID=LogEvent.custID where EventType in (38,40) and temp1.odate>eventdate
order by eventdate desc
The problem here is that the queries I am trying to run will return all rows for each of the customers from the first query where I only want the latest for each customer. So this is where on the client side I would loop to only get one Event instead of all the old ones. But as all the query has to run inside of Excel I can't really loop client side.
The third step then could use the results from the second query to make check if the event occurred between most current order and any previous order. I only want the data where the event precedes the order and no other orders are in between.
Select ordernum, shopcart.custID from shopcart right outer join temp2 on
shopcart.custID=temp2.custID where shopcart.odate >= temp2.eventdate and
ordernum is null
Is there a way to simplify this and make it set-based to run in SQL Server instead of some kind of loop that I is perform at the client?

THis is a great example of switching to set-based notation.
First, I combined all three of your queries into a single query. In general, having a single query let's the query optimizer do what it does best -- determine execution paths. It also prevents accidental serialization of queries on a multithreaded/multiprocessor machine.
The key is row_number() for ordering the events so the most recent has a value of 1. You'll see this in the final WHERE clause.
select ordernum, shopcart.custID
from (Select eventdate, temp1.custID,
row_number() over (partition by temp1.CustID order by EventDate desc) as seqnum
from LogEvent inner join
(Select odate, custId
from order
where odate>'5/1/12'
) temp1
on temp1.custID=LogEvent.custID
where EventType in (38,40) and temp1.odate>eventdate order by eventdate desc
) temp2 left outer join
ShopCart
on shopcart.custID=temp2.custID
where seqnum = 1 and shopcart.odate >= temp2.eventdate and ordernum is null
I kept your naming conventions, even though I think "from order" should generate a syntax error. Even if it doesn't it is bad practice to name tables and columns with reserved SQL words.

If you are using a newer version of sql server, then you can use the ROW_NUMBER function. I will write an example shortly.
;WITH myCTE AS
(
SELECT
eventdate, temp1.custID,
ROW_NUMBER() OVER (PARTITION BY temp1.custID ORDER BY eventdate desc) AS CustomerRanking
FROM LogEvent
JOIN temp1
ON temp1.custID=LogEvent.custID
WHERE EventType IN (38,40) AND temp1.odate>eventdate
)
SELECT * into temp2 from myCTE WHERE CustomerRanking = 1;
This gets you the most recent event for each customer without a loop.
Also, you could use RANK, however that will create duplicates for ties, whereas ROW_NUMBER will guarantee no duplicate numbers for your partition.

Related

Get most recent record from Right table with sub query

When I join to the right table I am getting way too many duplicates. I am trying to grab the most recent record from the right table however, it does not matter what I try it does not work.
So Far I have tried:
PROC SQL;
CREATE TABLE fs1.sample AS
SELECT A.*,
B.xx1,
max(B.time_s)
FROM lx1.results a left join (Select Distinct C.id, c.per FROM lx2.results c
Where c.id = a.id
and COMPGED(a.txt1, c.txt1,'i') < 100
and c.dt > a.dt
and c.ksv = 37
and datepart(c.lsg) >= '12DEC2020'd ) b
ON a.id = b.id
group by a.id, a.txt1
QUIT;
Unfortunately, I get an error. I also tried using case when exists, but that takes way too long. Essentially I am trying to grab the most recent record from the right table based on time_s. I also want to make sure the record I grab from the right table somewhat matches a.txt1.
Cheers

When you perform a join, you attach all records from the table that match your join conditions.
If the table is indexed appropriately, a subquery could achieve the goal of obtaining the most recent value, however, if the query uses the wrong index, TOP or equivalent functions may return the wrong result.
There are a number of ways to accomplish the task of retrieving the most recent record but they are contingent on a couple of things.
Firstly, you need to be able to identify what the most recent row is, usually by a column called CreatedDate or something similar against the IDs. (You should know what that business logic is, it may be that the table is chronologically entered [as most tables are] and therefore, SubID might be a thing. We're going to assume it is CreatedDate.)
Secondly, you need to rank the rows in terms of the CreatedDate in a descending order so that the newest matching ID is ranked 1.
Finally, you filter your results by 1 to return the newest result, but you could also filter by <= x if you are interested in the top x newest return results per ID.
To use more mathematical language: We are deriving a value from the CreatedDate and ID values and then using that derivative value to sort and filter the data. In this case we are deriving the RowNumber from the CreatedDate in descending order for each ID.
In order to accomplish this, you can use the Windowed Function ROW_NUMBER(),
ROW_NUMBER() OVER (PARTITION BY id ORDER BY CreatedDate DESC) as RankID
This windowed function will return a row value for each ID relative to the CreatedDate in descending order, where the newest created date is equal to 1.
You can then put brackets around the whole query to make it into a table so you will be able to filter the results of that Windowed Function.
SELECT id, txt
(SELECT id, txt
,ROW_NUMBER() OVER (PARTITION BY id ORDER BY CreatedDate DESC) as RankID
FROM SourceTable) A
WHERE RankID = 1
This should achieve your goal of returning the "newest result".
What ever your column is that determines the age of the data relative to the ID, it can be multiple, should be placed within the ORDER BY.
In order to make this query perform faster, you should index your data appropriately, whereby ID is the the first column, and CreatedDate Desc is your next column. This means your system will not have to perform a costly sort every time this runs, but that depends on whether you plan on using this query often and how much overhead it is grabbing.

SQL Query Deduplication / Join Issue

I've been having the worst time trying to write what I feel should be a pretty simple query to deal with duplicate entries.
For context: I've created a data warehouse using Big Query and am using Stitch to pull data from Hubspot. Everything works as expected as in: I have confirmed that I have the right number of records in BigQuery.
The issue comes into how Stitch refreshes data. Instead of updating records based on object id, it appends a new row. According to their documentation, the query below should work, but it doesn't for the simple reason that there exist multiple versions of a given record with the same _sdc_sequence (which I don't think should exist). There are other _sdc (stitch system fields) that I can use to help, but it's also not completely reliable for the same reasons as above.
SELECT DISTINCT o.*
FROM [sample-table:hubspot.companies] o
INNER JOIN (
SELECT
MAX(_sdc_sequence) AS seq,
id
FROM [sample-table:hubspot.companies]
GROUP BY companyid ) oo
ON o.companyid = oo.companyid
AND o._sdc_sequence = oo.seq
The query above returns fewer results than it should. If I run the following query, I get the right number of results, but I need the other fields besides companyid like name, description, revenue, etc.
SELECT o.companyid
FROM [samples_table:hubspot.companies] o
GROUP BY o.companyid
I was trying something like this, but it doesn't work (I'm getting the following error (Expression 'oo.properties.name.value' is not present in the GROUP BY list).
SELECT o.companyid,
oo.properties.name.value,
oo.properties.hubspot_owner_id.value,
oo.properties.description.value
FROM [sample_table:hubspot.companies] o
LEFT JOIN [sample_table:hubspot.companies] oo
ON o.companyid = oo.companyid
GROUP BY o.companyid
I'm my mind, the way that I'm thinking about this is:
Get list of unique records id (companyid)
Do a SQL "vlookup equivalent" of the raw, ungrouped company table that is sorted by insert time to get the first record that matches the id (which will be the most recent since the table is sorted)
I just don't know how to write this...

Try using window functions:
#standardSQL
SELECT c.*
FROM (SELECT c.*,
ROW_NUMBER() OVER (PARTITION BY companyid ORDER BY _sdc_sequence DESC) as seqnum
FROM `sample-table.hubspot.companies` c
) c
WHERE seqnum = 1;

Below is for BigQuery Standard SQL
#standardSQL
SELECT AS VALUE ARRAY_AGG(t ORDER BY _sdc_sequence DESC LIMIT 1)[OFFSET(0)]
FROM `sample-table.hubspot.companies` t
GROUP BY companyid

Retrieving last record in each group from database with additional max() condition in MSSQL

This is a follow-up question to Retrieving last record in each group from database - SQL Server 2005/2008
In the answers, this example was provided to retrieve last record for a group of parameters (example below retrieves last updates for each value in computername):
select t.*
from t
where t.lastupdate = (select max(t2.lastupdate)
from t t2
where t2.computername = t.computername
);
In my case, however, "lastupdate" is not unique (some updates come in batches and have same lastupdate value, and if two updates of "computername" come in the same batch, you will get non-unique output for "computername + lastupdate").
Suppose I also have field "rowId" that is just auto-incremental. The mitigation would be to include in the query another criterion for a max('rowId') field.
NB: while the example employs time-specific name "lastupdate", the actual selection criteria may not be related to the time at all.
I, therefore, like to ask, what would be the most performant query that selects the last record in each group based both on "group-defining parameter" (in the case above, "computername") and on maximal rowId?

If you don't have uniqueness, then row_number() is simpler:
select t.*
from (select t.*,
row_number() over (partition by computername order by lastupdate, rowid desc) as seqnum
from t
) t
where seqnum = 1;
With the right indexes, the correlated subquery is usually faster. However, the performance difference is not that great.

Oracle subquery in select

I have a table that keeps costs of products. I'd like to get the average cost AND last buying invoice for each product.
My solution was creating a sub-select to get last buying invoice but unfortunately I'm getting
ORA-00904: "B"."CODPROD": invalid identifier
My query is
SELECT (b.cod_aux) product,
-- here goes code to get average cost,
(SELECT round(valorultent, 2)
FROM (SELECT valorultent
FROM pchistest
WHERE codprod = b.codprod
ORDER BY dtultent DESC)
WHERE ROWNUM = 1)
FROM pchistest a, pcembalagem b
WHERE a.codprod = b.codprod
GROUP BY a.codprod, b.cod_aux
ORDER BY b.cod_aux
In short what I'm doing on sub-select is ordering descendantly and getting the first row given the product b.codprod

Your problem is that you can't use your aliased columns deeper than one sub-query. According to the comments, this was changed in 12C, but I haven't had a chance to try it as the data warehouse that I use is still on 11g.
I would use something like this:
SELECT b.cod_aux AS product
,ROUND (r.valorultent, 2) AS valorultent
FROM pchistest a
JOIN pcembalagem b ON (a.codprod = b.codprod)
JOIN (SELECT valorultent
,codprod
,ROW_NUMBER() OVER (PARTITION BY codprod
ORDER BY dtultent DESC)
AS row_no
FROM pchistest) r ON (r.row_no = 1 AND r.codprod = b.codprod)
GROUP BY a.codprod, b.cod_aux
ORDER BY b.cod_aux
I avoid sub-queries in SELECT statements. Most of the time, the optimizer wants to run a SELECT for each item in the cursor, OR it does some crazy nested loops. If you do it as a sub-query in the JOIN, Oracle will normally process the rows that you are joining; normally, it is more efficient. Finally, complete your per item functions (in this case, the ROUND) in the final product. This will prevent Oracle from doing it on ALL rows, not just the ones you use. It should do it correctly, but it can get confused on complex queries.
The ROW_NUMBER() OVER (PARTITION BY ..) is where the magic happens. This adds a row number to each group of CODPRODs. This allows you to pluck the top row from each CODPROD, so this allows you to get the newest/oldest/greatest/least/etc from your sub-query. It is also great for filtering duplicates.

Find the Max and related fields

Here is my (simplified) problem, very common I guess:
create table sample (client, recordDate, amount)
I want to find out the latest recording, for each client, with recordDate and amount.
I made the below code, which works, but I wonder if there is any better pattern or Oracle tweaks to improve the efficiency of such SELECT. (I am not allowed to modify to the structure of the database, so indexes etc are out of reach for me, and out of scope for the question).
select client, recordDate, Amount
from sample s
inner join (select client, max(recordDate) lastDate
from sample
group by client) t on s.id = t.id and s.recordDate = t.lastDate
The table has half a million records and the select takes 2-4 secs, which is acceptable but I am curious to see if that can be improved.
Thanks

In most cases Windowed Aggregate Functions might perform better (at least it's easier to write):
select client, recordDate, Amount
from
(
select client, recordDate, Amount,
rank() over (partition by client order by recordDate desc) as rn
from sample s
) dt
where rn = 1

Another structure for the query is not exists. This can perform faster under some circumstances:
select client, recordDate, Amount
from sample s
where not exists (select 1
from sample s2
where s2.client = s.client and
s2.recordDate > s.recordDate
);
This would take good advantage of an index on sample(client, recordDate), if one were available.
And, another thing to try is keep:
select client, max(recordDate),
max(Amount) keep (dense_rank first order by recordDate desc)
from sample s
group by client;
This version assumes only one max record date per client (your original query does not make that assumption).
These queries (plus the one by dnoeth) should all have different query plans and you might get lucky on one of them. The best solution, though, is to have the appropriate index.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas