Using Rank vs Max (T-SQL) - sql

Im currently working on troubleshooting an old job which is taking long in running the query. The old job uses the first query but I have been testing using the second query.
Differences between:
Select Max(Cl1) as Tab,
Max(Cl2) as Tb,
Customer
From TableA
group by Customer
vs
Select Customer,
Tab,
tb
From
(Select Customer,
Tab,
tb,
Rank() over (partition by Customer order by Cl1 desc) rk1,
Rank() over (partition by Customer order by Cl2 desc) rk2
From TableA) X
Where X.rk1 = 1 and X.rk2 = 1
Tab Tb Customer
A45845 100052 Shin
A45845 100053 Shin
A45845 100054 Reek
The table will always have value (no nulls or blank value) for both Tab and Tb columns. Tab is not unique to a particular customer. Tb is a sequential and continuously increasing integer with no duplicates possible (unique). The latest Tab value for a customer will also have the most recent Tb as well.
Though the results are the same, is there something I may not be considering when changing the query in this case?
Edit: Fixed errors on second query when building example and not using real column or table names. Also explanded on scenario. My apologies about the updated info and fix in original post, was called before I even had a chance to double check it.

Seriously doubt Rank is going to be faster.
Where you would need rank is if you wanted the value of CL2 on the row where CL1 is max.
Do you have indexes on Customer, CL1, and CL2?
Check fragmentation.
Check the execution plans.
And no way those are returning the same results.

This would be the equiv query, but I doubt it would be more efficient. #Damien_The_Unbeliever please let me know if I'm wrong again. (Distinct added for the scenario where there are multiple Cl1 and Cl2 rows with the same value. This can be removed if the primary key is across customer, Cl1 and Cl2.)
Select Distinct Customer,
X.Cl1 as Tab,
X2.Cl2 as tb
From (Select Customer,
Rank() over (partition by Customer order by Cl1 desc) rk1,
Cl1
From TableA) X
Join (Select Customer,
Rank() over (partition by Customer order by Cl2 desc) rk2,
Cl2
From TableA) X2
On X.Customer = X2.Customer
Where X.rk1 = 1
And X2.rk2 = 1

The first query returns values for Tab and Tb, the second always returns 1 and 1 for the two columns (because of the where clause).
The second will only return customers where there is a row that is the first by both CL1 and Cl2. The first returns a row for all customers.
The second will return duplicates, when multiple rows satisfy the two ordering conditions for a given customer. The first returns only one row per customer.
The second has a syntax error, so it will not run (the comma after rk2). The first seems to be valid SQL syntax.
The first seems simpler and more understandable. However, I think my first point is the biggest difference. (I'm ignoring the fact that the columns are in different orders.)

Related

How to work past "At most one record can be returned by this subquery"

I'm having trouble understanding this error through all the researching I have done. I have the following query
SELECT M.[PO Concatenate], Sum(M.SumofAward) AS TotalAward, (SELECT TOP 1 M1.[Material Group] FROM
[MGETCpreMG] AS M1 WHERE M1.[PO Concatenate]=M.[PO Concatenate] ORDER BY M1.SumofAward DESC) AS TopGroup
FROM MGETCpreMG AS M
GROUP BY M.[PO Concatenate];
For a brief instance it reviews the results I want, but then the "At most one record can be returned by this subquery" error comes and wipes all the data to #Name?
For context, [MGETCpreMG] is a query off a main table [MG ETC] that was used to consolidate Award for differing Material Groups on a PO transaction ([PO Concatenate])
SELECT [MG ETC].[PO Concatenate], Sum([MG ETC].Award) AS SumOfAward, [MG ETC].[Material Group]
FROM [MG ETC]
GROUP BY [MG ETC].[PO Concatenate], [MG ETC].[Material Group]
ORDER BY [MG ETC].[PO Concatenate];
I'm thinking it lies in my inability to understand how to utilize a subquery.
In the case in which the query can return more then one value? Simply add an additonal sort by.
So, a common sub query might be to get the last invoice. So you might have:
select ID, CompanyName,
(SELECT TOP 1 InvoiceDate from tblInvoice
where tblInvoice.CustomerID = tblCompany.ID
Order by InvoiceDate DESC)
As LastInvoiceDate
From tblCustomers
Now the above might work for some time, but then it will blow up since you might have two invoices for the same day!
So, all you have to do is add that extra order by clause - say on the PK of the child table like this:
Order by InvoiceDate DESC,ID DESC)
So top 1 will respect the "additional" order columns you add, and thus only ever return one row - even if there are multiple values that match the top 1 column.
I suppose in the above we could perhaps forget the invoiceDate and always take the top most last autonumber ID, but for a lot of queries, you can't always be sure - it might be we want the last most expensive invoice amount. And again, if the max value (top) was the same for two large invoice amounts, then again two rows could be return. So, simply add the extra ORDER BY clause with an 2nd column that further orders the data. And thus top 1 will only pull the first value. Your example of a top group is such an example. Just tack on the extra order by "ID" or whatever the auto number ID column is.

Selecting rows from other tables based on the first table using SQL

I have three T-SQL statements that I'd like to combine into one, so it is just a single call to the database, not three.
SELECT * FROM Clients
The first one, selects every client from the Clients table.
SELECT * FROM History
The second one, selects all the history entries from the History table. I then use some code to find the first history for each client. i.e. first history in the table for ClientID gets set into the HasHistory column for that ClientID.
SELECT * FROM Actions
The final one, I get all the actions from the action table. I then use some code to find the last action for each client. i.e. last action in the table for ClientID gets set into the LastAction column for that ClientID.
So I'm wondering if there is a way to write an SQL statement like this for example? Note this is not real SQL, just pseudo code to illustrate what I'm trying to achieve.
SELECT *
FROM Clients
AND
SELECT First History Row
FROM History
WHERE History.ClientID = Clients.ClientID
AND
SELECT Last Action Row
FROM Actions
WHERE Actions.ClientID = Clients.ClientID
There are a number of ways you can do this, but here is one example. I'll work on it a bit at a time to explain what we are doing. You haven't shown us the table design, so the column names are a guess, but you should get the idea.
First, you have to somehow mark which history rows you care about. One way to do this is to do a query that puts an order number on every history row, that starts from 1 with every new client, and orders them by date. This way, the first history row for each client (the one you want) always has a row number of one. This would look something like
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY clientID ORDER BY historyDate) AS orderNo
FROM
History
You would do something similar with actions, except you want the latest action, not the first one, so your order by column has to be in reverse order - you do this by telling the ORDER BY to use descending order, something like this
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY clientID ORDER BY actionDate DESC) AS orderNo
FROM Actions
You should now have two queries where the only rows you want are marked with a order number of one. What you do now is start with your first query, and join to these other two queries so that you only join to the orderno = 1 rows. Then all the data you want will be available in one row. You have to decide which join type to use - an inner join will only return Clients that actually have a history and an action. If you want to see clients that have no rows at all in the other tables, you need to use a left outer join. But your final query (you only need this one) will look something like
SELECT
C.*, H.*, A.*
FROM
Clients C
LEFT OUTER JOIN
(SELECT
*,
ROW_NUMBER() OVER (PARTITION BY clientID ORDER BY historyDate) AS orderNo
FROM History) H ON H.clientID = C.clientID AND H.orderNo = 1
LEFT OUTER JOIN
(SELECT
*,
ROW_NUMBER() OVER (PARTITION BY clientID ORDER BY actionDate DESC) AS orderNo
FROM Actions) A ON A.clientID = C.clientID AND A.orderNo = 1
What this says is: take Clients (which we'll call C), then for each row, try and join to (match a row from) the History query we looked at above (which we'll call H) where the client ID is the same and the orderNo is 1 - ie the first history row. It also does the same for the Actions query.

How to get the most frequent value SQL

I have a table Orders(id_trip, id_order), table Trip(id_hotel, id_bus, id_type_of_trip) and table Hotel(id_hotel, name).
I would like to get name of the most frequent hotel in table Orders.
SELECT hotel.name from Orders
JOIN Trip
on Orders.id_trip = Trip.id_hotel
JOIN hotel
on trip.id_hotel = hotel.id_hotel
FROM (SELECT hotel.name, rank() over (order by cnt desc) rnk
FROM (SELECT hotel.name, count(*) cnt
FROM Orders
GROUP BY hotel.name))
WHERE rnk = 1;
The "most frequently occurring value" in a distribution is a distinct concept in statistics, with a technical name. It's called the MODE of the distribution. And Oracle has the STATS_MODE() function for it. https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions154.htm
For example, using the EMP table in the standard SCOTT schema, select stats_mode(deptno) from scott.emp will return 30 - the number of the department with the most employees. (30 is the department "name" or number, it is NOT the number of employees in that department!)
In your case:
select stats_mode(h.name) from (the rest of your query)
Note: if two or more hotels are tied for "most frequent", then STATS_MODE() will return one of them (non-deterministic). If you need all the tied values, you will need a different solution - a good example is in the documentation (linked above). This is a documented flaw in Oracle's understanding and implementation of the statistical concept.
Use FIRST for a single result:
SELECT MAX(hotel.name) KEEP (DENSE_RANK FIRST ORDER BY cnt DESC)
FROM (
SELECT hotel.name, COUNT(*) cnt
FROM orders
JOIN trip USING (id_trip)
JOIN hotel USING (id_hotel)
GROUP BY hotel.name
) t
Here is one method:
select name
from (select h.name,
row_number() over (order by count(*) desc) as seqnum -- use `rank()` if you want duplicates
from orders o join
trip t
on o.id_trip = t.id_trip join -- this seems like the right join condition
hotels h
on t.id_hotel = h.id_hotel
) oth
where seqnum = 1;
** Getting the most recent statistical mode out of a data sample **
I know it's more than a year, but here's my answer. I came across this question hoping to find a simpler solution than what I know, but alas, nope.
I had a similar situation where I needed to get the mode from a data sample, with the requirement to get the mode of the most recently inserted value if there were multiple modes.
In such a case neither the STATS_MODE nor the LAST aggregate functions would do (as they would tend to return the first mode found, not necessarily the mode with the most recent entries.)
In my case it was easy to use the ROWNUM pseudo-column because the tables in question were performance metric tables that only experienced inserts (not updates)
In this oversimplified example, I'm using ROWNUM - it could easily be changed to a timestamp or sequence field if you have one.
SELECT VALUE
FROM
(SELECT VALUE ,
COUNT( * ) CNT,
MAX( R ) R
FROM
( SELECT ID, ROWNUM R FROM FOO
)
GROUP BY ID
ORDER BY CNT DESC,
R DESC
)
WHERE
(
ROWNUM < 2
);
That is, get the total count and max ROWNUM for each value (I'm assuming the values are discrete. If they aren't, this ain't gonna work.)
Then sort so that the ones with largest counts come first, and for those with the same count, the one with the largest ROWNUM (indicating most recent insertion in my case).
Then skim off the top row.
Your specific data model should have a way to discern the most recent (or the oldest or whatever) rows inserted in your table, and if there are collisions, then there's not much of a way other than using ROWNUM or getting a random sample of size 1.
If this doesn't work for your specific case, you'll have to create your own custom aggregator.
Now, if you don't care which mode Oracle is going to pick (your bizness case just requires a mode and that's it, then STATS_MODE will do fine.

Select query with join in huge table taking over 7 hours

Our system is facing performance issues selecting rows out of a 38 million rows table.
This table with 38 million rows stores information from clients/suppliers etc. These appear across many other tables, such as Invoices.
The main problem is that our database is far from normalized. The Clients_Suppliers table has a composite key made of 3 columns, the Code - varchar2(16), Category - char(2) and the last one is up_date, a date. Every change in one client's address is stored in that same table with a new date. So we can have records such as this:
code ca up_date
---------------- -- --------
1234567890123456 CL 01/01/09
1234567890123456 CL 01/01/10
1234567890123456 CL 01/01/11
1234567890123456 CL 01/01/12
6543210987654321 SU 01/01/10
6543210987654321 SU 08/03/11
Worst, in every table that uses a client's information, instead of the full composite key, only the code and category is stored. Invoices, for instance, has its own keys, including the emission date. So we can have something like this:
invoice_no serial_no emission code ca
---------- --------- -------- ---------------- --
1234567890 12345 05/02/12 1234567890123456 CL
My specific problem is that I have to generate a list of clients for which invoices where created in a given period. Since I have to get the most recent info from the clients, I have to use max(up_date).
So here's my query (in Oracle):
SELECT
CL.CODE,
CL.CATEGORY,
-- other address fields
FROM
CLIENTS_SUPPLIERS CL
INVOICES I
WHERE
CL.CODE = I.CODE AND
CL.CATEGORY = I.CATEGORY AND
CL.UP_DATE =
(SELECT
MAX(CL2.UP_DATE)
FROM
CLIENTS_SUPPLIERS CL2
WHERE
CL2.CODE = I.CODE AND
CL2.CATEGORY = I.CATEGORY AND
CL2.UP_DATE <= I.EMISSION
) AND
I.EMISSION BETWEEN DATE1 AND DATE2
It takes up to seven hours to select 178,000 rows. Invoices has 300,000 rows between DATE1 and DATE2.
It's a (very, very, very) bad design, and I've raised the fact that we should improve it, by normalizing the tables. That would involve creating a table for clients with a new int primary key for each pair of code/category and another one for Adresses (with the client primary key as a foreign key), then use the Adresses' primary key in each table that relates to clients.
But it would mean changing the whole system, so my suggestion has been shunned. I need to find a different way of improving performance (apparently using only SQL).
I've tried indexes, views, temporary tables but none have had any significant improvement on performance. I'm out of ideas, does anyone have a solution for this?
Thanks in advance!
What does the DBA have to say?
Has he/she tried:
Coalescing the tablespaces
Increasing the parallel query slaves
Moving indexes to a separate tablespace on a separate physical disk
Gathering stats on the relevant tables/indexes
Running an explain plan
Running the query through the index optimiser
I'm not saying the SQL is perfect, but if performance it is degrading over time, the DBA really needs to be having a look at it.
SELECT
CL2.CODE,
CL2.CATEGORY,
... other fields
FROM
CLIENTS_SUPPLIERS CL2 INNER JOIN (
SELECT DISTINCT
CL.CODE,
CL.CATEGORY,
I.EMISSION
FROM
CLIENTS_SUPPLIERS CL INNER JOIN INVOICES I ON CL.CODE = I.CODE AND CL.CATEGORY = I.CATEGORY
WHERE
I.EMISSION BETWEEN DATE1 AND DATE2) CL3 ON CL2.CODE = CL3.CODE AND CL2.CATEGORY = CL3.CATEGORY
WHERE
CL2.UP_DATE <= CL3.EMISSION
GROUP BY
CL2.CODE,
CL2.CATEGORY
HAVING
CL2.UP_DATE = MAX(CL2.UP_DATE)
The idea is to separate the process: first we tell oracle to give us the list of clients for which there are the invoices of the period you want, and then we get the last version of them. In your version there's a check against MAX 38000000 times, which I really think is what costed most of the time spent in the query.
However, I'm not asking for indexes, assuming they are correctly setup...
Assuming that the number of rows for a (code,ca) is smallish, I would try to force an index scan per invoice with an inline view, such as:
SELECT invoice_id,
(SELECT MAX(rowid) KEEP (DENSE_RANK FIRST ORDER BY up_date DESC
FROM clients_suppliers c
WHERE c.code = i.code
AND c.category = i.category
AND c.up_date < i.invoice_date)
FROM invoices i
WHERE i.invoice_date BETWEEN :p1 AND :p2
You would then join this query to CLIENTS_SUPPLIERS hopefully triggering a join via rowid (300k rowid read is negligible).
You could improve the above query by using SQL objects:
CREATE TYPE client_obj AS OBJECT (
name VARCHAR2(50),
add1 VARCHAR2(50),
/*address2, city...*/
);
SELECT i.o.name, i.o.add1 /*...*/
FROM (SELECT DISTINCT
(SELECT client_obj(
max(name) KEEP (DENSE_RANK FIRST ORDER BY up_date DESC),
max(add1) KEEP (DENSE_RANK FIRST ORDER BY up_date DESC)
/*city...*/
) o
FROM clients_suppliers c
WHERE c.code = i.code
AND c.category = i.category
AND c.up_date < i.invoice_date)
FROM invoices i
WHERE i.invoice_date BETWEEN :p1 AND :p2) i
The correlated subquery may be causing issues, but to me the real problem is in what seems to be your main client table, you cannot easily grab the most recent data without doing the max(up_date) mess. Its really a mix of history and current data, and as you describe poorly designed.
Anyway, it will help you in this and other long running joins to have a table/view with ONLY the most recent data for a client. So, first build a mat view for this (untested):
create or replace materialized view recent_clients_view
tablespace my_tablespace
nologging
build deferred
refresh complete on demand
as
select * from
(
select c.*, rownumber() over (partition by code, category order by up_date desc, rowid desc) rnum
from clients c
)
where rnum = 1;
Add unique index on code,category. The assumption is that this will be refreshed periodically on some off hours schedule, and that your queries using this will be ok with showing data AS OF the date of the last refresh. In a DW env or for reporting, this is usually the norm.
The snapshot table for this view should be MUCH smaller than the full clients table with all the history.
Now, you are doing an joining invoice to this smaller view, and doing an equijoin on code,category (where emission between date1 and date2). Something like:
select cv.*
from
recent_clients_view cv,
invoices i
where cv.code = i.code
and cv.category = i.category
and i.emission between :date1 and :date2;
Hope that helps.
You might try rewriting the query to use analytic functions rather than a correlated subquery:
select *
from (SELECT CL.CODE, CL.CATEGORY, -- other address fields
max(up_date) over (partition by cl.code, cl.category) as max_up_date
FROM CLIENTS_SUPPLIERS CL join
INVOICES I
on CL.CODE = I.CODE AND
CL.CATEGORY = I.CATEGORY and
I.EMISSION BETWEEN DATE1 AND DATE2 and
up_date <= i.emission
) t
where t.up_date = max_up_date
You might want to remove the max_up_date column in the outside select.
As some have noticed, this query is subtly different from the original, because it is taking the max of up_date over all dates. The original query has the condition:
CL2.UP_DATE <= I.EMISSION
However, by transitivity, this means that:
CL2.UP_DATE <= DATE2
So the only difference is when the max of the update date is less than DATE1 in the original query. However, these rows would be filtered out by the comparison to UP_DATE.
Although this query is phrased slightly differently, I think it does the same thing. I must admit to not being 100% positive, since this is a subtle situation on data that I'm not familiar with.

Find row number in a sort based on row id, then find its neighbours

Say that I have some SELECT statement:
SELECT id, name FROM people
ORDER BY name ASC;
I have a few million rows in the people table and the ORDER BY clause can be much more complex than what I have shown here (possibly operating on a dozen columns).
I retrieve only a small subset of the rows (say rows 1..11) in order to display them in the UI. Now, I would like to solve following problems:
Find the number of a row with a given id.
Display the 5 items before and the 5 items after a row with a given id.
Problem 2 is easy to solve once I have solved problem 1, as I can then use something like this if I know that the item I was looking for has row number 1000 in the sorted result set (this is the Firebird SQL dialect):
SELECT id, name FROM people
ORDER BY name ASC
ROWS 995 TO 1005;
I also know that I can find the rank of a row by counting all of the rows which come before the one I am looking for, but this can lead to very long WHERE clauses with tons of OR and AND in the condition. And I have to do this repeatedly. With my test data, this takes hundreds of milliseconds, even when using properly indexed columns, which is way too slow.
Is there some means of achieving this by using some SQL:2003 features (such as row_number supported in Firebird 3.0)? I am by no way an SQL guru and I need some pointers here. Could I create a cached view where the result would include a rank/dense rank/row index?
Firebird appears to support window functions (called analytic functions in Oracle). So you can do the following:
To find the "row" number of a a row with a given id:
select id, row_number() over (partition by NULL order by name, id)
from t
where id = <id>
This assumes the id's are unique.
To solve the second problem:
select t.*
from (select id, row_number() over (partition by NULL order by name, id) as rownum
from t
) t join
(select id, row_number() over (partition by NULL order by name, id) as rownum
from t
where id = <id>
) tid
on t.rownum between tid.rownum - 5 and tid.rownum + 5
I might suggest something else, though, if you can modify the table structure. Most databases offer the ability to add an auto-increment column when a row is inserted. If your records are never deleted, this can server as your counter, simplifying your queries.