The faster of two SQL queries, sort and select top 1, or select MAX

The faster of two SQL queries, sort and select top 1, or select MAX - sql

Which one is faster of the following two queries?
1
SELECT TOP 1 order_date
FROM orders WITH (NOLOCK)
WHERE customer_id = 9999999
ORDER BY order_date DESC
2
SELECT MAX(order_date)
FROM orders WITH (NOLOCK)
WHERE customer_id = 9999999

With an index on order_date, they are of same performance.
Without an index, MAX is a little bit faster, since it will use Stream Aggregation rather than Top N Sort.

ORDER BY is almost always slowest. The table's data must be sorted.
Aggregate functions shouldn't slow things down as much as a sort.
However, some aggregate functions use a sort as part of their implementation. So a particular set of tables in a particular database product must be tested experimentally to see which is faster -- for that set of data.

I would say the first, because the second requires it to go through an aggregate function.
However, as marc_s said, test it.

You can't answer that without knowing at least something about your indexes (have you got one on customer_id, on order_date?), the amount of data in the table, the state of your statistics etc.etc.

it depends on how big your table is. On the second query, i guess there's no need for "TOP 1," since MAX only returns one value. If I were you, I would use the second query.

Related

Adding a SUM statement increases run time way too much, is there a better method?

I have a table with invoice payments, which can be partial or full. I am comparing this calculated field to the total amount of the invoice. I have it twice in the query, once in the Select statement and again in the Where clause. Even if I remove one so it's only in either the Where or the Select, it takes more than an hour to run. If I remove the SUM entirely, it takes 10 seconds to run.
Is there a better method to get the sum? Should I use an index view? A temp table? Note that an invoice number is unique only to a vendor, not unique in general. The initial FROM is a view, if this makes a difference.
select distinct
transdate,
invoicedate,
PAY.OrderAccount,
v.VendorName,
invoiceamountmst,
(select sum(PAY1.settleamountcur) from [VIEW_INVOICE_PAYMENT] PAY1 where PAY.INVOICEID=PAY1.INVOICEID and PAY.OrderAccount=PAY1.OrderAccount) as "InvoiceSUM",
settleamountcur,
Currencycodeinvoice,
PAY.Description,
Voucher
from VIEW_INVOICE_PAYMENT PAY
inner join INVOICE on INVOICE_DOC_NO =invoiceid
JOIN VENDOR V on PAY.OrderAccount=v.VendorAccount
where TRANSDATE is not null
and (select sum(PAY1.settleamountcur) from [VIEW_INVOICE_PAYMENT] PAY1 where PAY.INVOICEID=PAY1.INVOICEID and PAY.OrderAccount=PAY1.OrderAccount)=total_cost_on_invoice

In this answer, when I refer to 'that select', I'm referring to the sub-query in the middle select sum(pay1.settlamountcur) ...
Note that aliases in 'that select' looks a little strange e.g., select sum(PAY1.settleamountcur) from [VIEW_INVOICE_PAYMENT] AX1. Where does the PAY1 alias come from? I may have missed something. If that's a typo in your code, it could be doing bad things (if it even runs). Assuming it's not, however...
For your broader problem, I believe that it will be running that select statement once for every row being returned by your overall table. Indeed, it may be doing it more often, depending on when it's doing your filtering in the execution plan.
Note I'm assuming SQL Server in this answer - but it should apply to other databases as well.
A couple of options
Instead of referring to the view, instead bring the tables into your current query and modify the query as such
Try removing aggregation from the subquery, and instead do it over the whole data set etc e.g., GROUP BY relevant fields, sum across relevant fields. This can be combined with option 1.
Put the sub-query as a CTE, or a sub-query within the FROM component. This may make it use it as a single table rather than running many times (or it may not)
(Sometimes my preferred option for large tables) Get the relevant data from the view into a temporary table first e.g.,
SELECT INVOICEId, OrderAccount, SUM(settleamountcur) AS total_settleamountcur
INTO #Temp
FROM [VIEW_INVOICE_PAYMENT]
GROUP BY INVOICEId, OrderAccount
-- Add any where/having clauses you can to filter
-- Consider creating temp table first with primary key, making joins easier for SQL Server
Then use the #Temp table instead of that select sub-query.

Get data like in database order without sorting it

I have a table which has ShipCountry, ShipCity and Freight column in SQL database. I tried to retrieve data from that table by using the below query.
Select ShipCountry from CountryDetails Group by ShipCountry
If i run this query i am getting results in Ascending order. Instead of this i need data in database order. How to achieve this through SQL query?
Note: If i run the below query, it will return the data in Database order. I am getting sorted data when i added group by clause in my query.
Select ShipCountry from CountryDetails

The use of group by for ordering is improper .. (group by is for aggregation function as min, max or count)
if you need a specific order use order by instead
Select ShipCountry from CountryDetails Order by ShipCountry
otherwise if want not order use simply
Select ShipCountry from CountryDetails
Remember that the values store in db have not a proper order ..and are selected in the sequence used for retrive the data.
Each time you need an order you must esplicitally use order by
for avoid "redundant values" .. use distinct and not group by eg:
Select distinct ShipCountry from CountryDetails

As already has been stated, what you describe might lead to unexpected results fro your end users.
Let's assume you have a table without any indexes or keys (A so-called heap). A heap pretty much can be compared to a phone book (yeah, I've been around for a while) consisting of hundreds of pages, on which information is randomly ordered. A heap is exactly that; A lot of randomly ordered data. Whenever you query from such a table, the query analyzer will do its very best to figure out what the fastest way to deliver the data is.
Such decisions from the query analyzer are guided by statistics; a collection of metrics about the data and the distribution thereof. SQL Server uses these statistics to figure out the cardinality (the uniqueness of values), and thus pick the fastest way to return data.
When you simply issue a SELECT * FROM myTable on a heap, those statistics will determine the order in which your data is returned. However, this also means that over time, the statistics will change, as more data flows into the table. This has the effect that the sort order of your data today is not necessarily the sort order in which the data is returned tomorrow, or even five minutes from now.
If that is fine with your end users, then a SELECT * FROM myTable is the right solution for you. But, if you absolutely need to have the data returned in a certain order, you should always implement an ORDER BY clause.

if you want to have the same database order in most cases if you have sorted by primary key it will be the same without ordering as you say:
here the id is the primary key, and if you can not use the primary key add an identity column and use it:
id name
1 elly
2 ahmad
3 joseph
4 omar

SQL Select, different than the last 10 records

I have a table called "dutyroster". I want to make a random selection from this table's "names" column, but, I want the selection be different than the last 10 records so that the same guy is not given a second duty in 10 days. Is that possible ?

Create a temporary table with only one column called oldnames which will have no records initially. For each select, execute a query like
select names from dutyroster where dutyroster.names not in (select oldnamesfrom temporarytable) limit 10
and when execution is done add the resultset to the temporary table

The other answer already here is addressing the portion of the question on how to avoid duplicating selections.
To accomplish the random part of the selection, leverage newid() directly within your select statement. I've made this sqlfiddle as an example.
SELECT TOP 10
newid() AS [RandomSortColumn],
*
FROM
dutyroster
ORDER BY
[RandomSortColumn] ASC
Keep executing the query, and you'll keep getting different results. Use the technique in the other answer for avoiding doubling a guy up.

The basic idea is to use a subquery to get all but users from the last ten days, then sort the rest randomly:
select dr.*
from dutyroster dr
where dr.name not in (select dr2.name
from dutyroster dr2
where dr2.datetimecol >= date_sub(curdate(), interval 10 day)
)
order by rand()
limit 1;
Different databases may have different syntax for limit, rand(), and for the date/time functions. The above gives the structure of the query, but the functions may differ.
If you have a large amount of data and performance is a concern, there are other (more complicated) ways to take a random sample.

you could use TOP function for SQL Server
and for MYSQL you could use LIMIT function

Maybe this would help...
SELECT TOP number|percent column_name(s)
FROM table_name;
Source: http://www.w3schools.com/sql/sql_top.asp

Is there a performance difference in using a GROUP BY with MAX() as the aggregate vs ROW_NUMBER over partition by?

Is there a performance difference between the following 2 queries, and if so, then which one is better?:
select
q.id,
q.name
from(
select id, name, row_number over (partition by name order by id desc) as row_num
from table
) q
where q.row_num = 1
versus
select
max(id) ,
name
from table
group by name
(The result set should be the same)
This is assuming that no indexes are set.
UPDATE: I tested this, and the group by was faster.

I had a table of about 4.5M rows, and I wrote both a MAX with GROUP BY as well as a ROW_NUMBER solution and tested them both. The MAX requires two clustered scans of the table, one to aggregate, and a second to join to the rest of the columns whereas ROW_NUMBER only needed one. (Obviously one or both of these could be indexed to minimize IO, but the point is that GROUP BY requires two index scans.)
According to the optimizer, in my case the ROW_NUMBER is about 60% more efficient according to the subtree cost. And according to statistics IO, about 20% less CPU time. However, in real elapsed time, the ROW_NUMBER solution takes about 80% more real time. So the GROUP BY wins in my case.
This seems to match the other answers here.

The group by should be faster. The row number has to assign a row to all rows in the table. It does this before filtering out the ones it doesn't want.
The second query is, by far, the better construct. In the first, you have to be sure that the columns in the partition clause match the columns that you want. More importantly, "group by" is a well-understood construct in SQL. I would also speculate that the group by might make better use of indexes, but that is speculation.

I'd use the group by name.
Not much in it when the index is name, id DESC (Plan 1)
but if the index is declared as name, id ASC (Plan 2) then in 2008 I see the ROW_NUMBER version is unable to use this index and gets a sort operation whereas the GROUP BY is able to use a backwards index scan to avoid this.
You'd need to check the plans on your version of SQL Server and with your data and indexes to be sure.

Multiple SQL searches vs searching through one returned array

Is it faster to do multiple SQL finds on one table with different conditions or to get all the items from the table in an array and then separate them out that way? I realize I'm not explaining my question that well, so here is an example:
I want to pull records on posts and display them in categories based on when they were posted, say within one year, within one month, one week, etc. The nature of the categories results in lower level categories being entirely contained within upper level ones.
Should I do a SQL find with different conditions for each category, resulting in multiple calls to the database, or should I do one search, returning all of the items and then sort them out from the array? Thanks for your responses, sorry I'm new at this.

Typically I would say that you are going to get better performance by letting your database engine do the sorting work. Each database engine has this functionality and typically it can do it faster than you can.
So I would vote to use the database to get your multiple groups rather than trying to do it yourself in memory.

I typically perform one large sql query and then break the array up in ruby to minimize the number or duration of database connections.
This isn't necessarily any faster, and I have never benchmarked it, but less reads to the db hopefully means it will scale longer.

Edit: Nevermind, I didn't quite understand the question. Just let SQL perform the ordering for you in a convenient fashion and then process the array yourself.
You can probably make it even easier if you let your SELECT statement generate helper columns to say which categories (e.g., based on the date) a record belongs to.

The simplest, and easiest to understand would be to perform multiple queries for each criteria, and then form each of those result sets into a group. I don't think you want to start traversing result sets and duplicating rows.
If you really want to do it in one query, you could try a UNION query.
SELECT *, 1 as group from posts WHERE date > '2009-07-24 00:00:00' ORDER BY date DESC
UNION ALL
SELECT *, 2 as group from posts WHERE date > '2009-07-17 00:00:00' ORDER BY date DESC
UNION ALL
SELECT *, 3 as group from posts WHERE date > '2009-06-24 00:00:00' ORDER BY date DESC
UNION ALL
SELECT *, 4 as group from posts WHERE date > '2008-07-24 00:00:00' ORDER BY date DESC
After that you just need to traverse the list once, and filter into new lists by the "group" column.

It depends. If you're using OR operators in your procedures, then things could get kind of slow. It would be better at that point to use multiple SQL statements.
But really, you need to analyze the query plans and decide for yourself if it is efficient enough or not. Run real world examples and TEST TEST TEST.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas