SQL Server : index for finding latest value which is greater than a passed value - sql

I have a table with 4 columns
USER_ID: numeric
EVENT_DATE: date
VERSION: date
SCORE: decimal
I have a clustered index on (USER_ID, EVENT_DATE, VERSION). These three values together are unique.
I need to get the maximum EventDate for a set of UserIds (~1000 different ids) where the Score is larger than a specific value and only consider those entries with a specific Version.
SELECT M.*
FROM (VALUES
( 5237 ),
………1000 more
( 27054 ) ) C (USER_ID)
CROSS APPLY
(SELECT TOP 1 C.USER_ID, M.EVENT_DATE, M.SCORE
FROM MY_HUGE_TABLE M
WHERE C. USER_ID = M. USER_ID
AND M.VERSION = 'xxxx-xx-xx'
AND M.SCORE > 2 --Comment M.SCORE > 2
ORDER BY M.EVENT_DATE DESC) M
Once I execute the query, I get poor results with respect to runtime, due to a missing index on score column (I suppose).
If I delete the filtering on “M.SCORE > 2” I get my results ten times faster, nevertheless the latest Scores may be less than “2”.
Could anyone please hint me on how to setup an index which could allow me to improve my query performance.
Thank you very much in advance

For your query, the optimal index would be on (User_ID, Version, ValueDate desc, Score).
Unfortunately, your clustered index doesn't match. Only the first and third columns match, but they need to match in order. So, only the User_ID can help but that probably doesn't do much to filter the data.

Related

Optimize query to get rows with highest, filtered count in another table

I'm trying to create the most optimal query where the database would return the names of readers who often borrow sci-fi books. That's what I'm trying to optimize:
SELECT reader.name,
COUNT (CASE WHEN book.status_id = 1 AND book.category_id = 2 THEN 1 END)
FROM reader
JOIN book ON book.reader_id = reader.id
GROUP BY reader.name
ORDER BY COUNT (CASE WHEN book.status_id = 1 AND book.category_id = 2 THEN 1 END) DESC
LIMIT 10;
How can I improve my query other than with INNER JOIN or memory consumption increase?
This is my ERD diagram:
You could try to add your criteria in your join statement and just use the total count. It really depends on how much data you have etc....
SELECT reader.name,
COUNT(1) AS COUNTER
FROM reader
JOIN book ON book.reader_id = reader.id
AND book.status_id = 1
AND book.category = 2
GROUP BY reader.name
ORDER BY COUNTER DESC
LIMIT 10;
Assuming at least 10 readers that pass the criteria (like another answer also silently assumes), else you get fewer than 10 result rows.
Start with the filter. Aggregate & limit before joining to the second table. Much cheaper:
SELECT r.reader_id, r.surname, r.name, b.ct
FROM (
SELECT reader_id, count(*) AS ct
FROM book
WHERE status_id = 1
AND category_id = 2
GROUP BY reader_id
ORDER BY ct DESC, reader_id -- tiebreaker
LIMIT 10
) b
JOIN reader r ON r.id = b.reader_id
ORDER BY b.ct DESC, r.reader_id; -- tiebreaker
A multicolumn index on (status_id, category_id) would help a lot. Or an index on just one of both columns if either predicate is very selective. If performance of this particular query is your paramount objective, have this partial multicolumn index:
CREATE INDEX book_reader_special_idx ON book (reader_id)
WHERE status_id = 1 AND category_id = 2;
Typically, you'd vary the query, then this last index is too specialized.
Additional points:
Group by reader_id, which is the primary key (I assume) and guaranteed to be unique - as opposed to reader.name! Your original is likely to fail completely, name being just the "first name" from the looks of your ERD.
It's also typically substantially faster to group by an integer instead of varchar(25) (two times). But that's secondary, correctness comes first.
Also output surname and reader_id to disambiguate identical names. (Even name & surname are not reliably unique.)
count(*) is faster than count(1) while doing the same, exactly.
Add a tiebreaker to the ORDER BY clause to get a stable sort order and deterministic results. (Else, the result can be different every time with ties on the count.)

Speeding up Sql query

I have a Postgresql table with about 50 million entries. Now i want to find two id's, by looking for the first time and the last time a timestamp appears, to give me a range of entries.
Finding the "first" id takes about 100 milliseconds.
But finding the second id takes about 3 minutes.
Query for finding the first id
SELECT id
FROM transactions
WHERE "hashBlock" =
(SELECT hash
FROM blocks
WHERE n_time > 1262300400
ORDER BY id ASC
LIMTI 1)
ORDER BY id ASC
LIMTI 1
Query for finding the second id
Select id
FROM transactions
WHERE "hashBlock" =
(SELECT hash
FROM blocks
WHERE n_time < 1306879200
ORDER BY id DESC
LIMTI 1)
ORDER BY id DESC
LIMTI 1
I guess the longer runtime result in the query going from the first id until it find an id which satisfies the query and the second one starting at the last id.
Is there any way to speed up the second query?
I would create the following indexes, if you don't have them already:
create index ix1 on transactions ("hashBlock", id);
create index ix2 on blocks (n_time, id, hash);

Incremental DISTINCT / GROUP BY operation

I have a simple two-stage SQL query that operators on two tables A and B, where I use a sub-select to retrieve a number of IDs of table A that are stored as foreign keys in B, using a (possibly complex) query on table B (and possibly other joined tables). Then, I want to simply return the first x IDs of A. I tried using a query like this:
SELECT sq.id
FROM (
SELECT a_id AS id, created_at
FROM B
WHERE ...
ORDER BY created_at DESC
) sq
GROUP BY sq.id
ORDER BY max(sq.created_at) DESC
LIMIT 10;
which is quite slow as Postgres seems to perform the GROUP BY / DISTINCT operation on the whole result set before limiting it. If I LIMIT the sub-query (e.g. to 100), the performance is just fine (as I'd expect), but of course it's no longer guaranteed that there will be at least 10 distinct a_id values in the resulting rows of sq.
Similarly, the query
SELECT a_id AS id
FROM B
WHERE ...
GROUP BY id
ORDER BY max(created_at) DESC
LIMIT 10
is quite slow as Postgres seems to perform a sequential scan on B instead of using an (existing) index. If I remove the GROUP BY clause it uses the index just fine.
The data in table B is such that most rows contain different a_ids, hence even without the GROUP BY most of the returned IDs will be different. The goal I pursue with the grouping is to assure that the result set always contains a given number of entries from A.
Is there a way to perform an "incremental DISTINCT / GROUP BY"? In my naive thinking it would suffice for Postgres to produce result rows and group them incrementally until it reaches the number specified by LIMIT, which in most cases should be nearly instantaneous as most a_id values are different. I tried various ways to query the data but so far I didn't find anything that works reliably.
The Postgres version is 9.6, the data schema as follows:
Table "public.a"
Column | Type | Modifiers
--------+-------------------+------------------------------------------------
id | bigint | not null default nextval('a_id_seq'::regclass)
bar | character varying |
Indexes:
"a_pkey" PRIMARY KEY, btree (id)
"ix_a_bar" btree (bar)
Referenced by:
TABLE "b" CONSTRAINT "b_a_id_fkey" FOREIGN KEY (a_id) REFERENCES a(id)
Table "public.b"
Column | Type | Modifiers
------------+-----------------------------+--------------------------------------------------
id | bigint | not null default nextval('b_id_seq'::regclass)
foo | character varying |
a_id | bigint | not null
created_at | timestamp without time zone |
Indexes:
"b_pkey" PRIMARY KEY, btree (id)
"ix_b_created_at" btree (created_at)
"ix_b_foo" btree (foo)
Foreign-key constraints:
"b_a_id_fkey" FOREIGN KEY (a_id) REFERENCES a(id)
This problem is much more complex than it might seem at a first glance.
If ...
your criteria are not very selective (much more than 10 distinct a_id qualify)
you don't have many duplicate a_id in table B (like you stated)
then there is a very fast way.
To simplify a bit I assume created_at is also defined NOT NULL, or you need to do more.
WITH RECURSIVE top10 AS (
( -- extra parentheses required
SELECT a_id, ARRAY[a_id] AS id_arr, created_at
FROM b
WHERE ... -- your other filter conditions here
ORDER BY created_at DESC, a_id DESC -- both NOT NULL
LIMIT 1
)
UNION ALL -- UNION ALL, not UNION, since we exclude dupes a priori
(
SELECT b.a_id, id_arr || b.a_id, b.created_at
FROM top10 t
JOIN b ON (b.created_at, b.a_id)
< (t.created_at, t.a_id) -- comparing ROW values
AND b.a_id <> ALL (t.id_arr)
WHERE ... -- repeat conditions
ORDER BY created_at DESC, a_id DESC
LIMIT 1
)
)
SELECT a_id
FROM top10
LIMIT 10;
Ideally supported by an index on (created_at DESC, a_id DESC) (or just (created_at, a_id)).
Depending on your other WHERE conditions, other (partial?) indexes may serve even better.
This is particularly efficient for a small result set. Else, and depending on various other details, other solutions may be faster.
Related (with much more explanation):
Can spatial index help a “range - order by - limit” query
Optimize GROUP BY query to retrieve latest record per user
SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'
Best way to select random rows PostgreSQL
PostgreSQL sort by datetime asc, null first?
Select first row in each GROUP BY group?
The only way the planner has a chance to avoid sorting the whole table is if you have an index on the complete ORDER BY clause.
Then an index scan can be chosen to get the correct ordering, and the first ten result rows may be found quickly.

Getting the last record in SQL in WHERE condition

i have loanTable that contain two field loan_id and status
loan_id status
==============
1 0
2 9
1 6
5 3
4 5
1 4 <-- How do I select this??
4 6
In this Situation i need to show the last Status of loan_id 1 i.e is status 4. Can please help me in this query.
Since the 'last' row for ID 1 is neither the minimum nor the maximum, you are living in a state of mild confusion. Rows in a table have no order. So, you should be providing another column, possibly the date/time when each row is inserted, to provide the sequencing of the data. Another option could be a separate, automatically incremented column which records the sequence in which the rows are inserted. Then the query can be written.
If the extra column is called status_id, then you could write:
SELECT L1.*
FROM LoanTable AS L1
WHERE L1.Status_ID = (SELECT MAX(Status_ID)
FROM LoanTable AS L2
WHERE L2.Loan_ID = 1);
(The table aliases L1 and L2 could be omitted without confusing the DBMS or experienced SQL programmers.)
As it stands, there is no reliable way of knowing which is the last row, so your query is unanswerable.
Does your table happen to have a primary id or a timestamp? If not then what you want is not really possible.
If yes then:
SELECT TOP 1 status
FROM loanTable
WHERE loan_id = 1
ORDER BY primaryId DESC
-- or
-- ORDER BY yourTimestamp DESC
I assume that with "last status" you mean the record that was inserted most recently? AFAIK there is no way to make such a query unless you add timestamp into your table where you store the date and time when the record was added. RDBMS don't keep any internal order of the records.
But if last = last inserted, that's not possible for current schema, until a PK addition:
select top 1 status, loan_id
from loanTable
where loan_id = 1
order by id desc -- PK
Use a data reader. When it exits the while loop it will be on the last row. As the other posters stated unless you put a sort on the query, the row order could change. Even if there is a clustered index on the table it might not return the rows in that order (without a sort on the clustered index).
SqlDataReader rdr = SQLcmd.ExecuteReader();
while (rdr.Read())
{
}
string lastVal = rdr[0].ToString()
rdr.Close();
You could also use a ROW_NUMBER() but that requires a sort and you cannot use ROW_NUMBER() directly in the Where. But you can fool it by creating a derived table. The rdr solution above is faster.
In oracle database this is very simple.
select * from (select * from loanTable order by rownum desc) where rownum=1
Hi if this has not been solved yet.
To get the last record for any field from a table the easiest way would be to add an ID to each record say pID. Also say that in your table you would like to hhet the last record for each 'Name', run the simple query
SELECT Name, MAX(pID) as LastID
INTO [TableName]
FROM [YourTableName]
GROUP BY [Name]/[Any other field you would like your last records to appear by]
You should now have a table containing the Names in one column and the last available ID for that Name.
Now you can use a join to get the other details from your primary table, say this is some price or date then run the following:
SELECT a.*,b.Price/b.date/b.[Whatever other field you want]
FROM [TableName] a LEFT JOIN [YourTableName]
ON a.Name = b.Name and a.LastID = b.pID
This should then give you the last records for each Name, for the first record run the same queries as above just replace the Max by Min above.
This should be easy to follow and should run quicker as well
If you don't have any identifying columns you could use to get the insert order. You can always do it like this. But it's hacky, and not very pretty.
select
t.row1,
t.row2,
ROW_NUMBER() OVER (ORDER BY t.[count]) AS rownum from (
select
tab.row1,
tab.row2,
1 as [count]
from table tab) t
So basically you get the 'natural order' if you can call it that, and add some column with all the same data. This can be used to sort by the 'natural order', giving you an opportunity to place a row number column on the next query.
Personally, if the system you are using hasn't got a time stamp/identity column, and the current users are using the 'natural order', I would quickly add a column and use this query to create some sort of time stamp/incremental key. Rather than risking having some automation mechanism change the 'natural order', breaking the data needed.
I think this code may help you:
WITH cte_Loans
AS
(
SELECT LoanID
,[Status]
,ROW_NUMBER() OVER(ORDER BY (SELECT 1)) AS RN
FROM LoanTable
)
SELECT LoanID
,[Status]
FROM LoanTable L1
WHERE RN = ( SELECT max(RN)
FROM LoanTable L2
WHERE L2.LoanID = L1.LoanID)

How can I optimize this SQL query to get rid of the filesort and temp table?

Here's the query:
SELECT
count(id) AS count
FROM `numbers`
GROUP BY
MONTH(created_at),
YEAR(created_at)
ORDER BY
YEAR(created_at),
MONTH(created_at)
That query throws a 'Using temporary' and 'Using filesort' when doing EXPLAIN.
Ultimately what I'm doing is looking at a table of user-submitted tracking numbers and counting the number of submitted rows a grouping the counts by month/year.
ie. In November 2008 there were 11,312 submitted rows.
UPDATE, here's the DESCRIBE for the numbers table.
id int(11) NO PRI NULL auto_increment
tracking varchar(255) YES NULL
service varchar(255) YES NULL
notes text YES NULL
user_id int(11) YES NULL
active tinyint(1) YES 1
deleted tinyint(1) YES 0
feed text YES NULL
status varchar(255) YES NULL
created_at datetime YES NULL
updated_at datetime YES NULL
scheduled_delivery date YES NULL
carrier_service varchar(255) YES NULL
Give this a shot:
SELECT COUNT(x.id)
FROM (SELECT t.id,
MONTH(t.created_at) 'created_month',
YEAR(t.created_at) 'created_year'
FROM NUMBERS t) x
GROUP BY x.created_month, x.created_year
ORDER BY x.created_month, x.created_year
It's not a good habit to use functions in the WHERE, GROUP BY and ORDER BY clauses because indexes can't be used.
...query throws a 'Using temporary' and 'Using filesort' when doing EXPLAIN.
From what I found, that's to be expected when using DISTINCT/GROUP BY.
Make sure you have a covering index over YEAR and MONTH (that is, both fields within the same index) so that the ORDER BY component of your query can use an index. This should remove the need for a filesort, although a temporary table may still be needed to handle the grouping.
SELECT
count(`id`) AS count, MONTH(`created_at`) as month, YEAR(`created_at`) as year
FROM `numbers`
GROUP BY month, year
ORDER BY created_at
This will be the best you can get, as far as I can tell. I created a table with an id and a datetime column and filled it with 10000 rows. The query above uses a sub select, but it really doesn't do you any different and has the overhead of a sub select. The resulting time for mine was 0.015s and his was 0.016s.
Make sure that you have an index on created_at, this will help your initial query out. It is pretty rare to not end up with a file sort when the group by comes about, but it may be possible in other situations. MySql's docs have an article about this if you feel so inclined. I do not see how those methods can be applied here, with the information you have provided.
Whenever MySQL has to do work in memory, and that work exceeds the available amount (innodb_buffer_pool_size), it starts having to use the disk to store temporary work. You could increase the variable I mentioned, but setting it too high could cause performance problems in other areas.
If you're running a dedicated server, set it to ~50-75%.
The best method would be creating a helper column that would contain numberic values of YEAR and MONTH concatenated together:
YEAR(created_at) * 100 + MONTH(created_at)
Grouping on this column would use INDEX FOR GROUP BY.
However, you can create two helper tables, the first one containing reasonable number of years (say, from 1900 to 2100), the second one containing months (from 0 to 11), and use these tables to generate the sets:
SELECT (
SELECT COUNT(*)
FROM numbers
WHERE created_at >= '1900-01-01' + INTERVAL y YEAR + INTERVAL m MONTH
AND created_at < '1900-01-01' + INTERVAL y YEAR + INTERVAL m + 1 MONTH
)
FROM year_table
CROSS JOIN
month_table
WHERE y BETWEEN 2008 AND 2010
I'm sorry, but I have to disagree with the other answers.
I think what you need is to add an index to your table, preferably a covering index.
If you add an index on the columns you are searching on (created_at) and also on the columns you want to get a result from (id) then it will be dramatically faster then before.
The reason why you are using a temp table is because you use a group by.
To speed up the group by, you can change the MySQL server settings to increase the size of the tmp table and the max heap table size so that the temp table will be in memory.