SQL Limit on "WHERE X IN (...)" - sql

I've got some data I'd like to pull off our SQL server.
This old database does not have any primary keys associated with it, so pulling data is like querying an Excel spreadsheet (what it actually originated as years ago).
I need to run reports on this data, though.
Currently, I get a list of distinct serial numbers for a given time period, then pull all of the records for a given serial number. For a 1-month time frame, this can be 1500 to 3000 serial numbers. The serial number field is formatted as char(20), even though the serial numbers are only 15 characters long.
BEGIN UPDATE
There are typically 5 to 15 entries in this table per Serial_Number.
There are at most 10 machines writing data to this table, so identical Date_Time values are possible
END UPDATE
This process takes a while, but between different serial numbers in the list, I am able to update the Windows Form with a Progress Bar so management knows something is happening and about how much longer to expect.
I am always trying to make this query run faster.
Now, I am thinking about pulling the data I need using a WHERE clause such as:
SELECT Col1, Col2, Col3
FROM Table1
WHERE Serial_Number IN (
SELECT DISTINCT Serial_Number
FROM Table1
WHERE Date_Time Between #startDate AND #endDate
)
My question is: Are there any issues I could run into with this, particularly because we have so many distinct serial numbers during a given time frame.
And, of course, you know someone in Management is going to try running a year's worth of data when they are bored! Then, they are going to try running data since Jesus was born, just because they've got nothing better to do.
Restate Question: Is there a limit to the WHERE clause's IN method that limits the number of items I can pass in?

Index Serial_Number and Date_Time in Table1 (with separate indexes, not a single compound index) and this should perform fairly well for you unless the table is really truly ginormous.
You might get a little more speed with one index on Serial_Number and the second on (Date_Time, Serial_Number). That second index covers the sub query, allowing it to be answered from the index alone.
Note: I'm suggesting indexes, not primary keys, which don't require uniqueness.

Well, in the naive case where there are no indexes (which it sounds like is your case) you're going to have to scan over all the rows in Table1 to perform the DISTINCT on Serial_Number anyway. So I'm not sure it's going to help you much.
I would highly recommend the following:
Use the execution plan to determine what's going on in your query, and
Use that information to add some relevant indexes to speed your operations.
Just from what we see here, it sounds like Date_Time would be a good candidate for a clustered index in Table1.
Edit:
To make a nonunique clustered index as I describe above, you can use the following:
CREATE CLUSTERED INDEX IX_Table1_Date_Time
ON Table1 (Date_Time)
(from http://msdn.microsoft.com/en-us/library/aa258260(v=sql.80).aspx)
This will reorder your table such that all rows are sorted in Date_Time order. Further work with the execution plan will help identify other indexes that may greatly help your performance, depending on the exact types of queries you run.

Honestly, I see no benefit to the WHERE clause as it is written.
You use an expensive inner query, but don't do anything meaningful with the results. I don't even see you getting the Serial_Number in the results anywhere. However, based on your question, it does sound like you need it.
I don't see the need for the DISTINCT keyword for Serial_Number, since the duplicates would not be eliminated in the results in the outer query.
What is wrong with doing this?
SELECT Serial_Number, Col1, Col2, Col3
FROM Table1
WHERE Date_Time Between #startDate AND #endDate
This should do the same thing as your original query. But it would eliminate the expensive nested query.
Just put an index on Date_Time and it should work. This would also eliminate the need for the index on Serial_Number.

Apparently, there is no way to tell what the maximum length of the WHERE X IN (...) can be.
For now, this is the answer.
If, at some later point in time, someone comes along and finds something to the contrary, please post that answer and I will mark it as such.
Thanks,
Joe

Related

Index for join query with where clause PostgreSQL

I have to optimize the following query with the help of indexes.
SELECT f.*
FROM first f
JOIN second s on f.attributex_id = s.id
WHERE f.attributex_id IS NOT NULL AND f.attributey_id IS NULL
ORDER BY s.month ASC LIMIT 100;
Further infos:
attributex_id is a foreign key pointing to second.id
attributey_id is a foreign key pointing to another table not used in the query
Changing the query is not an option
Most entries (98%) in first the following will be true f.attributex_id IS NOT NULL. Same for the second condition f.attributey_id IS NULL
I tried to add as index as follows.
CREATE INDEX index_for_first
ON first (attributex_id, attributey_id)
WHERE attributex_id IS NOT NULL AND (attributey_id IS NULL)
But the index is not used (checked via Explain Analyze) when executing the query. What kind of indexes would I need to optimize the query and what am I doing wrong with the above index?
Does an index on s.month make sense, too (month is unique)?
Based on the query text and the fact that nearly all records in first satisfy the where clause, what you're essentially trying to do is
identify the 100 second records with the lowest month value
output the contents of the related records in the first table.
To achieve that you can create indexes on
second.month
first.attributex_id
Caveats
Since this query must be optimized, it's safe to say there are many rows in both tables. Since there are only 12 months in the year, the output of the query is probably not deterministic (i.e., it may return a different set of rows each time it's run, even if there is no activity in either table between runs) since many records likely share the same value for month. Adding "tie breaker" column(s) to the index on second may help, though your order by only includes month, so no guarantees. Also, if second.month can have null values, you'll need to decide whether those null values should collate first or last among values.
Also, this particular query is not the only one being run against your data. These indexes will take up disk space and incrementally slow down writes to the tables. If you have a dozen queries that perform poorly, you might fall into a trap of creating a couple indexes to help each one individually and that's not a solution that scales well.
Finally, you stated that
changing the query is not an option
Does that mean you're not allowed to change the text of the query, or the output of the query?
I personally feel like re-writing the query to select from second and then join first makes the goal of the query more obvious. The fact that your initial instinct was to add indexes to first lends credence to this idea. If the query were written as follows, it would have been more obvious that the thing to do is facilitate efficient access to the tiny set of rows in second that you're interested in:
...
from second s
join first f ...
where ...
order by s.month asc limit 100;

Speeding up aggregations for a large table in Oracle

I am trying to see how to improve performance for aggregation queries in an Oracle database. The system is used to run financial series simulations.
Here is the simplified set-up:
The first table table1 has the following columns
date | id | value
It is read-only, has about 100 million rows and is indexed on id, date
The second table table2 is generated by the application according to user input, is relatively small (300K rows) and has this layout:
id | start_date | end_date | factor
After the second table is generated, I need to compute totals as follows:
select date, sum(value * nvl(factor,1)) as total
from table1
left join table2 on table1.id = table2.id
and table1.date between table2.start_date and table2.end_date group by date
My issue is that this is slow, taking up to 20-30 minutes if the second table is particularly large. Is there a generic way to speed this up, perhaps trading off storage space and execution time, ideally, to achieve something running in under a minute?
I am not a database expert and have been reading Oracle performance tuning docs but was not able to find anything appropriate for this. The most promising idea I found were OLAP cubes but I understand this would help only if my second table was fixed and I simply needed to apply different filters on the data.
First, to provide any real insight, you'd need to determine the execution plan that Oracle is producing for the slow query.
You say the second table is ~300K rows - yes that's small compared to 100M but since you have a range condition in the join between the two tables, it's hard to say how many rows from table1 are likely to be accessed in any given execution of the query. If a large proportion of the table is accessed, but the query optimizer doesn't recognize that, the index may actually be hurting instead of helping.
You might benefit from re-organizing table1 as an index-organized table, since you already have an index that covers most of the columns. But all I can say from the information so far is that it might help, but it might not.
Apart from indexes, Also try below. My two cents!
Try running this Query with PARALLEL option employing multiple processors. /*+ PARALLEL(table1,4) */ .
NVL has been done for million of rows, and this will be an impact
to some extent, any way data can be organised?
When you know the date in Advance, probably you divide this Query
into two chunks, by fetching the ids in TABLE2 using the start
date and end date. And issue a JOIN it to TABLE1 using a
view or temp table. By this we use the index (with id as
leading edge) optimally
Thanks!

Index on VARCHAR column

I have a table of 32,589 rows, and one of the columns is called 'Location' and is a Varchar(40) column type. The column holds a location, which is actually a suburb, all uppercase text.
A function that uses this table does a:
IF EXISTS(SELECT * FROM MyTable WHERE Location = 'A Suburb')
...
Would it be beneficial to add an index to this column, for efficiency? This is more a read-only table, so not much edits or inserts except for maintanance.
Without an index SQL Server will have to perform a table scan to find the first instance of the location you're looking for. You might get lucky and have the value be in one of the first few rows, but it could be at row 32,000, which would be a waste of time. Adding an index only takes a few second and you'll probably see a big performance gain.
I concur with #Brian Shamblen answer.
Also, try using TOP 1 in the inner select
IF EXISTS(SELECT TOP 1 * FROM MyTable WHERE Location = 'A Suburb')
You don't have to select all the records matching your criteria for EXISTS, one is enough.
An opportunistic approach to performance tuning is usually a bad idea.
To answer the specific question - if your function is using location in a where clause, and the table has more than a few hundred rows, and the values in the location column are not all identical, creating an index will speed up your function.
Whether you notice any difference is hard to say - there may be much bigger performance problems lurking in the database, and you might be fixing the wrong problem.

Should I use a unique ID for a row in a junction table?

I am using SQL Server 2008.
A while back, I asked the question "should I use RecordID in a junction table". The tables would look like this:
// Images
ImageID// PK
// Persons
PersonID // pk
// Images_Persons
RecordID // pk
ImageID // fk
PersonID // fk
I was strongly advised NOT to use RecordID because it's useless in a table where the two IDs create a unique combination, meaning there will be no duplicate records.
Now, I am trying to find a random record in the junction table to create a quiz. I want to pull the first id and see if someone can match the second id. Specifically, I grab a random image and display it with three possible choices of persons.
The following query works, but I've quite a bit of negativity that suggests that it's very slow. My database might have 10,000 records, so I don't think that matters much. I've also read that the values generated aren't truly random.
SELECT TOP 1 * FROM Images_Persons ORDER BY newid();
Should I add the RecordID column or not? Is there a better way to find a random record in this case?
Previous questions for reference
Should I use "RecordID" as a column name?
SQL - What is the best table design to store people as musicians and artists?
NEWID is random enough and probably best
10k rows is peanuts
You don't need a surrogate key for a junction (link, many-many) table
Edit: in case you want to prematurely optimise...
You could ignore this and read these from #Mitch Wheat. But with just 10k rows your development time will be longer than any saved execution time..
Efficiently select random rows from large resultset with LINQ (ala TABLESAMPLE)
Efficiently randomize (shuffle) data in Sql Server table
Personally, I don't think that having the RecordID column should be advised AGAINST. Rather I'd advise that often it is UNNECESSARY.
There are cases where having a single value to identify a row makes for simpler code. But they're at the cost of additional storage, often additional indexes, etc. The overheads realistically are small, but so are the benefits.
In terms of the selection of random records, the existence of a single unique identifier can make the task easier if the identifiers are both sequential and consecutive.
The reason I say this is because your proposed solution requires the assignment of NEWID() to every record, and the sorting of all records to find the first one. As the table size grows this operation grows, and can become relatively expensive. Whether it's expensive enough to be worth optimising depends on whatever else is happening, how often, etc.
Where there are sequential consecutive unique identifiers, however, one can then choose a random value between MIN(id) and MAX(id), and then SEEK that value out. The requirement that all value are consecutive, however, is often a constraint too far; you're never allowed to delete a value mid-table, for example...
To overcome this, and depending on indexes, you may find the following approach useful.
DECLARE
#max_id INT
SELECT
#id = COUNT(*)
FROM
Images_Persons
SELECT
*
FROM
(
SELECT
*,
ROW_NUMBER() OVER (ORDER BY ImageID, PersonID) AS id
FROM
Images_Persons
)
AS data
WHERE
Images_Persons.id = CAST(#max_id * RAND() + 1 AS INT)
-- Assuming that `ImageID, PersonID` is the clustered index.
A down side here is that RAND() is notoriously poor at being truly random. Yet it normally perfectly suitable if executed at a random time relative to any other call to RAND().
Consider what you've got.
SELECT TOP 1 * FROM Images_Persons ORDER BY newid();
Not truly random? Excluding the 'truly random is impossible' bit, you're probably right - I believe that there are patterns in generated uniqueidentifiers. But you should test this yourself. It'd be simple; just create a table with 1 to 100 in it, order by newid() a lot of times, and look at the results. If it's random 'enough' for you (which it probably will be, for a quiz) then it's good enough.
Very slow? I wouldn't worry about that. I'd be very surprised if the newid() is slower than reading the record from the table. But again, test and benchmark.
I'd be happy with the solution you have, pending tests if you're concerned about it.
I've always used order by newid().

What's the most efficient way to check the presence of a row in a table?

Say I want to check if a record in a MySQL table exists. I'd run a query, check the number of rows returned. If 0 rows do this, otherwise do that.
SELECT * FROM table WHERE id=5
SELECT id FROM table WHERE id=5
Is there any difference at all between these two queries? Is effort spent in returning every column, or is effort spent in filtering out the columns we don't care about?
SELECT COUNT(*) FROM table WHERE id=5
Is a whole new question. Would the server grab all the values and then count the values (harder than usual), or would it not bother grabbing anything and just increment a variable each time it finds a match (easier than usual)?
I think I'm making a lot of false assumptions about how MySQL works, but that's the meat of the question! Where am I wrong? Educate me, Stack Overflow!
Optimizers are pretty smart (generally). They typically only grab what they need so I'd go with:
SELECT COUNT(1) FROM mytable WHERE id = 5
The most explicit way would be
SELECT WHEN EXISTS (SELECT 1 FROM table WHERE id = 5) THEN 1 ELSE 0 END
If there is an index on (or starting with) id, it will only search, with maximum efficiency, for the first entry in the index it can find with that value. It won't read the record.
If you SELECT COUNT(*) (or COUNT anything else) it will, under the same circumstances, count the index entries, but not read the records.
If you SELECT *, it will read all the records.
Limit your results to at most one row by appending LIMIT 1, if all you want to do is check the presence of a record.
SELECT id FROM table WHERE id=5 LIMIT 1
This will definitely ensure that no more than one row is returned or processed. In my experience, LIMIT 1 (or TOP 1 depending in the DB) to check for existence of a row makes a big difference in terms of performance for large tables.
EDIT: I think I misread your question, but I'll leave my answer here anyway if it's of any help.
I would think this
SELECT null FROM table WHERE id = 5 LIMIT 1;
would be faster than this
SELECT 1 FROM table WHERE id = 5 LIMIT 1;
but the timer says the winner is "SELECT 1".
For the first two queries, most people will generally say, always specify exactly what you need and leave the rest. Effort isn't all specific as bandwidth could be spent in returning data that you aren't even going to do anything with.
As for the previous answer will do for your result set, unless you're dealing with a language that supports affected rows. This can sometimes work when getting data to collect information on how many rows were returned in the last query. You'll need to look at your interface documentation as to how to get that information.
The difference between your 3 queries depends on how you've built your index. Only returning the primary key is likely to be faster as MySQL will have your index in memory, and not have to hit disk. Adding the LIMIT 1 is also a good trick that will speed up the optimizer significantly in early 5.0.x branches and earlier.
try EXPLAIN SELECT id FROM table WHERE id=5 and check the Extras column for the presence of USING INDEX. If its there, then you're query is coming straight from the index, and is going to be much faster.