What is the fastest way to perform a date query in Oracle SQL? - sql

We have a 6B row table that is giving us challenges when retrieving data.
Our query returns values instantly when doing a...
SELECT * WHERE Event_Code = 102225120
That type of instant result is exactly what we need. We now want to filter to receive values for just a particular year - but the moment we add...
AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
...the query takes over 10 minutes to begin returning any values.
Another SO post mentions that indexes don't necessarily help date queries when pulling many rows as opposed to an individual row. There are other approaches like using TRUNC, or BETWEEN, or specifying the datetime in YYYY-MM-DD format for doing comparisons.
Of note, we do not have the option to add indexes to the database as it is a vendor's database.
What is the way to add a date filtering query and enable Oracle to begin streaming the results back in the fastest way possible?

Another SO post mentions that indexes don't necessarily help date queries when pulling many rows as opposed to an individual row
That question is quite different from yours. Firstly, your statement above applies to any data type, not only dates. Also the word many is relative to the number of records in the table. If the optimizer decides that the query will return many of all records in your table, then it may decide that a full scan of the table is faster than using the index. In your situation, this translates to how many records are in 2017 out of all records in the table? This calculation gives you the cardinality of your query which then gives you an idea if an index will be faster or not.
Now, if you decide that an index will be faster, based on the above, the next step is to know how to build your index. In order for the optimizer to use the index, it must match the condition that you're using. You are not comparing dates in your query, you are only comparing the year part. So an index on the date column will not be used by this query. You need to create an index on the year part, so use the same condition to create the index.
we do not have the option to add indexes to the database as it is a vendor's database.
If you cannot modify the database, there is no way to optimize your query. You need to talk to the vendor and get access to modify the database or ask them to add the index for you.

A function can also cause slowness for the number of records involved. Not sure if Function Based Index can help you for this, but you can try.
Had you tried to add a year column in the table? If not, try to add a year column and update it using code below.
UPDATE table
SET year = EXTRACT(YEAR FROM PERFORMED_DATE_TIME);
This will take time though.
But after this, you can run the query below.
SELECT *
FROM table
WHERE Event_Code = 102225120 AND year = 2017;
Also, try considering Table Partitioned for this big data. For starters, see link below,
link: https://oracle-base.com/articles/8i/partitioned-tables-and-indexes

Your question is a bit ambiguous IMHO:
but the moment we add...
AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
...the query takes over 10 minutes to begin returning any values.
Do you mean that
SELECT * WHERE Event_Code = 102225120
is fast, but
SELECT * WHERE Event_Code = 102225120 AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
is slow???
For starters I'll agree with Mitch Wheat that you should try to use PERFORMED_DATE_TIME between Jan 1, 2017 and Dec 31, 2017 instead of Year(field) = 2017. Even if you'd have an index on the field, the latter would hardly be able to make use of it while the first method would benefit enormously.
I'm also hoping you want to be more specific than just 'give me all of 2017' because returning over 1B rows is NEVER going to be fast.
Next, if you can't make changes to the database, would you be able to maintain a 'shadow' in another database? This would require that you create a table with all date-values AND the PK of the original table in another database and query those to find the relevant PK values and then JOIN those back to your original table to find whatever you need. The biggest problem with this would be that you need to keep the shadow in sync with the original table. If you know the original table only changes overnight, you could merge the changes in the morning and query all day. If the application is 'real-time(ish)' then this probably won't work without some clever thinking... And yes, your initial load of 6B values will be rather heavy =)

May this could be usefull (because you avoid functions (a cause for context switching) and if you have an index on your date field, it could be used) :
with
dt as
(
select
to_date('01/01/2017', 'DD/MM/YYYY') as d1,
to_date('31/01/2017', 'DD/MM/YYYY') as d2
from dual
),
dates as
(
select
dt.d1 + rownum -1 as d
from dt
connect by dt.d1 + rownum -1 <= dt.d2
)
select *
from your_table, dates
where dates.d = PERFORMED_DATE_TIME

Move the date literal to RHS:
AND PERFORMED_DATE_TIME >= date '2017-01-01'
AND PERFORMED_DATE_TIME < date '2018-01-01'
But without an (undisclosed) appropriate index on PERFORMED_DATE_TIME, the query is unlikely to be any faster.
One option to create indexes in third party databases is to script in the index and then before any vendor upgrade run a script to remove any indexes you've added. If the index is important, ask the vendor to add it to their database design.

Related

Index for join query with where clause PostgreSQL

I have to optimize the following query with the help of indexes.
SELECT f.*
FROM first f
JOIN second s on f.attributex_id = s.id
WHERE f.attributex_id IS NOT NULL AND f.attributey_id IS NULL
ORDER BY s.month ASC LIMIT 100;
Further infos:
attributex_id is a foreign key pointing to second.id
attributey_id is a foreign key pointing to another table not used in the query
Changing the query is not an option
Most entries (98%) in first the following will be true f.attributex_id IS NOT NULL. Same for the second condition f.attributey_id IS NULL
I tried to add as index as follows.
CREATE INDEX index_for_first
ON first (attributex_id, attributey_id)
WHERE attributex_id IS NOT NULL AND (attributey_id IS NULL)
But the index is not used (checked via Explain Analyze) when executing the query. What kind of indexes would I need to optimize the query and what am I doing wrong with the above index?
Does an index on s.month make sense, too (month is unique)?
Based on the query text and the fact that nearly all records in first satisfy the where clause, what you're essentially trying to do is
identify the 100 second records with the lowest month value
output the contents of the related records in the first table.
To achieve that you can create indexes on
second.month
first.attributex_id
Caveats
Since this query must be optimized, it's safe to say there are many rows in both tables. Since there are only 12 months in the year, the output of the query is probably not deterministic (i.e., it may return a different set of rows each time it's run, even if there is no activity in either table between runs) since many records likely share the same value for month. Adding "tie breaker" column(s) to the index on second may help, though your order by only includes month, so no guarantees. Also, if second.month can have null values, you'll need to decide whether those null values should collate first or last among values.
Also, this particular query is not the only one being run against your data. These indexes will take up disk space and incrementally slow down writes to the tables. If you have a dozen queries that perform poorly, you might fall into a trap of creating a couple indexes to help each one individually and that's not a solution that scales well.
Finally, you stated that
changing the query is not an option
Does that mean you're not allowed to change the text of the query, or the output of the query?
I personally feel like re-writing the query to select from second and then join first makes the goal of the query more obvious. The fact that your initial instinct was to add indexes to first lends credence to this idea. If the query were written as follows, it would have been more obvious that the thing to do is facilitate efficient access to the tiny set of rows in second that you're interested in:
...
from second s
join first f ...
where ...
order by s.month asc limit 100;

Creating index on timestamp column for query which uses year function

I have a HISTORY table with 9 million records. I need to find year-wise, month-wise records created. I was using query no 1, However it timed out several times.
SELECT
year(created) as year,
MONTHNAME(created) as month,
count(*) as ymcount
FROM
HISTORY
GROUP BY
year(created), MONTHNAME(created);
I decided to add where year(created), this time the query took 30 mins (yes it takes so long) to execute.
SELECT
year(created) as year,
MONTHNAME(created) as month,
count(*) as ymcount
FROM
HISTORY
WHERE
year(created) = 2010
GROUP BY
year(created), MONTHNAME(created) ;
I was planning to add an index on created timestamp column, however before doing so, I need the opinion (since its going to take a long time to index such a huge table).
Will adding an index on created(timestamp) column improve performance, considering year function is used on the column?
An index won't really help because you have formed the query such that it must perform a complete table scan, index or no index. You have to form the where clause so it is in the form:
where field op constant
where field is, of course, your field; op is = <= => <> between in, etc. and constant is either a direct constant, 42, or an operation that can be executed once and the result cached, getdate().
Like this:
where created >= DateFromParts( #year, 1, 1 )
and created < DateFromParts( #year + 1, 1, 1 )
The DateFromParts function will generate a value which remains in effect for the duration of the query. If created is indexed, now the optimizer will be able to seek to exactly where the correct dates start and tell when the last date in the range has been processed and it can stop. You can keep year(created) everywhere else -- just get rid of it from the where clause.
This is called sargability and you can google all kinds of good information on it.
P.S. This is in Sql Server format but you should be able to calculate "beginning of specified year" and "beginning of year after specified year" in whatever DBMS you're using.
An index will be used, when it helps narrow down the number of rows read.
It will also be used, when it avoids reading the table at all. This is the case, when the index contains all the columns referenced in the query.
In your case the only column referenced is created, so adding an index on this column should help reducing the necessary reads and improve the overall runtime of your query. However, if created is the only column in the table, the index won't change anything in the first query, because it doesn't reduce the number of pages to be read.
Even with a large table, you can test, if an index makes a difference. You can copy only part of the rows to a new table and compare the execution plans on the new table with and without an index, e.g.
insert into testhistory
select *
from history
fetch first 100000 rows only
You want what's known as a Calendar Table (the particular example uses SQL Server, but the solution should be adaptable). Then, you want lots of indices on it (since writes are few, and this is a primary dimension table for analysis).
Assuming you have a minimum Calendar Table that looks like this:
CREATE TABLE Calendar (isoDate DATE,
dayOfMonth INTEGER,
month INTEGER,
year INTEGER);
... with an index over [dayOfMonth, month, year, isoDate], your query can be re-written like this:
SELECT Calendar.year, Calendar.month,
COUNT(*) AS ymCount
FROM Calendar
JOIN History
ON History.created >= Calendar.isoDate
AND History.created < Calendar.isoDate + 1 MONTH
WHERE Calendar.dayOfMonth = 1
GROUP BY Calendar.year, Calendar.month
The WHERE Calendar.dayOfMonth = 1 is automatically limiting results to 12-per-year. The start of the range is trivially located with the index (given the SARGable data), and the end of the range as well (yes, doing math on a column generally disqualifies indices... on the side the math is used. If the optimizer is at all smart it's going to going to gen a virtual intermediate table containing the start/end of range).
So, index-based (and likely index-only) access for the query. Learn to love indexed dimension tables, that can be used for range queries (Calendar Tables being one of the most useful).
I'll assume you are using SQL Server based on your tags.
Yes, the index will make your query faster.
I recommend only using the 'created' column as a key for the index and to not include any additional columns from the History table because they will be unused and only result in more reads than what is necessary.
And of course, be mindful when you create indexes on tables that have a lot of INSERT, UPDATE, DELETE activity as your new index will make these actions more expensive when being performed on the table.
As been stated before, in your case, an index won't be used because the index is created on the column 'created' and you are querying on 'year(created)'.
What you can do is add two generated columns year_gen = year(create) and month_gen = MONTHNAME(created) to your table and index these two columns. The DB2 Query Optimizer will automatically use these two generated columns and it will also use the indices created on these columns.
The code should be something like (but not 100% sure since I have no DB2 to test)
SET INTEGRITY FOR HISTORY OFF CASCADE DEFERRED #
ALTER TABLE HISTORY ADD COLUMN YEAR_GEN SMALLINT GENERATED ALWAYS AS (YEAR(CREATE)),
ADD COLUMN MONTH_GEN VARCHAR(20) GENERATED ALWAYS AS (YEAR(CREATE)) #
SET INTEGRITY FOR HISTORY IMMEDIATE CHECKED FORCE GENERATED #
CREATE INDEX HISTORY_YEAR_IDX ON HISTORY YEAR_GEN ASC CLUSTER #
CREATE INDEX HISTORY_MONTH_IDX ON HISTORY YEAR_GEN ASC #
Just a sidenote: the set integrity off is mandatory to add generated columns. Your table is inaccessible untill you reset the integrity to checked and you force the re-calculation of the generated columns (this might take a while in your case).
Setting integrity off without cascade deferred will set every table with a foreign key to the HISTORY table to OFF too. You will have to manually reset the integrity of these tables too. If I remember correctly, using cascade deferred in combination with incomming foreign keys may cause DB2 to set the integrity of your table to 'checked by user'.

Writing Efficient Queries in SAS Using Proc sql with Teradata

EDIT: Here is a more complete set of code that shows exactly what's going on per the answer below.
libname output '/data/files/jeff'
%let DateStart = '01Jan2013'd;
%let DateEnd = '01Jun2013'd;
proc sql;
CREATE TABLE output.id AS (
SELECT DISTINCT id
FROM mydb.sale_volume AS sv
WHERE sv.category IN ('a', 'b', 'c') AND
sv.trans_date BETWEEN &DateStart AND &DateEnd
)
CREATE TABLE output.sums AS (
SELECT id, SUM(sales)
FROM mydb.sale_volue AS sv
INNER JOIN output.id AS ids
ON ids.id = sv.id
WHERE sv.trans_date BETWEEN &DateStart AND &DateEnd
GROUP BY id
)
run;
The goal is to simply query the table for some id's based on category membership. Then I sum these members' activity across all categories.
The above approach is far slower than:
Running the first query to get the subset
Running a second query the sums every ID
Running a third query that inner joins the two result sets.
If I'm understanding correctly, it may be more efficient to make sure that all of my code is completely passed through rather than cross-loading.
After posting a question yesterday, a member suggested I might benefit from asking a separate question on performance that was more specific to my situation.
I'm using SAS Enterprise Guide to write some programs/data queries. I don't have permissions to modify the underlying data, which is stored in 'Teradata'.
My basic problem is writing efficient SQL queries in this environment. For example, I query a large table (with tens of millions of records) for a small subset of ID's. Then, I use this subset to query the larger table again:
proc sql;
CREATE TABLE subset AS (
SELECT
id
FROM
bigTable
WHERE
someValue = x AND
date BETWEEN a AND b
)
This works in a matter of seconds and returns 90k ID's. Next, I want to query this set of ID's against the big table, and problems ensue. I'm wanting to sum values over time for the ID's:
proc sql;
CREATE TABLE subset_data AS (
SELECT
bigTable.id,
SUM(bigTable.value) AS total
FROM
bigTable
INNER JOIN subset
ON subset.id = bigTable.id
WHERE
bigTable.date BETWEEN a AND b
GROUP BY
bigTable.id
)
For whatever reason, this takes a really long time. The difference is that the first query flags 'someValue'. The second looks at all activity, regardless of what's in 'someValue'. For example, I could flag every customer who orders a pizza. Then I would look at every purchase for all customers who ordered pizza.
I'm not overly familiar with SAS so I'm looking for any advice on how to do this more efficiently or speed things up. I'm open to any thoughts or suggestions and please let me know if I can offer more detail. I guess I'm just surprised the second query takes so long to process.
The most critical thing to understand when using SAS to access data in Teradata (or any other external database for that matter) is that the SAS software prepares SQL and submits it to the database. The idea is to try and relieve you (the user) from all the database specific details. SAS does this using a concept called "implict pass-through", which just means that SAS does the translation from SAS code into DBMS code. Among the many things that occur is data type conversion: SAS only has two (and only two) data types, numeric and character.
SAS deals with translating things for you but it can be confusing. For example, I've seen "lazy" database tables defined with VARCHAR(400) columns having values that never exceed some smaller length (like column for a person's name). In the data base this isn't much of a problem, but since SAS does not have a VARCHAR data type, it creates a variable 400 characters wide for each row. Even with data set compression, this can really make the resulting SAS dataset unnecessarily large.
The alternative way is to use "explicit pass-through", where you write native queries using the actual syntax of the DBMS in question. These queries execute entirely on the DBMS and return results back to SAS (which still does the data type conversion for you. For example, here is a "pass-through" query that performs a join to two tables and creates a SAS dataset as a result:
proc sql;
connect to teradata (user=userid password=password mode=teradata);
create table mydata as
select * from connection to teradata (
select a.customer_id
, a.customer_name
, b.last_payment_date
, b.last_payment_amt
from base.customers a
join base.invoices b
on a.customer_id=b.customer_id
where b.bill_month = date '2013-07-01'
and b.paid_flag = 'N'
);
quit;
Notice that everything inside the pair of parentheses is native Teradata SQL and that the join operation itself is running inside the database.
The example code you have shown in your question is NOT a complete, working example of a SAS/Teradata program. To better assist, you need to show the real program, including any library references. For example, suppose your real program looks like this:
proc sql;
CREATE TABLE subset_data AS
SELECT bigTable.id,
SUM(bigTable.value) AS total
FROM TDATA.bigTable bigTable
JOIN TDATA.subset subset
ON subset.id = bigTable.id
WHERE bigTable.date BETWEEN a AND b
GROUP BY bigTable.id
;
That would indicate a previously assigned LIBNAME statement through which SAS was connecting to Teradata. The syntax of that WHERE clause would be very relevant to if SAS is even able to pass the complete query to Teradata. (You example doesn't show what "a" and "b" refer to. It is very possible that the only way SAS can perform the join is to drag both tables back into a local work session and perform the join on your SAS server.
One thing I can strongly suggest is that you try to convince your Teradata administrators to allow you to create "driver" tables in some utility database. The idea is that you would create a relatively small table inside Teradata containing the ID's you want to extract, then use that table to perform explicit joins. I'm sure you would need a bit more formal database training to do that (like how to define a proper index and how to "collect statistics"), but with that knowledge and ability, your work will just fly.
I could go on and on but I'll stop here. I use SAS with Teradata extensively every day against what I'm told is one of the largest Teradata environments on the planet. I enjoy programming in both.
You imply an assumption that the 90k records in your first query are all unique ids. Is that definite?
I ask because the implication from your second query is that they're not unique.
- One id can have multiple values over time, and have different somevalues
If the ids are not unique in the first dataset, you need to GROUP BY id or use DISTINCT, in the first query.
Imagine that the 90k rows consists of 30k unique ids, and so have an average of 3 rows per id.
And then imagine those 30k unique ids actually have 9 records in your time window, including rows where somevalue <> x.
You will then get 3x9 records back per id.
And as those two numbers grow, the number of records in your second query grows geometrically.
Alternative Query
If that's not the problem, an alternative query (which is not ideal, but possible) would be...
SELECT
bigTable.id,
SUM(bigTable.value) AS total
FROM
bigTable
WHERE
bigTable.date BETWEEN a AND b
GROUP BY
bigTable.id
HAVING
MAX(CASE WHEN bigTable.somevalue = x THEN 1 ELSE 0 END) = 1
If ID is unique and a single value, then you can try constructing a format.
Create a dataset that looks like this:
fmtname, start, label
where fmtname is the same for all records, a legal format name (begins and ends with a letter, contains alphanumeric or _); start is the ID value; and label is a 1. Then add one row with the same value for fmtname, a blank start, a label of 0, and another variable, hlo='o' (for 'other'). Then import into proc format using the CNTLIN option, and you now have a 1/0 value conversion.
Here's a brief example using SASHELP.CLASS. ID here is name, but it can be numeric or character - whichever is right for your use.
data for_fmt;
set sashelp.class;
retain fmtname '$IDF'; *Format name is up to you. Should have $ if ID is character, no $ if numeric;
start=name; *this would be your ID variable - the look up;
label='1';
output;
if _n_ = 1 then do;
hlo='o';
call missing(start);
label='0';
output;
end;
run;
proc format cntlin=for_fmt;
quit;
Now instead of doing a join, you can do your query 'normally' but with an additional where clause of and put(id,$IDF.)='1'. This won't be optimized with an index or anything, but it may be faster than the join. (It may also not be faster - depends on how the SQL optimizer is working.)
If the id is unique you might add a UNIQUE PRIMARY INDEX(id) to that table, otherwise it defaults to a Non-unique PI.
Knowing about uniquenes helps the optimizer to produce a better plan.
Without more info like an Explain (just put EXPLAIN in front of the SELECT) it's hard to tell how this can be improved.
One alternate solution is to use SAS procedures. I don't know what your actual SQL is doing, but if you're just doing frequencies (or something else that can be done in a PROC), you could do:
proc sql;
create view blah as select ... (your join);
quit;
proc freq data=blah;
tables id/out=summary(rename=count=total keep=id count);
run;
Or any number of other options (PROC MEANS, PROC TABULATE, etc.). That may be faster than doing the sum in SQL (depending on some details, such as how your data is organized, what you're actually doing, and how much memory you have available). It has the added benefit that SAS might choose to do this in-database, if you create the view in the database, which might be faster. (In fact, if you just run the freq off the base table, it's possible that would be even faster, and then join the results to the smaller table).

What is the fastest way to select rows from a huge database?

I have a huge database of more than 3 million rows (my users information), I need to select all users that have birthdays in the current day.
The birthday column is a text (e.g. '19/03' or '19/03/1975') with day and month and sometimes the years.
When I try to select rows with like of left functions it take more then a minute to return the results.
I've tried to use 3 int column for day, month and year and then make the selection but it toke longer to get the results.
Any idea on how to make it run faster?
I'm using SQL Server 2008
Thanks
As marc_s mentions, if at all possible, store this as a date type - it'll make it way faster for SQL Server to perform comparisons on, and it'll be way easier to maintain. Next up, make sure to put an index on that column, and consider including any extra columns if you're only looking up the birthday to select a small subset of the total row.
Finally - and this is a big one. TEXT is just about the worst data type you could choose. The way TEXT is stored, the data isn't actually stored on the page itself. Instead it leaves behind a 16-byte pointer to another page. This other page will then contain the data itself in a record. But it gets worse, that record will be a SMALL_ROOT datatype taking up 84 bytes of space when your data is between 0 and 64 bytes in length!
Thus, what could've been saved as an 8-byte datetime or a 4-byte date now takes up a total of 100 bytes, and causes an off-row lookup for each and every row. Basically the perfect storm for bad performance.
If you cannot change it to a more proper datetime, at the very least, change it to a varchar!
first of all save the date in a format that is supported by SQL Server something like DATE or DATETIME (in your case I am guessing DATE should be enough) once you have that you can use SQL functions like MONTH and DAY as follows and avoid complex string manipulation function like LEFT etc.
Your query will look like this:
select * from MyTable where MONTH(dateColumnA) = '1' && DAY(dateColumnB) ='7' --1 is for january
I am not sure if this will solve your performance problems entirely but you can run this query in SQL Query Analyzer and see what recommendation it throws with respect to indexes etc. I dont have a great deal of knowledge about indexes on Date type columns
Most of what I had to say has already been said: Use a DATE type to store the date, and make sure that it is indexed. If you're going to use the three integers to store the date and search by that, then make sure that they're indexed as well:
CREATE INDEX IX_MyTable_Date_Ints ON MyTable(intYear, intMonth, intDay)
CREATE INDEX IX_MyTable_Date ON MyTable(BirthDate)
If you're wanting to be able to search the user table for birthdays excluding the year, I would recommend storing the birthday in a different date field, using a fixed year, e.g. 3004 - instead of using three integers. You base year should be a leap-year, to cater for anyone who may have been born on 29 February. If you use a year far in the future, you can use the year to determine that a date is effectively a date for which the year should be disregarded.
Then you can search for the birthday, regardless of the year, without having to do a function call on each record, by adding "WHERE birth_day = '3004-12-10'. If this field is indexed, you should be able to return all matching rows in a flash. You need to bear in mind that when searching an index, the server will need to do a maximum of 32 comparisons to find a match in 4 billion records. Never underestimate the benefits of indexing!
I would be inclined to put maintain the birthday through a trigger, so that it keeps itself updated. For those birth dates where you don't have the year, just use your base year (3004). Since your base year is in the future, you know that this birth date doesn't have a year.
CREATE TABLE MyTable (
MyTable_key INT IDENTITY(1, 1),
username VARCHAR(30),
birth_date DATE,
birth_day DATE
)
ALTER TABLE MyTable ADD CONSTRAINT PK_MyTable PRIMARY KEY CLUSTERED (MyTable_key)
CREATE INDEX MyTable_birth_date ON MyTable(birth_date)
CREATE INDEX MyTable_birth_day ON MyTable(birth_day)
GO
CREATE TRIGGER tr_MyTable_calc_birth_day ON MyTable AFTER INSERT, UPDATE AS
UPDATE t SET birth_day = DATEADD(YEAR, 3004-DATEPART(YEAR, t.birth_date), t.birth_date)
FROM MyTable t, inserted i WHERE i.MyTable_key = t.MyTable_key
To update your existing table, run the update as a standalone query, without the join to the inserted table as it was used in the trigger:
UPDATE MyTable SET birth_day = DATEADD(YEAR, 3004-DATEPART(YEAR, birth_date), birth_date)
Hope this helps.
Try to use Result Set instead of DataTable or DataSet. ResultSet is fast when compared to both of these

Is there efficient SQL to query a portion of a large table

The typical way of selecting data is:
select * from my_table
But what if the table contains 10 million records and you only want records 300,010 to 300,020
Is there a way to create a SQL statement on Microsoft SQL that only gets 10 records at once?
E.g.
select * from my_table from records 300,010 to 300,020
This would be way more efficient than retrieving 10 million records across the network, storing them in the IIS server and then counting to the records you want.
SELECT * FROM my_table is just the tip of the iceberg. Assuming you're talking a table with an identity field for the primary key, you can just say:
SELECT * FROM my_table WHERE ID >= 300010 AND ID <= 300020
You should also know that selecting * is considered poor practice in many circles. They want you specify the exact column list.
Try looking at info about pagination. Here's a short summary of it for SQL Server.
Absolutely. On MySQL and PostgreSQL (the two databases I've used), the syntax would be
SELECT [columns] FROM table LIMIT 10 OFFSET 300010;
On MS SQL, it's something like SELECT TOP 10 ...; I don't know the syntax for offsetting the record list.
Note that you never want to use SELECT *; it's a maintenance nightmare if anything ever changes. This query, though, is going to be incredibly slow since your database will have to scan through and throw away the first 300,010 records to get to the 10 you want. It'll also be unpredictable, since you haven't told the database which order you want the records in.
This is the core of SQL: tell it which 10 records you want, identified by a key in a specific range, and the database will do its best to grab and return those records with minimal work. Look up any tutorial on SQL for more information on how it works.
When working with large tables, it is often a good idea to make use of Partitioning techniques available in SQL Server.
The rules of your partitition function typically dictate that only a range of data can reside within a given partition. You could split your partitions by date range or ID for example.
In order to select from a particular partition you would use a query similar to the following.
SELECT <Column Name1>…/*
FROM <Table Name>
WHERE $PARTITION.<Partition Function Name>(<Column Name>) = <Partition Number>
Take a look at the following white paper for more detailed infromation on partitioning in SQL Server 2005.
http://msdn.microsoft.com/en-us/library/ms345146.aspx
I hope this helps however please feel free to pose further questions.
Cheers, John
I use wrapper queries to select the core query and then just isolate the ROW numbers that i wish to take from the query - this allows the SQL server to do all the heavy lifting inside the CORE query and just pass out the small amount of the table that i have requested. All you need to do is pass the [start_row_variable] and the [end_row_variable] into the SQL query.
NOTE: The order clause is specified OUTSIDE the core query [sql_order_clause]
w1 and w2 are TEMPORARY table created by the SQL server as the wrapper tables.
SELECT
w1.*
FROM(
SELECT w2.*,
ROW_NUMBER() OVER ([sql_order_clause]) AS ROW
FROM (
<!--- CORE QUERY START --->
SELECT [columns]
FROM [table_name]
WHERE [sql_string]
<!--- CORE QUERY END --->
) AS w2
) AS w1
WHERE ROW BETWEEN [start_row_variable] AND [end_row_variable]
This method has hugely optimized my database systems. It works very well.
IMPORTANT: Be sure to always explicitly specify only the exact columns you wish to retrieve in the core query as fetching unnecessary data in these CORE queries can cost you serious overhead
Use TOP to select only a limited amont of rows like:
SELECT TOP 10 * FROM my_table WHERE ID >= 300010
Add an ORDER BY if you want the results in a particular order.
To be efficient there has to be an index on the ID column.