Error: TABLE_QUERY expressions cannot query BigQuery tables - google-bigquery

This s a followup question regarding Jordans answer here: Weird error in BigQuery
I was using to query reference table within "Table_Query" for quit some time. Now, following the recent changes Joradan is referring to, many of our queries are broken... I would like to ask the community advice for alternative solution to what we are doing.
I have tables containing events ("MyTable_YYYYMMDD"). I want to query my data for a period of a specific (or several) campaign. The period of that campaign is stored in a table with all campaigns data (ID, StartCampaignDate, EndCampaignDate). In order to query only the relevant tables, we use Table_Query(), and within the TableQuery() we construct a list of all relevant table names based on the campaigns data.
This query runs in various forms many times with different params. the reason for using wildcard function (rather than query the entire dataset), is performance, execution costs, and maintenance costs. So, having it query all tables and filter just the results is not an option as it drives execution costs too high.
a sample query will look like:
SELECT
*
FROM
TABLE_QUERY([MyProject:MyDataSet] 'table_id IN
(SELECT CONCAT("MyTable_",STRING(Year*100+Month)) TBL_NAME
FROM DWH.Dim_Periods P
CROSS JOIN DWH.Campaigns AS LC
WHERE ID IN ("86254e5a-b856-3b5a-85e1-0f5ab3ff20d6")
AND DATE(P.Date) BETWEEN DATE(StartCampaignDate) AND DATE(EndCampaignDate))')
This is now broken...
My question - the info, which tables should you query is stored on a reference table, How would you query only the relevant tables (partitions) when "TableQuery" is no longer allowed to query reference tables?
Many thanks

The "simple" way I see is split it to two steps
Step 1 - build list that will be used to filter table_id's
SELECT GROUP_CONCAT_UNQUOTED(
CONCAT('"',"MyTable_",STRING(Year*100+Month),'"')
) TBL_NAME_LIST
FROM DWH.Dim_Periods P
CROSS JOIN DWH.Campaigns AS LC
WHERE ID IN ("86254e5a-b856-3b5a-85e1-0f5ab3ff20d6")
AND DATE(P.Date) BETWEEN DATE(StartCampaignDate) AND DATE(EndCampaignDate)
Note the change in your query to transform result to list that you will use in step 2
Step 2 - final query
SELECT
*
FROM
TABLE_QUERY([MyProject:MyDataSet],
'table_id IN (<paste list (TBL_NAME_LIST) built in first query>)')
Above steps are easy to implement in any client you potentially using
If you use it from within BigQuery Web UI - this makes you do a little extra manual "moves" that you might not be happy about
My answer is obvious and you most likely have this already as an option, but wanted to mention

This is not ideal solution. But it seems to do the job.
In my previous query I passed the IDs List as a parameter in an external process that constructed the query. I wanted this process to be unaware to any logic implemented in the query.
Eventually we came up with this solution:
Instead of passing a list of IDs, we pass a JSON that contains the relevant meta data for each ID. We parse this JSON within the Table_Query() function. So instead of querying a physical reference table, we query some sort of a "table variable" that we have put in a JSON.
Below is a sample query that runs on the public dataset that demonstrates this solution.
SELECT
YEAR,
COUNT (*) CNT
FROM
TABLE_QUERY([fh-bigquery:weather_gsod], 'table_id in
(Select table_id
From
(Select table_id,concat(Right(table_id,4),"0101") as TBL_Date from [fh-bigquery:weather_gsod.__TABLES_SUMMARY__]
where table_id Contains "gsod"
)TBLs
CROSS JOIN
(select
Regexp_Replace(Regexp_extract(SPLIT(DatesInput,"},{"),r"\"fromDate\":\"(\d\d\d\d-\d\d-\d\d)\""),"-","") as fromDate,
Regexp_Replace(Regexp_extract(SPLIT(DatesInput,"},{"),r"\"toDate\":\"(\d\d\d\d-\d\d-\d\d)\""),"-","") as toDate,
FROM
(Select
"[
{
\"CycleID\":\"123456\",
\"fromDate\":\"1929-01-01\",
\"toDate\":\"1950-01-10\"
},{
\"CycleID\":\"123456\",
\"fromDate\":\"1970-02-01\",
\"toDate\":\"2000-02-10\"
}
]"
as DatesInput)) RefDates
WHERE TBLs.TBL_Date>=RefDates.fromDate
AND TBLs.TBL_Date<=RefDates.toDate
)')
GROUP BY
YEAR
ORDER BY
YEAR
This solution is not ideal as it requires an external process to be aware of the data stored in the reference tables.
Ideally the BigQuery team will re-enable this very useful functionality.

Related

MS Access 2010 SQL Top N query by group performance issue (continued)

I have signficant performcance issues (up to time-out) in MS Access 2010 with the query below. The table TempTableAnalysis contains between 10'000-15'000 records. I have already received input from this forum to work with a temporary table in the top 10 query (MS Access 2010 SQL Top N query by group performance issue)
Can anyone explain how to implement the temporary table in the subquery and how to join it? I can't get it to work.
Any other suggestions to improve performance are highly appreciated.
Here is my query:
SELECT
t2.Loc,
t2.ABCByPick,
t2.Planner,
t2.DmdUnit,
ROUND(t2.MASE,2) AS MASE,
ROUND(t2.AFAR,2) AS AFAR
FROM TempTableAnalysis AS t2
WHERE t2.MASE IN (
SELECT TOP 10 t1.MASE
FROM TempTableAnalysis AS t1
WHERE t1.ABCByPick = t2.ABCByPick
ORDER BY t1.MASE DESC
)
ORDER BY
t2.ABCByPick,
t2.MASE DESC;
Optimizing Access Query Performance For Large Data Sets
Based on your posted SQL Query, you have some options available to optimize and speed up the performance.
SELECT
t2.Loc,
t2.ABCByPick,
t2.Planner,
t2.DmdUnit,
ROUND(t2.MASE,2) AS MASE,
ROUND(t2.AFAR,2) AS AFAR
FROM TempTableAnalysis AS t2
...
This is the first part where TempTableAnalysis is the multi-thousand record subquery. If you want to squeeze a little more performance out of the use of this "Temp" Table, don't use it as a dynamic query (i.e., calculated on demand each time the query is opened), try constructing a macro that pushes the output to a static table:
Appending Subquery Data to a Static Table:
Create a QUERY object and change its type to DELETE. Design it to delete the contents of your "temporary" table object. If you prefer using SQL, the command will look like:
DELETE My_Table.*
FROM My_Table;
Create a QUERY object and change its type to APPEND. Design it to query all fields from your query defined by the SQL statement of this OP. Again, the SQL version of this task has the following syntax:
INSERT INTO StaticAnalysisTable ( ID, Loc, Item, AvgOfScaledError )
SELECT t1.ID, t1.Loc, t1.Item, t1.AvgOfScaledError
FROM TempTableAnalysis as t1;
The next step is to automate the population of this static table and it is optional. It's simple however and will make it less likely that you will make the mistake of forgetting to "Refresh" and accessing your static table while it has stale data... causing inaccuracies in your results.
Create a macro with two steps. Each step will have the following definition: OPEN QUERY. When prompted for the query to open, reference the objects you created in the previous two steps in the following order (important): (1) DELETE Query: (your delete query name) then (2) APPEND Query: (your append query name).
SQL Query Comments and Suggestions
The following part of the posted SQL query could use some help:
...
WHERE t2.MASE IN (
SELECT TOP 10 t1.MASE
FROM TempTableAnalysis AS t1
WHERE t1.ABCByPick = t2.ABCByPick
ORDER BY t1.MASE DESC
)
ORDER BY
t2.ABCByPick,
t2.MASE DESC;
There is a join across the sub query that generates the TOP-10 data and the outermost query that correlates these results with the supplementing MASE table data. This isn't necessary if the TempTableAnalysis.MASE represents a key value.
ORDER BY
in the inner most query isn't necessary unless it is intended to force some sort of selection criteria (as in when using SQL analytical functions) this doesn't look like one of those cases. Ordering records from large data sets is also a wasteful cpu and memory sink.
EDIT: Just as a counter-point argument, the ORDER BY clause used beside a TOP N query actually has a purpose, but I am still not clear if it is necessary. Just to round out the discussion, another SO thread talks about How to Select Top 10 in an Access Query.
WHERE t2.MASE IN (...
You may be experiencing blocks in performance with very large in-list set operations. On an Oracle database server, I have discovered with other developers that there is a limitation to the number of discrete elements in an in-list query operator. That value was in the thousands... which may be further limited based on server and database resources.
Consider using a SQL JOIN operator. The place where you define TABLE objects can also be populated with SQL defined queries with aliases known as INLINE VIEWS. Since you're using ACCESS, if an inline view does not work directly, just define another ACCESS QUERY object and reference it in your final query as if it were a table...
A possible rewrite to the ending part of the original query:
SELECT
t2.Loc,
t2.ABCByPick,
t2.Planner,
...
FROM TempTableAnalysis AS t2,
(SELECT TOP 10 t1.MASE, t1.ABCByPick
FROM TempTableAnalysis AS t1) AS ttop
WHERE t2.MASE = ttop.MASE
AND t2.ABCByPick = ttop.ABCByPick
ORDER BY
t2.ABCByPick,
t2.MASE DESC;
You will definitely need to run through these recommendations and validate the output data for accuracy. This represents approaches to capturing some of the "low-hanging fruit" (easy items) that you can pursue to speed up your query and reporting operations.
Conclusions and Closing Comments
As a background to other readers, the database object TempTableAnalysis is not a static table. It is the result of a sub query presented in another SO post requesting help with a Access TOP N Query. The query comes from multiple tables approaching 10,000 records in size (each?).
Tip: A query result in Access ALSO has potential table-like behaviors. You can push the output to a table for joining (as described above) or just join to the query object itself (careful though, especially when you get to "chaining" multiple query operations...)
The strategy of this solution was:
To minimize the number of trips through one or more instances of this very large table.
To pre-process and index optimize any data that would otherwise be "static" for the duration of its analysis.
To audit and review the SQL code used to obtain the final results.
Definitely look into Access MACROS. Coupled with identifying static data in your data sets, you can offload processing of your complex background analytic queries to improve the user experience when they view and query through the final results. Good Luck!

How do I use the TABLE_QUERY() function in BigQuery?

A couple of questions about the TABLE_QUERY function:
The examples show using table_id in the query string, are there other fields available?
It seems difficult to debug. I'm getting "error evaluating subsidiary query" when I try to use it.
How does TABLE_QUERY() work?
The TABLE_QUERY() function allows you to write a SQL WHERE clause that is evaluated to find which tables to run the query over. For instance, you can run the following query to count the rows in all tables in the publicdata:samples dataset that are older than 7 days:
SELECT count(*)
FROM TABLE_QUERY(publicdata:samples,
"MSEC_TO_TIMESTAMP(creation_time) < "
+ "DATE_ADD(CURRENT_TIMESTAMP(), -7, 'DAY')")
Or you can run this to query over all tables that have ‘git’ in the name (which are the github_timeline and the github_nested sample tables) and find the most common urls:
SELECT url, COUNT(*)
FROM TABLE_QUERY(publicdata:samples, "table_id CONTAINS 'git'")
GROUP EACH BY url
ORDER BY url DESC
LIMIT 100
Despite being very powerful, TABLE_QUERY() can be difficult to use. The WHERE clause must be specified as a string, which can be a little bit awkward. Moreover, it can be difficult to debug, since when there is a problem, you only get the error “Error evaluating subsidiary query”, which isn’t always helpful.
How it works:
TABLE_QUERY() essentially executes two queries. When you run TABLE_QUERY(<dataset>, <table_query>), BigQuery executes SELECT table_id FROM <dataset>.__TABLES_SUMMARY__ WHERE <table_query> to get the list of table IDs to run the query on, then it executes your actual query over those tables.
The __TABLES__ portion of that query may look unfamiliar. __TABLES_SUMMARY__ is a meta-table containing information about tables in a dataset. You can use this meta-table yourself. For example, the query SELECT * FROM publicdata:samples.__TABLES_SUMMARY__ will return metadata about the tables in the publicdata:samples dataset.
Available Fields:
The fields of the __TABLES_SUMMARY__ meta-table (that are all available in the TABLE_QUERY query) include:
table_id: name of the table.
creation_time: time, in milliseconds since 1/1/1970 UTC, that the table was created. This is the same as the creation_time field on the table.
type: whether it is a view (2) or regular table (1).
The following fields are not available in TABLE_QUERY() since they are members of __TABLES__ but not __TABLES_SUMMARY__. They're kept here for historical interest and to partially document the __TABLES__ metatable:
last_modified_time: time, in milliseconds since 1/1/1970 UTC, that the table was updated (either metadata or table contents). Note that if you use the tabledata.insertAll() to stream records to your table, this might be a few minutes out of date.
row_count: number of rows in the table.
size_bytes: total size in bytes of the table.
How to debug
In order to debug your TABLE_QUERY() queries, you can do the same thing that BigQuery does; that is, you can run the the metatable query yourself. For example:
SELECT * FROM publicdata:samples.__TABLES_SUMMARY__
WHERE MSEC_TO_TIMESTAMP(creation_time) <
DATE_ADD(CURRENT_TIMESTAMP(), -7, 'DAY')
lets you not only debug your query but also see what tables would be returned when you run the TABLE_QUERY function. Once you have debugged the inner query, you can put it together with your full query over those tables.
Alternative answer, for those moving forward to Standard SQL:
BigQuery Standard SQL doesn't support TABLE_QUERY, but it supports * expansion for table names.
When expanding table names *, you can use the meta-column _TABLE_SUFFIX to narrow the selection.
Table expansion with * only works when all tables have compatible schemas.
For example, to get the average worldwide NOAA GSOD temperature between 2010 and 2014:
#standardSQL
SELECT AVG(temp) avg_temp, _TABLE_SUFFIX y
FROM `bigquery-public-data.noaa.gsod_20*` #every year that starts with "20"
WHERE _TABLE_SUFFIX BETWEEN "10" AND "14" #only years between 2010 and 2014
GROUP BY y
ORDER BY y

Writing Efficient Queries in SAS Using Proc sql with Teradata

EDIT: Here is a more complete set of code that shows exactly what's going on per the answer below.
libname output '/data/files/jeff'
%let DateStart = '01Jan2013'd;
%let DateEnd = '01Jun2013'd;
proc sql;
CREATE TABLE output.id AS (
SELECT DISTINCT id
FROM mydb.sale_volume AS sv
WHERE sv.category IN ('a', 'b', 'c') AND
sv.trans_date BETWEEN &DateStart AND &DateEnd
)
CREATE TABLE output.sums AS (
SELECT id, SUM(sales)
FROM mydb.sale_volue AS sv
INNER JOIN output.id AS ids
ON ids.id = sv.id
WHERE sv.trans_date BETWEEN &DateStart AND &DateEnd
GROUP BY id
)
run;
The goal is to simply query the table for some id's based on category membership. Then I sum these members' activity across all categories.
The above approach is far slower than:
Running the first query to get the subset
Running a second query the sums every ID
Running a third query that inner joins the two result sets.
If I'm understanding correctly, it may be more efficient to make sure that all of my code is completely passed through rather than cross-loading.
After posting a question yesterday, a member suggested I might benefit from asking a separate question on performance that was more specific to my situation.
I'm using SAS Enterprise Guide to write some programs/data queries. I don't have permissions to modify the underlying data, which is stored in 'Teradata'.
My basic problem is writing efficient SQL queries in this environment. For example, I query a large table (with tens of millions of records) for a small subset of ID's. Then, I use this subset to query the larger table again:
proc sql;
CREATE TABLE subset AS (
SELECT
id
FROM
bigTable
WHERE
someValue = x AND
date BETWEEN a AND b
)
This works in a matter of seconds and returns 90k ID's. Next, I want to query this set of ID's against the big table, and problems ensue. I'm wanting to sum values over time for the ID's:
proc sql;
CREATE TABLE subset_data AS (
SELECT
bigTable.id,
SUM(bigTable.value) AS total
FROM
bigTable
INNER JOIN subset
ON subset.id = bigTable.id
WHERE
bigTable.date BETWEEN a AND b
GROUP BY
bigTable.id
)
For whatever reason, this takes a really long time. The difference is that the first query flags 'someValue'. The second looks at all activity, regardless of what's in 'someValue'. For example, I could flag every customer who orders a pizza. Then I would look at every purchase for all customers who ordered pizza.
I'm not overly familiar with SAS so I'm looking for any advice on how to do this more efficiently or speed things up. I'm open to any thoughts or suggestions and please let me know if I can offer more detail. I guess I'm just surprised the second query takes so long to process.
The most critical thing to understand when using SAS to access data in Teradata (or any other external database for that matter) is that the SAS software prepares SQL and submits it to the database. The idea is to try and relieve you (the user) from all the database specific details. SAS does this using a concept called "implict pass-through", which just means that SAS does the translation from SAS code into DBMS code. Among the many things that occur is data type conversion: SAS only has two (and only two) data types, numeric and character.
SAS deals with translating things for you but it can be confusing. For example, I've seen "lazy" database tables defined with VARCHAR(400) columns having values that never exceed some smaller length (like column for a person's name). In the data base this isn't much of a problem, but since SAS does not have a VARCHAR data type, it creates a variable 400 characters wide for each row. Even with data set compression, this can really make the resulting SAS dataset unnecessarily large.
The alternative way is to use "explicit pass-through", where you write native queries using the actual syntax of the DBMS in question. These queries execute entirely on the DBMS and return results back to SAS (which still does the data type conversion for you. For example, here is a "pass-through" query that performs a join to two tables and creates a SAS dataset as a result:
proc sql;
connect to teradata (user=userid password=password mode=teradata);
create table mydata as
select * from connection to teradata (
select a.customer_id
, a.customer_name
, b.last_payment_date
, b.last_payment_amt
from base.customers a
join base.invoices b
on a.customer_id=b.customer_id
where b.bill_month = date '2013-07-01'
and b.paid_flag = 'N'
);
quit;
Notice that everything inside the pair of parentheses is native Teradata SQL and that the join operation itself is running inside the database.
The example code you have shown in your question is NOT a complete, working example of a SAS/Teradata program. To better assist, you need to show the real program, including any library references. For example, suppose your real program looks like this:
proc sql;
CREATE TABLE subset_data AS
SELECT bigTable.id,
SUM(bigTable.value) AS total
FROM TDATA.bigTable bigTable
JOIN TDATA.subset subset
ON subset.id = bigTable.id
WHERE bigTable.date BETWEEN a AND b
GROUP BY bigTable.id
;
That would indicate a previously assigned LIBNAME statement through which SAS was connecting to Teradata. The syntax of that WHERE clause would be very relevant to if SAS is even able to pass the complete query to Teradata. (You example doesn't show what "a" and "b" refer to. It is very possible that the only way SAS can perform the join is to drag both tables back into a local work session and perform the join on your SAS server.
One thing I can strongly suggest is that you try to convince your Teradata administrators to allow you to create "driver" tables in some utility database. The idea is that you would create a relatively small table inside Teradata containing the ID's you want to extract, then use that table to perform explicit joins. I'm sure you would need a bit more formal database training to do that (like how to define a proper index and how to "collect statistics"), but with that knowledge and ability, your work will just fly.
I could go on and on but I'll stop here. I use SAS with Teradata extensively every day against what I'm told is one of the largest Teradata environments on the planet. I enjoy programming in both.
You imply an assumption that the 90k records in your first query are all unique ids. Is that definite?
I ask because the implication from your second query is that they're not unique.
- One id can have multiple values over time, and have different somevalues
If the ids are not unique in the first dataset, you need to GROUP BY id or use DISTINCT, in the first query.
Imagine that the 90k rows consists of 30k unique ids, and so have an average of 3 rows per id.
And then imagine those 30k unique ids actually have 9 records in your time window, including rows where somevalue <> x.
You will then get 3x9 records back per id.
And as those two numbers grow, the number of records in your second query grows geometrically.
Alternative Query
If that's not the problem, an alternative query (which is not ideal, but possible) would be...
SELECT
bigTable.id,
SUM(bigTable.value) AS total
FROM
bigTable
WHERE
bigTable.date BETWEEN a AND b
GROUP BY
bigTable.id
HAVING
MAX(CASE WHEN bigTable.somevalue = x THEN 1 ELSE 0 END) = 1
If ID is unique and a single value, then you can try constructing a format.
Create a dataset that looks like this:
fmtname, start, label
where fmtname is the same for all records, a legal format name (begins and ends with a letter, contains alphanumeric or _); start is the ID value; and label is a 1. Then add one row with the same value for fmtname, a blank start, a label of 0, and another variable, hlo='o' (for 'other'). Then import into proc format using the CNTLIN option, and you now have a 1/0 value conversion.
Here's a brief example using SASHELP.CLASS. ID here is name, but it can be numeric or character - whichever is right for your use.
data for_fmt;
set sashelp.class;
retain fmtname '$IDF'; *Format name is up to you. Should have $ if ID is character, no $ if numeric;
start=name; *this would be your ID variable - the look up;
label='1';
output;
if _n_ = 1 then do;
hlo='o';
call missing(start);
label='0';
output;
end;
run;
proc format cntlin=for_fmt;
quit;
Now instead of doing a join, you can do your query 'normally' but with an additional where clause of and put(id,$IDF.)='1'. This won't be optimized with an index or anything, but it may be faster than the join. (It may also not be faster - depends on how the SQL optimizer is working.)
If the id is unique you might add a UNIQUE PRIMARY INDEX(id) to that table, otherwise it defaults to a Non-unique PI.
Knowing about uniquenes helps the optimizer to produce a better plan.
Without more info like an Explain (just put EXPLAIN in front of the SELECT) it's hard to tell how this can be improved.
One alternate solution is to use SAS procedures. I don't know what your actual SQL is doing, but if you're just doing frequencies (or something else that can be done in a PROC), you could do:
proc sql;
create view blah as select ... (your join);
quit;
proc freq data=blah;
tables id/out=summary(rename=count=total keep=id count);
run;
Or any number of other options (PROC MEANS, PROC TABULATE, etc.). That may be faster than doing the sum in SQL (depending on some details, such as how your data is organized, what you're actually doing, and how much memory you have available). It has the added benefit that SAS might choose to do this in-database, if you create the view in the database, which might be faster. (In fact, if you just run the freq off the base table, it's possible that would be even faster, and then join the results to the smaller table).

Microsoft Access SQL STDEV of COUNT of data

I have a table in MS Access 2010 I'm trying to analyze of people who belong to various groups having completed various jobs. What I would like to do is calculate the standard deviation of the count of the number of jobs each person has completed per group. Meaning, the output I would like is that for each group, I'd have a number that constitutes the standard deviation of how many jobs each person did.
The data is structured like this:
OldGroup, OldPerson, JobID
I know that I need to do a COUNT of the job IDs by Group and Person. I tried creating a subquery to work with, but that didn't work:
SELECT data.OldGroup, STDEV(
SELECT COUNT(data.JobID)
FROM data
WHERE data.Classification = 1
GROUP BY data.OldGroup, data.OldPerson
)
FROM data
GROUP BY data.OldGroup;
This returned an error "At most one record can be returned by this subquery," which I know is wrong, since when I tried to run the subquery as a standalone query it successfully returned more than one record.
Question:
How can I get the STDEV of a COUNT?
Subquestion: If this question can be answered by correcting incorrect syntax in my examples, please do so.
A minor change in strategy that wouldn't work for all cases but did end up working for this one seemed to take care of the problem. Instead of sticking the subquery in the SELECT statement, I put it in FROM, mimicking creating a separate table.
As such, my code looks like:
SELECT OldGroup, STDEV(NumberJobs) AS JobsStDev
FROM (
SELECT OldGroup, OldPerson, COUNT(JobID) AS NumberJobs
FROM data
WHERE data.Classification = 1
GROUP BY OldGroup, OldPerson
) AS TempTable
GROUP BY OldGroup;
That seemed to get the job done.
Try doing a max table query for "SELECT COUNT(data.JobID)...."
Then for the 2nd query, use the new base table.
Sometimes it is just easier to do something in 2 or more queries.

Is there efficient SQL to query a portion of a large table

The typical way of selecting data is:
select * from my_table
But what if the table contains 10 million records and you only want records 300,010 to 300,020
Is there a way to create a SQL statement on Microsoft SQL that only gets 10 records at once?
E.g.
select * from my_table from records 300,010 to 300,020
This would be way more efficient than retrieving 10 million records across the network, storing them in the IIS server and then counting to the records you want.
SELECT * FROM my_table is just the tip of the iceberg. Assuming you're talking a table with an identity field for the primary key, you can just say:
SELECT * FROM my_table WHERE ID >= 300010 AND ID <= 300020
You should also know that selecting * is considered poor practice in many circles. They want you specify the exact column list.
Try looking at info about pagination. Here's a short summary of it for SQL Server.
Absolutely. On MySQL and PostgreSQL (the two databases I've used), the syntax would be
SELECT [columns] FROM table LIMIT 10 OFFSET 300010;
On MS SQL, it's something like SELECT TOP 10 ...; I don't know the syntax for offsetting the record list.
Note that you never want to use SELECT *; it's a maintenance nightmare if anything ever changes. This query, though, is going to be incredibly slow since your database will have to scan through and throw away the first 300,010 records to get to the 10 you want. It'll also be unpredictable, since you haven't told the database which order you want the records in.
This is the core of SQL: tell it which 10 records you want, identified by a key in a specific range, and the database will do its best to grab and return those records with minimal work. Look up any tutorial on SQL for more information on how it works.
When working with large tables, it is often a good idea to make use of Partitioning techniques available in SQL Server.
The rules of your partitition function typically dictate that only a range of data can reside within a given partition. You could split your partitions by date range or ID for example.
In order to select from a particular partition you would use a query similar to the following.
SELECT <Column Name1>…/*
FROM <Table Name>
WHERE $PARTITION.<Partition Function Name>(<Column Name>) = <Partition Number>
Take a look at the following white paper for more detailed infromation on partitioning in SQL Server 2005.
http://msdn.microsoft.com/en-us/library/ms345146.aspx
I hope this helps however please feel free to pose further questions.
Cheers, John
I use wrapper queries to select the core query and then just isolate the ROW numbers that i wish to take from the query - this allows the SQL server to do all the heavy lifting inside the CORE query and just pass out the small amount of the table that i have requested. All you need to do is pass the [start_row_variable] and the [end_row_variable] into the SQL query.
NOTE: The order clause is specified OUTSIDE the core query [sql_order_clause]
w1 and w2 are TEMPORARY table created by the SQL server as the wrapper tables.
SELECT
w1.*
FROM(
SELECT w2.*,
ROW_NUMBER() OVER ([sql_order_clause]) AS ROW
FROM (
<!--- CORE QUERY START --->
SELECT [columns]
FROM [table_name]
WHERE [sql_string]
<!--- CORE QUERY END --->
) AS w2
) AS w1
WHERE ROW BETWEEN [start_row_variable] AND [end_row_variable]
This method has hugely optimized my database systems. It works very well.
IMPORTANT: Be sure to always explicitly specify only the exact columns you wish to retrieve in the core query as fetching unnecessary data in these CORE queries can cost you serious overhead
Use TOP to select only a limited amont of rows like:
SELECT TOP 10 * FROM my_table WHERE ID >= 300010
Add an ORDER BY if you want the results in a particular order.
To be efficient there has to be an index on the ID column.