MS Access 2010 SQL Top N query by group performance issue (continued) - sql

I have signficant performcance issues (up to time-out) in MS Access 2010 with the query below. The table TempTableAnalysis contains between 10'000-15'000 records. I have already received input from this forum to work with a temporary table in the top 10 query (MS Access 2010 SQL Top N query by group performance issue)
Can anyone explain how to implement the temporary table in the subquery and how to join it? I can't get it to work.
Any other suggestions to improve performance are highly appreciated.
Here is my query:
SELECT
t2.Loc,
t2.ABCByPick,
t2.Planner,
t2.DmdUnit,
ROUND(t2.MASE,2) AS MASE,
ROUND(t2.AFAR,2) AS AFAR
FROM TempTableAnalysis AS t2
WHERE t2.MASE IN (
SELECT TOP 10 t1.MASE
FROM TempTableAnalysis AS t1
WHERE t1.ABCByPick = t2.ABCByPick
ORDER BY t1.MASE DESC
)
ORDER BY
t2.ABCByPick,
t2.MASE DESC;

Optimizing Access Query Performance For Large Data Sets
Based on your posted SQL Query, you have some options available to optimize and speed up the performance.
SELECT
t2.Loc,
t2.ABCByPick,
t2.Planner,
t2.DmdUnit,
ROUND(t2.MASE,2) AS MASE,
ROUND(t2.AFAR,2) AS AFAR
FROM TempTableAnalysis AS t2
...
This is the first part where TempTableAnalysis is the multi-thousand record subquery. If you want to squeeze a little more performance out of the use of this "Temp" Table, don't use it as a dynamic query (i.e., calculated on demand each time the query is opened), try constructing a macro that pushes the output to a static table:
Appending Subquery Data to a Static Table:
Create a QUERY object and change its type to DELETE. Design it to delete the contents of your "temporary" table object. If you prefer using SQL, the command will look like:
DELETE My_Table.*
FROM My_Table;
Create a QUERY object and change its type to APPEND. Design it to query all fields from your query defined by the SQL statement of this OP. Again, the SQL version of this task has the following syntax:
INSERT INTO StaticAnalysisTable ( ID, Loc, Item, AvgOfScaledError )
SELECT t1.ID, t1.Loc, t1.Item, t1.AvgOfScaledError
FROM TempTableAnalysis as t1;
The next step is to automate the population of this static table and it is optional. It's simple however and will make it less likely that you will make the mistake of forgetting to "Refresh" and accessing your static table while it has stale data... causing inaccuracies in your results.
Create a macro with two steps. Each step will have the following definition: OPEN QUERY. When prompted for the query to open, reference the objects you created in the previous two steps in the following order (important): (1) DELETE Query: (your delete query name) then (2) APPEND Query: (your append query name).
SQL Query Comments and Suggestions
The following part of the posted SQL query could use some help:
...
WHERE t2.MASE IN (
SELECT TOP 10 t1.MASE
FROM TempTableAnalysis AS t1
WHERE t1.ABCByPick = t2.ABCByPick
ORDER BY t1.MASE DESC
)
ORDER BY
t2.ABCByPick,
t2.MASE DESC;
There is a join across the sub query that generates the TOP-10 data and the outermost query that correlates these results with the supplementing MASE table data. This isn't necessary if the TempTableAnalysis.MASE represents a key value.
ORDER BY
in the inner most query isn't necessary unless it is intended to force some sort of selection criteria (as in when using SQL analytical functions) this doesn't look like one of those cases. Ordering records from large data sets is also a wasteful cpu and memory sink.
EDIT: Just as a counter-point argument, the ORDER BY clause used beside a TOP N query actually has a purpose, but I am still not clear if it is necessary. Just to round out the discussion, another SO thread talks about How to Select Top 10 in an Access Query.
WHERE t2.MASE IN (...
You may be experiencing blocks in performance with very large in-list set operations. On an Oracle database server, I have discovered with other developers that there is a limitation to the number of discrete elements in an in-list query operator. That value was in the thousands... which may be further limited based on server and database resources.
Consider using a SQL JOIN operator. The place where you define TABLE objects can also be populated with SQL defined queries with aliases known as INLINE VIEWS. Since you're using ACCESS, if an inline view does not work directly, just define another ACCESS QUERY object and reference it in your final query as if it were a table...
A possible rewrite to the ending part of the original query:
SELECT
t2.Loc,
t2.ABCByPick,
t2.Planner,
...
FROM TempTableAnalysis AS t2,
(SELECT TOP 10 t1.MASE, t1.ABCByPick
FROM TempTableAnalysis AS t1) AS ttop
WHERE t2.MASE = ttop.MASE
AND t2.ABCByPick = ttop.ABCByPick
ORDER BY
t2.ABCByPick,
t2.MASE DESC;
You will definitely need to run through these recommendations and validate the output data for accuracy. This represents approaches to capturing some of the "low-hanging fruit" (easy items) that you can pursue to speed up your query and reporting operations.
Conclusions and Closing Comments
As a background to other readers, the database object TempTableAnalysis is not a static table. It is the result of a sub query presented in another SO post requesting help with a Access TOP N Query. The query comes from multiple tables approaching 10,000 records in size (each?).
Tip: A query result in Access ALSO has potential table-like behaviors. You can push the output to a table for joining (as described above) or just join to the query object itself (careful though, especially when you get to "chaining" multiple query operations...)
The strategy of this solution was:
To minimize the number of trips through one or more instances of this very large table.
To pre-process and index optimize any data that would otherwise be "static" for the duration of its analysis.
To audit and review the SQL code used to obtain the final results.
Definitely look into Access MACROS. Coupled with identifying static data in your data sets, you can offload processing of your complex background analytic queries to improve the user experience when they view and query through the final results. Good Luck!

Related

Query process steps of select query

I have confused with query process steps of select query. I read some docs, select query will run like this
1. Getting Data (From, Join)
2. Row Filter (Where)
3. Grouping (Group by)
4. Group Filter (Having)
5. Return Expressions (Select)
6. Order & Paging (Order by & Limit / Offset)
I retry test run a query join A table ( 70m records ) and B table( 75m records)
select *
from A join B on A.code = B.box_code
where B.box_code = '123'
compare with
select *
from A join (select * from B where box_code = '123' ) on A.code = B.box_code
I assume the first query will run slower than second query. Because the first query will take time when mapping large data while second query filters box_code before mapping data. But two queries run the same. Why did that happen?
I searched google, it may be related to clustered index, but I am not sure.
1 more question , why clustered index can get where condition to filter data before join ? i think the query will run join before where
Where did I get it wrong?
illustrating images
first query
second query
Thanks
This part is wrong...
select query will run like this
Getting Data (From, Join)
Row Filter (Where)
Grouping (Group by)
Group Filter (Having)
Return Expressions (Select)
Order & Paging (Order by & Limit / Offset)
Oracle has a number of operations that it can perform to satisfy a query. Some operations may require child operations to be completed first. Operations include things like TABLE ACCESS BY INDEX ROWID, INDEX RANGE SCAN, and NESTED LOOPS.
Oracle's optimizer decides which operations are necessary and in what order. It very often will, for example, apply WHERE conditions to a row source before joining that row source to another one. It does that for exactly the reason you imply in your post: because it is probably faster to filter a million rows down to 10 before doing a join.
Oracle maintains an elaborate set of statistics on each table and column so that it can estimate when you submit your query what is likely to work well.
Theoretically, your job when writing SQL is to describe what you want and leave the how part to Oracle. In practice, the how part is still important, so your question is a very good one. Read Oracle's documentation on the subject, titled "Oracle Database SQL Tuning Guide". There is a version for each release of the database and they're available for free online (see: https://docs.oracle.com).

Error: TABLE_QUERY expressions cannot query BigQuery tables

This s a followup question regarding Jordans answer here: Weird error in BigQuery
I was using to query reference table within "Table_Query" for quit some time. Now, following the recent changes Joradan is referring to, many of our queries are broken... I would like to ask the community advice for alternative solution to what we are doing.
I have tables containing events ("MyTable_YYYYMMDD"). I want to query my data for a period of a specific (or several) campaign. The period of that campaign is stored in a table with all campaigns data (ID, StartCampaignDate, EndCampaignDate). In order to query only the relevant tables, we use Table_Query(), and within the TableQuery() we construct a list of all relevant table names based on the campaigns data.
This query runs in various forms many times with different params. the reason for using wildcard function (rather than query the entire dataset), is performance, execution costs, and maintenance costs. So, having it query all tables and filter just the results is not an option as it drives execution costs too high.
a sample query will look like:
SELECT
*
FROM
TABLE_QUERY([MyProject:MyDataSet] 'table_id IN
(SELECT CONCAT("MyTable_",STRING(Year*100+Month)) TBL_NAME
FROM DWH.Dim_Periods P
CROSS JOIN DWH.Campaigns AS LC
WHERE ID IN ("86254e5a-b856-3b5a-85e1-0f5ab3ff20d6")
AND DATE(P.Date) BETWEEN DATE(StartCampaignDate) AND DATE(EndCampaignDate))')
This is now broken...
My question - the info, which tables should you query is stored on a reference table, How would you query only the relevant tables (partitions) when "TableQuery" is no longer allowed to query reference tables?
Many thanks
The "simple" way I see is split it to two steps
Step 1 - build list that will be used to filter table_id's
SELECT GROUP_CONCAT_UNQUOTED(
CONCAT('"',"MyTable_",STRING(Year*100+Month),'"')
) TBL_NAME_LIST
FROM DWH.Dim_Periods P
CROSS JOIN DWH.Campaigns AS LC
WHERE ID IN ("86254e5a-b856-3b5a-85e1-0f5ab3ff20d6")
AND DATE(P.Date) BETWEEN DATE(StartCampaignDate) AND DATE(EndCampaignDate)
Note the change in your query to transform result to list that you will use in step 2
Step 2 - final query
SELECT
*
FROM
TABLE_QUERY([MyProject:MyDataSet],
'table_id IN (<paste list (TBL_NAME_LIST) built in first query>)')
Above steps are easy to implement in any client you potentially using
If you use it from within BigQuery Web UI - this makes you do a little extra manual "moves" that you might not be happy about
My answer is obvious and you most likely have this already as an option, but wanted to mention
This is not ideal solution. But it seems to do the job.
In my previous query I passed the IDs List as a parameter in an external process that constructed the query. I wanted this process to be unaware to any logic implemented in the query.
Eventually we came up with this solution:
Instead of passing a list of IDs, we pass a JSON that contains the relevant meta data for each ID. We parse this JSON within the Table_Query() function. So instead of querying a physical reference table, we query some sort of a "table variable" that we have put in a JSON.
Below is a sample query that runs on the public dataset that demonstrates this solution.
SELECT
YEAR,
COUNT (*) CNT
FROM
TABLE_QUERY([fh-bigquery:weather_gsod], 'table_id in
(Select table_id
From
(Select table_id,concat(Right(table_id,4),"0101") as TBL_Date from [fh-bigquery:weather_gsod.__TABLES_SUMMARY__]
where table_id Contains "gsod"
)TBLs
CROSS JOIN
(select
Regexp_Replace(Regexp_extract(SPLIT(DatesInput,"},{"),r"\"fromDate\":\"(\d\d\d\d-\d\d-\d\d)\""),"-","") as fromDate,
Regexp_Replace(Regexp_extract(SPLIT(DatesInput,"},{"),r"\"toDate\":\"(\d\d\d\d-\d\d-\d\d)\""),"-","") as toDate,
FROM
(Select
"[
{
\"CycleID\":\"123456\",
\"fromDate\":\"1929-01-01\",
\"toDate\":\"1950-01-10\"
},{
\"CycleID\":\"123456\",
\"fromDate\":\"1970-02-01\",
\"toDate\":\"2000-02-10\"
}
]"
as DatesInput)) RefDates
WHERE TBLs.TBL_Date>=RefDates.fromDate
AND TBLs.TBL_Date<=RefDates.toDate
)')
GROUP BY
YEAR
ORDER BY
YEAR
This solution is not ideal as it requires an external process to be aware of the data stored in the reference tables.
Ideally the BigQuery team will re-enable this very useful functionality.

5+ Intermediate SQL Tables to Arrive at Desired Table, Postgres

I am generating reports on electoral data that group voters into their age groups, and then assign those age groups a quartile, before finally returning the table of age groups and quartiles.
By the time I arrive at the table with the schema and data that I want, I have created 7 intermediate tables that might as well be deleted at this point.
My question is, is it plausible that so many intermediate tables are necessary? Or this a sign that I am "doing it wrong?"
Technical Specifics:
Postgres 9.4
I am chaining tables, starting with the raw database tables and successively transforming the table closer to what I want. For instance, I do something like:
CREATE TABLE gm.race_code_and_turnout_count AS
SELECT race_code, count(*)
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code
And then I do
CREATE TABLE gm.race_code_and_percent_of_total_turnout AS
SELECT race_code, count, round((count::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.race_code_and_turnout_count
And that first table goes off in a second branch:
CREATE TABLE gm.race_code_and_turnout_percentage AS
SELECT t1.race_code, round((t1.count::numeric / t2.count)*100,2) as turnout_percentage
FROM gm.race_code_and_turnout_count AS t1
JOIN gm.race_code_and_total_count AS t2
ON t1.race_code = t2.race_code
So each table is building on the one before it.
While temporary tables are used a lot in SQL Server (mainly to overcome the peculiar locking behaviour that it has) it is far less common in Postgres (and your example uses regular tables, not temporary tables).
Usually the overhead of creating a new table is higher than letting the system store intermediate on disk.
From my experience, creating intermediate tables usually only helps if:
you have a lot of data that is aggregated and can't be aggregated in memory
the aggregation drastically reduces the data volume to be processed so that the next step (or one of the next steps) can handle the data in memory
you can efficiently index the intermediate tables so that the next step can make use of those indexes to improve performance.
you re-use a pre-computed result several times in different steps
The above list is not completely and using this approach can also be beneficial if only some of these conditions are true.
If you keep creating those tables create them at least as temporary or unlogged tables to minimized the IO overhead that comes with writing that data and thus keep as much data in memory as possible.
However I would always start with a single query instead of maintaining many different tables (that all need to be changed if you have to change the structure of the report).
For example your first two queries from your question can easily be combined into a single query with no performance loss:
SELECT race_code,
count(*) as cnt,
round((count(*)::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code;
This is going to be faster than writing the data twice to disk (including all transactional overhead).
If you stack your queries using common table expressions Postgres will automatically store the data on disk if it gets too big, if not it will process it in-memory. When manually creating the tables you force Postgres to write everything to disk.
So you might want to try something like this:
with race_code_and_turnout_count as (
SELECT race_code,
count(*) as cnt,
round((count(*)::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code
), race_code_and_total_count as (
select ....
from ....
), race_code_and_turnout_percentage as (
SELECT t1.race_code,
round((t1.count::numeric / t2.count)*100,2) as turnout_percentage
FROM ace_code_and_turnout_count AS t1
JOIN race_code_and_total_count AS t2
ON t1.race_code = t2.race_code
)
select *
from ....;
and see how that performs.
If you don't re-use the intermediate steps more than once, writing them as a derived table instead of a CTE might be faster in Postgres due to the way the optimizer works, e.g.:
SELECT t1.race_code,
round((t1.count::numeric / t2.count)*100,2) as turnout_percentage
FROM (
SELECT race_code,
count(*) as cnt,
round((count(*)::numeric/11362)*100,2) AS percent_of_total_turnout
FROM gm.active_dem_voters_34th_house_in_2012_primary
GROUP BY race_code
) AS t1
JOIN race_code_and_total_count AS t2
ON t1.race_code = t2.race_code
If it performs well and results in the right output, I see nothing wrong with it. I do however suggest to use (local) temporary tables if you need intermediate tables.
Your series of queries can always be optimized to use fewer intermediate steps. Do that if you feel your reports start performing poorly.

SQL "WITH" Performance and Temp Table (possible "Query Hint" to simplify)

Given the example queries below (Simplified examples only)
DECLARE #DT int; SET #DT=20110717; -- yes this is an INT
WITH LargeData AS (
SELECT * -- This is a MASSIVE table indexed on dt field
FROM mydata
WHERE dt=#DT
), Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM LargeData
)
SELECT * FROM Ordered
and ...
DECLARE #DT int; SET #DT=20110717;
BEGIN TRY DROP TABLE #LargeData END TRY BEGIN CATCH END CATCH; -- dump any possible table.
SELECT * -- This is a MASSIVE table indexed on dt field
INTO #LargeData -- put smaller results into temp
FROM mydata
WHERE dt=#DT;
WITH Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM #LargeData
)
SELECT * FROM Ordered
Both produce the same results, which is a limited and ranked list of values from a list based on a fields data.
When these queries get considerably more complicated (many more tables, lots of criteria, multiple levels of "with" table alaises, etc...) the bottom query executes MUCH faster then the top one. Sometimes in the order of 20x-100x faster.
The Question is...
Is there some kind of query HINT or other SQL option that would tell the SQL Server to perform the same kind of optimization automatically, or other formats of this that would involve a cleaner aproach (trying to keep the format as much like query 1 as possible) ?
Note that the "Ranking" or secondary queries is just fluff for this example, the actual operations performed really don't matter too much.
This is sort of what I was hoping for (or similar but the idea is clear I hope). Remember this query below does not actually work.
DECLARE #DT int; SET #DT=20110717;
WITH LargeData AS (
SELECT * -- This is a MASSIVE table indexed on dt field
FROM mydata
WHERE dt=#DT
**OPTION (USE_TEMP_OR_HARDENED_OR_SOMETHING) -- EXAMPLE ONLY**
), Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM LargeData
)
SELECT * FROM Ordered
EDIT: Important follow up information!
If in your sub query you add
TOP 999999999 -- improves speed dramatically
Your query will behave in a similar fashion to using a temp table in a previous query. I found the execution times improved in almost the exact same fashion. WHICH IS FAR SIMPLIER then using a temp table and is basically what I was looking for.
However
TOP 100 PERCENT -- does NOT improve speed
Does NOT perform in the same fashion (you must use the static Number style TOP 999999999 )
Explanation:
From what I can tell from the actual execution plan of the query in both formats (original one with normal CTE's and one with each sub query having TOP 99999999)
The normal query joins everything together as if all the tables are in one massive query, which is what is expected. The filtering criteria is applied almost at the join points in the plan, which means many more rows are being evaluated and joined together all at once.
In the version with TOP 999999999, the actual execution plan clearly separates the sub querys from the main query in order to apply the TOP statements action, thus forcing creation of an in memory "Bitmap" of the sub query that is then joined to the main query. This appears to actually do exactly what I wanted, and in fact it may even be more efficient since servers with large ammounts of RAM will be able to do the query execution entirely in MEMORY without any disk IO. In my case we have 280 GB of RAM so well more then could ever really be used.
Not only can you use indexes on temp tables but they allow the use of statistics and the use of hints. I can find no refernce to being able to use the statistics in the documentation on CTEs and it says specifically you cann't use hints.
Temp tables are often the most performant way to go when you have a large data set when the choice is between temp tables and table variables even when you don't use indexes (possobly because it will use statistics to develop the plan) and I might suspect the implementation of the CTE is more like the table varaible than the temp table.
I think the best thing to do though is see how the excutionplans are different to determine if it is something that can be fixed.
What exactly is your objection to using the temp table when you know it performs better?
The problem is that in the first query SQL Server query optimizer is able to generate a query plan. In the second query a good query plan isn't able to be generated because you're inserting the values into a new temporary table. My guess is there is a full table scan going on somewhere that you're not seeing.
What you may want to do in the second query is insert the values into the #LargeData temporary table like you already do and then create a non-clustered index on the "valuefield" column. This might help to improve your performance.
It is quite possible that SQL is optimizing for the wrong value of the parameters.
There are a couple of options
Try using option(RECOMPILE). There is a cost to this as it recompiles the query every time but if different plans are needed it might be worth it.
You could also try using OPTION(OPTIMIZE FOR #DT=SomeRepresentatvieValue) The problem with this is you pick the wrong value.
See I Smell a Parameter! from The SQL Server Query Optimization Team blog

Is there efficient SQL to query a portion of a large table

The typical way of selecting data is:
select * from my_table
But what if the table contains 10 million records and you only want records 300,010 to 300,020
Is there a way to create a SQL statement on Microsoft SQL that only gets 10 records at once?
E.g.
select * from my_table from records 300,010 to 300,020
This would be way more efficient than retrieving 10 million records across the network, storing them in the IIS server and then counting to the records you want.
SELECT * FROM my_table is just the tip of the iceberg. Assuming you're talking a table with an identity field for the primary key, you can just say:
SELECT * FROM my_table WHERE ID >= 300010 AND ID <= 300020
You should also know that selecting * is considered poor practice in many circles. They want you specify the exact column list.
Try looking at info about pagination. Here's a short summary of it for SQL Server.
Absolutely. On MySQL and PostgreSQL (the two databases I've used), the syntax would be
SELECT [columns] FROM table LIMIT 10 OFFSET 300010;
On MS SQL, it's something like SELECT TOP 10 ...; I don't know the syntax for offsetting the record list.
Note that you never want to use SELECT *; it's a maintenance nightmare if anything ever changes. This query, though, is going to be incredibly slow since your database will have to scan through and throw away the first 300,010 records to get to the 10 you want. It'll also be unpredictable, since you haven't told the database which order you want the records in.
This is the core of SQL: tell it which 10 records you want, identified by a key in a specific range, and the database will do its best to grab and return those records with minimal work. Look up any tutorial on SQL for more information on how it works.
When working with large tables, it is often a good idea to make use of Partitioning techniques available in SQL Server.
The rules of your partitition function typically dictate that only a range of data can reside within a given partition. You could split your partitions by date range or ID for example.
In order to select from a particular partition you would use a query similar to the following.
SELECT <Column Name1>…/*
FROM <Table Name>
WHERE $PARTITION.<Partition Function Name>(<Column Name>) = <Partition Number>
Take a look at the following white paper for more detailed infromation on partitioning in SQL Server 2005.
http://msdn.microsoft.com/en-us/library/ms345146.aspx
I hope this helps however please feel free to pose further questions.
Cheers, John
I use wrapper queries to select the core query and then just isolate the ROW numbers that i wish to take from the query - this allows the SQL server to do all the heavy lifting inside the CORE query and just pass out the small amount of the table that i have requested. All you need to do is pass the [start_row_variable] and the [end_row_variable] into the SQL query.
NOTE: The order clause is specified OUTSIDE the core query [sql_order_clause]
w1 and w2 are TEMPORARY table created by the SQL server as the wrapper tables.
SELECT
w1.*
FROM(
SELECT w2.*,
ROW_NUMBER() OVER ([sql_order_clause]) AS ROW
FROM (
<!--- CORE QUERY START --->
SELECT [columns]
FROM [table_name]
WHERE [sql_string]
<!--- CORE QUERY END --->
) AS w2
) AS w1
WHERE ROW BETWEEN [start_row_variable] AND [end_row_variable]
This method has hugely optimized my database systems. It works very well.
IMPORTANT: Be sure to always explicitly specify only the exact columns you wish to retrieve in the core query as fetching unnecessary data in these CORE queries can cost you serious overhead
Use TOP to select only a limited amont of rows like:
SELECT TOP 10 * FROM my_table WHERE ID >= 300010
Add an ORDER BY if you want the results in a particular order.
To be efficient there has to be an index on the ID column.