Partition elimination in Greenplum

Partition elimination in Greenplum - sql

I have a scenario like this:
SELECT * FROM PACKAGE WHERE PACKAGE_TYPE IN ('BOX','CARD')
The table is partitioned by PACKAGE_TYPE field. Assume that there are twenty possible values for PACKAGE_TYPE field. So there are twenty partitions including BOX, CARD and DEFAULT partitions. When the above query is run, partition elimination happens correctly and only the BOX and CARDpartitions get scanned. The result is quick.
However, when the same query is written like this:
SELECT * FROM PACKAGE WHERE PACKAGE_TYPE IN (SELECT PACKAGE_TYPE FROM PACKAGE_LIST_TABLE), where the column PACKAGE_TYPE in PACKAGE_LIST_TABLE contains two values BOX and CARD.
When the above query is run, all the 20 partitions are being scanned. It degrades the performance.
It seems that the compiler is failing to identify the second query correctly and as a result all the partitions are getting accessed.
Any workarounds to overcome this?
Thanks in advance.

The Postgres manual page on Partitioning includes this caveat
Constraint exclusion only works when the query's WHERE clause contains constants (or externally supplied parameters). For example, a comparison against a non-immutable function such as CURRENT_TIMESTAMP cannot be optimized, since the planner cannot know which partition the function value might fall into at run time.
In order to eliminate a seek on a partition, Postgres must know when creating a query plan that no rows from that partition are relevant. In your query, this occurs only after the sub-query has completed, so the query would have to be split into two, with the second part planned only after the first completes.
If the partitions include an index on the partitioned column (PACKAGE_TYPE) as well as a constraint, the planner may elect to use an index scan on each partition, leading to the incorrect partitions being reasonably efficiently eliminated at runtime anyway. (That is, there would be 20 index scans, but each would require very little resource.)
An alternative would be to split the query yourself, and build the SQL dynamically. Since the SELECT PACKAGE_TYPE FROM PACKAGE_LIST_TABLE can only ever return up to 20 distinct values, you could select those into an array/set in your application or a user-defined function. Then you can pass these in as literals in the IN ( ... ) clause as in your first example (or equivalently = ANY(array_expression)), and achieve the partition elimination.

Related

SQL : Can WHERE clause increase a SELECT DISTINCT query's speed?

So here's the specific situation: I have primary unique indexed keys set to each entry in the database, but each row has a secondID referring to an attribute of the entry, and as such, the secondIDs are not unique. There is also another attribute of these rows, let's call it isTitle, which is NULL by default, but each group of entries with the same secondID have at least one entry with 1 isTitle value.
Considering the conditions above, would a WHERE clause increase the processing speed of the query or not? See the following:
SELECT DISTINCT secondID FROM table;
vs.
SELECT DISTINCT secondID FROM table WHERE isTitle = 1;
EDIT:
The first query without the WHERE clause is faster, but could someone explain me why? Algorithmically the process should be faster with having only one 'if' in the cycle, no?

In general, to benchmark performances of queries, you usually use queries that gives you the execution plan the query they receive in input (Every small step that the engine is performing to solve your request).
You are not mentioning your database engine (e.g. PostgreSQL, SQL Server, MySQL), but for example in PostgreSQL the query is the following:
EXPLAIN SELECT DISTINCT secondID FROM table WHERE isTitle = 1;
Going back to your question, since the isTitle is not indexed, I think the first action the engine will do is a full scan of the table to check that attribute and then perform the SELECT hence, in my opinion, the first query:
SELECT DISTINCT secondID FROM table;
will be faster.
If you want to optimize it, you can create an index on isTitle column. In such scenario, the query with the WHERE clause will become faster.

This is a very hard question to answer, particularly without specifying the database. Here are three important considerations:
Will the database engine use the index on secondID for select distinct? Any decent database optimizer should, but that doesn't mean that all do.
How wide is the table relative to the index? That is, is scanning the index really that much faster than scanning the table?
What is the ratio of isTitle = 1 to all rows with the same value of secondId?
For the first query, there are essentially two ways to process the query:
Scan the index, taking each unique value as it comes.
Scan the table, sort or hash the table, and choose the unique values.
If it is not obvious, (1) is much faster than (2), except perhaps in trivial cases where there are a small number of rows.
For the second query, the only real option is:
Scan the table, filter out the non-matching values, sort or hash the table, and choose the unique values.
The key issues here are how much data needs to be scanned and how much is filtered out. It is even possible -- if you had, say, zillions of rows per secondaryId, no additional columns, and small number of values -- that this might be comparable or slightly faster than (1) above. There is a little overhead for scanning an index and sorting a small amount of data is often quite fast.
And, this method is almost certainly faster than (2).
As mentioned in the comments, you should test the queries on your system with your data (use a reasonable amount of data!). Or, update the table statistics and learn to read execution plans.

Where clause is slowing my query from 2 seconds to 24 seconds

I am trying to write a simple query to count the results from a big table.
SELECT COUNT(*)
FROM DM.DM_CUSTOMER_SEG_BRIDGE_CORP_DW AL3
WHERE (AL3.REFERENCE_YEAR(+) =2012)
Above query is taking around 24 seconds to return me output. If I remove where clause and execute same query, it is giving me result in 2 seconds.
May i know what is the reason for that. I am relatively new to SQL queries.
Please help
Thanks,
Naveen

You might need an index on the table. Typically you will need an index on any columns used in the where clause
as for the (+) syntax I think it is redundant (i'm no Oracle expert) but see Difference between Oracle's plus (+) notation and ansi JOIN notation?

The reason may seem subtle. But there are multiple ways that Oracle could approach a query like this:
SELECT COUNT(*)
FROM DM.DM_CUSTOMER_SEG_BRIDGE_CORP_DW AL3
One way is to read all the rows in the table. Because this is a big table, that is not the most efficient approach. A second method would be to use statistics of some sort, where the number of rows are in the statistics. I don't think Oracle ever does this, but it is conceivable.
The final method is to read an index. Typically, an index would be much smaller than the table and it might already be in memory. The above query would be reading a much smaller amount of data. (Here is an interesting article on counting all the rows in a table.)
When you introduce the where clause,
WHERE (AL3.REFERENCE_YEAR(+) =2012)
Oracle can no longer scan just any index. It would have to scan the reference_year index. What is the problem? If it scanned an index, it would still need to fetch the data records to get the value of reference_year -- and that is equivalent (actually worse) than scanning the whole table.
Even with an index on reference_year, you are not guaranteed to use the index. The problem is something called selectivity. The number of rows that you are fetching may still be quite large, relative to the number of rows in the database (in this context, 10% is "quite large"). The Oracle optimize may choose to do a full table scan rather than read the index.

Oracle partition pruning with NLS_COMP = Linguistic

Oracle 10g.
We have a large table partitioned by a varchar2 column (if it were up to me, it wouldn't be this column, but it is) with each partition having a single value. Ex. PARTITION "PARTION1" VALUES ('C').
We also have NLS_COMP = LINGUISTIC.
Partition pruning, when indicating a value in that column, doesn't work.
SELECT * from table1 where column_partitioned_by = 'C'
That does a full table scan on all partitions and not only the relevant one.
According to the docs here, "The NLS_COMP parameter does not affect comparison behavior for partitioned tables."
If I issue:
ALTER SESSION SET NLS_COMP = BINARY
And then:
SELECT * from table1 where column_partitioned_by = 'C'
it does correctly prune the partitions down. (I'm basing the prune/not prune off of the plans generated)
Is there anything, short of hardcoding partition names into the from clause, that would work here?
Additionally, changing the partition definition is out as well. I'm in the minority on my team as even seeing this as a problem. Before I got there, the previous team decided it would "solve" this problem by sending all application sql queries through a string-find-and-replace that adds hardcoded partition names in the FROM clause and has somebody manually update partition names in stored procs as needed...but it will break one day and it will break hard. I'm trying to find the least invasive approach but I'm afraid there may not be one.
Preferably, it would be a solution that only changing queries themselves and not the underlying db structure. Like I said, this solution simply may not exist...

Some solutions to prototype:
The CAST function. You can partition by an expression; the downside is your application would have to provide a similar expression.
Partition on NLS_SORT(column_partitioned_by, 'NLSSORT=BINARY'). Again, application changes required.
Converting column_partitioned_by to a numeric value, possibly using a code table to transform between the two. You'd have to include a join to that table throughout the application, though.

effect of number of projections on query performance

I am looking to improve the performance of a query which selects several columns from a table. was wondering if limiting the number of columns would have any effect on performance of the query.

Reducing the number of columns would, I think, have only very limited effect on the speed of the query but would have a potentially larger effect on the transfer speed of the data. The less data you select, the less data that would need to be transferred over the wire to your application.

I might be misunderstanding the question, but here goes anyway:
The absolute number of columns you select doesn't make a huge difference. However, which columns you select can make a significant difference depending on how the table is indexed.
If you are selecting only columns that are covered by the index, then the DB engine can use just the index for the query without ever fetching table data. If you use even one column that's not covered, though, it has to fetch the entire row (key lookup) and this will degrade performance significantly. Sometimes it will kill performance so much that the DB engine opts to do a full scan instead of even bothering with the index; it depends on the number of rows being selected.
So, if by removing columns you are able to turn this into a covering query, then yes, it can improve performance. Otherwise, probably not. Not noticeably anyway.
Quick example for SQL Server 2005+ - let's say this is your table:
ID int NOT NULL IDENTITY PRIMARY KEY CLUSTERED,
Name varchar(50) NOT NULL,
Status tinyint NOT NULL
If we create this index:
CREATE INDEX IX_MyTable
ON MyTable (Name)
Then this query will be fast:
SELECT ID
FROM MyTable
WHERE Name = 'Aaron'
But this query will be slow(er):
SELECT ID, Name, Status
FROM MyTable
WHERE Name = 'Aaron'
If we change the index to a covering index, i.e.
CREATE INDEX IX_MyTable
ON MyTable (Name)
INCLUDE (Status)
Then the second query becomes fast again because the DB engine never needs to read the row.

Limiting the number of columns has no measurable effect on the query. Almost universally, an entire row is fetched to cache. The projection happens last in the SQL pipeline.
The projection part of the processing must happen last (after GROUP BY, for instance) because it may involve creating aggregates. Also, many columns may be required for JOIN, WHERE and ORDER BY processing. More columns than are finally returned in the result set. It's hardly worth adding a step to the query plan to do projections to somehow save a little I/O.
Check your query plan documentation. There's no "project" node in the query plan. It's a small part of formulating the result set.
To get away from "whole row fetch", you have to go for a columnar ("Inverted") database.

It can depend on the server you're dealing with (and, in the case of MySQL, the storage engine). Just for example, there's at least one MySQL storage engine that does column-wise storage instead of row-wise storage, and in this case more columns really can take more time.
The other major possibility would be if you had segmented your table so some columns were stored on one server, and other columns on another (aka vertical partitioning). In this case, retrieving more columns might involve retrieving data from different servers, and it's always possible that the load is imbalanced so different servers have different response times. Of course, you usually try to keep the load reasonably balanced so that should be fairly unusual, but it's still possible (especially if, for example, if one of the servers handles some other data whose usage might vary independently from the rest).

yes, if your query can be covered by a non clustered index it will be faster since all the data is already in the index and the base table (if you have a heap) or clustered index does not need to be touched by the optimizer

To demonstrate what tvanfosson has already written, that there is a "transfer" cost I ran the following two statements on a MSSQL 2000 DB from query analyzer.
SELECT datalength(text) FROM syscomments
SELECT text FROM syscomments
Both results returned 947 rows but the first one took 5 ms and the second 973 ms.
Also because the fields are the same I would not expect indexing to factor here.

Why do SQL statements take so long when "limited"?

consider the following pgSQL statement:
SELECT DISTINCT some_field
FROM some_table
WHERE some_field LIKE 'text%'
LIMIT 10;
Consider also, that some_table consists of several million records, and that some_field has a b-tree index.
Why does the query take so long to execute (several minutes)? What I mean is, why doesnt it loop through creating the result set, and once it gets 10 of them, return the result? It looks like the execution time is the same, regardless of whether or not you include a 'LIMIT 10' or not.
Is this correct or am I missing something? Is there anything I can do to get it to return the first 10 results and 'screw' the rest?
UPDATE: If you drop the distinct, the results are returned virtually instantaneously. I do know however, that many of the some_table records are fairly unique already, and certianly when I run the query without he distinct declaration, the first 10 results are in fact unique. I also eliminated the where clause (eliminating it as a factor). So, my original question still remains, why isnt it terminating as soon as 10 matches are found?

You have a DISTINCT. This means that to find 10 distinct rows, it's necessary to scan all rows that match the predicate until 10 different some_fields are found.
Depending on your indices, the query optimizer may decide that scanning all rows is the best way to do this.
10 distinct rows could represent 10, a million, an infinity of non-distinct rows.

Can you post the results of running EXPLAIN on the query? This will reveal what Postgres is doing to execute the query, and is generally the first step in diagnosing query performance problems.
It may be sorting or constructing a hash table of the entire rowset to eliminate the non-distinct records before returning the first row to the LIMIT operator. It makes sense that the engine should be able to read a fraction of the records, returning one new distinct at a time until the LIMIT clause has satisfied its 10 quota, but there may not be an operator implemented to make that work.
Is the some_field unique? If not, it would be useless in locating distinct records. If it is, then the DISTINCT clause would be unnecessary, since that index already guarantees that each row is unique on some_field.

Any time there's an operation that involves aggregation, and "DISTINCT" certainly qualifies, the optimizer is going to do the aggration before even thinking about what's next. And aggration means scanning the whole table (in your case involving a sort, unless there's an index).
But the most likely deal-breaker is that you are grouping on an operation on a column, rather than a plain column value. The optimizer generally disregards a number of possible operations once you are operating on a column transformation of some kind. It's quite possibly not smart enough to know that the ordering of "LIKE 'text%'" and "= 'text'" is the same for grouping purposes.
And remember, you're doing an aggregation on an operation on a column.

how big is the table? do you have any indexes on the table? check your query execution plan to determine if it's doing a table scan, an index scan, or an index seek. if it's doing a table scan then you most likely dont have any indexes.
try putting an index on the field your filtering by and/or the field your selecting.

I'm suspicious it's because you don't have an ORDER BY. Without ordering, you might have to cruise a whole lot of records to get 10 that meet your criterion.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas