Oracle partition pruning with NLS_COMP = Linguistic - sql

Oracle 10g.
We have a large table partitioned by a varchar2 column (if it were up to me, it wouldn't be this column, but it is) with each partition having a single value. Ex. PARTITION "PARTION1" VALUES ('C').
We also have NLS_COMP = LINGUISTIC.
Partition pruning, when indicating a value in that column, doesn't work.
SELECT * from table1 where column_partitioned_by = 'C'
That does a full table scan on all partitions and not only the relevant one.
According to the docs here, "The NLS_COMP parameter does not affect comparison behavior for partitioned tables."
If I issue:
ALTER SESSION SET NLS_COMP = BINARY
And then:
SELECT * from table1 where column_partitioned_by = 'C'
it does correctly prune the partitions down. (I'm basing the prune/not prune off of the plans generated)
Is there anything, short of hardcoding partition names into the from clause, that would work here?
Additionally, changing the partition definition is out as well. I'm in the minority on my team as even seeing this as a problem. Before I got there, the previous team decided it would "solve" this problem by sending all application sql queries through a string-find-and-replace that adds hardcoded partition names in the FROM clause and has somebody manually update partition names in stored procs as needed...but it will break one day and it will break hard. I'm trying to find the least invasive approach but I'm afraid there may not be one.
Preferably, it would be a solution that only changing queries themselves and not the underlying db structure. Like I said, this solution simply may not exist...

Some solutions to prototype:
The CAST function. You can partition by an expression; the downside is your application would have to provide a similar expression.
Partition on NLS_SORT(column_partitioned_by, 'NLSSORT=BINARY'). Again, application changes required.
Converting column_partitioned_by to a numeric value, possibly using a code table to transform between the two. You'd have to include a join to that table throughout the application, though.

Related

Partition elimination in Greenplum

I have a scenario like this:
SELECT * FROM PACKAGE WHERE PACKAGE_TYPE IN ('BOX','CARD')
The table is partitioned by PACKAGE_TYPE field. Assume that there are twenty possible values for PACKAGE_TYPE field. So there are twenty partitions including BOX, CARD and DEFAULT partitions. When the above query is run, partition elimination happens correctly and only the BOX and CARDpartitions get scanned. The result is quick.
However, when the same query is written like this:
SELECT * FROM PACKAGE WHERE PACKAGE_TYPE IN (SELECT PACKAGE_TYPE FROM PACKAGE_LIST_TABLE), where the column PACKAGE_TYPE in PACKAGE_LIST_TABLE contains two values BOX and CARD.
When the above query is run, all the 20 partitions are being scanned. It degrades the performance.
It seems that the compiler is failing to identify the second query correctly and as a result all the partitions are getting accessed.
Any workarounds to overcome this?
Thanks in advance.
The Postgres manual page on Partitioning includes this caveat
Constraint exclusion only works when the query's WHERE clause contains constants (or externally supplied parameters). For example, a comparison against a non-immutable function such as CURRENT_TIMESTAMP cannot be optimized, since the planner cannot know which partition the function value might fall into at run time.
In order to eliminate a seek on a partition, Postgres must know when creating a query plan that no rows from that partition are relevant. In your query, this occurs only after the sub-query has completed, so the query would have to be split into two, with the second part planned only after the first completes.
If the partitions include an index on the partitioned column (PACKAGE_TYPE) as well as a constraint, the planner may elect to use an index scan on each partition, leading to the incorrect partitions being reasonably efficiently eliminated at runtime anyway. (That is, there would be 20 index scans, but each would require very little resource.)
An alternative would be to split the query yourself, and build the SQL dynamically. Since the SELECT PACKAGE_TYPE FROM PACKAGE_LIST_TABLE can only ever return up to 20 distinct values, you could select those into an array/set in your application or a user-defined function. Then you can pass these in as literals in the IN ( ... ) clause as in your first example (or equivalently = ANY(array_expression)), and achieve the partition elimination.

optimize query with column in where clause

I have an sql query which fetch the first N rows in a table which is designed as a low-level queue.
select top N * from my_table where status = 0 order by date asc
The intention behind this query is as follows:
First, this question is intended to be database agnostic, as my implementation will support sql server, oracle, DB2 and sybase. The sql syntax above of "top N" is just an example.
The table can contain millions of rows.
N is a relatively small number in comparison, e.g. 100.
status is 0 when the row is in the queue. Later it is changed to 1 to indicate that it is in processing. After processing it is deleted. So it is expected that at least 90% of the rows in the table will be with status 0.
rows in the table should be fetched according to their date, hence the order by clause.
What is the optimal index to make this query works fastest?
I initially thought the index should be on (date, status), but I am not sure about it anymore. Since the status column will contain mostly zeros, is there an added-value to it? Will it be sufficient to index by (date) alone?
Or maybe it should be (status, date)?
I don't think there is an efficient solution that will be RDMS independent. For example, Oracle has bitmap indexes, SQLServer has partial indexes, and I don't see reasons not to use them if, for instance, Mysql or Sqlite has nothing similar. Also, historically SQLServer implements clustered tables (or IOT in Oracle world) way better than Oracle does, so having clustered index on date column may work perfectly for SQLServer, but not for Oracle.
I'd rather change approach a bit. If you say 90% of rows don't satisfy status=0 condition, why not try refactoring schema, and adding a new table (or materialized view) that holds only records you are interested in ? The number of new programmable objects required for keeping that table up-to-date and merging data with original table is relatively small even if RDMS doesn't support materialized view directly. Also, if it's possible to redesign underlying logic, so rows never updated, only inserted or deleted, then it will help avoiding lock contentions , and as a result , the whole system will have a better performance .
Have a clustered index on Date and a non clustered index on Status.

select * vs select column

If I just need 2/3 columns and I query SELECT * instead of providing those columns in select query, is there any performance degradation regarding more/less I/O or memory?
The network overhead might be present if I do select * without a need.
But in a select operation, does the database engine always pull atomic tuple from the disk, or does it pull only those columns requested in the select operation?
If it always pulls a tuple then I/O overhead is the same.
At the same time, there might be a memory consumption for stripping out the requested columns from the tuple, if it pulls a tuple.
So if that's the case, select someColumn will have more memory overhead than that of select *
There are several reasons you should never (never ever) use SELECT * in production code:
since you're not giving your database any hints as to what you want, it will first need to check the table's definition in order to determine the columns on that table. That lookup will cost some time - not much in a single query - but it adds up over time
if you need only 2/3 of the columns, you're selecting 1/3 too much data which needs to be retrieving from disk and sent across the network
if you start to rely on certain aspects of the data, e.g. the order of the columns returned, you could get a nasty surprise once the table is reorganized and new columns are added (or existing ones removed)
in SQL Server (not sure about other databases), if you need a subset of columns, there's always a chance a non-clustered index might be covering that request (contain all columns needed). With a SELECT *, you're giving up on that possibility right from the get-go. In this particular case, the data would be retrieved from the index pages (if those contain all the necessary columns) and thus disk I/O and memory overhead would be much less compared to doing a SELECT *.... query.
Yes, it takes a bit more typing initially (tools like SQL Prompt for SQL Server will even help you there) - but this is really one case where there's a rule without any exception: do not ever use SELECT * in your production code. EVER.
It always pulls a tuple (except in cases where the table has been vertically segmented - broken up into columns pieces), so, to answer the question you asked, it doesn't matter from a performance perspective. However, for many other reasons, (below) you should always select specifically those columns you want, by name.
It always pulls a tuple, because (in every vendors RDBMS I am familiar with), the underlying on-disk storage structure for everything (including table data) is based on defined I/O Pages (in SQL Server for e.g., each Page is 8 kilobytes). And every I/O read or write is by Page.. I.e., every write or read is a complete Page of data.
Because of this underlying structural constraint, a consequence is that Each row of data in a database must always be on one and only one page. It cannot span multiple Pages of data (except for special things like blobs, where the actual blob data is stored in separate Page-chunks, and the actual table row column then only gets a pointer...). But these exceptions are just that, exceptions, and generally do not apply except in special cases ( for special types of data, or certain optimizations for special circumstances)
Even in these special cases, generally, the actual table row of data itself (which contains the pointer to the actual data for the Blob, or whatever), it must be stored on a single IO Page...
EXCEPTION. The only place where Select * is OK, is in the sub-query after an Exists or Not Exists predicate clause, as in:
Select colA, colB
From table1 t1
Where Exists (Select * From Table2
Where column = t1.colA)
EDIT: To address #Mike Sherer comment, Yes it is true, both technically, with a bit of definition for your special case, and aesthetically. First, even when the set of columns requested are a subset of those stored in some index, the query processor must fetch every column stored in that index, not just the ones requested, for the same reasons - ALL I/O must be done in pages, and index data is stored in IO Pages just like table data. So if you define "tuple" for an index page as the set of columns stored in the index, the statement is still true.
and the statement is true aesthetically because the point is that it fetches data based on what is stored in the I/O page, not on what you ask for, and this true whether you are accessing the base table I/O Page or an index I/O Page.
For other reasons not to use Select *, see Why is SELECT * considered harmful? :
You should always only select the columns that you actually need. It is never less efficient to select less instead of more, and you also run into fewer unexpected side effects - like accessing your result columns on client side by index, then having those indexes become incorrect by adding a new column to the table.
[edit]: Meant accessing. Stupid brain still waking up.
Unless you're storing large blobs, performance isn't a concern. The big reason not to use SELECT * is that if you're using returned rows as tuples, the columns come back in whatever order the schema happens to specify, and if that changes you will have to fix all your code.
On the other hand, if you use dictionary-style access then it doesn't matter what order the columns come back in because you are always accessing them by name.
This immediately makes me think of a table I was using which contained a column of type blob; it usually contained a JPEG image, a few Mbs in size.
Needless to say I didn't SELECT that column unless I really needed it. Having that data floating around - especially when I selected mulitple rows - was just a hassle.
However, I will admit that I otherwise usually query for all the columns in a table.
During a SQL select, the DB is always going to refer to the metadata for the table, regardless of whether it's SELECT * for SELECT a, b, c... Why? Becuase that's where the information on the structure and layout of the table on the system is.
It has to read this information for two reasons. One, to simply compile the statement. It needs to make sure you specify an existing table at the very least. Also, the database structure may have changed since the last time a statement was executed.
Now, obviously, DB metadata is cached in the system, but it's still processing that needs to be done.
Next, the metadata is used to generate the query plan. This happens each time a statement is compiled as well. Again, this runs against cached metadata, but it's always done.
The only time this processing is not done is when the DB is using a pre-compiled query, or has cached a previous query. This is the argument for using binding parameters rather than literal SQL. "SELECT * FROM TABLE WHERE key = 1" is a different query than "SELECT * FROM TABLE WHERE key = ?" and the "1" is bound on the call.
DBs rely heavily on page caching for there work. Many modern DBs are small enough to fit completely in memory (or, perhaps I should say, modern memory is large enough to fit many DBs). Then your primary I/O cost on the back end is logging and page flushes.
However, if you're still hitting the disk for your DB, a primary optimization done by many systems is to rely on the data in indexes, rather than the tables themselves.
If you have:
CREATE TABLE customer (
id INTEGER NOT NULL PRIMARY KEY,
name VARCHAR(150) NOT NULL,
city VARCHAR(30),
state VARCHAR(30),
zip VARCHAR(10));
CREATE INDEX k1_customer ON customer(id, name);
Then if you do "SELECT id, name FROM customer WHERE id = 1", it is very likely that you DB will pull this data from the index, rather than from the tables.
Why? It will likely use the index anyway to satisfy the query (vs a table scan), and even though 'name' isn't used in the where clause, that index will still be the best option for the query.
Now the database has all of the data it needs to satisfy the query, so there's no reason to hit the table pages themselves. Using the index results in less disk traffic since you have a higher density of rows in the index vs the table in general.
This is a hand wavy explanation of a specific optimization technique used by some databases. Many have several optimization and tuning techniques.
In the end, SELECT * is useful for dynamic queries you have to type by hand, I'd never use it for "real code". Identification of individual columns gives the DB more information that it can use to optimize the query, and gives you better control in your code against schema changes, etc.
I think there is no exact answer for your question, because you have pondering performance and facility of maintain your apps. Select column is more performatic of select *, but if you is developing an oriented object system, then you will like use object.properties and you can need a properties in any part of apps, then you will need write more methods to get properties in special situations if you don't use select * and populate all properties. Your apps need have a good performance using select * and in some case you will need use select column to improve performance. Then you will have the better of two worlds, facility to write and maintain apps and performance when you need performance.
The accepted answer here is wrong. I came across this when another question was closed as a duplicate of this (while I was still writing my answer - grr - hence the SQL below references the other question).
You should always use SELECT attribute, attribute.... NOT SELECT *
It's primarily for performance issues.
SELECT name FROM users WHERE name='John';
Is not a very useful example. Consider instead:
SELECT telephone FROM users WHERE name='John';
If there's an index on (name, telephone) then the query can be resolved without having to look up the relevant values from the table - there is a covering index.
Further, suppose the table has a BLOB containing a picture of the user, and an uploaded CV, and a spreadsheet...
using SELECT * will willpull all this information back into the DBMS buffers (forcing out other useful information from the cache). Then it will all be sent to client using up time on the network and memory on the client for data which is redundant.
It can also cause functional issues if the client retrieves the data as an enumerated array (such as PHP's mysql_fetch_array($x, MYSQL_NUM)). Maybe when the code was written 'telephone' was the third column to be returned by SELECT *, but then someone comes along and decides to add an email address to the table, positioned before 'telephone'. The desired field is now shifted to the 4th column.
There are reasons for doing things either way. I use SELECT * a lot on PostgreSQL because there are a lot of things you can do with SELECT * in PostgreSQL that you can't do with an explicit column list, particularly when in stored procedures. Similarly in Informix, SELECT * over an inherited table tree can give you jagged rows while an explicit column list cannot because additional columns in child tables are returned as well.
The main reason why I do this in PostgreSQL is that it ensures that I get a well-formed type specific to a table. This allows me to take the results and use them as the table type in PostgreSQL. This also allows for many more options in the query than a rigid column list would.
On the other hand, a rigid column list gives you an application-level check that db schemas haven't changed in certain ways and this can be helpful. (I do such checks on another level.)
As for performance, I tend to use VIEWs and stored procedures returning types (and then a column list inside the stored procedure). This gives me control over what types are returned.
But keep in mind I am using SELECT * usually against an abstraction layer rather than base tables.
Reference taken from this article:
Without SELECT *:
When you are using ” SELECT * ” at that time you are selecting more columns from the database and some of this column might not be used by your application.
This will create extra cost and load on database system and more data travel across the network.
With SELECT *:
If you have special requirements and created dynamic environment when add or delete column automatically handle by application code. In this special case you don’t require to change application and database code and this will automatically affect on production environment. In this case you can use “SELECT *”.
Just to add a nuance to the discussion which I don't see here: In terms of I/O, if you're using a database with column-oriented storage you can do A LOT less I/O if you only query for certain columns. As we move to SSDs the benefits may be a bit smaller vs. row-oriented storage but there's a) only reading the blocks that contain columns you care about b) compression, which generally greatly reduces the size of the data on disk and therefore the volume of data read from disk.
If you're not familiar with column-oriented storage, one implementation for Postgres comes from Citus Data, another is Greenplum, another Paraccel, another (loosely speaking) is Amazon Redshift. For MySQL there's Infobright, the now-nigh-defunct InfiniDB. Other commercial offerings include Vertica from HP, Sybase IQ, Teradata...
select * from table1 INTERSECT select * from table2
equal
select distinct t1 from table1 where Exists (select t2 from table2 where table1.t1 = t2 )

Optimize SQL query that uses NOT EXISTS with many columns in not exists' WHERE clause

Edit: using SQL Server 2005.
I have a query that has to check whether rows from a legacy database have already been imported into a new database and imports them if they are not already there. Since the legacy database was badly designed, there is no unique id for the rows from the legacy table so I have to use heuristics to decide whether the row has been imported. (I have no control over the legacy database.) The new database has slightly different structure and I have to check several values such as whether create dates match, group number match, etc. to heuristically decide whether the row exists in the new database or not. Not very pretty, but the bad design of the legacy system it has to interface with leaves me little choice.
Anyhow the users of the system started throwing 10x to 100x more data at the system than I designed for, and now the query is running too slow. Can you suggest a way to make it faster? Here is the code, with some redadacted for privacy or to simplify but I think I left the important part:
INSERT INTO [...NewDatabase...]
SELECT [...Bunch of columns...]
FROM [...OldDatabase...] AS t1
WHERE t1.Printed = 0
AND NOT EXISTS(SELECT *
FROM [...New Database...] AS s3
WHERE year(s3.dtDatePrinted) = 1850 --This allows for re-importing rows marked for reprint
AND CAST(t1.[Group] AS int) = CAST(s3.vcGroupNum AS int)
AND RTRIM(t1.Subgroup) = s3.vcSubGroupNum
AND RTRIM(t1.SSN) = s3.vcPrimarySSN
AND RTRIM(t1.FirstName) = s3.vcFirstName
AND RTRIM(t1.LastName) = s3.vcLastName
AND t1.CaptureDate = s3.dtDateCreated)
Not knowing what the schema looks like, your first step is to EXPLAIN those sub-queries. That should show you where the database is chewing up its time. If there's no indexes its likely doing multiple full table scans. If I had to guess, I'd say t1.printed and s3.dtDatePrinted are the two most vital to get indexed as they'll weed out what's already been converted.
Also anything which needs to be calculated might cause the database not to use the index. For example, the calls to RTRIM and CAST. That suggests you have dirty data in the new database. Trim it off permanently, and see about changing t1.group to the right type.
year(s3.dtDatePrinted) = 1850 may fool the optimizer into not using an index for s3.dtDatePrinted (EXPLAIN should let you know). This appears to be just a flag set by you to check if the row has already been converted, so set it to a specific date (ie. 1850-01-01 00:00:00) and do a specific match (ie. s3.dtDatePrinted = "1850-01-01 00:00:00") and now that's a simple index lookup.
Making your comparision simpler would also help. Essentially what you have here is a 1-to-1 relationship between t1 and s3 (if t1 is the real name for the new table, consider something more descriptive). So rather than matching each individual bit of s3 to t1, just give t1 a column to reference the primary key of its corresponding s3 row. Then you just have one thing to check. If you can't alter t1 then you could use a 3rd table to track t1 to s3 mappings.
Once you have that, all you should have to do is a join to find rows in s3 which are not in t1.
SELECT s3.*
FROM s3
LEFT JOIN t1 ON t1.s3 = s3.id -- or whatever s3's primary key is
WHERE t1.s3 IS NULL
Try replacing this:
year(s3.dtDatePrinted) = 1850
With this:
s3.dtDatePrinted >= '1850-01-01' and s3.dtDatePrinted < '1851-01-01'
In this case, and if there's an index on dtDatePrinted MAYBE the optimizer could use a range index scan.
But I agree with previous posters that you should avoid the RTRIMs. One idea is keeping in s3 the untrimmed (original) value, or creating an intermediate table that maps untrimmed values with trimmed (new) ones. Or even creating materialized views. But all this work is useless without proper indexes.

SQL `LIKE` complexity

Does anyone know what the complexity is for the SQL LIKE operator for the most popular databases?
Let's consider the three core cases separately. This discussion is MySQL-specific, but might also apply to other DBMS due to the fact that indexes are typically implemented in a similar manner.
LIKE 'foo%' is quick if run on an indexed column. MySQL indexes are a variation of B-trees, so when performing this query it can simply descend the tree to the node corresponding to foo, or the first node with that prefix, and traverse the tree forward. All of this is very efficient.
LIKE '%foo' can't be accelerated by indexes and will result in a full table scan. If you have other criterias that can by executed using indices, it will only scan the the rows that remain after the initial filtering.
There's a trick though: If you need to do suffix matching - searching for file names with extension .foo, for instance - you can achieve the same performance by adding a column with the same contents as the original one but with the characters in reverse order.
ALTER TABLE my_table ADD COLUMN col_reverse VARCHAR (256) NOT NULL;
ALTER TABLE my_table ADD INDEX idx_col_reverse (col_reverse);
UPDATE my_table SET col_reverse = REVERSE(col);
Searching for rows with col ending in .foo then becomes:
SELECT * FROM my_table WHERE col_reverse LIKE 'oof.%'
Finally, there's LIKE '%foo%', for which there are no shortcuts. If there are no other limiting criterias which reduces the amount of rows to a feasible number, it'll cause a hard performance hit. You might want to consider a full text search solution instead, or some other specialized solution.
If you are asking about the performance impact:
The problem of like is that it keeps the database from using an index. On Oracle I think it doesn't use indexes anymore (but I'm still on Oracle 9). SqlServer uses indexes if the wildcard is only at the end. I don't know about other databases.
Depends on the RDBMS, the data (and possibly size of data), indexes and how the LIKE is used (with or without prefix wildcard)!
You are asking too general a question.