I am trying to understand what the +0 at the end of this Oracle 9i query means:
SELECT /*+ INDEX (a CODE_ZIP_CODE_IX) */
a.city,
a.state,
LPAD(a.code,5,0) ZipCode,
b.County_Name CoName,
c.Description RegDesc,
d.Description RegTypeDesc
FROM TBL_CODE_ZIP a,
TBL_CODE_COUNTY b,
TBL_CODE_REGION c,
TBL_CODE_REGION_TYPE d
WHERE a.City = 'LONDONDERRY'
AND a.State = 'NH'
AND lpad(a.Code,5,0) = '03038'
AND a.Region_Type_Code = 1
AND b.County(+) = a.County_Code
AND b.STATE(+) = a.STATE
AND c.Code(+) = a.Region_Code
AND d.Code(+) = a.Region_Type_Code
ORDER BY a.Code +0
Any ideas?
NOTE: I don't think it has to do with ascending or descending since I can't add asc or desc between a.Code and +0 and I can add asc or desc after +0
The + 0 was a trick back in the days of the rule based optimizer, which made it impossible to use an index on the numeric column. Similarly, they did a || '' for alphanumeric columns.
For your query, the only conclusion I can reach after inspecting it is that its creator was struggling with the performance. If (that's my assumption) index CODE_ZIP_CODE_IX is an index on TBL_CODE_ZIP(Code), then the query won't use it, even though it is hinted to use it. The creator probably wasn't aware that by using LPAD(a.code,5,0) instead of a.code, the index cannot be used. An order by clause takes its intermediate result set - which resides in memory - and sorts it. No index is needed for that. But with the + 0 it looks like he was thinking to disable it.
So, the tricks that were used were ineffective, and are now only misleading, as you have found out.
Regards,
Rob.
PS1: It's better to use LPAD(TO_CHAR(a.code),5,'0') or TO_CHAR(a.code,'fm00009'). Then it is clear what you are doing with the datatype.
PS2: Your query might benefit from using a function based index on LPAD(TO_CHAR(a.code),5,'0'), or whatever expression you use to left pad your zipcode.
My guess would be that a.code is a VARCHAR2 containing a numeric string, and the +0 is effectively casting it to a NUMBER so the sort will be numeric rather than alpha
You should be able to add ASC/DESC after the +0
Note: I had deleted this answer, because Mark B was the faster typist. However, I have re-instated it because I think there is some value in demonstrating what may have been the underlying intent of the SQL which Lucas posted.
Suppose CODE had been a VARCHAR2 column holding strings of digits (zip codes). The problem is that varchars sort as strings not numbers. Adding a zero to the CODE spawns an implicit cast to number, and hence sorts numerically:
SQL> select id, code
2 from t72
3 order by code
4 /
ID CODE
---------- -----
1 1
2 11
3 111
4 12
SQL> select id, code
2 from t72
3 order by code+0
4 /
ID CODE
---------- -----
1 1
2 11
4 12
3 111
SQL>
If the stored codes had been left-padded with zeroes then the cast would not have been necessary, as they would sort in numeric order anyway.
As others have observed, using TO_NUMBER() would have been the better choice. The +0 is less obvious than an explicit cast, and it is always good to be clear about intent.
Is there an index on TBL_CODE_ZIP.Code? I've seen queries that add 0 to a number (or '' to a string) in order to force the optimizer to avoid using an index for that part of the query. (Of course, the proper way to avoid using an index is to add an appropriate hint)
Maybe the original writer had a problem where the ORDER BY was being optimized to an index scan, which caused the query to run slower; so they added +0 to force a different access path and do an ordinary sort.
Fisrt of all sorry for answer because it is very old question now. However +0 is a hint to your database to ignore the index (if it is on a.Code column) for this specific query,
Some time index make retrieval fast while some time make it very slow depending on optimizer mode of database.
so now you have two options eigther use +0 hint or delete index if it on a.code you will get same speed.
Related
I have two files which I would like to match by name and I would like to take account of spelling errors by using the compged function. The names have been thoroughly cleaned and I have no other useful match variables that could be used to reduce the search space.
The files name1 and name2 have over 500k rows each and thus after 11 hours this code has not run.
Is there some way I can code this more efficiently or is my issue purely due to computing power?
proc sql;
create table name1_name2_Fuzzy as
select a.*, b.*
from name1 as a
inner join name2 as b
on COMPGED(a.match_name, b.match_name) < 200;
quit;
You have a parameter in compged function that you didn't use, and that can improve the performance (maybe 6 or 7 hours instead of 11..).
this parameter is the cutoff. If you choose 300 as a cutoff, when the distance between the words reaches 300, sas stops the calculation and outputs 300.
So here in your case, you should choose a cutoff >200 (and NOT >=200).
Complev function is faster than Compged. If you don't need an exact cost of each operation (with call compost routine), you can use it instead of compged, and you can reduce minutes or maybe hours of computations. Complev has also the cutoff option.
Hope this helps !
Working off memory here, but if the first char of each match_name is different, the COMPGED will be over 200, true? So, you wouldn't consider them a match?
If so, make an indexed column with the first character of match_name in each table, and join on that before the COMPGED. That should eliminate most of the non-matches so far fewer COMPGED calculations will be needed.
I have the following FireBird query:
update hrs h
set h.plan_week_id=
(select first 1 c.plan_week_id from calendar c
where c.calendar_id=h.calendar_id)
where coalesce(h.calendar_id,0) <> 0
(Intention: For records in hrs with a (non-zero) calendar_id
take calendar.plan_week_id and put it in hrs.plan_week_id)
The trick to select the first record in Oracle is to use WHERE ROWNUM=1, and if understand correctly I do not have to use ROWNUM in a separate outer query because I 'only' match ROWNUM=1 - thanks SO for suggesting Questions that may already have your answer ;-)
This would make it
update hrs h
set h.plan_week_id=
(select c.plan_week_id from calendar c
where (c.calendar_id=h.calendar_id) and (rownum=1))
where coalesce(h.calendar_id,0) <> 0
I'm actually using the 'first record' together with the selection of only one field to guarantee that I get one value back which can be put into h.plan_week_id.
Question: Will the above query work under Oracle as intended?
Right now, I do not have a filled Oracle DB at hand to run the query on.
Like Nicholas Krasnov said, you can test it in SQL Fiddle.
But if you ever find yourself about to use where rownum = 1 in a subquery, alarm bells should go off, because in 90% of the cases you are doing something wrong. Very rarely will you need a random value. Only when all selected values are the same, a rownum = 1 is valid.
In this case I expect calendar_id to be a primary key in calendar. Therefor each record in hrs can only have 1 plan_week_id selected per record. So the where rownum = 1 is not required.
And to answer your question: Yes, it will run just fine. Though the brackets around each where clause are also not required and in fact only confusing (me).
I ran across a problem with a SQL statement today that I was able to fix by adding additional criteria, however I really want to know why my change fixed the problem.
The problem query:
SELECT *
FROM
(SELECT ah.*,
com.location,
ha.customer_number,
d.name applicance_NAME,
house.name house_NAME,
dr.name RULE_NAME
FROM actionhistory ah
INNER JOIN community com
ON (t.city_id = com.city_id)
INNER JOIN house_address ha
ON (t.applicance_id = ha.applicance_id
AND ha.status_cd = 'ACTIVE')
INNER JOIN applicance d
ON (t.applicance_id = d.applicance_id)
INNER JOIN house house
ON (house.house_id = t.house_id)
LEFT JOIN the_rule tr
ON (tr.the_rule_id = t.the_rule_id)
WHERE actionhistory_id >= 'ACT100010000'
ORDER BY actionhistory_id
)
WHERE rownum <= 30000;
The "fix"
SELECT *
FROM
(SELECT ah.*,
com.location,
ha.customer_number,
d.name applicance_NAME,
house.name house_NAME,
dr.name RULE_NAME
FROM actionhistory ah
INNER JOIN community com
ON (t.city_id = com.city_id)
INNER JOIN house_address ha
ON (t.applicance_id = ha.applicance_id
AND ha.status_cd = 'ACTIVE')
INNER JOIN applicance d
ON (t.applicance_id = d.applicance_id)
INNER JOIN house house
ON (house.house_id = t.house_id)
LEFT JOIN the_rule tr
ON (tr.the_rule_id = t.the_rule_id)
WHERE actionhistory_id >= 'ACT100010000' and actionhistory_id <= 'ACT100030000'
ORDER BY actionhistory_id
)
All of the _id columns are indexed sequences.
The first query's explain plan had a cost of 372 and the second was 14. This is running on an Oracle 11g database.
Additionally, if actionhistory_id in the where clause is anything less than ACT100000000, the original query returns instantly.
This is because of the index on the actionhistory_id column.
During the first query Oracle has to return all the index blocks containing indexes for records that come after 'ACT100010000', then it has to match the index to the table to get all the records, and then it pulls 29999 records from the result set.
During the second query Oracle only has to return the index blocks containing records between 'ACT100010000' and 'ACT100030000'. Then it grabs from the table those records that are represented in the index blocks. A lot less work in that step of grabbing the record after having found the index than if you use the first query.
Noticing your last line about if the id is less than ACT100000000 - sounds to me that those records may all be in the same memory block (or in a contiguous set of blocks).
EDIT: Please also consider what is said by Justin - I was talking about actual performance, but he is pointing out that the id being a varchar greatly increases the potential values (as opposed to a number) and that the estimated plan may reflect a greater time than reality because the optimizer doesn't know the full range until execution. To further optimize, taking his point into consideration, you could put a function based index on the id column or you could make it a combination key, with the varchar portion in one column and the numeric portion in another.
What are the plans for both queries?
Are the statistics on your tables up to date?
Do the two queries return the same set of rows? It's not obvious that they do but perhaps ACT100030000 is the largest actionhistory_id in the system. It's also a bit confusing because the first query has a predicate on actionhistory_id with a value of TRA100010000 which is very different than the ACT value in the second query. I'm guessing that is a typo?
Are you measuring the time required to fetch the first row? Or the time required to fetch the last row? What are those elapsed times?
My guess without that information is that the fact that you appear to be using the wrong data type for your actionhistory_id column is affecting the Oracle optimizer's ability to generate appropriate cardinality estimates which is likely causing the optimizer to underestimate the selectivity of your predicates and to generate poorly performing plans. A human may be able to guess that actionhistory_id is a string that starts with ACT10000 and then has 30,000 sequential numeric values from 00001 to 30000 but the optimizer is not that smart. It sees a 13 character string and isn't able to figure out that the last 10 characters are always going to be numbers so there are only 10 possible values rather than 256 (assuming 8-bit characters) and that the first 8 characters are always going to be the same constant value. If, on the other hand, actionhistory_id was defined as a NUMBER and had values between 1 and 30000, it would be dramatically easier for the optimizer to make reasonable estimates about the selectivity of various predicates.
I'm working with a non-profit that is mapping out solar potential in the US. Needless to say, we have a ridiculously large PostgreSQL 9 database. Running a query like the one shown below is speedy until the order by line is uncommented, in which case the same query takes forever to run (185 ms without sorting compared to 25 minutes with). What steps should be taken to ensure this and other queries run in a more manageable and reasonable amount of time?
select A.s_oid, A.s_id, A.area_acre, A.power_peak, A.nearby_city, A.solar_total
from global_site A cross join na_utility_line B
where (A.power_peak between 1.0 AND 100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06' AND A.fips_country = 'US' AND A.fips_level2 = '025')
and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'
and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))
--order by A.area_acre
offset 0 limit 11;
The sort is not the problem - in fact the CPU and memory cost of the sort is close to zero since Postgres has Top-N sort where the result set is scanned while keeping up to date a small sort buffer holding only the Top-N rows.
select count(*) from (1 million row table) -- 0.17 s
select * from (1 million row table) order by x limit 10; -- 0.18 s
select * from (1 million row table) order by x; -- 1.80 s
So you see the Top-10 sorting only adds 10 ms to a dumb fast count(*) versus a lot longer for a real sort. That's a very neat feature, I use it a lot.
OK now without EXPLAIN ANALYZE it's impossible to be sure, but my feeling is that the real problem is the cross join. Basically you're filtering the rows in both tables using :
where (A.power_peak between 1.0 AND 100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06' AND A.fips_country = 'US' AND A.fips_level2 = '025')
and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'
OK. I don't know how many rows are selected in both tables (only EXPLAIN ANALYZE would tell), but it's probably significant. Knowing those numbers would help.
Then we got the worst case CROSS JOIN condition ever :
and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))
This means all rows of A are matched against all rows of B (so, this expression is going to be evaluated a large number of times), using a bunch of pretty complex, slow, and cpu-intensive functions.
Of course it's horribly slow !
When you remove the ORDER BY, postgres just comes up (by chance ?) with a bunch of matching rows right at the start, outputs those, and stops since the LIMIT is reached.
Here's a little example :
Tables a and b are identical and contain 1000 rows, and a column of type BOX.
select * from a cross join b where (a.b && b.b) --- 0.28 s
Here 1000000 box overlap (operator &&) tests are completed in 0.28s. The test data set is generated so that the result set contains only 1000 rows.
create index a_b on a using gist(b);
create index b_b on a using gist(b);
select * from a cross join b where (a.b && b.b) --- 0.01 s
Here the index is used to optimize the cross join, and speed is ridiculous.
You need to optimize that geometry matching.
add columns which will cache :
ST_Centroid(A.wkb_geometry)
ST_Buffer((B.wkb_geometry), 1000)
There is NO POINT in recomputing those slow functions a million times during your CROSS JOIN, so store the results in a column. Use a trigger to keep them up to date.
add columns of type BOX which will cache :
Bounding Box of ST_Centroid(A.wkb_geometry)
Bounding Box of ST_Buffer((B.wkb_geometry), 1000)
add gist indexes on the BOXes
add a Box overlap test (using the && operator) which will use the index
keep your ST_Within which will act as a final filter on the rows that pass
Maybe you can just index the ST_Centroid and ST_Buffer columns... and use an (indexed) "contains" operator, see here :
http://www.postgresql.org/docs/8.2/static/functions-geometry.html
I would suggest creating an index on area_acre. You may want to take a look at the following: http://www.postgresql.org/docs/9.0/static/sql-createindex.html
I would recommend doing this sort of thing off of peak hours though because this can be somewhat intensive with a large amount of data. One thing you will have to look at as well with indexes is rebuilding them on a schedule to ensure performance over time. Again this schedule should be outside of peak hours.
You may want to take a look at this article from a fellow SO'er and his experience with database slowdowns over time with indexes: Why does PostgresQL query performance drop over time, but restored when rebuilding index
If the A.area_acre field is not indexed that may slow it down. You can run the query with EXPLAIN to see what it is doing during execution.
First off I would look at creating indexes , ensure your db is being vacuumed, increase the shared buffers for your db install, work_mem settings.
First thing to look at is whether you have an index on the field you're ordering by. If not, adding one will dramatically improve performance. I don't know postgresql that well but something similar to:
CREATE INDEX area_acre ON global_site(area_acre)
As noted in other replies, the indexing process is intensive when working with a large data set, so do this during off-peak.
I am not familiar with the PostgreSQL optimizations, but it sounds like what is happening when the query is run with the ORDER BY clause is that the entire result set is created, then it is sorted, and then the top 11 rows are taken from that sorted result. Without the ORDER BY, the query engine can just generate the first 11 rows in whatever order it pleases and then it's done.
Having an index on the area_acre field very possibly may not help for the sorting (ORDER BY) depending on how the result set is built. It could, in theory, be used to generate the result set by traversing the global_site table using an index on area_acre; in that case, the results would be generated in the desired order (and it could stop after generating 11 rows in the result). If it does not generate the results in that order (and it seems like it may not be), then that index will not help in sorting the results.
One thing you might try is to remove the "CROSS JOIN" from the query. I doubt that this will make a difference, but it's worth a test. Because a WHERE clause is involved joining the two tables (via ST_WITHIN), I believe the result is the same as an inner join. It is possible that the use of the CROSS JOIN syntax is causing the optimizer to make an undesirable choice.
Otherwise (aside from making sure indexes exist for fields that are being filtered), you could play a bit of a guessing game with the query. One condition that stands out is the area_acre >= 500. This means that the query engine is considering all rows that meet that condition. But then only the first 11 rows are taken. You could try changing it to area_acre >= 500 and area_acre <= somevalue. The somevalue is the guessing part that would need adjustment to make sure you get at least 11 rows. This, however, seems like a pretty cheesy thing to do, so I mention it with some reticence.
Have you considered creating Expression based indexes for the benefit of the hairier joins and where conditions?
SELECT MAX(verification_id)
FROM VERIFICATION_TABLE
WHERE head = 687422
AND mbr = 23102
AND RTRIM(LTRIM(lname)) = '.iq bzw'
AND TO_CHAR(dob,'MM/DD/YYYY')= '08/10/2004'
AND system_code = 'M';
This query is taking 153 seconds to run. there are millions of rows in VERIFICATION_TABLE.
I think query is taking long because of the functions in where clause. However, I need to do ltrim rtrim on the columns and also date has to be matched in MM/DD/YYYY format. How can I optimize this query?
Explain plan:
SELECT STATEMENT, GOAL = ALL_ROWS 80604 1 59
SORT AGGREGATE 1 59
TABLE ACCESS FULL P181 VERIFICATION_TABLE 80604 1 59
Primary key:
VRFTN_PK Primary VERIFICATION_ID
Indexes:
N_VRFTN_IDX2 head, mbr, dob, lname, verification_id
N_VRFTN_IDX3 last_update_date
N_VRFTN_IDX4 mbr, lname, dob, verification_id
N_VRFTN_IDX4 verification_id
Though, in the explain plan I dont see indexes/primary key being used. is that the problem?
Try this:
SELECT MAX(verification_id)
FROM VERIFICATION_TABLE
WHERE head = 687422
AND mbr = 23102
AND TRIM(lname) = '.iq bzw'
AND TRUNCATE(dob) = TO_DATE('08/10/2004')
AND system_code = 'M';
Remove that TRUNCATE() if dob doesn't have time on it already, from the looks of it (Date of Birth?) it may not. Past that, you need some indexing work. If you're querying that much in this style, I'd index mbr and head in a 2 column index, if you said what the columns mean it'd help determine the best indexing here.
The only index that is a possible candidate for use in your query is N_VRFTN_IDX2, because it indexes four of the columns you use in your WHERE clause: HEAD, MBR, DOB and LNAME.
However, because you apply functions to both DOB and LNAME they are ineligible for consideration. The optimizer may then decide not to use that index because it thinks HEAD+MBR on their own are an insufficiently selective combination. If you removed the TO_CHAR() call from DOB then you have three leading columns on N_VRFTN_IDX2 which might make it more attractive to the optimizer. Likewise, is it necessary to TRIM() LNAME?
The other thing is, the need to look up SYSTEM_CODE means the query has to read from the table (because that column is not indexed). If N_VRFTN_IDX2 has a poor clustering factoring the optimizer may decide to go for a FULL TABLE SCAN because the indexed reads are an overhead. Whereas if you added SYSTEM_CODE to the index the entire query could be satisfied by an INDEX RANGE SCAN, which would be a lot faster.
Finally, how fresh are your statistics? If your statistics are stale, that might lead the optimizer to make a duff decision. For instance, more accurate statistics might lead the optimizer to use the compound index even with just the two leading columns.
You should turn the literal into a DATE and not the column into a VARCHAR2 like this:
AND dob = TO_DATE('08/10/2004','MM/DD/YYYY')
Or use the preferable ANSI date literal syntax:
AND dob = DATE '2004-08-10'
If the dob column contains time (a date of birth doesn't usually, except presumably in a hospital!) then you can do:
AND dob >= DATE '2004-08-10'
AND dob < DATE '2004-08-11'
Check the datatypes for HEAD and MBR.
The values "687422 and 23102" have the 'feel' of being quite selective. That is, if you have hundreds of thousands of values for head and millions of records in the table, it would seem that HEAD is quite selective. [That could be totally misleading though.]
Anyway, you may find that HEAD and/or MBR are actually stored as VARCHAR2 or CHAR fields rather than NUMBER. If so, comparing the character to a number would prevent the use of the index. Try the following (and I've included the conversion of the dob predicate with a date but added the explicit format mask).
SELECT MAX(verification_id)
FROM VERIFICATION_TABLE
WHERE head = '687422'
AND mbr = '23102'
AND RTRIM(LTRIM(lname)) = '.iq bzw'
AND TRUNCATE(dob) = TO_DATE('08/10/2004','MM/DD/YYYY')
AND system_code = 'M';
Please provide an EXPLAIN output on this query so we know where the slow-down occurs. Two thoughts:
change
AND TO_CHAR(dob,'MM/DD/YYYY')= '08/10/2004'
to
AND dob = <date here, not sure which oracle str2date function you need>
and use a function based index on
RTRIM(LTRIM(lname))
Try this:
SELECT MAX(verification_id)
FROM VERIFICATION_TABLE
WHERE head = 687422
AND mbr = 23102
AND TRIM(lname) = '.iq bzw'
AND dob between TO_DATE('08/10/2004') and TO_DATE('08/11/2004')
AND system_code = 'M';
This way a possible index on dob will be used.