Querying time higher with 'Where' than without it - sql

I have something what I think is a srange issue. Normally, I think that a Query should last less time if I put a restriction (so that less rows are processed). But I don't know why, this is not the case. Maybe I'm putting something wrong, but I don't get error; the query just seems to run 'till infinity'.
This is the query
SELECT
A.ENTITYID AS ORG_ID,
A.ID_VALUE AS LEI,
A.MODIFIED_BY,
A.AUDITDATETIME AS LAST_DATE_MOD
FROM (
SELECT
CASE WHEN IFE.NEWVALUE IS NOT NULL
then EXTRACTVALUE(xmltype(IFE.NEWVALUE), '/DocumentElement/ORG_IDENTIFIERS/ID_TYPE')
ELSE NULL
end as ID_TYPE,
case when IFE.NEWVALUE is not null
then EXTRACTVALUE(xmltype(IFE.NEWVALUE), '/DocumentElement/ORG_IDENTIFIERS/ID_VALUE')
ELSE NULL
END AS ID_VALUE,
(select u.username from admin.users u where u.userid = ife.analystuserid) as Modified_by,
ife.*
FROM ife.audittrail ife
WHERE
--IFE.AUDITDATETIME >= '01-JUN-2016' AND
attributeid = 499
AND ROWNUM <= 10000
AND (CASE WHEN IFE.NEWVALUE IS NOT NULL then EXTRACTVALUE(xmltype(IFE.NEWVALUE), '/DocumentElement/ORG_IDENTIFIERS/ID_TYPE') ELSE NULL end) = '38') A
--WHERE A.AUDITDATETIME >= '01-JUN-2016';
So I tried with the two clauses commented (one per each time of course).
And with both of them happens the same; the query runs for so long time that I have to abort it.
Do you know why this could be happening? How could I do, maybe in a different way, to put the restriction?
The values of the field AUDITDATETIME are '06-MAY-2017', for example. In that format.
Thank you very much in advance

I think you may misunderstand how databases work.
Firstly, read up on EXPLAIN - you can find out exactly what is taking time, and why, by learning to read the EXPLAIN statement.
Secondly - the performance characteristics of any given query are determined by a whole range of things, but usually the biggest effort goes not in processing rows, but finding them.
Without an index, the database has to look at every row in the database and compare it to your where clause. It's the equivalent of searching in the phone book for a phone number, rather than a name (the phone book is indexed on "last name").
You can improve this by creating indexes - for instance, on columns "AUDITDATETIME" and "attributeid".
Unlike the phone book, a database server can support multiple indexes - and if those indexes match your where clause, your query will be (much) faster.
Finally, using an XML string extraction for a comparison in the where clause is likely to be extremely slow unless you've got an index on that XML data.
This is the equivalent of searching the phone book and translating the street address from one language to another - not only do you have to inspect every address, you have to execute an expensive translation step for each item.

You probably need index(es)... We can all make guesses on what indexes you already have, and need to add, but most dbms have built in query optimizers.
If you are using MS SQL Server you can execute query with query plan, that will tell you what index you need to add to optimize this particular query. It will even let you copy /paste the command to create it.

Related

Replace specific characters within SQL query

I'm struggling with some special characters that work fine with my SQL query, however will create problems in a secondary system (Excel), so I would like to replace them already during the query if possible.
TRANSACTIONS
ID DESC
1 14ft
2 15/16ft
3 17ft
This is just a dummy example, but "/" represents one of the characters I need to remove, but there are a few different. Although it should technically work, I can't use:
select ID, case when DESC = '15/16ft' then '15_16ft' else DESC from TRANSACTIONS
I can't keep track on all the strings, so I should approach based on character. I'd prefer converting them to another char or removing them altogether.
Unfortunately not sure on the exact db engine, although good chance it's an IBM based product, but most "generic" SQL queries tend to run fine. And just to emphazise that I'm looking to convert data within the SQL query, not update the database records. Thanks a lot!

SQL Querying - Matching datetimes vs Matching integers

I have a bunch of data in my database and I want to filter out data that has been stored that for longer than a week. I'm using SQLServer and I found that I could use the 'DATEDIFF' function.
At the moment it works great and fast but I don't have a lot of records at the moment therefore anything runs quite smoothly.
After some research online I found out that the comparison of integers in databases is faster than the comparison of strings, I assume at this point that the comparison of datetimes (using the given function) is even slower at a major scale.
Let's say my database table content looks like this:
Currently I would filter out records that are older like a week like so:
SELECT * FROM onlineMainTable WHERE DATEDIFF(wk, Date, GETDATE()) > 1
I assume that this query would be quite slow if there were a thousand rows tin the table.
The status column represents a calculation status, I wondered if I would speed up the process if I were to look for a specific status instead of matching datetimes, for me in order to set that status to the one that represents 'old records' I need to update those rows before I select them, it would look something like this:
UPDATE table SET Status = -1 WHERE NOT Status = -1 AND DATEDIFF(wk, Date, GETDATE()) > 1;
SELECT * FROM table WHERE Status = -1;
I used the '-1' as an example.
So obviously I could be wrong but I think updating in this case would be fast enough since there won't be that many records to update since older ones have already been updated with its status. The selection would be faster aswell since I would be matching integers instead of datetimes.
The downside to my (possible) solution is that I would query twice every time I fetch data, even when it might not be needed (if every row is newer than 1 week).
It comes down to this: Should I compare datetimes or should I update an integer column based on that datetime and then select using the comparison of those ints?
If there is a different/better way of doing this i'm all ears.
Context
I am making a webapp for quotation requests. Requests should expire after a week since they won't be valid at that point. I need to both display valid requests and expired requests (so costumers have an overview). All these requests are stored in a database table.
Indexes are the objects that are design to improve select queries performances the drawback is that they slow down insert delete and update operations, so they have to be used when necessary. Generally DBMS provide tools to explain queries execution plan.
Maybe you just need to add an index on Date column:
create index "index_name" on onlineMainTable(Date)
and query could be
SELECT * FROM onlineMainTable WHERE Date > DATEADD(week,-1,GETDATE());

For an Oracle NUMBER datatype, LIKE operator vs BETWEEN..AND operator

Assume mytable is an Oracle table and it has a field called id. The datatype of id is NUMBER(8). Compare the following queries:
select * from mytable where id like '715%'
and
select * from mytable where id between 71500000 and 71599999
I would think the second is more efficient since I think "number comparison" would require fewer number of assembly language instructions than "string comparison". I need a confirmation or correction. Please confirm/correct and throw any further comment related to either operator.
UPDATE: I forgot to mention 1 important piece of info. id in this case must be an 8-digit number.
If you only want values between 71500000 and 71599999 then yes the second one is much more efficient. The first one would also return values between 7150-7159, 71500-71599 etc. and so forth. You would either need to sift through unecessary results or write another couple lines of code to filter the rest of them out. The second option is definitely more efficient for what you seem to want to do.
It seems like the execution plan on the second query is more efficient.
The first query is doing a full table scan of the id's, whereas the second query is not.
My Test Data:
Execution Plan of first query:
Execution Plan of second query:
I don't like the idea of using LIKE with a numeric column.
Also, it may not give the results you are looking for.
If you have a value of 715000000, it will show up in the query result, even though it is larger than 71599999.
Also, I do not like between on principle.
If a thing is between two other things, it should not include those two other things. But this is just a personal annoyance.
I prefer to use >= and <= This avoids confusion when I read the query. In addition, sometimes I have to change the query to something like >= a and < c. If I started by using the between operator, I would have to rewrite it when I don't want to be inclusive.
Harv
In addition to the other points raised, using LIKE in the manner you suggest would cause Oracle to not use any indexes on the ID column due to the implicit conversion of the data from number to character, resulting in a full table scan when using LIKE versus and index range scan when using BETWEEN. Assuming, of course, you have an index on ID. Even if you don't, however, Oracle will have to do the type conversion on each value it scans in the LIKE case, which it won't have to do in the other.
You can use math function, otherwise you have to use to_char function to use like, but it will cause performance problems.
select * from mytable where floor(id /100000) = 715
or
select * from mytable where floor(id /100000) = TO_NUMBER('715') // this is parametric

IP address numbers in MySQL subquery

I have a problem with a subquery involving IPV4 addresses stored in MySQL (MySQL 5.0).
The IP addresses are stored in two tables, both in network number format - e.g. the format output by MySQL's INET_ATON(). The first table ('events') contains lots of rows with IP addresses associated with them, the second table ('network_providers') contains a list of provider information for given netblocks.
events table (~4,000,000 rows):
event_id (int)
event_name (varchar)
ip_address (unsigned int)
network_providers table (~60,000 rows):
ip_start (unsigned int)
ip_end (unsigned int)
provider_name (varchar)
Simplified for the purposes of the problem I'm having, the goal is to create an export along the lines of:
event_id,event_name,ip_address,provider_name
If do a query along the lines of either of the following, I get the result I expect:
SELECT provider_name FROM network_providers WHERE INET_ATON('192.168.0.1') >= network_providers.ip_start ORDER BY network_providers.ip_start DESC LIMIT 1
SELECT provider_name FROM network_providers WHERE 3232235521 >= network_providers.ip_start ORDER BY network_providers.ip_start DESC LIMIT 1
That is to say, it returns the correct provider_name for whatever IP I look up (of course I'm not really using 192.168.0.1 in my queries).
However, when performing this same query as a subquery, in the following manner, it doesn't yield the result I would expect:
SELECT
events.event_id,
events.event_name,
(SELECT provider_name FROM network_providers
WHERE events.ip_address >= network_providers.ip_start
ORDER BY network_providers.ip_start DESC LIMIT 1) as provider
FROM events
Instead the a different (incorrect) value for provider is returned. Over 90% (but curiously not all) values returned in the provider column contain the wrong provider information for that IP.
Using events.ip_address in a subquery just to echo out the value confirms it contains the value I'd expect and that the subquery can parse it. Replacing events.ip_address with an actual network number also works, just using it dynamically in the subquery in this manner that doesn't work for me.
I suspect the problem is there is something fundamental and important about subqueries in MySQL that I don't get. I've worked with IP addresses like this in MySQL quite a bit before, but haven't previously done lookups for them using a subquery.
The question:
I'd really appreciate an example of how I could get the output I want, and if someone here knows, some enlightenment as to why what I'm doing doesn't work so I can avoid making this mistake again.
Notes:
The actual real-world usage I'm trying to do is considerably more complicated (involving joining two or three tables). This is a simplified version, to avoid overly complicating the question.
Additionally, I know I'm not using a between on ip_start & ip_end - that's intentional (the DB's can be out of date, and such cases the owner in the DB is almost always in the next specified range and 'best guess' is fine in this context) however I'm grateful for any suggestions for improvement that relate to the question.
Efficiency is always nice, but in this case absolutely not essential - any help appreciated.
You should take a look at this post:
http://jcole.us/blog/archives/2007/11/24/on-efficiently-geo-referencing-ips-with-maxmind-geoip-and-mysql-gis/
It has some nice ideas for working with IPs in queries very similar to yours.
Another thing you should try is using a stored function instead of a sub-query. That would simplify your query as follows:
SELECT
event.id,
event.event_name,
GET_PROVIDER_NAME(event.ip_address) as provider
FROM events
There seems to be no way to achieve what I wanted with a JOIN or Subquery.
To expand on Ike Walker's suggestion of using a stored function, I ended up creating a stored function in MySQL with the following:
DELIMITER //
DROP FUNCTION IF EXISTS get_network_provider //
CREATE FUNCTION get_network_provider(ip_address_number INT) RETURNS VARCHAR(255)
BEGIN
DECLARE network_provider VARCHAR(255);
SELECT provider_name INTO network_provider FROM network_providers
WHERE ip_address_number >= network_providers.ip_start
AND network_providers.provider_name != ""
ORDER BY provider_name.ip_start DESC LIMIT 1;
RETURN network_provider;
END //
Explanation:
The check to ignore blank names, and using >= & ORDER BY for ip_start rather than BETWEEN ip_start and ip_end is a specific fudge for the two combined network provider databases I'm using, both of which need to be queried in this way.
This approach works well when the query calling the function only needs to return a few hundred results (though it may take a handful of seconds). On queries that return a few thousand results, it may take 2 or 3 minutes. For queries with tens of thousands of results (or more) it's too slow to be practical use.
This was not unexpected from using a stored function like this (i.e. every result returned triggering a separate query) but I did hit a drop in performance sooner than I had expected.
Recommendation:
The upshot of this was that I needed to accept that the data structure is just not suitable for my needs. This had been already pointed out to me by a friend, it just wasn't something I really wanted to hear at the time (because I really wanted to use that specific network_provider DB due to other keys in the table that were useful to me, e.g. for things like geolocation).
If you end up trying to use any of the IP provider DB's (or indeed any other database) that follow a similar dubious data format, then I can only suggest they are just not going to be suitable and it's not worth trying to cobble something together that will work with them as they are.
At the very least you need to reformat the data so that they can be reliably used with a simple BETWEEN statement (no sorting, and no other comparisons) so you can use it with subqueries (or JOINS) - although it's likely an indicator that any data that messed up is probably not all that reliable anyway.

Providex Query Performance

I am running a query against a providex database that we use in MAS 90. The query has three tables joined together, and has been slow but not unbearably, taking about 8 minutes per run. The query has a fair number of conditions in the where clause:
I'm going to omit the select part of the query as its long and simple, just a list of fields from the three tables that are to be used in the results.
But the tables and the where clauses in the 8 minute run time version are:
(The first parameter is the lower bound of the user-selected date range, the second is the upper bound.)
FROM "AR_InvoiceHistoryDetail" "AR_InvoiceHistoryDetail",
"AR_InvoiceHistoryHeader" "AR_InvoiceHistoryHeader", "IM1_InventoryMasterfile"
"IM1_InventoryMasterfile"
WHERE "AR_InvoiceHistoryDetail"."InvoiceNo" = "AR_InvoiceHistoryHeader"."InvoiceNo"
AND "AR_InvoiceHistoryDetail"."ItemCode" = "IM1_InventoryMasterfile"."ItemNumber"
AND "AR_InvoiceHistoryHeader"."SalespersonNo" = 'SMC'
AND "AR_InvoiceHistoryHeader"."OrderDate" >= #p_dr
AND "AR_InvoiceHistoryHeader"."OrderDate" <= #p_d2
However, it turns out that another date field in the same table needs to be the one that the Date Range is compared with. So I changed the Order Dates at the end of the WHERE clause to InvoiceDate. I haven't had the query run successfully at all yet. And I've waited over 40 minutes. I have no control over indexing because this is a MAS 90 database which I don't believe I can directly change the database characteristics of.
What could cause such a large (at least 5 fold) difference in performance. Is it that OrderDate might have been indexed while InvoiceDate was not? I have tried BETWEEN clauses but it doesn't seem to work in the providex dialect. I am using the ODBC interface through .NET in my custom report engine. I have been debugging the report and it is running at the database execution point when I asked VS to Break All, at the same spot where the 8 minute report was waiting, so it is almost certainly either something in my query or something in the database that is screwed up.
If its just the case that InvoiceDates aren't indexed, what else can I do in the providex dialect of SQL to optimize the performance of these queries? Should I change the order of my criteria? This report gets results for a specific salesperson which is why the SMC clause exists. The prior clauses are for the inner joins, and the last clause is for the date range.
I used an identical date range in both the OrderDate and InvoiceDate versions and have ran them all mulitiple times and got the same results.
I still don't know exactly why it was so slow, but we had another problem with the results coming from the query (we switched back to using OrderDate). We weren't getting some of the results because of the nature of the IM1 table.
So I added a Left Outer Join once I figured out Providex's syntax for that. And for some reason, even though we still have 3 tables, it runs a lot faster now.
The new query criteria are:
FROM "AR_InvoiceHistoryHeader" "AR_InvoiceHistoryHeader",
{OJ "AR_InvoiceHistoryDetail" "AR_InvoiceHistoryDetail"
LEFT OUTER JOIN "IM1_InventoryMasterfile" "IM1_InventoryMasterfile"
ON "AR_InvoiceHistoryDetail"."ItemCode" =
"IM1_InventoryMasterfile"."ItemNumber" }
WHERE "AR_InvoiceHistoryDetail"."InvoiceNo" =
"AR_InvoiceHistoryHeader"."InvoiceNo" AND
"AR_InvoiceHistoryHeader"."SalespersonNo" = 'SMC'
AND "AR_InvoiceHistoryHeader"."InvoiceDate" >= ?
AND "AR_InvoiceHistoryHeader"."InvoiceDate" <= ?
Strange, but at least I learned more of the world of Providex Sql in the process.
I've never used providex before.
A search turned up this reference article on the syntax for creating an index.
Looking over your query, there's three tables and five criteria. Two of the criteria are "join criteria", and three criteria are filtering criteria:
AND "AR_InvoiceHistoryHeader"."SalespersonNo" = 'SMC'
AND "AR_InvoiceHistoryHeader"."OrderDate" >= #p_dr
AND "AR_InvoiceHistoryHeader"."OrderDate" <= #p_d2
I don't know how good SalespersonNo is for limiting return results, but it might be good to add an index on that.
I haven't used .NET so my question may show ignorance, but in Access you must use a SQL Pass-Through query to wring any results from ProvideX, if more than one table is involved.