SQL Querying - Matching datetimes vs Matching integers - sql

I have a bunch of data in my database and I want to filter out data that has been stored that for longer than a week. I'm using SQLServer and I found that I could use the 'DATEDIFF' function.
At the moment it works great and fast but I don't have a lot of records at the moment therefore anything runs quite smoothly.
After some research online I found out that the comparison of integers in databases is faster than the comparison of strings, I assume at this point that the comparison of datetimes (using the given function) is even slower at a major scale.
Let's say my database table content looks like this:
Currently I would filter out records that are older like a week like so:
SELECT * FROM onlineMainTable WHERE DATEDIFF(wk, Date, GETDATE()) > 1
I assume that this query would be quite slow if there were a thousand rows tin the table.
The status column represents a calculation status, I wondered if I would speed up the process if I were to look for a specific status instead of matching datetimes, for me in order to set that status to the one that represents 'old records' I need to update those rows before I select them, it would look something like this:
UPDATE table SET Status = -1 WHERE NOT Status = -1 AND DATEDIFF(wk, Date, GETDATE()) > 1;
SELECT * FROM table WHERE Status = -1;
I used the '-1' as an example.
So obviously I could be wrong but I think updating in this case would be fast enough since there won't be that many records to update since older ones have already been updated with its status. The selection would be faster aswell since I would be matching integers instead of datetimes.
The downside to my (possible) solution is that I would query twice every time I fetch data, even when it might not be needed (if every row is newer than 1 week).
It comes down to this: Should I compare datetimes or should I update an integer column based on that datetime and then select using the comparison of those ints?
If there is a different/better way of doing this i'm all ears.
Context
I am making a webapp for quotation requests. Requests should expire after a week since they won't be valid at that point. I need to both display valid requests and expired requests (so costumers have an overview). All these requests are stored in a database table.

Indexes are the objects that are design to improve select queries performances the drawback is that they slow down insert delete and update operations, so they have to be used when necessary. Generally DBMS provide tools to explain queries execution plan.
Maybe you just need to add an index on Date column:
create index "index_name" on onlineMainTable(Date)
and query could be
SELECT * FROM onlineMainTable WHERE Date > DATEADD(week,-1,GETDATE());

Related

Querying time higher with 'Where' than without it

I have something what I think is a srange issue. Normally, I think that a Query should last less time if I put a restriction (so that less rows are processed). But I don't know why, this is not the case. Maybe I'm putting something wrong, but I don't get error; the query just seems to run 'till infinity'.
This is the query
SELECT
A.ENTITYID AS ORG_ID,
A.ID_VALUE AS LEI,
A.MODIFIED_BY,
A.AUDITDATETIME AS LAST_DATE_MOD
FROM (
SELECT
CASE WHEN IFE.NEWVALUE IS NOT NULL
then EXTRACTVALUE(xmltype(IFE.NEWVALUE), '/DocumentElement/ORG_IDENTIFIERS/ID_TYPE')
ELSE NULL
end as ID_TYPE,
case when IFE.NEWVALUE is not null
then EXTRACTVALUE(xmltype(IFE.NEWVALUE), '/DocumentElement/ORG_IDENTIFIERS/ID_VALUE')
ELSE NULL
END AS ID_VALUE,
(select u.username from admin.users u where u.userid = ife.analystuserid) as Modified_by,
ife.*
FROM ife.audittrail ife
WHERE
--IFE.AUDITDATETIME >= '01-JUN-2016' AND
attributeid = 499
AND ROWNUM <= 10000
AND (CASE WHEN IFE.NEWVALUE IS NOT NULL then EXTRACTVALUE(xmltype(IFE.NEWVALUE), '/DocumentElement/ORG_IDENTIFIERS/ID_TYPE') ELSE NULL end) = '38') A
--WHERE A.AUDITDATETIME >= '01-JUN-2016';
So I tried with the two clauses commented (one per each time of course).
And with both of them happens the same; the query runs for so long time that I have to abort it.
Do you know why this could be happening? How could I do, maybe in a different way, to put the restriction?
The values of the field AUDITDATETIME are '06-MAY-2017', for example. In that format.
Thank you very much in advance
I think you may misunderstand how databases work.
Firstly, read up on EXPLAIN - you can find out exactly what is taking time, and why, by learning to read the EXPLAIN statement.
Secondly - the performance characteristics of any given query are determined by a whole range of things, but usually the biggest effort goes not in processing rows, but finding them.
Without an index, the database has to look at every row in the database and compare it to your where clause. It's the equivalent of searching in the phone book for a phone number, rather than a name (the phone book is indexed on "last name").
You can improve this by creating indexes - for instance, on columns "AUDITDATETIME" and "attributeid".
Unlike the phone book, a database server can support multiple indexes - and if those indexes match your where clause, your query will be (much) faster.
Finally, using an XML string extraction for a comparison in the where clause is likely to be extremely slow unless you've got an index on that XML data.
This is the equivalent of searching the phone book and translating the street address from one language to another - not only do you have to inspect every address, you have to execute an expensive translation step for each item.
You probably need index(es)... We can all make guesses on what indexes you already have, and need to add, but most dbms have built in query optimizers.
If you are using MS SQL Server you can execute query with query plan, that will tell you what index you need to add to optimize this particular query. It will even let you copy /paste the command to create it.

Constructing an sql Query to get records betwen two dates

I'm trying to filter out and report records in a database that fall between a specified date range. I'm there are other threads here on how to do something similar, but my dates are stored as date timestamps (which is why I think the issue is arising)
My current query is as follows:
"SELECT * FROM JOURNAL WHERE Date_Time>'10/10/2013 00:00:00'"
(Note that journal is the name of the table I'm pulling the data from and date_time is the field in which the date is stored. I'm aware the query doesn't quite do what I want it to yet, but I was just testing out a simpler case at first.)
When I run this query (as part of an excel macro), excel reports that it can't find any records even though I know their are records past this date. Does anyone know how to do this properly?
Edit: I've got it, it was an issue unrelated to the query (something else in the macro) Thanks so much for the help (changing the date format worked)
have you tried other date format? like this:
"SELECT * FROM JOURNAL WHERE Date_Time>'2013-10-10:00:00:00'"
A simple between statement is what you need:
SELECT * FROM JOURNAL WHERE Date_Time between '10/10/2013 00:00:00' and '[otherdate]'
You need to run this to check for one important thing: If the server is running the BETWEEN as inclusive or not. If it's inclusive, both dates are included. If not, the range will begin either before or after one or both.
I've seen SQL servers that are the same in every respect actually treat this condition differently. So it's a good idea to check that.

How could i write this code in a more performant way?

In our app people have 1 or multiple projects. These projects have a start and an end date. People have a limited amount of available days.
Now we have a page that displays the availability of a given person on a week by week basis. It currently shows 18 weeks.
The way we currently calculate the available time for a given week is like this:
def days_available(query_date=Date.today)
days_engaged = projects.current.where("start_date < ? AND finish_date > ?", query_date, query_date).sum(:days_on_project)
available = days_total - hours_engaged
end
This means that to display the page descibed above the app will fire 18(!) queries into the database. We have pages that lists the availability of multiple people in a table. For these pages the amount of queries is quickly becomes staggering.
It is also quite slow.
How could we handle the availability retrieval in a more performant manner?
This is quite a common scenario when working with date ranges in an entity. Easy and fastest way is in SQL:
Join your events to a number generated date table (see generate days from date range) so that you have a row for each day a person or people are occupied. Once you have the data in this form it is simply a matter of grouping by the week date part of the date and counting the rows per grouping.
You can extend this to group by person for multiple person queries.
From a SQL point of view, I'd advise using a stored procedure and pass in your date/range requirement, you can then return a recordset for a user or possibly multiple users. This way your code just has to access db once.
You can then output recordset data in one go, by iterating through.
Hope this helps.
USE Stored procedure to fire your query to SQL to get data.
Pass paramerts in your case it is today's date to the SQl query.
Apply your conditions and Logic in the SQL Stored procedure , Using procedure is the goood and fastest way to retrieve data from the SQL , also it will prevent your code from the SQL injection too.
Call that SP from your Code as i dont know the Ruby on raisl I cant provide you steps about how to Call the Stored procedure from it.
After that the data fdetched as per you stored procedure will be available in Data table or something like that.
After getting the data you can perform all you need
Hope this helps
see what query is executed. further you may make comand explain to your query
explain select * from project where start_date < any_date and end_date> any_date2
you see the plan of query . Use this plan to optimized your query.
for example :
if you have index using field end_date replace a condition(end_date> any_date2 and start_date < any_date) . this step will using index if you have index on this field. But it step is db dependent . example is for nysql. if you want use index in mysql you must have using index condition on left part of where
There's not really enough information in your question to know exactly what you're trying to achieve here, e.g. the code snippet doesn't make use of the returned database query, so you could just remove it to make it faster. Perhaps this is just a bug in the code you posted?
Having said that, there are some techniques you should look into to implement your functionality.
I would take a look at using data warehouse techniques. I would think of your 'availability information' as a Fact table in a star schema, with 'Dates' and 'People' as Dimension tables.
You can then use queries to get stuff like - list of users for this projects for this week, and their availability.
Data warehousing has a whole bunch of resources you can tap into to help make this perform well, but there's also a lot of terminology that can be confusing, but for this type of 'I need to slice and dice my data across several sets of things (people and time)', Data Warehousing techniques can be quite powerful.
As I dont understand ruby on rails,from sql point of view i suggest you to write a stored procedure and return a dataset.And do the necessary table operations on the dataset from front end.It will reduce the unnecessary calls to DB.

Whats the best way to handle a SQL query on a Date (no time)?

We have date columns in our database that are just a day - like birth date. However, SQL Server stores them as a date & time and the time in the records has various values (no idea how it ended up that way).
The problem is people will run a query for all birthdates <= {some date} and the ones that are equal aren't returned because a DateTime (using ADO.NET) set to a given date has a time of midnight.
I understand what's going on. The question is how best to handle this. We could force in a time of 23:23:59.999999999 to the date but that feels like it would have problems.
What's the standard best practice for handling this?
Simply add 1 day to {some_date} and use a less than comparison. Just make sure it's the next day at 12am...
If you need to query this frequently, I would probably add a computed, persisted column that casts your DATETIME to just a DATE (assuming you're on SQL Server 2008 or newer):
ALTER TABLE dbo.YourTableName
ADD JustDay AS CAST(YourDateTimeColumn AS DATE) PERSISTED
That way, you can now query on JustDay and it's just a DATE - no time portion involved. Since it's computed, there's no need to update it constantly; SQL Server will do this automagically for you. And since it's persisted, it's part of the table's on-disk structure and just as fast as a query on any other column - and it can even be indexed, if needed.
It's a classic space - vs - speed tradeoff - since you're now storing the date-only portion of all your birthdays, too, you're on-disk structure will be larger; on the other hand, since you have a nice, date-only column that can be indexed, you have a great way to speed up searches.
You say
The problem is people will run a query for all birthdates <= {some
date}
You could leave it as is and make sure people get rid of the time by using something like the following in their WHERE clauses:
CONVERT(DATETIME,CONVERT(CHAR(8),birthdates,112))<= {some date}
..or in later versions of SQL-Server:
CONVERT(DATE,birthdates)<= {some date}
But this is a workaround and best to take the other advice and get rid of the time in the actual target data.
One more option is:
DATEDIFF(d, birthdates, {some date}) <= 0

IP address numbers in MySQL subquery

I have a problem with a subquery involving IPV4 addresses stored in MySQL (MySQL 5.0).
The IP addresses are stored in two tables, both in network number format - e.g. the format output by MySQL's INET_ATON(). The first table ('events') contains lots of rows with IP addresses associated with them, the second table ('network_providers') contains a list of provider information for given netblocks.
events table (~4,000,000 rows):
event_id (int)
event_name (varchar)
ip_address (unsigned int)
network_providers table (~60,000 rows):
ip_start (unsigned int)
ip_end (unsigned int)
provider_name (varchar)
Simplified for the purposes of the problem I'm having, the goal is to create an export along the lines of:
event_id,event_name,ip_address,provider_name
If do a query along the lines of either of the following, I get the result I expect:
SELECT provider_name FROM network_providers WHERE INET_ATON('192.168.0.1') >= network_providers.ip_start ORDER BY network_providers.ip_start DESC LIMIT 1
SELECT provider_name FROM network_providers WHERE 3232235521 >= network_providers.ip_start ORDER BY network_providers.ip_start DESC LIMIT 1
That is to say, it returns the correct provider_name for whatever IP I look up (of course I'm not really using 192.168.0.1 in my queries).
However, when performing this same query as a subquery, in the following manner, it doesn't yield the result I would expect:
SELECT
events.event_id,
events.event_name,
(SELECT provider_name FROM network_providers
WHERE events.ip_address >= network_providers.ip_start
ORDER BY network_providers.ip_start DESC LIMIT 1) as provider
FROM events
Instead the a different (incorrect) value for provider is returned. Over 90% (but curiously not all) values returned in the provider column contain the wrong provider information for that IP.
Using events.ip_address in a subquery just to echo out the value confirms it contains the value I'd expect and that the subquery can parse it. Replacing events.ip_address with an actual network number also works, just using it dynamically in the subquery in this manner that doesn't work for me.
I suspect the problem is there is something fundamental and important about subqueries in MySQL that I don't get. I've worked with IP addresses like this in MySQL quite a bit before, but haven't previously done lookups for them using a subquery.
The question:
I'd really appreciate an example of how I could get the output I want, and if someone here knows, some enlightenment as to why what I'm doing doesn't work so I can avoid making this mistake again.
Notes:
The actual real-world usage I'm trying to do is considerably more complicated (involving joining two or three tables). This is a simplified version, to avoid overly complicating the question.
Additionally, I know I'm not using a between on ip_start & ip_end - that's intentional (the DB's can be out of date, and such cases the owner in the DB is almost always in the next specified range and 'best guess' is fine in this context) however I'm grateful for any suggestions for improvement that relate to the question.
Efficiency is always nice, but in this case absolutely not essential - any help appreciated.
You should take a look at this post:
http://jcole.us/blog/archives/2007/11/24/on-efficiently-geo-referencing-ips-with-maxmind-geoip-and-mysql-gis/
It has some nice ideas for working with IPs in queries very similar to yours.
Another thing you should try is using a stored function instead of a sub-query. That would simplify your query as follows:
SELECT
event.id,
event.event_name,
GET_PROVIDER_NAME(event.ip_address) as provider
FROM events
There seems to be no way to achieve what I wanted with a JOIN or Subquery.
To expand on Ike Walker's suggestion of using a stored function, I ended up creating a stored function in MySQL with the following:
DELIMITER //
DROP FUNCTION IF EXISTS get_network_provider //
CREATE FUNCTION get_network_provider(ip_address_number INT) RETURNS VARCHAR(255)
BEGIN
DECLARE network_provider VARCHAR(255);
SELECT provider_name INTO network_provider FROM network_providers
WHERE ip_address_number >= network_providers.ip_start
AND network_providers.provider_name != ""
ORDER BY provider_name.ip_start DESC LIMIT 1;
RETURN network_provider;
END //
Explanation:
The check to ignore blank names, and using >= & ORDER BY for ip_start rather than BETWEEN ip_start and ip_end is a specific fudge for the two combined network provider databases I'm using, both of which need to be queried in this way.
This approach works well when the query calling the function only needs to return a few hundred results (though it may take a handful of seconds). On queries that return a few thousand results, it may take 2 or 3 minutes. For queries with tens of thousands of results (or more) it's too slow to be practical use.
This was not unexpected from using a stored function like this (i.e. every result returned triggering a separate query) but I did hit a drop in performance sooner than I had expected.
Recommendation:
The upshot of this was that I needed to accept that the data structure is just not suitable for my needs. This had been already pointed out to me by a friend, it just wasn't something I really wanted to hear at the time (because I really wanted to use that specific network_provider DB due to other keys in the table that were useful to me, e.g. for things like geolocation).
If you end up trying to use any of the IP provider DB's (or indeed any other database) that follow a similar dubious data format, then I can only suggest they are just not going to be suitable and it's not worth trying to cobble something together that will work with them as they are.
At the very least you need to reformat the data so that they can be reliably used with a simple BETWEEN statement (no sorting, and no other comparisons) so you can use it with subqueries (or JOINS) - although it's likely an indicator that any data that messed up is probably not all that reliable anyway.