I need to select a record with dates which has dates ( in range: form 1998 to 1999). I wrote the statement which did seem to work . Why?
SELECT *
FROM Factory
WHERE
(EXTRACT(YEAR FROM date) AS dyear) BETWEEN '1998' AND '1999'
You can use YEAR() to get the year from the date.
SELECT *
FROM Factory
WHERE YEAR(date) BETWEEN 1998 AND 1999
MSAccess YEAR()
Applying the Year() function for every row in Factory will be a noticeable performance challenge if the table includes thousands of rows. (Actually it would be a performance challenge for a smaller table, too, but you would be less likely to notice the hit in that case.) A more efficient approach would be to index the [date] field and use indexed retrieval to limit the db engine's workload.
SELECT f.*
FROM Factory AS f
WHERE f.date >= #1998-1-1# AND f.date < #2000-1-1#;
Whenever possible, design your queries to take advantage of indexed retrieval. That can improve performance dramatically. As a simplistic rule of thumb: indexed retrieval = good; full table scan = bad. Try to avoid full tables scans whenever possible.
Related
I'm attempting to create a view to join in hierarchical data to a normalized dataset in SQL using a profileID field.
The issue that I'm having is that my company's hierarchy data is, for lack of a better term, gapped. There are startdate and enddate fields that need to be considered in the join.
Currently I'm working with something like the following -
Select * from
dbo.datatable dt
inner join dbo.hierarchy h
on dt.profileid = h.profileid
AND dt.date >= h.startdate
AND dt.date < h.enddate
I've got a clustered index on dt that includes date and profileid and a clustered index on h that includes startdate, enddate, and profileid. SSMS has also suggested a couple indexes that I've added as well that include a lot of the data fields.
I cannot change the format of the hierarchy, but the view is absurdly slow when I try to pull a large number of days in a sql query. This dataset is end-user facing, so it's gotta be fast and usable.
Any tips are greatly appreciated!
In both tables, put profileId first in the likely index. This is because it is tested with =.
Alas, that is probably the only optimization you can do. Looking at a date range degenerates into using only one of the tests, and leads to scanning up to half the table. Or, after testing profileId, scanning half the rows for that profileId.
If the start-end ranges never overlap, there may be a trick to make things faster, but it will involve changes to the schema and code.
I have this oracle query that takes around 1 minute to get the results:
SELECT TRUNC(sysdate - data_ricezione) AS delay
FROM notifiche#fe_engine2fe_gateway n
WHERE NVL(n.data_ricezione, TO_DATE('01011900', 'ddmmyyyy')) =
(SELECT NVL(MAX(n2.data_ricezione), TO_DATE('01011900', 'ddmmyyyy'))
FROM notifiche#fe_engine2fe_gateway n2
WHERE n.id_sdi = n2.id_sdi)
--AND sysdate-data_ricezione > 15
Basically i have this table named "notifiche", where each record represents a kind of update to another type of object (invoices). I want to know which invoice has not received any update in the last 15 days. I can do it by joining the notifiche n2 table, getting the most recent record for each invoice, and evaluate the difference between the update date (data_ricezione) and the current date (sysdate).
When i add the commented condition, the query takes then infinite time to complete (i mean hours, never saw the end of it...)
How is possibile that this simple condition make the query so slow?
How can I improve the performance?
Try to keep data_ricezione alone; if there's an index on it, it might help.
So: switch from
and sysdate - data_ricezione > 15
to
and -data_ricezione > 15 - sysdate / * (-1)
to
and data_ricezione < sysdate - 15
As everything is done over the database link, see whether the driving_site hint does any good, i.e.
select /*+ driving_site (n) */ --> "n" is table's alias
trunc(sysdate-data_ricezione) as delay
from
notifiche#fe_engine2fe_gateway n
...
Use an analytic function to avoid a self-join over a database link. The below query only reads from the table once, divides the rows into windows, finds theMAX value for each window, and lets you select rows based on that maximum. Analytic functions are tricky to understand at fist, but they often lead to code that is smaller and more efficient.
select id_sdi, data_ricezion
from
(
select id_sdi, data_ricezion, max(data_ricezion) over (partition by id_sdi) max_date
from notifiche#fe_engine2fe_gateway
)
where sysdate - max_date > 15;
As for why adding a simple condition can make the query slow - it's all about cardinality estimates. Cardinality, the number of rows, drives most of the database optimizer's decision. The best way to join a small amount of data may be very different than the best way to join a large amount of data. Oracle must always guess how many rows are returned by an operation, to know which algorithm to use.
Optimizer statistics (metadata about the tables, columns, and indexes) are what Oracle uses to make cardinality estimates. For example, to guess the number of rows filtered out by sysdate-data_ricezione > 15, the optimizer would want to know how many rows are in the table (DBA_TABLES.NUM_ROWS), what the maximum value for the column is (DBA_TAB_COLUMNS.HIGH_VALUE), and maybe a break down of how many rows are in different age ranges (DBA_TAB_HISTOGRAMS).
All of that information depends on optimizer statistics being correctly gathered. If a DBA foolishly disabled automatic optimizer statistics gathering, then these problems will happen all the time. But even if your system is using good settings, the predicate you're using may be an especially difficult case. Optimizer statistics aren't free to gather, so the system only collects them when 10% of the data changes. But since your predicate involves SYSDATE, the percentage of rows will change every day even if the table doesn't change. It may make sense to manually gather stats on this table more often than the default schedule, or use a /*+ dynamic_sampling */ hint, or create a SQL Profile/Plan Baseline, or one of the many ways to manage optimizer statistics and plan stability. But hopefully none of that will be necessary if you use an analytic function instead of a self-join.
We have a 6B row table that is giving us challenges when retrieving data.
Our query returns values instantly when doing a...
SELECT * WHERE Event_Code = 102225120
That type of instant result is exactly what we need. We now want to filter to receive values for just a particular year - but the moment we add...
AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
...the query takes over 10 minutes to begin returning any values.
Another SO post mentions that indexes don't necessarily help date queries when pulling many rows as opposed to an individual row. There are other approaches like using TRUNC, or BETWEEN, or specifying the datetime in YYYY-MM-DD format for doing comparisons.
Of note, we do not have the option to add indexes to the database as it is a vendor's database.
What is the way to add a date filtering query and enable Oracle to begin streaming the results back in the fastest way possible?
Another SO post mentions that indexes don't necessarily help date queries when pulling many rows as opposed to an individual row
That question is quite different from yours. Firstly, your statement above applies to any data type, not only dates. Also the word many is relative to the number of records in the table. If the optimizer decides that the query will return many of all records in your table, then it may decide that a full scan of the table is faster than using the index. In your situation, this translates to how many records are in 2017 out of all records in the table? This calculation gives you the cardinality of your query which then gives you an idea if an index will be faster or not.
Now, if you decide that an index will be faster, based on the above, the next step is to know how to build your index. In order for the optimizer to use the index, it must match the condition that you're using. You are not comparing dates in your query, you are only comparing the year part. So an index on the date column will not be used by this query. You need to create an index on the year part, so use the same condition to create the index.
we do not have the option to add indexes to the database as it is a vendor's database.
If you cannot modify the database, there is no way to optimize your query. You need to talk to the vendor and get access to modify the database or ask them to add the index for you.
A function can also cause slowness for the number of records involved. Not sure if Function Based Index can help you for this, but you can try.
Had you tried to add a year column in the table? If not, try to add a year column and update it using code below.
UPDATE table
SET year = EXTRACT(YEAR FROM PERFORMED_DATE_TIME);
This will take time though.
But after this, you can run the query below.
SELECT *
FROM table
WHERE Event_Code = 102225120 AND year = 2017;
Also, try considering Table Partitioned for this big data. For starters, see link below,
link: https://oracle-base.com/articles/8i/partitioned-tables-and-indexes
Your question is a bit ambiguous IMHO:
but the moment we add...
AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
...the query takes over 10 minutes to begin returning any values.
Do you mean that
SELECT * WHERE Event_Code = 102225120
is fast, but
SELECT * WHERE Event_Code = 102225120 AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
is slow???
For starters I'll agree with Mitch Wheat that you should try to use PERFORMED_DATE_TIME between Jan 1, 2017 and Dec 31, 2017 instead of Year(field) = 2017. Even if you'd have an index on the field, the latter would hardly be able to make use of it while the first method would benefit enormously.
I'm also hoping you want to be more specific than just 'give me all of 2017' because returning over 1B rows is NEVER going to be fast.
Next, if you can't make changes to the database, would you be able to maintain a 'shadow' in another database? This would require that you create a table with all date-values AND the PK of the original table in another database and query those to find the relevant PK values and then JOIN those back to your original table to find whatever you need. The biggest problem with this would be that you need to keep the shadow in sync with the original table. If you know the original table only changes overnight, you could merge the changes in the morning and query all day. If the application is 'real-time(ish)' then this probably won't work without some clever thinking... And yes, your initial load of 6B values will be rather heavy =)
May this could be usefull (because you avoid functions (a cause for context switching) and if you have an index on your date field, it could be used) :
with
dt as
(
select
to_date('01/01/2017', 'DD/MM/YYYY') as d1,
to_date('31/01/2017', 'DD/MM/YYYY') as d2
from dual
),
dates as
(
select
dt.d1 + rownum -1 as d
from dt
connect by dt.d1 + rownum -1 <= dt.d2
)
select *
from your_table, dates
where dates.d = PERFORMED_DATE_TIME
Move the date literal to RHS:
AND PERFORMED_DATE_TIME >= date '2017-01-01'
AND PERFORMED_DATE_TIME < date '2018-01-01'
But without an (undisclosed) appropriate index on PERFORMED_DATE_TIME, the query is unlikely to be any faster.
One option to create indexes in third party databases is to script in the index and then before any vendor upgrade run a script to remove any indexes you've added. If the index is important, ask the vendor to add it to their database design.
I have a table which has Year, Month and few numeric columns
Year Month Total
2011 10 100
2011 11 150
2011 12 100
2012 01 50
2012 02 200
Now, I want to SELECT rows between 2011 Nov and 2012 FEB. Note that I want the to Query to use range. Just as if I had a date column in the table..
Coming up with a way to use BETWEEN with the table as it is will work, but will be worse performance in every case:
It will at best consume more CPU to do some kind of calculation on the rows instead of working with them as dates.
It will at the very worst force a table scan on every row in the table, but if your columns have indexes, then with the right query a seek is possible. This could be a HUGE performance difference, because forcing the constraints into a BETWEEN clause will disable using the index.
I suggest the following instead if you have an index on your date columns and care at all about performance:
DECLARE
#FromDate date = '20111101',
#ToDate date = '20120201';
SELECT *
FROM dbo.YourTable T
WHERE
(
T.[Year] > Year(#FromDate)
OR (
T.[Year] = Year(#FromDate)
AND T.[Month] >= Month(#FromDate)
)
) AND (
T.[Year] < Year(#ToDate)
OR (
T.[Year] = Year(#ToDate)
AND T.[Month] <= Month(#ToDate)
)
);
However, it is understandable that you don't want to use such a construction as it is very awkward. So here is a compromise query, that at least uses numeric computation and will use less CPU than date-to-string-conversion computation (though not enough less to make up for the forced scan which is the real performance problem).
SELECT *
FROM dbo.YourTable T
WHERE
T.[Year] * 100 + T.[Month] BETWEEN 201111 AND 201202;
If you have an index on Year, you can get a big boost by submitting the query as follows, which has the opportunity to seek:
SELECT *
FROM dbo.YourTable T
WHERE
T.[Year] * 100 + T.[Month] BETWEEN 201111 AND 201202
AND T.[Year] BETWEEN 2011 AND 2012; -- allows use of an index on [Year]
While this breaks your requirement of using a single BETWEEN expression, it is not too much more painful and will perform very well with the Year index.
You can also change your table. Frankly, using separate numbers for your date parts instead of a single column with a date data type is not good. The reason it isn't good is because of the exact issue you are facing right now--it is very hard to query.
In some data warehousing scenarios where saving bytes matters a lot, I could envision situations where you might store the date as a number (such as 201111) but that is not recommended. The best solution is to change your table to use dates instead of splitting out the numeric value of the month and the year. Simply store the first day of the month, recognizing that it stands in for the entire month.
If changing the way you use these columns is not an option but you can still change your table, then you can add a persisted computed column:
ALTER Table dbo.YourTable
ADD ActualDate AS (DateAdd(year, [Year] - 1900, DateAdd(month, [Month], '18991201')))
PERSISTED;
With this you can just do:
SELECT *
FROM dbo.YourTable
WHERE
ActualDate BETWEEN '20111101' AND '20120201';
The PERSISTED keyword means that while you still will get a scan, it won't have to do any calculation on each row since the expression is calculated on each INSERT or UPDATE and stored in the row. But you can get a seek if you add an index on this column, which will make it perform very well (though all in all, this is still not as ideal as changing to use an actual date column, because it will take more space and will affect INSERTs and UPDATEs):
CREATE NONCLUSTERED INDEX IX_YourTable_ActualDate ON dbo.YourTable (ActualDate);
Summary: if you truly can't change the table in any way, then you are going to have to make a compromise in some way. It will not be possible to get the simple syntax you want that will also perform well, when your dates are stored split into separate columns.
(Year > #FromYear OR Year = #FromYear AND Month >= #FromMonth)
AND (Year < #ToYear OR Year = #ToYear AND Month <= #ToMonth)
Your example table seems to indicate that there's only one record per year and month (if it's really a summary-by-month table). If that's so, you're likely to accrue very little data in the table even over several decades of activity. The concatenated expression solution will work and performance (in this case) won't be an issue:
SELECT * FROM Table WHERE ((Year * 100) + Month) BETWEEN 201111 AND 201202
If that's not the case and you really have a large number of records in the table (more than a few thousand records), you have a couple of choices:
Change your table to store year and month in the format YYYYMM (either as an integer value or text). This column can replace your current year and index column or be in addition to them (although this breaks normal form). Index this column and query against it.
Create a separate table with one record per year and month and also the indexable column as described above. In your query, JOIN this table back to the source table and perform your query against the indexed column in the smaller table.
I have a table that loads employee records weekly on Monday. The load date is stored on the record. I need to sum the total changed (add/update) records from one week to the next.
This is what I have so far. It splits new record and updated record counts for the latest load date compared to the previous load date.
I'm not sure if this is a good way to do this and I'd really appreciate any feedback I could get about my method, or advice on a better way to accomplish my goal.
Thanks.
SELECT
RIGHT(CONVERT(VARCHAR(10), REPORT_DATE, 103), 7) AS REPORT_DATE,
[NEW],
[UPDATED]
FROM
(
SELECT
CUR.LOAD_DATE AS REPORT_DATE,
CASE
WHEN PRV.LOAD_DATE IS NULL THEN 'NEW'
ELSE 'UPDATED'
END AS RECORD_TYPE,
COUNT(*) AS RECORD_COUNT
FROM
(SELECT *
FROM EMPLOYEES
WHERE LOAD_DATE = (SELECT MAX(LOAD_DATE) FROM EMPLOYEES)) CUR
LEFT OUTER JOIN
(SELECT *
FROM EMPLOYEES
WHERE LOAD_DATE = (SELECT DATEADD(WEEK,-1,MAX(LOAD_DATE)) FROM EMPLOYEES))PRV
ON
CUR.EMPLOYEE_ID = PRV.EMPLOYEE_ID
WHERE
PRV.EMPLOYEE_ID IS NULL
OR (CUR.FIRST_NAME != PRV.FIRST_NAME
OR CUR.LAST_NAME != PRV.LAST_NAME
OR CUR.ADDRESS1 != PRV.ADDRESS1
OR CUR.ADDRESS2 != PRV.ADDRESS2
OR CUR.CITY != PRV.CITY
OR CUR.STATE != PRV.STATE
OR CUR.ZIP != PRV.ZIP
OR CUR.POSITION != PRV.POSITION
OR CUR.LOCATION != PRV.LOCATION)
GROUP BY
CUR.LOAD_DATE,
PRV.LOAD_DATE
) DT
PIVOT
(SUM(RECORD_COUNT) FOR RECORD_TYPE IN ([NEW], [UPDATED])) PV;
I have a couple of suggestions that could simplify your code even improve the performance of the query.
While you are looking for "Last date of loading data for employee", try to add a table to log the loading process, which contains time of loading. This would improve your performance and you don't have to use the "select MAX(LOAD_DATE) from ..." twice.
You may add a additional column to record the updated time of the record; so that while your are looking for changed record, just to compare records' "updated time" and "load time". Putting a updating trigger on this table would be a better tactic to modify the "updated time".
Based on above suggestions, the point is to prevent from joining the table twice and touching the data page. Since your report is to retrieve the "SUM" of data, your don't have to use the whole information of "EMPLOYEES" table.
First, the code is more clear to match your intention for "sum the total changed records". Second, the database just need the index to "COUNT" your metric of data(of course, a proper index on "load_date"), so the performance should be superior to your "JOIN-SELF-TABLE" method.
There are multiple ways to generate a report by SQL. Because SQL is a kind of hard-to-read language, concise writing is a matter of maintenance. Because it is a tough effort to figure out performance problems in SQL, writing a more efficient SQL is worth than rewriting it afterwards.
In my experience, the "decent SQL" is about:
Acceptable performance in plausible anticipation.
Without sacrificing the performance, make code more readable.
Forgiving me for repeat of my points, if you have a complex SQL that has poor performance. You have more risk to modify the SQL for the sake of improving performance afterwards.