I have a table records having three fields:
id - the row id
value - the row value
source - the source of the value
timestamp - the time when the row was inserted (should this be a unix timestamp or a datetime?)
And I want to perform a query like this:
SELECT timestamp, value FROM records WHERE timestamp >= a AND timestamp <= b
However in a table with millions of records this query is super inefficient!
I am using Azure SQL Server as the DBMS. Can this be optimised?
If so can you provide a step-by-step guide to do it (please don't skip "small" steps)? Be it creating indexes, redesigning the query statement, redesigning the table (partitioning?)...
Thanks!
After creating an index on the field you want to search, you can use a between operator so it is a single operation, which is most efficient for sql.
SELECT XXX FROM ABC WHERE DateField BETWEEN '1/1/2015' AND '12/31/2015'
Also, in SQL Server 2016 you can create range indexes for use on things like time-stamps using memory optimized tables. That's really the way to do it.
I would recommend using the datetime, or even better the datetime2 data type to store the date data (datetime2 being better as it has a higher level of precision, and with lower precision levels will use less storage).
As for your query, based upon the statement you posted you would want the timestamp to be the key column, and then include the value. This is because you are using the timestamp as your predicate, and returning the value along with it.
CREATE NONCLUSTERED INDEX IX_Records_Timestamp on Records (Timestamp) INCLUDE (Value)
This being said, be careful of your column names. I would highly recommend not using reserved keywords for columns names as they can be a lot more difficult to work with.
Related
We have a 6B row table that is giving us challenges when retrieving data.
Our query returns values instantly when doing a...
SELECT * WHERE Event_Code = 102225120
That type of instant result is exactly what we need. We now want to filter to receive values for just a particular year - but the moment we add...
AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
...the query takes over 10 minutes to begin returning any values.
Another SO post mentions that indexes don't necessarily help date queries when pulling many rows as opposed to an individual row. There are other approaches like using TRUNC, or BETWEEN, or specifying the datetime in YYYY-MM-DD format for doing comparisons.
Of note, we do not have the option to add indexes to the database as it is a vendor's database.
What is the way to add a date filtering query and enable Oracle to begin streaming the results back in the fastest way possible?
Another SO post mentions that indexes don't necessarily help date queries when pulling many rows as opposed to an individual row
That question is quite different from yours. Firstly, your statement above applies to any data type, not only dates. Also the word many is relative to the number of records in the table. If the optimizer decides that the query will return many of all records in your table, then it may decide that a full scan of the table is faster than using the index. In your situation, this translates to how many records are in 2017 out of all records in the table? This calculation gives you the cardinality of your query which then gives you an idea if an index will be faster or not.
Now, if you decide that an index will be faster, based on the above, the next step is to know how to build your index. In order for the optimizer to use the index, it must match the condition that you're using. You are not comparing dates in your query, you are only comparing the year part. So an index on the date column will not be used by this query. You need to create an index on the year part, so use the same condition to create the index.
we do not have the option to add indexes to the database as it is a vendor's database.
If you cannot modify the database, there is no way to optimize your query. You need to talk to the vendor and get access to modify the database or ask them to add the index for you.
A function can also cause slowness for the number of records involved. Not sure if Function Based Index can help you for this, but you can try.
Had you tried to add a year column in the table? If not, try to add a year column and update it using code below.
UPDATE table
SET year = EXTRACT(YEAR FROM PERFORMED_DATE_TIME);
This will take time though.
But after this, you can run the query below.
SELECT *
FROM table
WHERE Event_Code = 102225120 AND year = 2017;
Also, try considering Table Partitioned for this big data. For starters, see link below,
link: https://oracle-base.com/articles/8i/partitioned-tables-and-indexes
Your question is a bit ambiguous IMHO:
but the moment we add...
AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
...the query takes over 10 minutes to begin returning any values.
Do you mean that
SELECT * WHERE Event_Code = 102225120
is fast, but
SELECT * WHERE Event_Code = 102225120 AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
is slow???
For starters I'll agree with Mitch Wheat that you should try to use PERFORMED_DATE_TIME between Jan 1, 2017 and Dec 31, 2017 instead of Year(field) = 2017. Even if you'd have an index on the field, the latter would hardly be able to make use of it while the first method would benefit enormously.
I'm also hoping you want to be more specific than just 'give me all of 2017' because returning over 1B rows is NEVER going to be fast.
Next, if you can't make changes to the database, would you be able to maintain a 'shadow' in another database? This would require that you create a table with all date-values AND the PK of the original table in another database and query those to find the relevant PK values and then JOIN those back to your original table to find whatever you need. The biggest problem with this would be that you need to keep the shadow in sync with the original table. If you know the original table only changes overnight, you could merge the changes in the morning and query all day. If the application is 'real-time(ish)' then this probably won't work without some clever thinking... And yes, your initial load of 6B values will be rather heavy =)
May this could be usefull (because you avoid functions (a cause for context switching) and if you have an index on your date field, it could be used) :
with
dt as
(
select
to_date('01/01/2017', 'DD/MM/YYYY') as d1,
to_date('31/01/2017', 'DD/MM/YYYY') as d2
from dual
),
dates as
(
select
dt.d1 + rownum -1 as d
from dt
connect by dt.d1 + rownum -1 <= dt.d2
)
select *
from your_table, dates
where dates.d = PERFORMED_DATE_TIME
Move the date literal to RHS:
AND PERFORMED_DATE_TIME >= date '2017-01-01'
AND PERFORMED_DATE_TIME < date '2018-01-01'
But without an (undisclosed) appropriate index on PERFORMED_DATE_TIME, the query is unlikely to be any faster.
One option to create indexes in third party databases is to script in the index and then before any vendor upgrade run a script to remove any indexes you've added. If the index is important, ask the vendor to add it to their database design.
I have a SQLServer database with tables containing temperature measurements.
The tables have columns for MeasurementId(prim key), SensorId, Timestamp and Value.
We now have have enough measurements that our queries are starting to get a little bit slow and I'm trying to improve this.
The Timestamp values are not necessarily in order, but for each SensorId they are ordered.
My question is: Is there anyway I can use this knowlegde to improve the performance of a query like SELECT * FROM MeasurmentTable WHERE SensorId=xx AND Timestamp>yy ?
I.e, can i hint to SQL Server that once you've narrowed your results to a unique SensorId, the rows are guaranteed to be ordered by timestamp?
For your query, you simply want a composite index:
create index idx_measurementtable_sensorid_timestamp
on MeasurementTable(SensorId, Timestamp);
I have a HISTORY table with 9 million records. I need to find year-wise, month-wise records created. I was using query no 1, However it timed out several times.
SELECT
year(created) as year,
MONTHNAME(created) as month,
count(*) as ymcount
FROM
HISTORY
GROUP BY
year(created), MONTHNAME(created);
I decided to add where year(created), this time the query took 30 mins (yes it takes so long) to execute.
SELECT
year(created) as year,
MONTHNAME(created) as month,
count(*) as ymcount
FROM
HISTORY
WHERE
year(created) = 2010
GROUP BY
year(created), MONTHNAME(created) ;
I was planning to add an index on created timestamp column, however before doing so, I need the opinion (since its going to take a long time to index such a huge table).
Will adding an index on created(timestamp) column improve performance, considering year function is used on the column?
An index won't really help because you have formed the query such that it must perform a complete table scan, index or no index. You have to form the where clause so it is in the form:
where field op constant
where field is, of course, your field; op is = <= => <> between in, etc. and constant is either a direct constant, 42, or an operation that can be executed once and the result cached, getdate().
Like this:
where created >= DateFromParts( #year, 1, 1 )
and created < DateFromParts( #year + 1, 1, 1 )
The DateFromParts function will generate a value which remains in effect for the duration of the query. If created is indexed, now the optimizer will be able to seek to exactly where the correct dates start and tell when the last date in the range has been processed and it can stop. You can keep year(created) everywhere else -- just get rid of it from the where clause.
This is called sargability and you can google all kinds of good information on it.
P.S. This is in Sql Server format but you should be able to calculate "beginning of specified year" and "beginning of year after specified year" in whatever DBMS you're using.
An index will be used, when it helps narrow down the number of rows read.
It will also be used, when it avoids reading the table at all. This is the case, when the index contains all the columns referenced in the query.
In your case the only column referenced is created, so adding an index on this column should help reducing the necessary reads and improve the overall runtime of your query. However, if created is the only column in the table, the index won't change anything in the first query, because it doesn't reduce the number of pages to be read.
Even with a large table, you can test, if an index makes a difference. You can copy only part of the rows to a new table and compare the execution plans on the new table with and without an index, e.g.
insert into testhistory
select *
from history
fetch first 100000 rows only
You want what's known as a Calendar Table (the particular example uses SQL Server, but the solution should be adaptable). Then, you want lots of indices on it (since writes are few, and this is a primary dimension table for analysis).
Assuming you have a minimum Calendar Table that looks like this:
CREATE TABLE Calendar (isoDate DATE,
dayOfMonth INTEGER,
month INTEGER,
year INTEGER);
... with an index over [dayOfMonth, month, year, isoDate], your query can be re-written like this:
SELECT Calendar.year, Calendar.month,
COUNT(*) AS ymCount
FROM Calendar
JOIN History
ON History.created >= Calendar.isoDate
AND History.created < Calendar.isoDate + 1 MONTH
WHERE Calendar.dayOfMonth = 1
GROUP BY Calendar.year, Calendar.month
The WHERE Calendar.dayOfMonth = 1 is automatically limiting results to 12-per-year. The start of the range is trivially located with the index (given the SARGable data), and the end of the range as well (yes, doing math on a column generally disqualifies indices... on the side the math is used. If the optimizer is at all smart it's going to going to gen a virtual intermediate table containing the start/end of range).
So, index-based (and likely index-only) access for the query. Learn to love indexed dimension tables, that can be used for range queries (Calendar Tables being one of the most useful).
I'll assume you are using SQL Server based on your tags.
Yes, the index will make your query faster.
I recommend only using the 'created' column as a key for the index and to not include any additional columns from the History table because they will be unused and only result in more reads than what is necessary.
And of course, be mindful when you create indexes on tables that have a lot of INSERT, UPDATE, DELETE activity as your new index will make these actions more expensive when being performed on the table.
As been stated before, in your case, an index won't be used because the index is created on the column 'created' and you are querying on 'year(created)'.
What you can do is add two generated columns year_gen = year(create) and month_gen = MONTHNAME(created) to your table and index these two columns. The DB2 Query Optimizer will automatically use these two generated columns and it will also use the indices created on these columns.
The code should be something like (but not 100% sure since I have no DB2 to test)
SET INTEGRITY FOR HISTORY OFF CASCADE DEFERRED #
ALTER TABLE HISTORY ADD COLUMN YEAR_GEN SMALLINT GENERATED ALWAYS AS (YEAR(CREATE)),
ADD COLUMN MONTH_GEN VARCHAR(20) GENERATED ALWAYS AS (YEAR(CREATE)) #
SET INTEGRITY FOR HISTORY IMMEDIATE CHECKED FORCE GENERATED #
CREATE INDEX HISTORY_YEAR_IDX ON HISTORY YEAR_GEN ASC CLUSTER #
CREATE INDEX HISTORY_MONTH_IDX ON HISTORY YEAR_GEN ASC #
Just a sidenote: the set integrity off is mandatory to add generated columns. Your table is inaccessible untill you reset the integrity to checked and you force the re-calculation of the generated columns (this might take a while in your case).
Setting integrity off without cascade deferred will set every table with a foreign key to the HISTORY table to OFF too. You will have to manually reset the integrity of these tables too. If I remember correctly, using cascade deferred in combination with incomming foreign keys may cause DB2 to set the integrity of your table to 'checked by user'.
I relative new to sql and I have a statement which takes forever to run.
SELECT
sum(a.amountcur)
FROM
custtrans a
WHERE
a.transdate <= '2013-12-31';
I's a large table but the statemnt takes about 6 minutes!
Any ideas why?
Your select, as you post it, will read 99% of the whole table (2013-12-31 is just a week ago, and i assume most entries are before that date and only very few after). If your table has many large columns (like varchar2(4000)), all that data will be read as well when oracle scans the table. So you might read several KB each row just to get the 30 bytes you need for amountcur and transdate.
If you have this scenario. create a combined index on transdate and amountcur:
CREATE INDEX myindex ON custtrans(transdate, amountcur)
With the combined index, oracle can read the index to fulfill your query and doesn't have to touch the main table at all, which might result in considerably less data that needs to be read from disk.
Make sure the table has an index on transdate.
create index custtrans_idx on custtrans (transdate);
Also if this field is defined as a date in the table then do
SELECT sum(a.amountcur)
FROM custtrans a
WHERE a.transdate <= to_date('2013-12-31', 'yyyy-mm-dd');
If the table is really large, the query has to scan every row with transdate below given.
Even if you have an index on transdate and it helps to stop the scan early (which it may not), when the number of matching rows is very high, it would take considerable time to scan them all and sum the values.
To speed things up, you could calculate partial sums, e.g. for each passed month, assuming that your data is historical and past does not change. Then you'd only need to scan custtrans only for 1-2 months, then quickly scan the table with monthly sums, and add the results.
Try to create an index only on column amountcur:
CREATE INDEX myindex ON custtrans(amountcur)
In this case Oracle will read most probably only the Index (Index Full Scan), nothing else.
Correction, as mentioned in comment. It must be a composite Index:
CREATE INDEX myindex ON custtrans(transdate, amountcur)
But maybe it is a bit useless to create an index just for a single select statement.
One option is to create an index on the column used in the where clause (this is useful if you want to retrieve only 10-15% rows by using indexed column).
Another option is to partition your table if it has millions of rows. In this case also if you try to retrieve 70-80% data, it wont help.
The best option is first to analyze your requirements and then make a choice.
Whenever you deal with date functions it's better to use to_date() function. Do not rely on implicit data type conversion.
I have a huge database of more than 3 million rows (my users information), I need to select all users that have birthdays in the current day.
The birthday column is a text (e.g. '19/03' or '19/03/1975') with day and month and sometimes the years.
When I try to select rows with like of left functions it take more then a minute to return the results.
I've tried to use 3 int column for day, month and year and then make the selection but it toke longer to get the results.
Any idea on how to make it run faster?
I'm using SQL Server 2008
Thanks
As marc_s mentions, if at all possible, store this as a date type - it'll make it way faster for SQL Server to perform comparisons on, and it'll be way easier to maintain. Next up, make sure to put an index on that column, and consider including any extra columns if you're only looking up the birthday to select a small subset of the total row.
Finally - and this is a big one. TEXT is just about the worst data type you could choose. The way TEXT is stored, the data isn't actually stored on the page itself. Instead it leaves behind a 16-byte pointer to another page. This other page will then contain the data itself in a record. But it gets worse, that record will be a SMALL_ROOT datatype taking up 84 bytes of space when your data is between 0 and 64 bytes in length!
Thus, what could've been saved as an 8-byte datetime or a 4-byte date now takes up a total of 100 bytes, and causes an off-row lookup for each and every row. Basically the perfect storm for bad performance.
If you cannot change it to a more proper datetime, at the very least, change it to a varchar!
first of all save the date in a format that is supported by SQL Server something like DATE or DATETIME (in your case I am guessing DATE should be enough) once you have that you can use SQL functions like MONTH and DAY as follows and avoid complex string manipulation function like LEFT etc.
Your query will look like this:
select * from MyTable where MONTH(dateColumnA) = '1' && DAY(dateColumnB) ='7' --1 is for january
I am not sure if this will solve your performance problems entirely but you can run this query in SQL Query Analyzer and see what recommendation it throws with respect to indexes etc. I dont have a great deal of knowledge about indexes on Date type columns
Most of what I had to say has already been said: Use a DATE type to store the date, and make sure that it is indexed. If you're going to use the three integers to store the date and search by that, then make sure that they're indexed as well:
CREATE INDEX IX_MyTable_Date_Ints ON MyTable(intYear, intMonth, intDay)
CREATE INDEX IX_MyTable_Date ON MyTable(BirthDate)
If you're wanting to be able to search the user table for birthdays excluding the year, I would recommend storing the birthday in a different date field, using a fixed year, e.g. 3004 - instead of using three integers. You base year should be a leap-year, to cater for anyone who may have been born on 29 February. If you use a year far in the future, you can use the year to determine that a date is effectively a date for which the year should be disregarded.
Then you can search for the birthday, regardless of the year, without having to do a function call on each record, by adding "WHERE birth_day = '3004-12-10'. If this field is indexed, you should be able to return all matching rows in a flash. You need to bear in mind that when searching an index, the server will need to do a maximum of 32 comparisons to find a match in 4 billion records. Never underestimate the benefits of indexing!
I would be inclined to put maintain the birthday through a trigger, so that it keeps itself updated. For those birth dates where you don't have the year, just use your base year (3004). Since your base year is in the future, you know that this birth date doesn't have a year.
CREATE TABLE MyTable (
MyTable_key INT IDENTITY(1, 1),
username VARCHAR(30),
birth_date DATE,
birth_day DATE
)
ALTER TABLE MyTable ADD CONSTRAINT PK_MyTable PRIMARY KEY CLUSTERED (MyTable_key)
CREATE INDEX MyTable_birth_date ON MyTable(birth_date)
CREATE INDEX MyTable_birth_day ON MyTable(birth_day)
GO
CREATE TRIGGER tr_MyTable_calc_birth_day ON MyTable AFTER INSERT, UPDATE AS
UPDATE t SET birth_day = DATEADD(YEAR, 3004-DATEPART(YEAR, t.birth_date), t.birth_date)
FROM MyTable t, inserted i WHERE i.MyTable_key = t.MyTable_key
To update your existing table, run the update as a standalone query, without the join to the inserted table as it was used in the trigger:
UPDATE MyTable SET birth_day = DATEADD(YEAR, 3004-DATEPART(YEAR, birth_date), birth_date)
Hope this helps.
Try to use Result Set instead of DataTable or DataSet. ResultSet is fast when compared to both of these