how to create an index for a non-deterministic function - sql

I have a table with Date-Of-Birth column. I have defined a function, say FIND_AGE, which takes it as input and returns age (it uses system date in calculations).
I want to optimize a query which returns all records having a certain age, say 30. I understand that we can't use non-deterministic functions (like FIND_AGE) while creating indexes.
Is there still a way I can create an index to optimize the query to fetch all records having age 30?

I would advice if you have a performance issue to share the whole query. Generally, storing a date and having a index on it is enough to find particular records based on it.
For example, getting users born before 30 years on current date:
SELECT *
FROM my_table
WHERE dob = DATEADD(year, -30, GETDATE())
If you have billions of records, which is unusual for users data, I can accept this is the cause of your performance issue.
If not, it will be better to check how the data from this table is read. You currently can have a index on this column, which is ignore from the engine, because the index is not covering. For example, you are reading also the first and the last name of the users. So, you index can be:
CREATE INDEX INX_my_table_DOB_I_FisrtsName_LastName ON
(
DOB
)
INCLUDE (FirstName, LastName);
or you are filtering by country code, also, so the index will be:
CREATE INDEX INX_my_table_DOB_I_FisrtsName_LastName ON
(
DOB
,CountryCode
)
INCLUDE (FirstName, LastName);
If your users table has many columns or large columns holding text, xml, blob, etc. scanning the table and not using the index can be the root of your issues.

If your table has an "updatedDate" column you could take the opportunity to maintain an indexed column "ageAtUpdatedDate" at the same time you update the "updatedDate" column at low cost: then the people having age X now are obviously among the ones having "ageAtUpdatedDate <= X" and you could reduce the data set size of people to test for "age = X", but he reduction of the data set size will depend on X in regard of the histogram of ages of your population, so the improvement will be "random"...

Related

Select random rows according to a given criteria PostgreSQL

I have table user with ten million rows. It has fields: id int4 primary key, rating int4, country varchar(32), last_active timestamp. It has gaps in identifiers.
The task is to select five random users for a given country which were active in a period of last two days and has rating in a given range.
Is there a tricky way to select them faster than the query below?
SELECT id
FROM user
WHERE last_active > '2020-04-07'
AND rating between 200 AND 280
AND country = 'US'
ORDER BY random()
LIMIT 5
It thought about this query:
SELECT id
FROM user
WHERE last_active > '2020-04-07'
AND rating between 200 AND 280
AND country = 'US'
AND id > (SELECT random()*max(id) FROM user)
ORDER BY id ASC
LIMIT 5
but the problem is that there lots of inactive users with small identifier values, the majority of new users are in the end of the id range. So, this query would select some users too often.
Based on the EXPLAIN plan, your table is large. About 2 rows per page. Either it is very bloated, or the rows themselves are very wide.
The key to getting good performance is probably to get it to use an index-only scan, by creating an index which contains all 4 columns referenced in your query. The column tested for equality should come first. After that, you have to choose between your two range-or-inequality queried columns ("last_active" or "rating"), based on whichever you think will be more selective. Then you add the other range-or-inequality and the id column to the end, so that an index-only scan can be used. So maybe create index on app_user (country, last_active, rating, id). That will probably be good enough.
You could also try a GiST index on those same columns. This has the theoretical advantage that the two range-or-inequality restrictions can be used together in defining what index pages to look at. But in practise GiST indexes have very high overhead, and this overhead would likely exceed the theoretical benefit.
If the above aren't good enough, you could try partitioning. But how exactly you do that should be based on a holistic view of your application, not just one query.

Creating index on timestamp column for query which uses year function

I have a HISTORY table with 9 million records. I need to find year-wise, month-wise records created. I was using query no 1, However it timed out several times.
SELECT
year(created) as year,
MONTHNAME(created) as month,
count(*) as ymcount
FROM
HISTORY
GROUP BY
year(created), MONTHNAME(created);
I decided to add where year(created), this time the query took 30 mins (yes it takes so long) to execute.
SELECT
year(created) as year,
MONTHNAME(created) as month,
count(*) as ymcount
FROM
HISTORY
WHERE
year(created) = 2010
GROUP BY
year(created), MONTHNAME(created) ;
I was planning to add an index on created timestamp column, however before doing so, I need the opinion (since its going to take a long time to index such a huge table).
Will adding an index on created(timestamp) column improve performance, considering year function is used on the column?
An index won't really help because you have formed the query such that it must perform a complete table scan, index or no index. You have to form the where clause so it is in the form:
where field op constant
where field is, of course, your field; op is = <= => <> between in, etc. and constant is either a direct constant, 42, or an operation that can be executed once and the result cached, getdate().
Like this:
where created >= DateFromParts( #year, 1, 1 )
and created < DateFromParts( #year + 1, 1, 1 )
The DateFromParts function will generate a value which remains in effect for the duration of the query. If created is indexed, now the optimizer will be able to seek to exactly where the correct dates start and tell when the last date in the range has been processed and it can stop. You can keep year(created) everywhere else -- just get rid of it from the where clause.
This is called sargability and you can google all kinds of good information on it.
P.S. This is in Sql Server format but you should be able to calculate "beginning of specified year" and "beginning of year after specified year" in whatever DBMS you're using.
An index will be used, when it helps narrow down the number of rows read.
It will also be used, when it avoids reading the table at all. This is the case, when the index contains all the columns referenced in the query.
In your case the only column referenced is created, so adding an index on this column should help reducing the necessary reads and improve the overall runtime of your query. However, if created is the only column in the table, the index won't change anything in the first query, because it doesn't reduce the number of pages to be read.
Even with a large table, you can test, if an index makes a difference. You can copy only part of the rows to a new table and compare the execution plans on the new table with and without an index, e.g.
insert into testhistory
select *
from history
fetch first 100000 rows only
You want what's known as a Calendar Table (the particular example uses SQL Server, but the solution should be adaptable). Then, you want lots of indices on it (since writes are few, and this is a primary dimension table for analysis).
Assuming you have a minimum Calendar Table that looks like this:
CREATE TABLE Calendar (isoDate DATE,
dayOfMonth INTEGER,
month INTEGER,
year INTEGER);
... with an index over [dayOfMonth, month, year, isoDate], your query can be re-written like this:
SELECT Calendar.year, Calendar.month,
COUNT(*) AS ymCount
FROM Calendar
JOIN History
ON History.created >= Calendar.isoDate
AND History.created < Calendar.isoDate + 1 MONTH
WHERE Calendar.dayOfMonth = 1
GROUP BY Calendar.year, Calendar.month
The WHERE Calendar.dayOfMonth = 1 is automatically limiting results to 12-per-year. The start of the range is trivially located with the index (given the SARGable data), and the end of the range as well (yes, doing math on a column generally disqualifies indices... on the side the math is used. If the optimizer is at all smart it's going to going to gen a virtual intermediate table containing the start/end of range).
So, index-based (and likely index-only) access for the query. Learn to love indexed dimension tables, that can be used for range queries (Calendar Tables being one of the most useful).
I'll assume you are using SQL Server based on your tags.
Yes, the index will make your query faster.
I recommend only using the 'created' column as a key for the index and to not include any additional columns from the History table because they will be unused and only result in more reads than what is necessary.
And of course, be mindful when you create indexes on tables that have a lot of INSERT, UPDATE, DELETE activity as your new index will make these actions more expensive when being performed on the table.
As been stated before, in your case, an index won't be used because the index is created on the column 'created' and you are querying on 'year(created)'.
What you can do is add two generated columns year_gen = year(create) and month_gen = MONTHNAME(created) to your table and index these two columns. The DB2 Query Optimizer will automatically use these two generated columns and it will also use the indices created on these columns.
The code should be something like (but not 100% sure since I have no DB2 to test)
SET INTEGRITY FOR HISTORY OFF CASCADE DEFERRED #
ALTER TABLE HISTORY ADD COLUMN YEAR_GEN SMALLINT GENERATED ALWAYS AS (YEAR(CREATE)),
ADD COLUMN MONTH_GEN VARCHAR(20) GENERATED ALWAYS AS (YEAR(CREATE)) #
SET INTEGRITY FOR HISTORY IMMEDIATE CHECKED FORCE GENERATED #
CREATE INDEX HISTORY_YEAR_IDX ON HISTORY YEAR_GEN ASC CLUSTER #
CREATE INDEX HISTORY_MONTH_IDX ON HISTORY YEAR_GEN ASC #
Just a sidenote: the set integrity off is mandatory to add generated columns. Your table is inaccessible untill you reset the integrity to checked and you force the re-calculation of the generated columns (this might take a while in your case).
Setting integrity off without cascade deferred will set every table with a foreign key to the HISTORY table to OFF too. You will have to manually reset the integrity of these tables too. If I remember correctly, using cascade deferred in combination with incomming foreign keys may cause DB2 to set the integrity of your table to 'checked by user'.

Oracle sql statement on very large table

I relative new to sql and I have a statement which takes forever to run.
SELECT
sum(a.amountcur)
FROM
custtrans a
WHERE
a.transdate <= '2013-12-31';
I's a large table but the statemnt takes about 6 minutes!
Any ideas why?
Your select, as you post it, will read 99% of the whole table (2013-12-31 is just a week ago, and i assume most entries are before that date and only very few after). If your table has many large columns (like varchar2(4000)), all that data will be read as well when oracle scans the table. So you might read several KB each row just to get the 30 bytes you need for amountcur and transdate.
If you have this scenario. create a combined index on transdate and amountcur:
CREATE INDEX myindex ON custtrans(transdate, amountcur)
With the combined index, oracle can read the index to fulfill your query and doesn't have to touch the main table at all, which might result in considerably less data that needs to be read from disk.
Make sure the table has an index on transdate.
create index custtrans_idx on custtrans (transdate);
Also if this field is defined as a date in the table then do
SELECT sum(a.amountcur)
FROM custtrans a
WHERE a.transdate <= to_date('2013-12-31', 'yyyy-mm-dd');
If the table is really large, the query has to scan every row with transdate below given.
Even if you have an index on transdate and it helps to stop the scan early (which it may not), when the number of matching rows is very high, it would take considerable time to scan them all and sum the values.
To speed things up, you could calculate partial sums, e.g. for each passed month, assuming that your data is historical and past does not change. Then you'd only need to scan custtrans only for 1-2 months, then quickly scan the table with monthly sums, and add the results.
Try to create an index only on column amountcur:
CREATE INDEX myindex ON custtrans(amountcur)
In this case Oracle will read most probably only the Index (Index Full Scan), nothing else.
Correction, as mentioned in comment. It must be a composite Index:
CREATE INDEX myindex ON custtrans(transdate, amountcur)
But maybe it is a bit useless to create an index just for a single select statement.
One option is to create an index on the column used in the where clause (this is useful if you want to retrieve only 10-15% rows by using indexed column).
Another option is to partition your table if it has millions of rows. In this case also if you try to retrieve 70-80% data, it wont help.
The best option is first to analyze your requirements and then make a choice.
Whenever you deal with date functions it's better to use to_date() function. Do not rely on implicit data type conversion.

What is the fastest way to select rows from a huge database?

I have a huge database of more than 3 million rows (my users information), I need to select all users that have birthdays in the current day.
The birthday column is a text (e.g. '19/03' or '19/03/1975') with day and month and sometimes the years.
When I try to select rows with like of left functions it take more then a minute to return the results.
I've tried to use 3 int column for day, month and year and then make the selection but it toke longer to get the results.
Any idea on how to make it run faster?
I'm using SQL Server 2008
Thanks
As marc_s mentions, if at all possible, store this as a date type - it'll make it way faster for SQL Server to perform comparisons on, and it'll be way easier to maintain. Next up, make sure to put an index on that column, and consider including any extra columns if you're only looking up the birthday to select a small subset of the total row.
Finally - and this is a big one. TEXT is just about the worst data type you could choose. The way TEXT is stored, the data isn't actually stored on the page itself. Instead it leaves behind a 16-byte pointer to another page. This other page will then contain the data itself in a record. But it gets worse, that record will be a SMALL_ROOT datatype taking up 84 bytes of space when your data is between 0 and 64 bytes in length!
Thus, what could've been saved as an 8-byte datetime or a 4-byte date now takes up a total of 100 bytes, and causes an off-row lookup for each and every row. Basically the perfect storm for bad performance.
If you cannot change it to a more proper datetime, at the very least, change it to a varchar!
first of all save the date in a format that is supported by SQL Server something like DATE or DATETIME (in your case I am guessing DATE should be enough) once you have that you can use SQL functions like MONTH and DAY as follows and avoid complex string manipulation function like LEFT etc.
Your query will look like this:
select * from MyTable where MONTH(dateColumnA) = '1' && DAY(dateColumnB) ='7' --1 is for january
I am not sure if this will solve your performance problems entirely but you can run this query in SQL Query Analyzer and see what recommendation it throws with respect to indexes etc. I dont have a great deal of knowledge about indexes on Date type columns
Most of what I had to say has already been said: Use a DATE type to store the date, and make sure that it is indexed. If you're going to use the three integers to store the date and search by that, then make sure that they're indexed as well:
CREATE INDEX IX_MyTable_Date_Ints ON MyTable(intYear, intMonth, intDay)
CREATE INDEX IX_MyTable_Date ON MyTable(BirthDate)
If you're wanting to be able to search the user table for birthdays excluding the year, I would recommend storing the birthday in a different date field, using a fixed year, e.g. 3004 - instead of using three integers. You base year should be a leap-year, to cater for anyone who may have been born on 29 February. If you use a year far in the future, you can use the year to determine that a date is effectively a date for which the year should be disregarded.
Then you can search for the birthday, regardless of the year, without having to do a function call on each record, by adding "WHERE birth_day = '3004-12-10'. If this field is indexed, you should be able to return all matching rows in a flash. You need to bear in mind that when searching an index, the server will need to do a maximum of 32 comparisons to find a match in 4 billion records. Never underestimate the benefits of indexing!
I would be inclined to put maintain the birthday through a trigger, so that it keeps itself updated. For those birth dates where you don't have the year, just use your base year (3004). Since your base year is in the future, you know that this birth date doesn't have a year.
CREATE TABLE MyTable (
MyTable_key INT IDENTITY(1, 1),
username VARCHAR(30),
birth_date DATE,
birth_day DATE
)
ALTER TABLE MyTable ADD CONSTRAINT PK_MyTable PRIMARY KEY CLUSTERED (MyTable_key)
CREATE INDEX MyTable_birth_date ON MyTable(birth_date)
CREATE INDEX MyTable_birth_day ON MyTable(birth_day)
GO
CREATE TRIGGER tr_MyTable_calc_birth_day ON MyTable AFTER INSERT, UPDATE AS
UPDATE t SET birth_day = DATEADD(YEAR, 3004-DATEPART(YEAR, t.birth_date), t.birth_date)
FROM MyTable t, inserted i WHERE i.MyTable_key = t.MyTable_key
To update your existing table, run the update as a standalone query, without the join to the inserted table as it was used in the trigger:
UPDATE MyTable SET birth_day = DATEADD(YEAR, 3004-DATEPART(YEAR, birth_date), birth_date)
Hope this helps.
Try to use Result Set instead of DataTable or DataSet. ResultSet is fast when compared to both of these

Indexes, EXPLAIN PLAN, and record access in Oracle SQL

I have been learning about indexes in Oracle SQL, and I wanted to conduct a small experiment with a test table to see how indexes really worked. As I discovered from an earlier post made here, the best way to do this is with EXPLAIN PLAN. However, I am running into something which confuses me.
My sample table contains attributes (EmpID, Fname, Lname, Occupation, .... etc). I populated it with 500,000 records using a java program I wrote (random names, occupations, etc). Now, here are some sample queries with and without indexes:
NO INDEX:
SELECT Fname FROM EMPLOYEE WHERE Occupation = 'DOCTOR';
EXPLAIN PLAN says:
OPERATION OPTIMIZER COST
TABLE ACCESS(FULL) TEST.EMPLOYEE ANALYZED 1169
Now I create index:
CREATE INDEX occupation_idx
ON EMPLOYEE (Occupation);
WITH INDEX "occupation_idx":
SELECT Fname FROM EMPLOYEE WHERE Occupation = 'DOCTOR';
EXPLAIN PLAN says:
OPERATION OPTIMIZER COST
TABLE ACCESS(FULL) TEST.EMPLOYEE ANALYZED 1169
So... the cost is STILL the same, 1169? Now I try this:
WITH INDEX "occupation_idx":
SELECT Occupation FROM EMPLOYEE WHERE Occupation = 'DOCTOR';
EXPLAIN PLAN says:
OPERATION OPTIMIZER COST
INDEX(RANGE SCAN) TEST.OCCUPATION_IDX ANALYZED 67
So, it appears that the index only is utilized when that column is the only one I'm pulling values from. But I thought that the point of an index was to unlock the entire record using the indexed column as the key? The search above is a pretty pointless one... it searches for values which you already know. The only worthwhile query I can think of which ONLY involves an indexed column's value (and not the rest of the record) would be an aggregate such as COUNT or something.
What am I missing?
Even with your index, Oracle decided to do a full scan for the second query.
Why did it do this? Oracle would have created two plans and come up with a cost for each:-
1) Full scan
2) Index access
Oracle selected the plan with the lower cost. Obviously it came up with the full scan as the lower cost.
If you want to see the cost of the index plan, you can do an explain plan with a hint like this to force the index usage:
SELECT /*+ INDEX(EMPLOYEE occupation_idx) */ Fname
FROM EMPLOYEE WHERE Occupation = 'DOCTOR';
If you do an explain plan on the above, you will see that the cost is greater than the full scan cost. This is why Oracle did not choose to use the index.
A simple way to consider the cost of the index plan is:-
The blevel of the index (how many blocks must be read from top to bottom)
The number of table blocks that must be subsequently read for records matching in the index. This relies on Oracle's estimate of the number of employees that have an occupation of 'DOCTOR'. In your simple example, this would be:
number of rows / number of distinct values
More complicated considerations include the clustering factory and index cost adjustments which both reflect the likelyhood that a block that is read is already in memory and hence does not need to read from disk.
Perhaps you could update your question with the results from your query with the index hint and also the results of this query:-
SELECT COUNT(*), COUNT(DISTINCT( Occupation ))
FROM EMPLOYEE;
This will allow people to comment on the cost of the index plan.
I think I see what's happening here.
When you have the index in place, and you do:
SELECT Occupation FROM EMPLOYEE WHERE Occupation = 'DOCTOR';
The execution plan will use the index. This is a no-brainer, cause all the data that's required to satisfy the query is right there in the index, and Oracle never even has to reference the table at all.
However, when you do:
SELECT Fname FROM EMPLOYEE WHERE Occupation = 'DOCTOR';
then, if Oracle uses the index, it will do an INDEX RANGE SCAN followed by a TABLE ACCESS BY ROWID to look up the Fname that corresponds to that Occupation. Now, depending on how many rows have DOCTOR for Occupation, Oracle will have to make one or more trips to the table, to look up the Fname. If, for example, you have a table, and all the employees have Occupation set to 'DOCTOR', the index isn't of much use, and Oracle will simply do a FULL TABLE SCAN of the table. If there are 10,000 employees, and only one is a DOCTOR, then again, it's a no-brainer, and Oracle will use the index.
But there are some subtleties, when you're somewhere between those two extremes. People like to talk about 'selectivity', i.e., how many rows are identifed by the index, vs. the size of the table, when discussing whether the index will be used. But, that's not really true. What Oracle really cares about is block selectivity. That is, how many blocks does it have to visit, to satisfy the query? So, first, how "wide" is the RANGE SCAN? The more limited the range of values specified by the predicate values, the better. Second, when your query needs to do table lookups, how many different blocks will it have to visit to find all the data it needs. That is, how "random" is the data in the table relative to the index order? This is called the CLUSTERING_FACTOR. If you analyze the index to collect statistics, and then look at USER_INDEXES, you'll see that the CLUSTERING_FACTOR is now populated.
So, what's CLUSTERING_FACTOR? CLUSTERING_FACTOR is the "orderedness" of the table, with respect to the index's key column(s). The value of CLUSTERING_FACTOR will always be between the number of blocks in a table and the number of rows in a table. A low CLUSTERING_FACTOR, that is, one that is very near to the number of blocks in the table, indicates a table that's very ordered, relative to the index. A high CLUSTERING_FACTOR, that is, one that is very near to the number of rows in the table, is very unordered, relative to the index.
It's an important concept to understand that the CLUSTERING_FACTOR describes the order of data in the table relative to the index. So, rebuilding an index, for example, will not change the CLUSTERING_FACTOR. It's also important to understand that the same table could have two indexes, and one could have an excellent CLUSTERING_FACTOR, and the other could have an extremely poor CLUSTERING_FACTOR. The table itself can only be ordered in one way.
So, why have I spent so much time describing CLUSTERING_FACTOR? Because when you have an execution plan that does an INDEX RANGE SCAN followed by TABLE ACCESS BY ROWID, you can be sure that the CLUSTERING_FACTOR has been considered by Oracle's optimizer, to come up with the execution plan. For example, suppose you have a 10,000 row table, and suppose 100 of the rows have Occupation = 'DOCTOR'. You write the query above, asking for the Fname of the employees whose occupation is DOCTOR. Well, Oracle can very easily and efficiently determine the rowids of the rows where occupation is DOCTOR. But, how many table blocks will Oracle need to visit, to do the Fname lookup? It could be only 1 or 2 table blocks, if the data is clustered (ordered) by Occupation in the table. But, it could be as many as 100, if the data is very unordered in the table! So, again, 10,000 row table, and, let's assume, (for the purposes of illustration and simple math) that the table has 100 rows/block, and so, 100 blocks. Depending on table order (i.e. CLUSTERING_FACTOR), the number of table block visits could be as few as 1, or as many as 100.
So, I hope this helps you understand why the optimizer may be reluctant to use an index in some cases.
An index is the copy of the table which only stores the following data:
Indexed field(s)
A pointer to the original row (rowid).
Say you have a table like this:
rowid id name occupation
[1] 1 John clerk
[2] 2 Jim manager
[3] 3 Jane boss
Then an index on occupation would look like this:
occupation rowid
boss [3]
manager [2]
clerk [1]
, with the records sorted on occupation in a B-Tree.
As you can see, if you only select the indexed fields, you only need the index (the second table).
If you select anything other than occupation:
SELECT *
FROM mytable
WHERE occupation = 'clerk'
then the engine should make two things: first find the relevant records in the index, second, find the records in the original table by rowid. It's like if you joined the two tables on rowid.
Since the rowids in the index are not in order, the reads to the original table are not sequential and can be slow. It may be faster to read the original table in sequential order and just filter the records with occupation = 'clerk'.
The engine does not "unlock" the records: it just finds the rowid in the index, and if there are not enough data in the index itself, it looks up data in the original table by the rowid found.
As a WAG. Analyze the table, and the index, then see if the plan changes.
When you are selecting just the occupation, the entire query can be satisfied from the index. The index literally has a copy of the occupation. The moment you add an additional column to the select, Oracle has to go to the data record, to get it. The optimizer chooses to read all of the data rows instead of all of the index rows, and the data rows. It's cheaper.