What is the fastest way to select rows from a huge database? - sql

I have a huge database of more than 3 million rows (my users information), I need to select all users that have birthdays in the current day.
The birthday column is a text (e.g. '19/03' or '19/03/1975') with day and month and sometimes the years.
When I try to select rows with like of left functions it take more then a minute to return the results.
I've tried to use 3 int column for day, month and year and then make the selection but it toke longer to get the results.
Any idea on how to make it run faster?
I'm using SQL Server 2008
Thanks

As marc_s mentions, if at all possible, store this as a date type - it'll make it way faster for SQL Server to perform comparisons on, and it'll be way easier to maintain. Next up, make sure to put an index on that column, and consider including any extra columns if you're only looking up the birthday to select a small subset of the total row.
Finally - and this is a big one. TEXT is just about the worst data type you could choose. The way TEXT is stored, the data isn't actually stored on the page itself. Instead it leaves behind a 16-byte pointer to another page. This other page will then contain the data itself in a record. But it gets worse, that record will be a SMALL_ROOT datatype taking up 84 bytes of space when your data is between 0 and 64 bytes in length!
Thus, what could've been saved as an 8-byte datetime or a 4-byte date now takes up a total of 100 bytes, and causes an off-row lookup for each and every row. Basically the perfect storm for bad performance.
If you cannot change it to a more proper datetime, at the very least, change it to a varchar!

first of all save the date in a format that is supported by SQL Server something like DATE or DATETIME (in your case I am guessing DATE should be enough) once you have that you can use SQL functions like MONTH and DAY as follows and avoid complex string manipulation function like LEFT etc.
Your query will look like this:
select * from MyTable where MONTH(dateColumnA) = '1' && DAY(dateColumnB) ='7' --1 is for january
I am not sure if this will solve your performance problems entirely but you can run this query in SQL Query Analyzer and see what recommendation it throws with respect to indexes etc. I dont have a great deal of knowledge about indexes on Date type columns

Most of what I had to say has already been said: Use a DATE type to store the date, and make sure that it is indexed. If you're going to use the three integers to store the date and search by that, then make sure that they're indexed as well:
CREATE INDEX IX_MyTable_Date_Ints ON MyTable(intYear, intMonth, intDay)
CREATE INDEX IX_MyTable_Date ON MyTable(BirthDate)
If you're wanting to be able to search the user table for birthdays excluding the year, I would recommend storing the birthday in a different date field, using a fixed year, e.g. 3004 - instead of using three integers. You base year should be a leap-year, to cater for anyone who may have been born on 29 February. If you use a year far in the future, you can use the year to determine that a date is effectively a date for which the year should be disregarded.
Then you can search for the birthday, regardless of the year, without having to do a function call on each record, by adding "WHERE birth_day = '3004-12-10'. If this field is indexed, you should be able to return all matching rows in a flash. You need to bear in mind that when searching an index, the server will need to do a maximum of 32 comparisons to find a match in 4 billion records. Never underestimate the benefits of indexing!
I would be inclined to put maintain the birthday through a trigger, so that it keeps itself updated. For those birth dates where you don't have the year, just use your base year (3004). Since your base year is in the future, you know that this birth date doesn't have a year.
CREATE TABLE MyTable (
MyTable_key INT IDENTITY(1, 1),
username VARCHAR(30),
birth_date DATE,
birth_day DATE
)
ALTER TABLE MyTable ADD CONSTRAINT PK_MyTable PRIMARY KEY CLUSTERED (MyTable_key)
CREATE INDEX MyTable_birth_date ON MyTable(birth_date)
CREATE INDEX MyTable_birth_day ON MyTable(birth_day)
GO
CREATE TRIGGER tr_MyTable_calc_birth_day ON MyTable AFTER INSERT, UPDATE AS
UPDATE t SET birth_day = DATEADD(YEAR, 3004-DATEPART(YEAR, t.birth_date), t.birth_date)
FROM MyTable t, inserted i WHERE i.MyTable_key = t.MyTable_key
To update your existing table, run the update as a standalone query, without the join to the inserted table as it was used in the trigger:
UPDATE MyTable SET birth_day = DATEADD(YEAR, 3004-DATEPART(YEAR, birth_date), birth_date)
Hope this helps.

Try to use Result Set instead of DataTable or DataSet. ResultSet is fast when compared to both of these

Related

how to create an index for a non-deterministic function

I have a table with Date-Of-Birth column. I have defined a function, say FIND_AGE, which takes it as input and returns age (it uses system date in calculations).
I want to optimize a query which returns all records having a certain age, say 30. I understand that we can't use non-deterministic functions (like FIND_AGE) while creating indexes.
Is there still a way I can create an index to optimize the query to fetch all records having age 30?
I would advice if you have a performance issue to share the whole query. Generally, storing a date and having a index on it is enough to find particular records based on it.
For example, getting users born before 30 years on current date:
SELECT *
FROM my_table
WHERE dob = DATEADD(year, -30, GETDATE())
If you have billions of records, which is unusual for users data, I can accept this is the cause of your performance issue.
If not, it will be better to check how the data from this table is read. You currently can have a index on this column, which is ignore from the engine, because the index is not covering. For example, you are reading also the first and the last name of the users. So, you index can be:
CREATE INDEX INX_my_table_DOB_I_FisrtsName_LastName ON
(
DOB
)
INCLUDE (FirstName, LastName);
or you are filtering by country code, also, so the index will be:
CREATE INDEX INX_my_table_DOB_I_FisrtsName_LastName ON
(
DOB
,CountryCode
)
INCLUDE (FirstName, LastName);
If your users table has many columns or large columns holding text, xml, blob, etc. scanning the table and not using the index can be the root of your issues.
If your table has an "updatedDate" column you could take the opportunity to maintain an indexed column "ageAtUpdatedDate" at the same time you update the "updatedDate" column at low cost: then the people having age X now are obviously among the ones having "ageAtUpdatedDate <= X" and you could reduce the data set size of people to test for "age = X", but he reduction of the data set size will depend on X in regard of the histogram of ages of your population, so the improvement will be "random"...

What is the fastest way to perform a date query in Oracle SQL?

We have a 6B row table that is giving us challenges when retrieving data.
Our query returns values instantly when doing a...
SELECT * WHERE Event_Code = 102225120
That type of instant result is exactly what we need. We now want to filter to receive values for just a particular year - but the moment we add...
AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
...the query takes over 10 minutes to begin returning any values.
Another SO post mentions that indexes don't necessarily help date queries when pulling many rows as opposed to an individual row. There are other approaches like using TRUNC, or BETWEEN, or specifying the datetime in YYYY-MM-DD format for doing comparisons.
Of note, we do not have the option to add indexes to the database as it is a vendor's database.
What is the way to add a date filtering query and enable Oracle to begin streaming the results back in the fastest way possible?
Another SO post mentions that indexes don't necessarily help date queries when pulling many rows as opposed to an individual row
That question is quite different from yours. Firstly, your statement above applies to any data type, not only dates. Also the word many is relative to the number of records in the table. If the optimizer decides that the query will return many of all records in your table, then it may decide that a full scan of the table is faster than using the index. In your situation, this translates to how many records are in 2017 out of all records in the table? This calculation gives you the cardinality of your query which then gives you an idea if an index will be faster or not.
Now, if you decide that an index will be faster, based on the above, the next step is to know how to build your index. In order for the optimizer to use the index, it must match the condition that you're using. You are not comparing dates in your query, you are only comparing the year part. So an index on the date column will not be used by this query. You need to create an index on the year part, so use the same condition to create the index.
we do not have the option to add indexes to the database as it is a vendor's database.
If you cannot modify the database, there is no way to optimize your query. You need to talk to the vendor and get access to modify the database or ask them to add the index for you.
A function can also cause slowness for the number of records involved. Not sure if Function Based Index can help you for this, but you can try.
Had you tried to add a year column in the table? If not, try to add a year column and update it using code below.
UPDATE table
SET year = EXTRACT(YEAR FROM PERFORMED_DATE_TIME);
This will take time though.
But after this, you can run the query below.
SELECT *
FROM table
WHERE Event_Code = 102225120 AND year = 2017;
Also, try considering Table Partitioned for this big data. For starters, see link below,
link: https://oracle-base.com/articles/8i/partitioned-tables-and-indexes
Your question is a bit ambiguous IMHO:
but the moment we add...
AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
...the query takes over 10 minutes to begin returning any values.
Do you mean that
SELECT * WHERE Event_Code = 102225120
is fast, but
SELECT * WHERE Event_Code = 102225120 AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
is slow???
For starters I'll agree with Mitch Wheat that you should try to use PERFORMED_DATE_TIME between Jan 1, 2017 and Dec 31, 2017 instead of Year(field) = 2017. Even if you'd have an index on the field, the latter would hardly be able to make use of it while the first method would benefit enormously.
I'm also hoping you want to be more specific than just 'give me all of 2017' because returning over 1B rows is NEVER going to be fast.
Next, if you can't make changes to the database, would you be able to maintain a 'shadow' in another database? This would require that you create a table with all date-values AND the PK of the original table in another database and query those to find the relevant PK values and then JOIN those back to your original table to find whatever you need. The biggest problem with this would be that you need to keep the shadow in sync with the original table. If you know the original table only changes overnight, you could merge the changes in the morning and query all day. If the application is 'real-time(ish)' then this probably won't work without some clever thinking... And yes, your initial load of 6B values will be rather heavy =)
May this could be usefull (because you avoid functions (a cause for context switching) and if you have an index on your date field, it could be used) :
with
dt as
(
select
to_date('01/01/2017', 'DD/MM/YYYY') as d1,
to_date('31/01/2017', 'DD/MM/YYYY') as d2
from dual
),
dates as
(
select
dt.d1 + rownum -1 as d
from dt
connect by dt.d1 + rownum -1 <= dt.d2
)
select *
from your_table, dates
where dates.d = PERFORMED_DATE_TIME
Move the date literal to RHS:
AND PERFORMED_DATE_TIME >= date '2017-01-01'
AND PERFORMED_DATE_TIME < date '2018-01-01'
But without an (undisclosed) appropriate index on PERFORMED_DATE_TIME, the query is unlikely to be any faster.
One option to create indexes in third party databases is to script in the index and then before any vendor upgrade run a script to remove any indexes you've added. If the index is important, ask the vendor to add it to their database design.

How to improve a SQL timestamp range query?

I have a table records having three fields:
id - the row id
value - the row value
source - the source of the value
timestamp - the time when the row was inserted (should this be a unix timestamp or a datetime?)
And I want to perform a query like this:
SELECT timestamp, value FROM records WHERE timestamp >= a AND timestamp <= b
However in a table with millions of records this query is super inefficient!
I am using Azure SQL Server as the DBMS. Can this be optimised?
If so can you provide a step-by-step guide to do it (please don't skip "small" steps)? Be it creating indexes, redesigning the query statement, redesigning the table (partitioning?)...
Thanks!
After creating an index on the field you want to search, you can use a between operator so it is a single operation, which is most efficient for sql.
SELECT XXX FROM ABC WHERE DateField BETWEEN '1/1/2015' AND '12/31/2015'
Also, in SQL Server 2016 you can create range indexes for use on things like time-stamps using memory optimized tables. That's really the way to do it.
I would recommend using the datetime, or even better the datetime2 data type to store the date data (datetime2 being better as it has a higher level of precision, and with lower precision levels will use less storage).
As for your query, based upon the statement you posted you would want the timestamp to be the key column, and then include the value. This is because you are using the timestamp as your predicate, and returning the value along with it.
CREATE NONCLUSTERED INDEX IX_Records_Timestamp on Records (Timestamp) INCLUDE (Value)
This being said, be careful of your column names. I would highly recommend not using reserved keywords for columns names as they can be a lot more difficult to work with.

Creating index on timestamp column for query which uses year function

I have a HISTORY table with 9 million records. I need to find year-wise, month-wise records created. I was using query no 1, However it timed out several times.
SELECT
year(created) as year,
MONTHNAME(created) as month,
count(*) as ymcount
FROM
HISTORY
GROUP BY
year(created), MONTHNAME(created);
I decided to add where year(created), this time the query took 30 mins (yes it takes so long) to execute.
SELECT
year(created) as year,
MONTHNAME(created) as month,
count(*) as ymcount
FROM
HISTORY
WHERE
year(created) = 2010
GROUP BY
year(created), MONTHNAME(created) ;
I was planning to add an index on created timestamp column, however before doing so, I need the opinion (since its going to take a long time to index such a huge table).
Will adding an index on created(timestamp) column improve performance, considering year function is used on the column?
An index won't really help because you have formed the query such that it must perform a complete table scan, index or no index. You have to form the where clause so it is in the form:
where field op constant
where field is, of course, your field; op is = <= => <> between in, etc. and constant is either a direct constant, 42, or an operation that can be executed once and the result cached, getdate().
Like this:
where created >= DateFromParts( #year, 1, 1 )
and created < DateFromParts( #year + 1, 1, 1 )
The DateFromParts function will generate a value which remains in effect for the duration of the query. If created is indexed, now the optimizer will be able to seek to exactly where the correct dates start and tell when the last date in the range has been processed and it can stop. You can keep year(created) everywhere else -- just get rid of it from the where clause.
This is called sargability and you can google all kinds of good information on it.
P.S. This is in Sql Server format but you should be able to calculate "beginning of specified year" and "beginning of year after specified year" in whatever DBMS you're using.
An index will be used, when it helps narrow down the number of rows read.
It will also be used, when it avoids reading the table at all. This is the case, when the index contains all the columns referenced in the query.
In your case the only column referenced is created, so adding an index on this column should help reducing the necessary reads and improve the overall runtime of your query. However, if created is the only column in the table, the index won't change anything in the first query, because it doesn't reduce the number of pages to be read.
Even with a large table, you can test, if an index makes a difference. You can copy only part of the rows to a new table and compare the execution plans on the new table with and without an index, e.g.
insert into testhistory
select *
from history
fetch first 100000 rows only
You want what's known as a Calendar Table (the particular example uses SQL Server, but the solution should be adaptable). Then, you want lots of indices on it (since writes are few, and this is a primary dimension table for analysis).
Assuming you have a minimum Calendar Table that looks like this:
CREATE TABLE Calendar (isoDate DATE,
dayOfMonth INTEGER,
month INTEGER,
year INTEGER);
... with an index over [dayOfMonth, month, year, isoDate], your query can be re-written like this:
SELECT Calendar.year, Calendar.month,
COUNT(*) AS ymCount
FROM Calendar
JOIN History
ON History.created >= Calendar.isoDate
AND History.created < Calendar.isoDate + 1 MONTH
WHERE Calendar.dayOfMonth = 1
GROUP BY Calendar.year, Calendar.month
The WHERE Calendar.dayOfMonth = 1 is automatically limiting results to 12-per-year. The start of the range is trivially located with the index (given the SARGable data), and the end of the range as well (yes, doing math on a column generally disqualifies indices... on the side the math is used. If the optimizer is at all smart it's going to going to gen a virtual intermediate table containing the start/end of range).
So, index-based (and likely index-only) access for the query. Learn to love indexed dimension tables, that can be used for range queries (Calendar Tables being one of the most useful).
I'll assume you are using SQL Server based on your tags.
Yes, the index will make your query faster.
I recommend only using the 'created' column as a key for the index and to not include any additional columns from the History table because they will be unused and only result in more reads than what is necessary.
And of course, be mindful when you create indexes on tables that have a lot of INSERT, UPDATE, DELETE activity as your new index will make these actions more expensive when being performed on the table.
As been stated before, in your case, an index won't be used because the index is created on the column 'created' and you are querying on 'year(created)'.
What you can do is add two generated columns year_gen = year(create) and month_gen = MONTHNAME(created) to your table and index these two columns. The DB2 Query Optimizer will automatically use these two generated columns and it will also use the indices created on these columns.
The code should be something like (but not 100% sure since I have no DB2 to test)
SET INTEGRITY FOR HISTORY OFF CASCADE DEFERRED #
ALTER TABLE HISTORY ADD COLUMN YEAR_GEN SMALLINT GENERATED ALWAYS AS (YEAR(CREATE)),
ADD COLUMN MONTH_GEN VARCHAR(20) GENERATED ALWAYS AS (YEAR(CREATE)) #
SET INTEGRITY FOR HISTORY IMMEDIATE CHECKED FORCE GENERATED #
CREATE INDEX HISTORY_YEAR_IDX ON HISTORY YEAR_GEN ASC CLUSTER #
CREATE INDEX HISTORY_MONTH_IDX ON HISTORY YEAR_GEN ASC #
Just a sidenote: the set integrity off is mandatory to add generated columns. Your table is inaccessible untill you reset the integrity to checked and you force the re-calculation of the generated columns (this might take a while in your case).
Setting integrity off without cascade deferred will set every table with a foreign key to the HISTORY table to OFF too. You will have to manually reset the integrity of these tables too. If I remember correctly, using cascade deferred in combination with incomming foreign keys may cause DB2 to set the integrity of your table to 'checked by user'.

Get Day, Month, Year, Lifetime total records with one query w/ optimizations

I have a Postgres DB running 7.4 (Yeah we're in the midst of upgrading)
I have four separate queries to get the Daily, Monthly, Yearly and Lifetime record counts
SELECT COUNT(field)
FROM database
WHERE date_field
BETWEEN DATE_TRUNC('DAY' LOCALTIMESTAMP)
AND DATE_TRUNC('DAY' LOCALTIMESTAMP) + INTERVAL '1 DAY'
For Month just replace the word DAY with MONTH in the query and so on for each time duration.
Looking for ideas on how to get all the desired results with one query and any optimizations one would recommend.
Thanks in advance!
NOTE: date_field is timestamp without time zone
UPDATE:
Sorry I do filter out records with additional query constraints, just wanted to give the gist of the date_field comparisons. Sorry for any confusion
I have some idea of using prepared statements and simple statistics (record_count_t) table for that:
-- DROP TABLE IF EXISTS record_count_t;
-- DEALLOCATE record_count;
-- DROP FUNCTION updateRecordCounts();
CREATE TABLE record_count_t (type char, count bigint);
INSERT INTO record_count_t (type) VALUES ('d'), ('m'), ('y'), ('l');
PREPARE record_count (text) AS
UPDATE record_count_t SET count =
(SELECT COUNT(field)
FROM database
WHERE
CASE WHEN $1 <> 'l' THEN
DATE_TRUNC($1, date_field) = DATE_TRUNC($1, LOCALTIMESTAMP)
ELSE TRUE END)
WHERE type = $1;
CREATE FUNCTION updateRecordCounts() RETURNS void AS
$$
EXECUTE record_count('d');
EXECUTE record_count('m');
EXECUTE record_count('y');
EXECUTE record_count('l');
$$
LANGUAGE SQL;
SELECT updateRecordCounts();
SELECT type,count FROM record_count_t;
Use updateRecordCounts() function any time you need update statistics.
I'd guess that it is not possible to optimize this any further than it already is.
If you're collecting daily/monthly/yearly stats, as I'm assuming you are doing, one option (after upgrading, of course) is a with statement and the relevant joins, e.g.:
with daily_stats as (
(what you posted)
),
monthly_stats as (
(what you posted monthly)
),
etc.
select daily_stats.stats,
monthly_stats.stats,
etc.
stats
left join yearly_stats on ...
left join monthly_stats on ...
left join daily_stats on ...
However, that will actually perform less well than running each query separately in a production environment, because you'll introduce left joins in the DB which could be done just as well in the middleware (i.e. show daily, then monthly, then yearly and finally lifetime stats). (If not better, since you'll be avoiding full table scans.)
By keeping things as if, you'll save the precious DB resources to deal with reads and writes on actual data. The tradeoff (less network traffic between your database and your app) is almost certainly not worth it.
Yikes! Don't do this!!! Not because you can't do what you're asking, but because you probably shouldn't be doing what you're asking in this manner. I'm guessing the reason you've got date_field in your example is because you've got a date_field attached to a user or some other meta-data.
Think about it: you are asking PostgreSQL to scan 100% of the records relevant to a given user. Unless this is a one-time operation, you almost assuredly do not want to do this. If this is a one-time operation and you are planning on caching this value as a meta-data, then who cares about the optimizations? Space is cheap and will save you heaps of execution time down the road.
You should add 4x per-user (or whatever it is) meta-data fields that help sum up the data. You have two options, I'll let you figure out how to use this so that you keep historical counts, but here's the easy version:
CREATE TABLE user_counts_only_keep_current (
user_id , -- Your user_id
lifetime INT DEFAULT 0,
yearly INT DEFAULT 0,
monthly INT DEFAULT 0,
daily INT DEFAULT 0,
last_update_utc TIMESTAMP WITH TIME ZONE,
FOREIGN KEY(user_id) REFERENCES "user"(id)
);
CREATE UNIQUE INDEX this_tbl_user_id_udx ON user_counts_only_keep_current(user_id);
Setup some stored procedures that zero out individual columns if last_update_utc doesn't match the current day according to NOW(). You can get creative from here, but incrementing records like this is going to be the way to go.
Handling of time series data in any relational database requires special handling and maintenance. Look in to PostgreSQL's table inheritance if you want good temporal data management.... but really, don't do whatever it is you're about to do to your application because it's almost certainly going to result in bad things(tm).