SQL Server BETWEEN - sql

I have a table which has Year, Month and few numeric columns
Year Month Total
2011 10 100
2011 11 150
2011 12 100
2012 01 50
2012 02 200
Now, I want to SELECT rows between 2011 Nov and 2012 FEB. Note that I want the to Query to use range. Just as if I had a date column in the table..

Coming up with a way to use BETWEEN with the table as it is will work, but will be worse performance in every case:
It will at best consume more CPU to do some kind of calculation on the rows instead of working with them as dates.
It will at the very worst force a table scan on every row in the table, but if your columns have indexes, then with the right query a seek is possible. This could be a HUGE performance difference, because forcing the constraints into a BETWEEN clause will disable using the index.
I suggest the following instead if you have an index on your date columns and care at all about performance:
DECLARE
#FromDate date = '20111101',
#ToDate date = '20120201';
SELECT *
FROM dbo.YourTable T
WHERE
(
T.[Year] > Year(#FromDate)
OR (
T.[Year] = Year(#FromDate)
AND T.[Month] >= Month(#FromDate)
)
) AND (
T.[Year] < Year(#ToDate)
OR (
T.[Year] = Year(#ToDate)
AND T.[Month] <= Month(#ToDate)
)
);
However, it is understandable that you don't want to use such a construction as it is very awkward. So here is a compromise query, that at least uses numeric computation and will use less CPU than date-to-string-conversion computation (though not enough less to make up for the forced scan which is the real performance problem).
SELECT *
FROM dbo.YourTable T
WHERE
T.[Year] * 100 + T.[Month] BETWEEN 201111 AND 201202;
If you have an index on Year, you can get a big boost by submitting the query as follows, which has the opportunity to seek:
SELECT *
FROM dbo.YourTable T
WHERE
T.[Year] * 100 + T.[Month] BETWEEN 201111 AND 201202
AND T.[Year] BETWEEN 2011 AND 2012; -- allows use of an index on [Year]
While this breaks your requirement of using a single BETWEEN expression, it is not too much more painful and will perform very well with the Year index.
You can also change your table. Frankly, using separate numbers for your date parts instead of a single column with a date data type is not good. The reason it isn't good is because of the exact issue you are facing right now--it is very hard to query.
In some data warehousing scenarios where saving bytes matters a lot, I could envision situations where you might store the date as a number (such as 201111) but that is not recommended. The best solution is to change your table to use dates instead of splitting out the numeric value of the month and the year. Simply store the first day of the month, recognizing that it stands in for the entire month.
If changing the way you use these columns is not an option but you can still change your table, then you can add a persisted computed column:
ALTER Table dbo.YourTable
ADD ActualDate AS (DateAdd(year, [Year] - 1900, DateAdd(month, [Month], '18991201')))
PERSISTED;
With this you can just do:
SELECT *
FROM dbo.YourTable
WHERE
ActualDate BETWEEN '20111101' AND '20120201';
The PERSISTED keyword means that while you still will get a scan, it won't have to do any calculation on each row since the expression is calculated on each INSERT or UPDATE and stored in the row. But you can get a seek if you add an index on this column, which will make it perform very well (though all in all, this is still not as ideal as changing to use an actual date column, because it will take more space and will affect INSERTs and UPDATEs):
CREATE NONCLUSTERED INDEX IX_YourTable_ActualDate ON dbo.YourTable (ActualDate);
Summary: if you truly can't change the table in any way, then you are going to have to make a compromise in some way. It will not be possible to get the simple syntax you want that will also perform well, when your dates are stored split into separate columns.

(Year > #FromYear OR Year = #FromYear AND Month >= #FromMonth)
AND (Year < #ToYear OR Year = #ToYear AND Month <= #ToMonth)

Your example table seems to indicate that there's only one record per year and month (if it's really a summary-by-month table). If that's so, you're likely to accrue very little data in the table even over several decades of activity. The concatenated expression solution will work and performance (in this case) won't be an issue:
SELECT * FROM Table WHERE ((Year * 100) + Month) BETWEEN 201111 AND 201202
If that's not the case and you really have a large number of records in the table (more than a few thousand records), you have a couple of choices:
Change your table to store year and month in the format YYYYMM (either as an integer value or text). This column can replace your current year and index column or be in addition to them (although this breaks normal form). Index this column and query against it.
Create a separate table with one record per year and month and also the indexable column as described above. In your query, JOIN this table back to the source table and perform your query against the indexed column in the smaller table.

Related

What is the fastest way to perform a date query in Oracle SQL?

We have a 6B row table that is giving us challenges when retrieving data.
Our query returns values instantly when doing a...
SELECT * WHERE Event_Code = 102225120
That type of instant result is exactly what we need. We now want to filter to receive values for just a particular year - but the moment we add...
AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
...the query takes over 10 minutes to begin returning any values.
Another SO post mentions that indexes don't necessarily help date queries when pulling many rows as opposed to an individual row. There are other approaches like using TRUNC, or BETWEEN, or specifying the datetime in YYYY-MM-DD format for doing comparisons.
Of note, we do not have the option to add indexes to the database as it is a vendor's database.
What is the way to add a date filtering query and enable Oracle to begin streaming the results back in the fastest way possible?
Another SO post mentions that indexes don't necessarily help date queries when pulling many rows as opposed to an individual row
That question is quite different from yours. Firstly, your statement above applies to any data type, not only dates. Also the word many is relative to the number of records in the table. If the optimizer decides that the query will return many of all records in your table, then it may decide that a full scan of the table is faster than using the index. In your situation, this translates to how many records are in 2017 out of all records in the table? This calculation gives you the cardinality of your query which then gives you an idea if an index will be faster or not.
Now, if you decide that an index will be faster, based on the above, the next step is to know how to build your index. In order for the optimizer to use the index, it must match the condition that you're using. You are not comparing dates in your query, you are only comparing the year part. So an index on the date column will not be used by this query. You need to create an index on the year part, so use the same condition to create the index.
we do not have the option to add indexes to the database as it is a vendor's database.
If you cannot modify the database, there is no way to optimize your query. You need to talk to the vendor and get access to modify the database or ask them to add the index for you.
A function can also cause slowness for the number of records involved. Not sure if Function Based Index can help you for this, but you can try.
Had you tried to add a year column in the table? If not, try to add a year column and update it using code below.
UPDATE table
SET year = EXTRACT(YEAR FROM PERFORMED_DATE_TIME);
This will take time though.
But after this, you can run the query below.
SELECT *
FROM table
WHERE Event_Code = 102225120 AND year = 2017;
Also, try considering Table Partitioned for this big data. For starters, see link below,
link: https://oracle-base.com/articles/8i/partitioned-tables-and-indexes
Your question is a bit ambiguous IMHO:
but the moment we add...
AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
...the query takes over 10 minutes to begin returning any values.
Do you mean that
SELECT * WHERE Event_Code = 102225120
is fast, but
SELECT * WHERE Event_Code = 102225120 AND EXTRACT(YEAR FROM PERFORMED_DATE_TIME) = 2017
is slow???
For starters I'll agree with Mitch Wheat that you should try to use PERFORMED_DATE_TIME between Jan 1, 2017 and Dec 31, 2017 instead of Year(field) = 2017. Even if you'd have an index on the field, the latter would hardly be able to make use of it while the first method would benefit enormously.
I'm also hoping you want to be more specific than just 'give me all of 2017' because returning over 1B rows is NEVER going to be fast.
Next, if you can't make changes to the database, would you be able to maintain a 'shadow' in another database? This would require that you create a table with all date-values AND the PK of the original table in another database and query those to find the relevant PK values and then JOIN those back to your original table to find whatever you need. The biggest problem with this would be that you need to keep the shadow in sync with the original table. If you know the original table only changes overnight, you could merge the changes in the morning and query all day. If the application is 'real-time(ish)' then this probably won't work without some clever thinking... And yes, your initial load of 6B values will be rather heavy =)
May this could be usefull (because you avoid functions (a cause for context switching) and if you have an index on your date field, it could be used) :
with
dt as
(
select
to_date('01/01/2017', 'DD/MM/YYYY') as d1,
to_date('31/01/2017', 'DD/MM/YYYY') as d2
from dual
),
dates as
(
select
dt.d1 + rownum -1 as d
from dt
connect by dt.d1 + rownum -1 <= dt.d2
)
select *
from your_table, dates
where dates.d = PERFORMED_DATE_TIME
Move the date literal to RHS:
AND PERFORMED_DATE_TIME >= date '2017-01-01'
AND PERFORMED_DATE_TIME < date '2018-01-01'
But without an (undisclosed) appropriate index on PERFORMED_DATE_TIME, the query is unlikely to be any faster.
One option to create indexes in third party databases is to script in the index and then before any vendor upgrade run a script to remove any indexes you've added. If the index is important, ask the vendor to add it to their database design.

Is hive partitioning hierarchical in nature?

Say we have a table partitioned as:-
CREATE EXTERNAL TABLE MyTable (
col1 string,
col2 string,
col3 string
)
PARTITIONED BY(year INT, month INT, day INT, hour INT, combination_id BIGINT);
Now obviously year is going to store year value (e.g. 2016), the month will store month va.ue (e.g. 7) the day will store day (e.g. 18) and hour will store hour value in 24 hour format (e.g. 13). And combination_id is going to be combination of padded (if single digit value pad it with 0 on left) values for all these. So in this case for example the combination id is 2016071813.
So we fire query (lets call it Query A):-
select * from mytable where combination_id = 2016071813
Now Hive doesn't know that combination_id is actually combination of year,month,day and hour. So will this query not take proper advantage of partitioning?
In other words, if I have another query, call it Query B, will this be more optimal than query A or there is no difference?:-
select * from mytable where year=2016 and month=7 and day=18 and hour=13
If Hive partitioning scheme is really hierarchical in nature then Query B should be better from performance point of view is what I am thinking. Actually I want to decide whether to get rid of combination_id altogether from partitioning scheme if it is not contributing to better performance at all.
The only real advantage for using combination id is to be able to use BETWEEN operator in select:-
select * from mytable where combination_id between 2016071813 and 2016071823
But if this is not going to take advantage of partitioning scheme, it is going to hamper performance.
Yes. Hive partitioning is hierarchical.
You can simply check this by printing the partitions of the table using below query.
show partitions MyTable;
Output:
year=2016/month=5/day=5/hour=5/combination_id=2016050505
year=2016/month=5/day=5/hour=6/combination_id=2016050506
year=2016/month=5/day=5/hour=7/combination_id=2016050507
In your scenario, you don't need to specify combination_id as partition column if you are not using for querying.
You can partition either by
Year, month, day, hour columns
or
combination_id only
Partitioning by Multiple columns helps in performance in grouping operations.
Say if you want to find maximum of a col1 for 'March' month of the years (2016 & 2015).
It can easily fetch the records by going to the specific 'Year' partition(year=2016/2015) and month partition(month=3)

Could adding a column with month identificator increase queries performance?

Situation is as follows. There is a table with about 40 000 000 rows per month times 24 months, so lets say almost 1 000 000 000 rows. Each rows has got a timestamp column with index created on this column.
Most frequent queries are the once that aggregate data for the specific month - for example January 2016. If we assign a separete identificator for every month, lets call it "idm" and for January 2016 make it equal 1 (February 2016 = 2 and so on), create index on idm, would it have any effect on query performance comparing WHERE statements :
timestamp >= '20160101' AND timestamp < '20160201'
idm = 1
?
Would using idm be faster?
If you have an index on timestamp and on the proposed idm column, then the two would probably be identical. This an an approximate answer. If you have other conditions in the where clause, then the idm = 1 is better for performance. It allows more ways of using optimization.
However, indexes are not the right approach. Because of the nature of your data and queries, you should consider table partitions. This would allow each month of data to be stored separately. You can read about table partitioning here.
If you don't want to partition the tables, I would recommend making idm or timestamp a clustered index. This will help queries, even the where clause selects a relatively high proportion of rows in the table.

MS Access SQL to Select date range

I need to select a record with dates which has dates ( in range: form 1998 to 1999). I wrote the statement which did seem to work . Why?
SELECT *
FROM Factory
WHERE
(EXTRACT(YEAR FROM date) AS dyear) BETWEEN '1998' AND '1999'
You can use YEAR() to get the year from the date.
SELECT *
FROM Factory
WHERE YEAR(date) BETWEEN 1998 AND 1999
MSAccess YEAR()
Applying the Year() function for every row in Factory will be a noticeable performance challenge if the table includes thousands of rows. (Actually it would be a performance challenge for a smaller table, too, but you would be less likely to notice the hit in that case.) A more efficient approach would be to index the [date] field and use indexed retrieval to limit the db engine's workload.
SELECT f.*
FROM Factory AS f
WHERE f.date >= #1998-1-1# AND f.date < #2000-1-1#;
Whenever possible, design your queries to take advantage of indexed retrieval. That can improve performance dramatically. As a simplistic rule of thumb: indexed retrieval = good; full table scan = bad. Try to avoid full tables scans whenever possible.

What is the fastest way to select rows from a huge database?

I have a huge database of more than 3 million rows (my users information), I need to select all users that have birthdays in the current day.
The birthday column is a text (e.g. '19/03' or '19/03/1975') with day and month and sometimes the years.
When I try to select rows with like of left functions it take more then a minute to return the results.
I've tried to use 3 int column for day, month and year and then make the selection but it toke longer to get the results.
Any idea on how to make it run faster?
I'm using SQL Server 2008
Thanks
As marc_s mentions, if at all possible, store this as a date type - it'll make it way faster for SQL Server to perform comparisons on, and it'll be way easier to maintain. Next up, make sure to put an index on that column, and consider including any extra columns if you're only looking up the birthday to select a small subset of the total row.
Finally - and this is a big one. TEXT is just about the worst data type you could choose. The way TEXT is stored, the data isn't actually stored on the page itself. Instead it leaves behind a 16-byte pointer to another page. This other page will then contain the data itself in a record. But it gets worse, that record will be a SMALL_ROOT datatype taking up 84 bytes of space when your data is between 0 and 64 bytes in length!
Thus, what could've been saved as an 8-byte datetime or a 4-byte date now takes up a total of 100 bytes, and causes an off-row lookup for each and every row. Basically the perfect storm for bad performance.
If you cannot change it to a more proper datetime, at the very least, change it to a varchar!
first of all save the date in a format that is supported by SQL Server something like DATE or DATETIME (in your case I am guessing DATE should be enough) once you have that you can use SQL functions like MONTH and DAY as follows and avoid complex string manipulation function like LEFT etc.
Your query will look like this:
select * from MyTable where MONTH(dateColumnA) = '1' && DAY(dateColumnB) ='7' --1 is for january
I am not sure if this will solve your performance problems entirely but you can run this query in SQL Query Analyzer and see what recommendation it throws with respect to indexes etc. I dont have a great deal of knowledge about indexes on Date type columns
Most of what I had to say has already been said: Use a DATE type to store the date, and make sure that it is indexed. If you're going to use the three integers to store the date and search by that, then make sure that they're indexed as well:
CREATE INDEX IX_MyTable_Date_Ints ON MyTable(intYear, intMonth, intDay)
CREATE INDEX IX_MyTable_Date ON MyTable(BirthDate)
If you're wanting to be able to search the user table for birthdays excluding the year, I would recommend storing the birthday in a different date field, using a fixed year, e.g. 3004 - instead of using three integers. You base year should be a leap-year, to cater for anyone who may have been born on 29 February. If you use a year far in the future, you can use the year to determine that a date is effectively a date for which the year should be disregarded.
Then you can search for the birthday, regardless of the year, without having to do a function call on each record, by adding "WHERE birth_day = '3004-12-10'. If this field is indexed, you should be able to return all matching rows in a flash. You need to bear in mind that when searching an index, the server will need to do a maximum of 32 comparisons to find a match in 4 billion records. Never underestimate the benefits of indexing!
I would be inclined to put maintain the birthday through a trigger, so that it keeps itself updated. For those birth dates where you don't have the year, just use your base year (3004). Since your base year is in the future, you know that this birth date doesn't have a year.
CREATE TABLE MyTable (
MyTable_key INT IDENTITY(1, 1),
username VARCHAR(30),
birth_date DATE,
birth_day DATE
)
ALTER TABLE MyTable ADD CONSTRAINT PK_MyTable PRIMARY KEY CLUSTERED (MyTable_key)
CREATE INDEX MyTable_birth_date ON MyTable(birth_date)
CREATE INDEX MyTable_birth_day ON MyTable(birth_day)
GO
CREATE TRIGGER tr_MyTable_calc_birth_day ON MyTable AFTER INSERT, UPDATE AS
UPDATE t SET birth_day = DATEADD(YEAR, 3004-DATEPART(YEAR, t.birth_date), t.birth_date)
FROM MyTable t, inserted i WHERE i.MyTable_key = t.MyTable_key
To update your existing table, run the update as a standalone query, without the join to the inserted table as it was used in the trigger:
UPDATE MyTable SET birth_day = DATEADD(YEAR, 3004-DATEPART(YEAR, birth_date), birth_date)
Hope this helps.
Try to use Result Set instead of DataTable or DataSet. ResultSet is fast when compared to both of these