Object with starts_on and ends_on best query suggestion - sql

I need to select from a table that has a starts_on field and an ends_on field.
I need to pass start date and end date for filtering and retrieving those objects.
At the moment it's working and I use the following:
SELECT * FROM ***
WHERE ((starts_on >= START_DATE AND starts_on <= END_DATE) OR
(ends_on >= START_DATE AND ends_on <= END_DATE) OR
(starts_on <= END_DATE AND ends_on >= END_DATE))
ORDER BY starts_on, id
It looks a bit messy, but can't see an easy way to simplify it. Any idea?
I'm using postgres 9.1 as dbms.
Edit:
starts_on | timestamp without time zone |
ends_on | timestamp without time zone |
Ex: if one entry has starts_on = '2012/02/02' and ends_on '2012/02/05' I want the following behavior:
if I filter by start date 2012/01/01 and end date 2012/03/01 I want the item to be returned
if I filter by start date 2012/02/04 and end date 2012/03/01 I want the item to be returned
if I filter by start date 2012/02/05 and end date 2012/03/01 I want the item to be returned
if I filter by start date 2012/02/04 and end date 2012/02/04 I want the item to be returned
if I filter by start date 2012/02/06 and end date 2012/03/01 I want the item to NOT be returned
if I filter by start date 2012/01/01 and end date 2012/02/01 I want the item to NOT be returned

Query
If you want all rows where the time period between starts_on and ends_on overlaps with the passed time period of START_DATE and END_DATE, and "end" is always later than "start", and all involved columns are of type timestamp (as opposed to time or date), this simpler query does the job:
SELECT *
FROM tbl
WHERE starts_on <= END_DATE
AND ends_on >= START_DATE
ORDER BY starts_on, id;
Fits the question as later clarified.
Index
The best index for this query would be a multi-column index like:
CREATE INDEX tbl_range_idx ON tbl (starts_on, ends_on DESC)
Would work with DESC / ASC almost as well, because an index can be searched in both directions almost equally well.
How do I figure?
The index is searched on the first condition starts_on <= END_DATE, qualifying rows are at the beginning.
From there, Postgres can take all rows that end late enough according to ends_on >= START_DATE. Qualifying rows come first. Optimal index.
But don't just take my word - test performance with EXPLAIN ANALYZE. Run a couple of times to exclude caching effects.
There is also the OVERLAPS operator for the same purpose. Simplifies the logic, but isn't superior otherwise.
And there are the new range types in PostgreSQL 9.2, with their own operators. Not for 9.1 though.

Related

sql query to get today new records compared with yesterday

i have this table:
COD (Integer) (PK)
ID (Varchar)
DATE (Date)
I just want to get the new ID's from today, compared with yesterday (the ID's from today that are not present yesterday)
This needs to be done with just one query, maximum efficiency because the table will have 4-5 millions records
As a java developer i am able to do this with 2 queries, but with just one is beyond my knowledge so any help would be so much appreciated
EDIT: date format is dd/mm/yyyy and every day each ID may come 0 or 1 times
Here is a solution that will go over the base data one time only. It selects the id and the date where the date is either yesterday or today (or both). Then it GROUPS BY id - each group will have either one or two rows. Then it filters by the condition that the MIN date in the group is "today". Those are the id's that exist today but did not exist yesterday.
DATE is an Oracle keyword, best not used as a column name. I changed that to DT. I also assume that your "dt" field is a pure date (as pure as it can be in Oracle, meaning: time of day, which is always present, is 00:00:00).
select id
from your_table
where dt in (trunc(sysdate), trunc(sysdate) - 1)
group by id
having min(dt) = trunc(sysdate)
;
Edit: Gordon makes a good point: perhaps you may have more than one such row per ID, in the same day? In that case the time-of-day may also be different from 00:00:00.
If so, the solution can be adapted:
select id
from your_table
where dt >= trunc(sysdate) - 1 and dt < trunc(sysdate) + 1
group by id
having min(dt) >= trunc(sysdate)
;
Either way: (1) the base table is read just once; (2) the column DT is not wrapped within any function, so if there is an index on that column, it can be used to access just the needed rows.
The typical method would use not exists:
select t.*
from t
where t.date >= trunc(sysdate) and t.date < trunc(sysdate + 1) and
not exists (select 1
from t t2
where t2.id = t.id and
t2.date >= trunc(sysdate - 1) and t2.date < trunc(sysdate)
);
This is a general solution. If you know that there is at most one record per day, there are better solutions, such as using lag().
Use MINUS. I suppose your date column has a time part, so you need to truncate it.
select id from mytable where trunc(date) = trunc(sysdate)
minus
select id from mytable where trunc(date) = trunc(sysdate) - 1;
I suggest the following function index. Without it, the query would have to full scan the table, which would probably be quite slow.
create idx on mytable( trunc(sysdate) , id );

Postgresql query between date ranges

I am trying to query my postgresql db to return results where a date is in certain month and year. In other words I would like all the values for a month-year.
The only way i've been able to do it so far is like this:
SELECT user_id
FROM user_logs
WHERE login_date BETWEEN '2014-02-01' AND '2014-02-28'
Problem with this is that I have to calculate the first date and last date before querying the table. Is there a simpler way to do this?
Thanks
With dates (and times) many things become simpler if you use >= start AND < end.
For example:
SELECT
user_id
FROM
user_logs
WHERE
login_date >= '2014-02-01'
AND login_date < '2014-03-01'
In this case you still need to calculate the start date of the month you need, but that should be straight forward in any number of ways.
The end date is also simplified; just add exactly one month. No messing about with 28th, 30th, 31st, etc.
This structure also has the advantage of being able to maintain use of indexes.
Many people may suggest a form such as the following, but they do not use indexes:
WHERE
DATEPART('year', login_date) = 2014
AND DATEPART('month', login_date) = 2
This involves calculating the conditions for every single row in the table (a scan) and not using index to find the range of rows that will match (a range-seek).
From PostreSQL 9.2 Range Types are supported. So you can write this like:
SELECT user_id
FROM user_logs
WHERE '[2014-02-01, 2014-03-01]'::daterange #> login_date
this should be more efficient than the string comparison
Just in case somebody land here... since 8.1 you can simply use:
SELECT user_id
FROM user_logs
WHERE login_date BETWEEN SYMMETRIC '2014-02-01' AND '2014-02-28'
From the docs:
BETWEEN SYMMETRIC is the same as BETWEEN except there is no
requirement that the argument to the left of AND be less than or equal
to the argument on the right. If it is not, those two arguments are
automatically swapped, so that a nonempty range is always implied.
SELECT user_id
FROM user_logs
WHERE login_date BETWEEN '2014-02-01' AND '2014-03-01'
Between keyword works exceptionally for a date. it assumes the time is at 00:00:00 (i.e. midnight) for dates.
Read the documentation.
http://www.postgresql.org/docs/9.1/static/functions-datetime.html
I used a query like that:
WHERE
(
date_trunc('day',table1.date_eval) = '2015-02-09'
)
or
WHERE(date_trunc('day',table1.date_eval) >='2015-02-09'AND date_trunc('day',table1.date_eval) <'2015-02-09')

Efficient way of counting a large content from a cloumn or a two in a database using selected time period

I need to list number of column1 that have been added to the database over the selected time period (since the day the list is requested)-daily, weekly (last 7 days), monthly (last 30 days) and quarterly (last 3 months). for example below is the table I created to perform this task.
Column | Type | Modifiers
------------------+-----------------------------+-----------------------------------------------------
column1 character varying (256) not null default nextval
date timestamp without time zone not null default now()
coloumn2 charater varying(256) ..........
Now, I need the total count of entries in column1 with respect the selected time period.
Like,
Column 1 | Date | Coloumn2
------------------+-----------------------------+-----------------------------------------------------
abcdef 2013-05-12 23:03:22.995562 122345rehr566
njhkepr 2013-04-10 21:03:22.337654 45hgjtron
ffb3a36dce315a7 2013-06-14 07:34:59.477735 jkkionmlopp
abcdefgggg 2013-05-12 23:03:22.788888 22345rehr566
From above data, for daily selected time period it should be count= 2
I have tried doing this query
select count(column1) from table1 where date='2012-05-12 23:03:22';
and have got the exact one record matching the time stamp. But I really needed to do it in proper way I believe this is not an efficient way of retrieving the count. Anyone who could help me know the right and efficient way of writing such query would be great. I am new to the database world, and I am trying to be efficient in writing any query.
Thanks!
[EDIT]
Each query currently is taking 175854ms to get process. What could be the efficient way to lessen the time to have it processed accordingly. Any help would be really great. I am using Postgresql to do the same.
To be efficient, conditions should compare values of the sane type as the columns being compared. In this case, the column being compared - Date - has type timestamp, so we need to use a range of tinestamp values.
In keeping with this, you should use current_timestamp for the "now" value, and as confirmed by the documentation, subtracting an interval from a timestamp yields a timestamp, so...
For the last 1 day:
select count(*) from table1
where "Date" > current_timestamp - interval '1 day'
For the last 7 days:
select count(*) from table1
where "Date" > current_timestamp - interval '7 days'
For the last 30 days:
select count(*) from table1
where "Date" > current_timestamp - interval '30 days'
For the last 3 months:
select count(*) from table1
where "Date" > current_timestamp - interval '3 months'
Make sure you have an index on the Date column.
If you find that the index is not being used, try converting the condition to a between, eg:
where "Date" between current_timestamp - interval '3 months' and current_timestamp
Logically the same, but may help the optimizer to choose the index.
Note that column1 is irrelevant to the question; being unique there is no possibility of the row count being different from the number of different values of column1 found by any given criteria.
Also, the choice of "Date" for the column name is poor, because a) it is a reserved word, and b) it is not in fact a date.
If you want to count number of records between two dates:
select count(*)
from Table1
where "Date" >= '2013-05-12' and "Date" < '2013-05-13'
-- count for one day, upper bound not included
select count(*)
from Table1
where "Date" >= '2013-05-12' and "Date" < '2013-06-13'
-- count for one month, upper bound not included
select count(*)
from Table1
where
"Date" >= current_date and
"Date" < current_date + interval '1 day'
-- current date
What I understand from your wording is
select date_trunc('day', "date"), count(*)
from t
where "date" >= '2013-01-01'
group by 1
order by 1
Replace 'day' for 'week', 'month', 'quarter' as needed.
http://www.postgresql.org/docs/current/static/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC
Create an index on the "date" column.
select count(distinct column1) from table1 where date > '2012-05-12 23:03:22';
I assume "number of column1" means "number of distinct values in column1.
Edit:
Regarding your second question (speed of the query): I would assume that an index on the date column should speed up the runtime. Depending on the data content, this could even be declared unique.
To throw another option into the mix...
Add a column of type "date" and index that -- named "datecol" for this example:
create index on tbl_datecol_idx on tbl (datecol);
analyze tbl;
Then your query can use an equality operator:
select count(*) from tbl where datecol = current_date - 1; --yesterday
Or if you can't add the date datatype column, you could create a functional index on the existing column:
create index tbl_date_fbi on tbl ( ("date"::DATE) );
analyze tbl;
select count(*) from tbl where "date"::DATE = current_date - 1;
Note1: you do not need to query "column1" directly as every row has that attribute filled due to the NOT NULL.
Note2: Creating a column named "date" is poor form, and even worse that it is of type TIMESTAMP.

Getting results between two dates in PostgreSQL

I have the following table:
+-----------+-----------+------------+----------+
| id | user_id | start_date | end_date |
| (integer) | (integer) | (date) | (date) |
+-----------+-----------+------------+----------+
Fields start_date and end_date are holding date values like YYYY-MM-DD.
An entry from this table can look like this: (1, 120, 2012-04-09, 2012-04-13).
I have to write a query that can fetch all the results matching a certain period.
The problem is that if I want to fetch results from 2012-01-01 to 2012-04-12, I get 0 results even though there is an entry with start_date = "2012-04-09" and end_date = "2012-04-13".
SELECT *
FROM mytable
WHERE (start_date, end_date) OVERLAPS ('2012-01-01'::DATE, '2012-04-12'::DATE);
Datetime functions is the relevant section in the docs.
Assuming you want all "overlapping" time periods, i.e. all that have at least one day in common.
Try to envision time periods on a straight time line and move them around before your eyes and you will see the necessary conditions.
SELECT *
FROM tbl
WHERE start_date <= '2012-04-12'::date
AND end_date >= '2012-01-01'::date;
This is sometimes faster for me than OVERLAPS - which is the other good way to do it (as #Marco already provided).
Note the subtle difference. The manual:
OVERLAPS automatically takes the earlier value of the pair as the
start. Each time period is considered to represent the half-open
interval start <= time < end, unless start and end are equal in which
case it represents that single time instant. This means for instance
that two time periods with only an endpoint in common do not overlap.
Bold emphasis mine.
Performance
For big tables the right index can help performance (a lot).
CREATE INDEX tbl_date_inverse_idx ON tbl(start_date, end_date DESC);
Possibly with another (leading) index column if you have additional selective conditions.
Note the inverse order of the two columns. See:
Optimizing queries on a range of timestamps (two columns)
just had the same question, and answered this way, if this could help.
select *
from table
where start_date between '2012-01-01' and '2012-04-13'
or end_date between '2012-01-01' and '2012-04-13'
To have a query working in any locale settings, consider formatting the date yourself:
SELECT *
FROM testbed
WHERE start_date >= to_date('2012-01-01','YYYY-MM-DD')
AND end_date <= to_date('2012-04-13','YYYY-MM-DD');
Looking at the dates for which it doesn't work -- those where the day is less than or equal to 12 -- I'm wondering whether it's parsing the dates as being in YYYY-DD-MM format?
You have to use the date part fetching method:
SELECT * FROM testbed WHERE start_date ::date >= to_date('2012-09-08' ,'YYYY-MM-DD') and date::date <= to_date('2012-10-09' ,'YYYY-MM-DD')
No offense but to check for performance of sql I executed some of the above mentioned solutiona pgsql.
Let me share you Statistics of top 3 solution approaches that I come across.
1) Took : 1.58 MS Avg
2) Took : 2.87 MS Avg
3) Took : 3.95 MS Avg
Now try this :
SELECT * FROM table WHERE DATE_TRUNC('day', date ) >= Start Date AND DATE_TRUNC('day', date ) <= End Date
Now this solution took : 1.61 Avg.
And best solution is 1st that suggested by marco-mariani
SELECT *
FROM ecs_table
WHERE (start_date, end_date) OVERLAPS ('2012-01-01'::DATE, '2012-04-12'::DATE + interval '1');
Let's try range data type.
--sample data.
begin;
create temp table tbl(id serial, user_id integer, start_date date, end_date date);
insert into tbl(user_id, start_date, end_date) values(1, '2012-04-09', '2012-04-13');
insert into tbl(user_id, start_date, end_date) values(1, '2012-01-09', '2012-04-12');
insert into tbl(user_id, start_date, end_date) values(1, '2012-02-09', '2012-04-10');
insert into tbl(user_id, start_date, end_date) values(1, '2012-04-09', '2012-04-10');
commit;
add a new daterange column.
begin;
alter table tbl add column tbl_period daterange ;
update tbl set tbl_period = daterange(start_date,end_date);
commit;
--now test time.
select * from tbl
where tbl_period && daterange('2012-04-10' ::date, '2012-04-12'::date);
returns:
id | user_id | start_date | end_date | tbl_period
----+---------+------------+------------+-------------------------
1 | 1 | 2012-04-09 | 2012-04-13 | [2012-04-09,2012-04-13)
2 | 1 | 2012-01-09 | 2012-04-12 | [2012-01-09,2012-04-12)
further reference: https://www.postgresql.org/docs/current/functions-range.html#RANGE-OPERATORS-TABLE

How to filter table to date when it has a timestamp with time zone format?

I have a very large dataset - records in the hundreds of millions/billions.
I would like to filter the data in this column - i am only showing 2 records of millions:
arrival_time
2019-04-22 07:36:09.870+00
2019-06-07 09:46:09.870+00
How can i filter the data in this column to only the date part? as in I would like to filter where the arrival_time is 2019-04-22 as this would give me the first record and any other records which have the matching date of 2019-04-22?
I have tried to cast the column to timestamp::date = "2019-04-22" but this has been costly and does not work well given i have such vast amounts of records.
sample code is:
select
*
from
mytable
where
arrival_time::timestamp::date = '2019-09-30'
again very costly if i cast to date format as this will be done before the filtering!
any ideas? I am using postgresql and pgadmin4
This query:
where (arrival_time::timestamp)::date = '2019-09-30'
Is converting arrival_time to another type. That generally precludes the use of index and makes it harder for the optimizer to choose the best execution path.
Instead, compare to same data type:
where arrival_time >= '2019-09-30'::timestamp and
arrival_time >= ('2019-09-30'::timestamp + interval '1 day')
You can try to filter for the upper and lower bounds of that day.
...
WHERE arrival_time >= '2019-04-22'::timestamp
AND arrival_time < '2019-04-23'::timestamp
...
Like that an index on arrival_time should be usable and help to improve performance.