Optimization on large tables - sql

I have the following query that joins two large tables. I am trying to join on patient_id and records that are not older than 30 days.
select * from
chairs c
join data id
on c.patient_id = id.patient_id
and to_date(c.from_date, 'YYYYMMDD') - to_date(id.from_date, 'YYYYMMDD') >= 0
and to_date (c.from_date, 'YYYYMMDD') - to_date(id.from_date, 'YYYYMMDD') < 30
Currently, this query takes 2 hours to run. What indexes can I create on these tables for this query to run faster.

I will take a shot in the dark, because as others said it depends on what the table structure, indices, and the output of the planner is.
The most obvious thing here is that as long as it is possible, you want to represent dates as some date datatype instead of strings. That is the first and most important change you should make here. No index can save you if you transform strings. Because very likely, the problem is not the patient_id, it's your date calculation.
Other than that, forcing hash joins on the patient_id and then doing the filtering could help if for some reason the planner decided to do nested loops for that condition. But that is for after you fixed your date representation AND you still have a problem AND you see that the planner does nested loops on that attribute.

Some observations if you are stuck with string fields for the dates:
YYYYMMDD date strings are ordered and can be used for <,> and =.
Building strings from the data in chairs to use to JOIN on data will make good use of an index like one on data for patient_id, from_date.
So my suggestion would be to write expressions that build the date strings you want to use in the JOIN. Or to put it another way: do not transform the child table data from a string to something else.
Example expression that takes 30 days off a string date and returns a string date:
select to_char(to_date('20200112', 'YYYYMMDD') - INTERVAL '30 DAYS','YYYYMMDD')
Untested:
select * from
chairs c
join data id
on c.patient_id = id.patient_id
and id.from_date between to_char(to_date(c.from_date, 'YYYYMMDD') - INTERVAL '30 DAYS','YYYYMMDD')
and c.from_date

For this query:
select *
from chairs c join data
id
on c.patient_id = id.patient_id and
to_date(c.from_date, 'YYYYMMDD') - to_date(id.from_date, 'YYYYMMDD') >= 0 and
to_date (c.from_date, 'YYYYMMDD') - to_date(id.from_date, 'YYYYMMDD') < 30;
You should start with indexes on (patient_id, from_date) -- you can put them in both tables.
The date comparisons are problematic. Storing the values as actual dates can help. But it is not a 100% solution because comparison operations are still needed.
Depending on what you are actually trying to accomplish there might be other ways of writing the query. I might encourage you to ask a new question, providing sample data, desired results, and a clear explanation of what you really want. For instance, this query is likely to return a lot of rows. And that just takes time as well.

Your query have a non SERGABLE predicate because it uses functions that are iteratively executed. You need to discard such functions and replace them by a direct access to the columns. As an exemple :
SELECT *
FROM chairs AS c
JOIN data AS id
ON c.patient_id = id.patient_id
AND c.from_date BETWEEN id.from_date AND id.from_date + INTERVAL '1 day'
Will run faster with those two indexes :
CREATE X_SQLpro_001 ON chairs (patient_id, from_date);
CREATE X_SQLpro_002 ON data (patient_id, from_date) ;
Also try to avoid
SELECT *
And list only the necessary columns

Related

Finds number which have same last digit from 2 different columns

I have 2 columns of phone number and the requirement is to get the numbers which have same last 8 digits. ColumnA's numbers have 11 digits and columnB's numbers have 9 or 10 digits.
I tried to use SUBSTR or LIKE and LEFT RIGHT function to solve but the problem is the data is too big and i can't use that way.
select trunc(ta.timeA), ta.columnA
from table1A ta,
tableB tb
WHERE substr(ta.columnA,-8) LIKE substr(tb.columnB,-8)
and trunc(ta.timeA) = trunc(ta.timeB)
AND trunc(ta.timeA) >= TO_DATE('01/01/2018', 'dd/mm/yyyy')
AND trunc(ta.timeA) < TO_DATE('01/01/2018', 'dd/mm/yyyy') + 1
GROUP BY ta.columnA, trunc(ta.timeA)
You want to select from tableA, so do this. Don't join. You only want to select tableA rows that have a match in tableB. So place an EXISTS clause in your WHERE clause.
select trunc(timea), columna
from table1a ta
where trunc(timea) >= date '2018-01-01'
and trunc(timea) < date '2018-01-02'
and exists
(
select *
from tableb tb
where trunc(tb.timeb) = trunc(ta.timea)
and substr(tb.columnb, -8) = substr(ta.columna, -8)
)
order by trunc(timea), columna;
In order to have this run fast, create the following indexes:
create idxa on tablea( trunc(timea), substr(columna, -8) );
create idxb on tableb( trunc(timeb), substr(columnb, -8) );
I don't see, however, why you are so eager to have this run fast. Do you want to keep all data as is and run the query again and again? There should be a better solution. Splitting the area code and number into two separate columns is the first thing that comes to mind.
UPDATE: Still faster than the suggested idxa should be a covering index for tableA:
create idxa on tablea( trunc(timea), substr(columna, -8), columna );
Here the DBMS can work with the index only and doesn't have to access the table. So just in case the above was still a bit too slow for you, you can try with this altered index.
And as Alex Poole has pointed out in the comments below, it should be
where trunc(timea) = date '2018-01-01'
only, if the range you are looking at is always a single day as in the example.
You can try below using = operator instead of like operator
as your want to match last 2 digit
select trunc(ta.timeA),ta.columnA
from table1A ta inner join tableB tb
on substr(ta.columnA,-8) = substr(tb.columnB,-8)
and trunc(ta.timeA) = trunc(ta.timeB)
AND trunc(ta.timeA) >= TO_DATE('01/01/2018', 'dd/mm/yyyy')
AND trunc(ta.timeA) < TO_DATE('01/01/2018', 'dd/mm/yyyy') + 1
GROUP BY ta.columnA, trunc(ta.timeA)
It would be easier to help if you were more specific about your SQL environment, below is some advice on this query that would apply in most environments.
When dealing with large data sets performance becomes even more critical and small changes in technique can have a big impact.
For example:
Like is normally used for a partial match with wildcards, do you not mean equals? Like is slower than equals, if you're not using wildcards I recommend looking for equality.
Also, you initially start with a (cross/cartesian product) join, but then your where clause defines very specific match criteria (matching time fields), if you need a matching time field make it part of the table join, this will reduce the number of join results which will significantly shrink the dataset that then needs to have the other criteria applied to it.
Also, having date calculated values in your where clause is slow. It is better to set a #fromDate and #toDate parameter before the query, then use these in the where clause as what are then literals which then don't need to be calculated for every row.

SQL SELECT that excludes rows with any of a list of values?

I have found many Questions and Answers about a SELECT excluding rows with a value "NOT IN" a sub-query (such as this). But how to exclude a list of values rather than a sub-query?
I want to search for rows whose timestamp is within a range but exclude some specific date-times. In English, that would be:
Select all the ORDER rows recorded between noon and 2 PM today except for the ones of these times: Today 12:34, Today 12:55, and Today 13:05.
SQL might be something like:
SELECT *
FROM order_
WHERE recorded_ >= ?
AND recorded_ < ?
AND recorded_ NOT IN ( list of date-times… )
;
So two parts to this Question:
How to write the SQL to exclude rows having any of a list of values?
How to set an arbitrary number of arguments to a PreparedStatement in JDBC?(the arbitrary number being the count of the list of values to be excluded)
Pass array
A fast and NULL-safe alternative would be a LEFT JOIN to an unnested array:
SELECT o.*
FROM order_ o
LEFT JOIN unnest(?::timestamp[]) x(recorded_) USING (recorded_)
WHERE o.recorded_ >= ?
AND o.recorded_ < ?
AND x.recorded_ IS NULL;
This way you can prepare a single statement and pass any number of timestamps as array.
The explicit cast ::timestamp[] is only necessary if you cannot type your parameters (like you can in prepared statements). The array is passed as single text (or timestamp[]) literal:
'{2015-07-09 12:34, 2015-07-09 12:55, 2015-07-09 13:05}', ...
Or put CURRENT_DATE into the query and pass times to add like outlined by #drake . More about adding a time / interval to a date:
How to get the end of a day?
Pass individual values
You could also use a VALUES expression - or any other method to create an ad-hoc table of values.
SELECT o.*
FROM order_ o
LEFT JOIN (VALUES (?::timestamp), (?), (?) ) x(recorded_)
USING (recorded_)
WHERE o.recorded_ >= ?
AND o.recorded_ < ?
AND x.recorded_ IS NULL;
And pass:
'2015-07-09 12:34', '2015-07-09 12:55', '2015-07-09 13:05', ...
This way you can only pass a predetermined number of timestamps.
Asides
For up to 100 parameters (or your setting of max_function_args), you could use a server-side function with a VARIADIC parameter:
Return rows matching elements of input array in plpgsql function
I know that you are aware of timestamp characteristics, but for the general public: equality matches can be tricky for timestamps, since those can have up to 6 fractional digits for seconds and you need to match exactly.
Related
Select rows which are not present in other table
Optimizing a Postgres query with a large IN
SELECT *
FROM order_
WHERE recorded_ BETWEEN (CURRENT_DATE + time '12:00' AND CURRENT_DATE + time '14:00')
AND recorded_ NOT IN (CURRENT_DATE + time '12:34',
CURRENT_DATE + time '12:55',
CURRENT_DATE + time '13:05')
;

SQLl create table /w select statements and defined date ranges

I'm trying to track down new id numbers over time for at least the past twelve months. Note, the data is such that once id numbers are in, they stick around for at least 3-5 years. And I just literally run this thing once a month. These are the specs, Oracle Database 11g Release 11.2.0.3.0 - 64bit Production
PL/SQL Release 11.2.0.3.0 - Production.
So far I'm wondering if I can use more dynamic date ranges and run this whole thing on a timer, or if this is the best way to write something. I've just picked up sql mostly through Googling and looking at sample queries that others have graciously shared. I also do not know how to write PL/SQL right now either but am willing to learn.
create table New_ids_calendar_year_20xx
as
select b.id_num, (bunch of other fields)
from (select * from source_table where date = last_day(date_add(sysdate,-11))a, (select * from source_table where date = last_day(date_add(sysdate,-10))b where a.id_num (+) = b.id_num
union all
*repeats this same select statement /w union all until:
last_day(date_add(sysdate,0)
In Oracle there is no built-in function date_add, maybe you have some which you created, anyway for adding and substracting dates
I used simple sysdate+number. Also I am not quite sure about logic behind your whole query. And for field names - better avoid
reserved words like date in column names, so I used
tdate here.
This query does what your unioned query did for last 30 days. For other periods change 30 to something other.
The whole solution is based on the hierarchical
subquery connect by which gives simple list of numbers 0..29.
select b.id_num, b.field1 field1_b, a.field1 field1_a --..., (bunch of other fields)
from (select level - 1 lvl from dual connect by level <= 30) l
join source_table b
on b.tdate = last_day(trunc(sysdate) - l.lvl - 1)
left join source_table a
on a.id_num = b.id_num and a.tdate = last_day(trunc(sysdate) - l.lvl)
order by lvl desc
For date column you may want to use trunc(tdate) if you store time also, but this way your index on date field will not work if one exists.
In this case change date condition to something like x-1 <= date and date < x.

Postgresql query between date ranges

I am trying to query my postgresql db to return results where a date is in certain month and year. In other words I would like all the values for a month-year.
The only way i've been able to do it so far is like this:
SELECT user_id
FROM user_logs
WHERE login_date BETWEEN '2014-02-01' AND '2014-02-28'
Problem with this is that I have to calculate the first date and last date before querying the table. Is there a simpler way to do this?
Thanks
With dates (and times) many things become simpler if you use >= start AND < end.
For example:
SELECT
user_id
FROM
user_logs
WHERE
login_date >= '2014-02-01'
AND login_date < '2014-03-01'
In this case you still need to calculate the start date of the month you need, but that should be straight forward in any number of ways.
The end date is also simplified; just add exactly one month. No messing about with 28th, 30th, 31st, etc.
This structure also has the advantage of being able to maintain use of indexes.
Many people may suggest a form such as the following, but they do not use indexes:
WHERE
DATEPART('year', login_date) = 2014
AND DATEPART('month', login_date) = 2
This involves calculating the conditions for every single row in the table (a scan) and not using index to find the range of rows that will match (a range-seek).
From PostreSQL 9.2 Range Types are supported. So you can write this like:
SELECT user_id
FROM user_logs
WHERE '[2014-02-01, 2014-03-01]'::daterange #> login_date
this should be more efficient than the string comparison
Just in case somebody land here... since 8.1 you can simply use:
SELECT user_id
FROM user_logs
WHERE login_date BETWEEN SYMMETRIC '2014-02-01' AND '2014-02-28'
From the docs:
BETWEEN SYMMETRIC is the same as BETWEEN except there is no
requirement that the argument to the left of AND be less than or equal
to the argument on the right. If it is not, those two arguments are
automatically swapped, so that a nonempty range is always implied.
SELECT user_id
FROM user_logs
WHERE login_date BETWEEN '2014-02-01' AND '2014-03-01'
Between keyword works exceptionally for a date. it assumes the time is at 00:00:00 (i.e. midnight) for dates.
Read the documentation.
http://www.postgresql.org/docs/9.1/static/functions-datetime.html
I used a query like that:
WHERE
(
date_trunc('day',table1.date_eval) = '2015-02-09'
)
or
WHERE(date_trunc('day',table1.date_eval) >='2015-02-09'AND date_trunc('day',table1.date_eval) <'2015-02-09')

working with a date in SQL

i know this query is not right, and kind of jumbled. but it sort of displays what i want to do. what I'm trying to figure out is how to use the current date in a query. basically i want to subtract a stored date from the current date and if the result is < 30 do something. but i obviously don't know how to work with dates.... i am assuming that it shouldn't be a char value, but if i just use sys date oracle gives me a table error.
select e.STUDENT_ID
from COURSES c, CLASS_ENROLLMENT e, (SELECT TO_CHAR (SYSDATE, 'MM-DD-YYYY') as now
FROM DUAL) t
where t - c.END_DATE <= 30;
Assuming the rest of your query is correct this should solve your date problem:
select e.STUDENT_ID
from COURSES c, CLASS_ENROLLMENT e
where to_date(to_char(sysdate, 'DD-MON-RR')) - c.END_DATE <= 30;
You don't need to select sysdate from dual unless you just want to display it for one reason or another without querying something else. You can just specify it as sysdate as part of a any query anywhere.
The reason I converted it to a character and then to a date in this case is just to remove the time from sysdate, this way it will be 30 days from the given day, regardless of time. If you didn't care about time you could just say sysdate - c.end_date <=30
As an fyi - you probably need to add join conditions between the COURSES and CLASS_ENROLLMENT tables. The above should not result in a sql error and should do you want with respect to the date of the records, but it's unlikely to be what you want (in full).