Sql server full text search performance with additional conditions - sql

We have a performance problem with SQL Server (2008 R2) Full text search. When we have additional where conditions to full-text search condition, it gets too slow.
Here is my simplified query:
SELECT * FROM Calls C
WHERE (C.CallTime BETWEEN '2013-08-01 00:00:00' AND '2013-08-07 00:00:00')
AND CONTAINS(CustomerText, '("efendim")')
Calls table's primary key is CallId (int, clustered index) and also Calls table indexed by CallTime. We have 16.000.000 rows and CustomerText is about 10KB for each row.
When I see execution plan, first it finds full-text search resultset and then joins with Calls table by CallId. Because of that, if first resultset has more rows, query gets slower (over a minute).
This is the execution plan:
When I run where conditions seperately, it returns 360.000 rows for CallTime condition:
SELECT COUNT(*) FROM Calls C
WHERE (C.CallTime BETWEEN '2013-08-01 00:00:00' AND '2013-08-07 00:00:00')
and 1.200.000 rows for Contains condition:
SELECT COUNT(*) FROM Calls C
WHERE CONTAINS(AgentText, '("efendim")')
So, what can I do to increase performans of my query?

If you have indexed and sorted Calls according to their call time, Instead of calling:
WHERE (C.CallTime BETWEEN '2013-08-01 00:00:00' AND '2013-08-07 00:00:00')
You can find the first index with call time bigger than '2013-08-01 00:00:00' and the last index smaller than the '2013-08-07 00:00:00'
and your new conditional will be:
WHERE (C.CallTime BETWEEN first_index AND last_index)
which is faster than comparing dates.

Related

Query performance with sorting in postgres

I've a performance issue with a query on a table which has 33m rows. The query should return 6m rows.
I'm trying to achive that the response to the request to begin without any significant delay. It's required for data streaming in my app.
After the start, the data transfer may take longer. The difficult is the query has sorting.
So, I created an Index with fields that are used in the "order by" statement and in the "where" clause.
Example likes that:
CREATE TABLE Table1 (
Id SERIAL PRIMARY KEY,
Field1 INT NOT NULL,
Field2 INT NOT NULL,
Field3 INT NOT NULL,
Field4 VARCHAR(200) NOT NULL,
CreateDate TIMESTAMP,
CloseDate TIMESTAMP NULL
);
CREATE INDEX IX_Table1_SomeIndex ON Table1 (Field2, Field4);
And query likes that:
SELECT * FROM Table1 t
WHERE t.CreateDate >= '2020-01-01' AND t.CreateDate < '2021-01-01'
ORDER BY t.Field2, t.Field4
It leads to the following:
when I add "LIMIT 1000" it retruns result immediately and builds the following plan:
the plan with 'LIMIT'
when I run without "LIMIT" it "thinks" for about a minute and returns data for about 16 minutes. And it builds the following plan:
the plan with 'LIMIT'
Why are plans different?
Could you help me to make souliton for streaming immediately (without LIMIT)?
Thanks!
You will need to use a server side cursor or something similar for this to work. Otherwise it runs the query to completion before returning any results. There is no "streaming" by default. How you do this depends on your client, which you don't mention.
If you simply DECLARE a cursor and then FETCH in chunks, then the setting cursor_tuple_fraction will control whether it chooses the plan with a faster start up cost (like what you get with the LIMIT), or a faster overall run cost (like you get without the LIMIT).
If "when I add LIMIT 1000 it returns result immediately" and you want to avoid latency then I would suggest that you run a slightly modified query many times in a loop with LIMIT 1000. An important benefit would be that there will be no long running transactions.
The query to run many times in a loop should return records starting after the largest value of (field2, field4) from the previous iteration run.
SELECT *
FROM table1 t
WHERE t.CreateDate >= '2020-01-01' AND t.CreateDate < '2021-01-01'
AND (t.field2, t.field4) > (:last_run_largest_f2_value, :last_run_largest_f4_value)
ORDER BY t.field2, t.field4
LIMIT 1000;
last_run_largest_f2_value and last_run_largest_f4_value are parameters. Their values shall come from the last record returned by the previous iteration.
AND (t.field2, t.field4) > (:last_run_largest_f2_value, :last_run_largest_f4_value) shall be omitted in the first iteration.
Important limitation
This is an alternative of OFFSET that will work correctly if (field2, field4) values are unique

SQL range conditions less than, greater than and between

What I would like to accomplish is; query if 'email ocr in' & 'universal production' rows in the "documents created column" field, total the same amount as "email OCR" 'documents_created. If not, pull that batch. finally if the attachment count is less than 7 entries after the email ocr in & universal production files are pulled then return said result
current query below:
use N
SELECT id,
type,
NAME,
log_time ,
start_time ,
documents_created ,
pages_created,
processed,
processed_time 
FROM N_LF_OCR_LOG
WHERE
-- Log time is current day
log_time between  CONVERT(date, getdate()) AND CONVERT(datetime,floor(CONVERT(float,getdate()))) + '23:59:00' 
-- Documents created is NULL or non zero
AND (documents_created IS NULL OR documents_created <> 0)
or  ( documents_created is null and log_time between  CONVERT(date, getdate()) AND CONVERT(datetime,floor(CONVERT(float,getdate()))) + '23:59:00')
-- Filter for specific types
AND type IN ('Email OCR In',
'Universal Production')
-- Filter to rows where number of pages and documents created are not equal
AND documents_created <2 and pages_created >2
ORDER BY log_time
,id asc
,processed_time asc
any idea how to incorporate that? Im a novice. thanks
When creating an index, you just specify the columns to be indexed. There is no difference in creating an index for a range query or an exact match. You can add multiple columns to the same index so all columns can benefit from the index, because only one index per table at the time can be selected to support a query.
You could create an index just covering your where-clause:
alter table N_LF_OCR_LOG add index test1(log_time, documents_created, type, pages_created);
Or also add the required columns for the ordering into the index. The ordering of the columns in the index is important and must be the same as for the ordering in the query:
alter table N_LF_OCR_LOG add index test1(log_time, id, processed_time, documents_created, type, pages_created);
Or add a covering index that also contains the returned columns so you do not have to load any values from your tables and can answer to complete query by just using the index. This gives the best response time for the query. But the index takes up more space on the disk.
alter table N_LF_OCR_LOG add index test1(log_time, id, processed_time, documents_created, type, pages_created, NAME, start_time, processed);
Use the explain keyword infront of your query to see how good your index performs.

SQL query takes longer time when date range is smaller?

I have a simple select statement which selects data from a SQL Server 2000 (so old) table with about 10-20 million rows like this -
#startDate = '2014-01-25' -- yyyy-mm-dd
#endDate = '2014-02-20'
SELECT
Id, 6-7 other columns
FROM
Table1 as t1
LEFT OUTER JOIN
Table2 as t2 ON t1.Code = t2.Code
WHERE
t1.Id = 'G59' -- yes, its a varchar
AND (t1.Entry_Date >= #startDate AND t1.Entry_Date < #endDate)
This gives me about 40 K rows in about 10 seconds. But, if I set #startDate = '2014-01-30', keeping #endDate same ALWAYS, then the query takes about 2 min 30 sec
To produce the same number of rows, I tried it with 01-30 again and it took 2 min 48 seconds.
I am surprised to see the difference. I was not expecting the difference to be so big. Rather, I was expecting it to take the same time or lesser for a smaller date range.
What could be the reason for this and how do I fix it ?
Have you recently inserted and/or deleted a large number of rows? It could be that the statistics on the table's indices are out of date, and thus the query optimizer will go for a "index seek + key lookup" scenario on the smaller date range - but that turns out to be slower than just doing a table/clustered index scan.
I would recommend to update the statistics (see this TechNEt article on how to update the statistics) and try again - any improvement?
The query optimizer uses statistics to determine whether it's faster to just do a table scan (just read all the table's data pages and select the rows that match), or whether it's faster to search for the search value in an index; that index typically doesn't contain all the data - so once a match is found, a key lookup needs to be performed on the table to get at the data - which is an expensive operation, so it's only viable for small sets of data. If out-of-date statistics "mislead" the query optimizer, it might choose a suboptimal execution plan

Adding string comparison where clause to inner join takes huge performance hit with SQLite

I'm playing around with Philadelphia transit data and I have an sqlite database storing the gtfs data. I have this query looking for departure times at a particular stop:
SELECT "stop_time".departure_time FROM "stop_time"
INNER JOIN "trip" ON "trip".trip_id = "stop_time".trip_id
WHERE
(trip.route_id = '10726' )
-- AND (trip.service_id = '1')
AND (stop_time.stop_id = '220')
AND (time( stop_time.departure_time ) > time('08:30:45'))
AND (time( stop_time.departure_time ) < time('09:30:45'));
The clause to match service_id to 1 is currently commented out. If I run the query as it is now, without matching service_id, it takes 2 seconds. If I uncomment the service_id clause, it'll take 30. I'm clueless as to why since I'm already looking into the trip table for the route_id.
Any thoughts?
This generally happens if an index is defined on (route_id, stop_id and departure_time) that is used when there is no filter on service_id. Once you include that in the WHERE clause, since it is not present in the index, it requires a TABLE SCAN and hence the shoot up in the execution time. If you include service_id also in the index definition, then TABLE SCAN would be replaced with INDEX SCAN.
The reason is you have an index on service_id that is being chosen in preference to either another index, and there being not many different values of service_id, so using the index isn't being very useful because there are so many rows for service_id = '1'.

Efficient querying of multi-partition Postgres table

I've just restructured my database to use partitioning in Postgres 8.2. Now I have a problem with query performance:
SELECT *
FROM my_table
WHERE time_stamp >= '2010-02-10' and time_stamp < '2010-02-11'
ORDER BY id DESC
LIMIT 100;
There are 45 million rows in the table. Prior to partitioning, this would use a reverse index scan and stop as soon as it hit the limit.
After partitioning (on time_stamp ranges), Postgres does a full index scan of the master table and the relevant partition and merges the results, sorts them, then applies the limit. This takes way too long.
I can fix it with:
SELECT * FROM (
SELECT *
FROM my_table_part_a
WHERE time_stamp >= '2010-02-10' and time_stamp < '2010-02-11'
ORDER BY id DESC
LIMIT 100) t
UNION ALL
SELECT * FROM (
SELECT *
FROM my_table_part_b
WHERE time_stamp >= '2010-02-10' and time_stamp < '2010-02-11'
ORDER BY id DESC
LIMIT 100) t
UNION ALL
... and so on ...
ORDER BY id DESC
LIMIT 100
This runs quickly. The partitions where the times-stamps are out-of-range aren't even included in the query plan.
My question is: Is there some hint or syntax I can use in Postgres 8.2 to prevent the query-planner from scanning the full table but still using simple syntax that only refers to the master table?
Basically, can I avoid the pain of dynamically building the big UNION query over each partition that happens to be currently defined?
EDIT: I have constraint_exclusion enabled (thanks #Vinko Vrsalovic)
Have you tried Constraint Exclusion (section 5.9.4 in the document you've linked to)
Constraint exclusion is a query
optimization technique that improves
performance for partitioned tables
defined in the fashion described
above. As an example:
SET constraint_exclusion = on;
SELECT count(*) FROM measurement WHERE logdate >= DATE '2006-01-01';
Without
constraint exclusion, the above query
would scan each of the partitions of
the measurement table. With constraint
exclusion enabled, the planner will
examine the constraints of each
partition and try to prove that the
partition need not be scanned because
it could not contain any rows meeting
the query's WHERE clause. When the
planner can prove this, it excludes
the partition from the query plan.
You can use the EXPLAIN command to
show the difference between a plan
with constraint_exclusion on and a
plan with it off.
I had a similar problem that I was able fix by casting conditions in WHERE.
EG: (assuming the time_stamp column is timestamptz type)
WHERE time_stamp >= '2010-02-10'::timestamptz and time_stamp < '2010-02-11'::timestamptz
Also, make sure the CHECK condition on the table is defined the same way...
EG:
CHECK (time_stamp < '2010-02-10'::timestamptz)
I had the same problem and it boiled down to two reasons in my case:
I had indexed column of type timestamp WITH time zone and partition constraint by this column with type timestamp WITHOUT time zone.
After fixing constraints ANALYZE of all child tables was needed.
Edit: another bit of knowledge - it's important to remember that constraint exclusion (which allows PG to skip scanning some tables based on your partitioning criteria) doesn't work with, quote: non-immutable function such as CURRENT_TIMESTAMP
I had requests with CURRENT_DATE and it was part of my problem.