I have a like query that’s processing millions of rows:
SELECT
sample_id,
REPLACE( sample_id, '*', '') AS term
FROM
sample.table
WHERE
sample_id LIKE '%*%'
ORDER BY
sample_id ASC;
I tried batching the queries but its still too slow to process. Have someone experienced this in the past and successfully solved this? I’m basically open to any ideas at this point. Thanks!
You did not mention which RDBMS you are using, but you can speed up processing by using properly designed index.
Index properties (basing on Microsoft SQL Server RDBMS):
filtered index:
you can implement a filtered index. Filter corresponds to the WHERE clause from your query. You can add "sample_id LIKE '%*%'" as a filter condition.
covering index:
your query is not complicated, so it should be easy to create a covering index
for it. By covering index I mean a structure which will contain all the columns
which are mentioned in your query, it will help the RDBMS engine to decide to
use it during execution becaue it will contain all the needed columns, and the
filter also, as mentioned in the first point.
So the syntax could look like this (Microsoft SQL Server pseudo code):
CREATE INDEX idx1 ON your_table_name (sample_id) WHERE sample_id LIKE '%*%'
If you would build it, you would have a DEDICATED structure for your query. You can think of it as of a subset of the data from your table, but physicaly present in your database, written to disk and being constantly updated as the data changes. As long as this index has the filter, it contains only the rows needed by your query. So you can imagine that if the RDBMS engine would choose it - by parsing and analyzing your code - the WHERE clause would not have to execute.
Unfortunatelly, I am not aware if other RDBMSes than Microsoft SQL Server deliver filtered indexes.
If your RDBMS doesn't allow for filtered indexes you can at least create a covering one. Still it might be lighter structure than your table, however, you didn't present the structure of your table.
An index doesn't come without a cost but this is a further story. Just remember that it takes place on disk and is being updated along with the data in your table.
Related
I want to improve the performance of a simple query, typical structure like that:
SELECT title,datetime
FROM LICENSE_MOVIES
WHERE client='Alex'
As you can read in different websites,like this, you should make an index like that:
CREATE INDEX INDEX_LICENSE_MOVIES
ON LICENSE_MOVIES(client);
But there is any performance in the query, it is like it where "ignoring" the index.
I have try to use hints like this webpage says.
And the query result like this:
SELECT /*+ INDEX(LICENSE_MOVIES INDEX_LICENSE_MOVIES) */ title, datetime
FROM LICENSE_MOVIES
WHERE client='Alex'
Is there is any error in this syntax? Why couldn't I appreciate any improvement?
Oracle has a smart optimizer. It does not always use indexes -- in fact, you might be surprised to learn that sometimes using an index is exactly the wrong thing to do.
In your case, your data fits on a handful of data pages (well, dozens). The question is: How many "Alex"s are in the data. If there is just one, then Oracle should use the index, as following:
Oracle looks up the row containing "Alex" in the index.
Oracle identifies the data page where the row is located.
Oracle loads the data page.
Oracle processes the query and returns the results.
If lots of rows (say more than a few dozen) are for "Alex", then the optimizer is going to "think" . . . "Gosh, I need to read every data page anyway. Let me avoid using the index and just scan all the data."
Of course, this decision is based on the available statistics (which might be inaccurate or out-of-date). But there are definitely circumstances where a full table scan is the right approach, even when an index is available.
Assume I'm running a website that shows funny cat pictures. I have a table called CatPictures with the columns Filename, Awesomeness, and DeletionDate, and the following index:
create nonclustered index CatsByAwesomeness
on CatPictures (Awesomeness)
include (Filename)
where DeletionDate is null
My main query is this:
select Filename from CatPictures where DeletionDate is null and Awesomeness > 10
I, as a human being, know that the above index is all that SQL Server needs, because the index filter condition already ensures the DeletionDate is null part.
SQL Server however doesn't know this; the execution plan for my query will not use my index:
Even if adding an index hint, it will still explicitly check DeletionDate by looking at the actual table data:
(and in addition complain about a missing index that would include DeletionDate).
Of course I could
include (Filename, DeletionDate)
instead, and it will work:
But it seems a waste to include that column, since this just uses up space without adding any new information.
Is there a way to make SQL Server aware that the filter condition is already doing the job of checking DeletionDate?
No, not currently.
See this connect item. It is Closed as Won't Fix. (Or this one for the IS NULL case specifically)
The connect item does provide a workaround shown below.
Posted by RichardB CFCU on 29/09/2011 at 9:15 AM
A workaround is to INCLUDE the column that is being filtered on.
Example:
CREATE NONCLUSTERED INDEX [idx_FilteredKey1] ON [dbo].[TABLE]
(
[TABLE_ID] ASC,
[TABLE_ID2] ASC
)
INCLUDE ( [REMOVAL_TIMESTAMP]) --explicitly include the column here
WHERE ([REMOVAL_TIMESTAMP] IS NULL)
Is there a way to make SQL Server aware that the filter condition is
already doing the job of checking DeletionDate?
No.
Filtered indexes were designed to solve certain problems, not ALL. Things evolve and some day, you may see SQL Server supporting the feature you expect of filtered indexes, but it is also possible that you may never see it.
There are several good reasons I can see for how it works.
What it improves on:
Storage. The index contains only keys matching the filtering condition
Performance. A shoo-in from the above. Less to write and fewer pages = faster retrieval
What it does not do:
Change the query engine radically
Putting them together, considering that SQL Server is a heavily pipelined, multi-processor parallelism capable beast, we get the following behaviour when dealing with servicing a query:
Pre-condition to the query optimizer selecting indexes: check whether a Filtered Index is applicable against the WHERE clause.
Query optimizer continues it's normal work of determining selectivity from statistics, weighing up index->bookmark lookup vs clustered/heap scan depending on whether the index is covering etc
Threading the condition against the filtered index into the query optimizer "core" I suspect is going to be a much bigger job than leaving it at step 1.
Personally, I respect the SQL Server dev team and if it were easy enough, they might pull it into a not-too-distant sprint and get it done. However, what's there currently has achieved what it was intended to and makes me quite happy.
Just found that "gap in functionality", it's really sad that filtered indexes are ignored by optimizer.
I think I'll try to use indexed views for that, take a look at this article
http://www.sqlperformance.com/2013/04/t-sql-queries/optimizer-limitations-with-filtered-indexes
I have the following query:
SELECT * FROM messages GROUP BY peer
(really it's more complicated with joins, but I omitted them here for simplicity)
The problem is that SQLite doesn't use any indexes and always performs a full scan of the table. Expectedly, it works fast on small data sets but it's noticeably slow with a big table containing thousands of rows. Here's the output of the EXPLAIN QUERY PLAN command:
0|0|0|SCAN TABLE messages USING INDEX messages_peer_mid (~1000000 rows)
Despite it says "USING INDEX" it still performs a full scan. Is there any way to make SQLite use index for this query or it's better to give up with GROUP BY and look for some other approach?
The plan takes into account the amount of data and performs a scan because it's algorithm probably concludes it's faster to do so.
Other comments, your query has no WHERE condition and you are returning ALL columns so why wouldn't you expect a table scan?
Indexes assist in selecting records from a table (using a WHERE clause or as a result of a JOIN operation). GROUP BY is performed on a set of records after they've been selected and retrieved from the table. It cannot be assisted by indexes.
If you want to know more about what options are available for index use in your query, please post the entire query.
Also, you note that the SQL you gave is a symbolic representation of the code you're running, but if you're really using *, or any non-aggregated field names other than peer in your statement you may not be getting the results you want.
Finally, you ask "it's better to give up with GROUP BY and look for some other approach?" GROUP BY is used for a specific function in SQL (producing new aggregated result sets from non-aggregated data). If that's your goal, GROUP BY is likely to be the best solution (because it defers to the database engine, which is highly optimized and cognizant of database statistics the decision about how to retrieve and process the data). If that's not your goal and you're trying to do something else using GROUP BY as an "approach" to that other functionality, let us know what it is you're actually trying to achieve.
Recently, I came across a pattern (not sure, could be an anti-pattern) of sorting data in a SELECT query. The pattern is more of a verbose and non-declarative way for ordering data. The pattern is to dump relevant data from actual table into temporary table and then apply orderby on a field on the temporary table. I guess, the only reason why someone would do that is to improve the performance (which I doubt) and no other benefit.
For e.g. Let's say, there is a user table. The table might contain rows in millions. We want to retrieve all the users whose first name starts with 'G' and sorted by first name. The natural and more declarative way to implement a SQL query for this scenario is:
More natural and declarative way
SELECT * FROM Users
WHERE NAME LIKE 'G%'
ORDER BY Name
Verbose way
SELECT * INTO TempTable
FROM Users
WHERE NAME LIKE 'G%'
SELECT * FROM TempTable
ORDER BY Name
With that context, I have few questions:
Will there be any performance difference between two ways if there is no index on the first name field. If yes, which one would be better.
Will there be any performance difference between two ways if there is an index on the first name field. If yes, which one would be better.
Should not the SQL Server optimizer generate same execution plan for both the ways?
Is there any benefit in writing a verbose way from any other persective like locking/blocking?
Thanks in advance.
Reguzlarly: Anti pattern by people without an idea what they do.
SOMETIMES: ok, because SQL Server has a problem that is not resolvable otherwise - not seen that one in yeas, though.
It makes things slower because it forces the tmpddb table to be fully populated FIRST, while otherwise the query could POSSIBLY be resoled more efficiently.
last time I saw that was like 3 years ago. We got it 3 times as fast by not being smart and using a tempdb table ;)
Answers:
1: No, it still needs a table scan, obviously.
2: Possibly - depends on data amount, but an index seek by index would contain the data in order already (as the index is ordered by content).
3: no. Obviously. Query plan optimization is statement by statement. By cutting the execution in 2, the query optimizer CAN NOT merge the join into the first statement.
4: Only if you run into a query optimizer issue or a limitation of how many tables you can join - not in that degenerate case (degenerate in a technical meaning - i.e. very simplistic). BUt if you need to join MANY MANY tables it may be better to go with an interim step.
If the field you want to do an order by on is not indexed, you could put everything into a temp table and index it and then do the ordering and it might be faster. You would have to test to make sure.
There is never any benefit of the second approach that I can think of.
It means if the data is available pre-ordered SQL Server can't take advantage of this and adds an unnecessary blocking operator and additional sort to the plan.
In the case that the data is not available pre-ordered SQL Server will sort it in a work table either in memory or tempdb anyway and adding an explicit #temp table just adds an unnecessary additional step.
Edit
I suppose one case where the second approach could give an apparent benefit might be if the presence of the ORDER BY caused SQL Server to choose a different plan that turned out to be sub optimal. In which case I would resolve that in a different way by either improving statistics or by using hints/query rewrite to avoid the undesired plan.
Does anyone know what the complexity is for the SQL LIKE operator for the most popular databases?
Let's consider the three core cases separately. This discussion is MySQL-specific, but might also apply to other DBMS due to the fact that indexes are typically implemented in a similar manner.
LIKE 'foo%' is quick if run on an indexed column. MySQL indexes are a variation of B-trees, so when performing this query it can simply descend the tree to the node corresponding to foo, or the first node with that prefix, and traverse the tree forward. All of this is very efficient.
LIKE '%foo' can't be accelerated by indexes and will result in a full table scan. If you have other criterias that can by executed using indices, it will only scan the the rows that remain after the initial filtering.
There's a trick though: If you need to do suffix matching - searching for file names with extension .foo, for instance - you can achieve the same performance by adding a column with the same contents as the original one but with the characters in reverse order.
ALTER TABLE my_table ADD COLUMN col_reverse VARCHAR (256) NOT NULL;
ALTER TABLE my_table ADD INDEX idx_col_reverse (col_reverse);
UPDATE my_table SET col_reverse = REVERSE(col);
Searching for rows with col ending in .foo then becomes:
SELECT * FROM my_table WHERE col_reverse LIKE 'oof.%'
Finally, there's LIKE '%foo%', for which there are no shortcuts. If there are no other limiting criterias which reduces the amount of rows to a feasible number, it'll cause a hard performance hit. You might want to consider a full text search solution instead, or some other specialized solution.
If you are asking about the performance impact:
The problem of like is that it keeps the database from using an index. On Oracle I think it doesn't use indexes anymore (but I'm still on Oracle 9). SqlServer uses indexes if the wildcard is only at the end. I don't know about other databases.
Depends on the RDBMS, the data (and possibly size of data), indexes and how the LIKE is used (with or without prefix wildcard)!
You are asking too general a question.