Use of MD5(URL) instead of URL in DB for WHERE - sql

I have a big MySQL InnoDB table (about 1 milion records, increase by 300K weekly) let's say with blog posts. This table has an url field with index.
By adding new records in it I'm checking for existent records with the same url. Here is how query looks like:
SELECT COUNT(*) FROM `tablename` WHERE url='http://www.google.com/';
Currently system produces about 10-20 queries per second and this amount will be increased. I'm thinking about improving performance by adding additional field which is MD5 hash of the URL.
SELECT COUNT(*) FROM `tablename` WHERE md5url=MD5('http://www.google.com/');
So it will be shorter and with constant length which is better for index compared to URL field. What do you guys think about it. Does it make sense?
Another suggestion by friend of mine is to use CRC32 instead of MD5, but I'm not sure about how unique will be result of CRC32. Let me know what you think about CRC32 for this role.
UPDATE: the URL column is unique for each row.

Create a non-clustered index on URL. That will let your SQL engine do all the optimization internally and will produce the best results!
If you create an index on a VARCHAR column, SQL will create a hash internally anyway and using the index can give better performance by an order of magnitude or even more!
Also, something to keep in mind if you're only checking whether a URL exists, is that certain SQL products will produce faster results with a query like this:
IF NOT EXISTS(SELECT * FROM `tablename` WHERE url='')
-- return TRUE or do your logic here

I think CRC32 would actually be better for this role, as it's shorter and it saves more SQL space. If you're receiving that many queries, the object is to save space anyways? If it does the job, I'd say go for it.
Although, since it's only 32bit, and shorter in length, it's not as unique as MD5 of course. You will have to decide if you want unique, or if you want to save space.
I still think I'd choose CRC32.
My system generates roughly 4k queries per second, and I use CRC32 for links.

Using the build-in indexing is always best, or you should volunteer to add to their codebase anyways ;)
When using a hash, create a 2 column index on the hash and the URL. If you only choose the first couple of letters on the index, it still does a complete match, but it doesn't index more then the first few letters.
Something like this:
INDEX(CRC32_col, URL_col(5))
Either hash would work in that case. It's a trade-off of space vs speed.
Also, this query will be much faster:
SELECT * FROM table WHERE hash_col = 'hashvalue' AND url_col = 'urlvalue' LIMIT 1;
This will find the first value and stop. Much faster then finding many matches for the COUNT(*) calculation.
Ultimately the best choice is to make test cases for each variant and benchmark.

Don't most SQL engines use hash functions internally for text column searches?

If you're going to use hashed keys and you're concerned about collisions, use two different hash functions and concatenate the two hashed values.
But even if you do this, you should always store the original key value in the row as well.

If the tendency is for the result of that select statement to be rather high, an alternative solution would be to have a separate table which keeps track of the counts. Obviously there are high penalties for using that technique, but if this specific query is a common one and is too slow, this might be a solution.
There are obvious trade-offs involved in this solution, and you probably do not want to update this 2nd table after every individual insertion of a new record inserted, as that would slow down your insertions.

If you choose a hash you need to take into account collissions. Even with a large hash like MD5 you have to account the meet-in-the-middle probability, better known as birthday attack. For a smaller hash like CRC-32 the collision probability will be quite large and your WHERE has to specify hash and the full URL.
But I gotta ask, is this the best way to spend your efforts? Is there nothing else left to optimize? You may be well doing premature optimizations unless you have clear metrics and measurements indicating that this problem is the bottleneck of the system. After all, this kind of seek is what databases are optimized for (all of them), and by doing something like a hash you may actually decrease performance (eg. your index may become fragmented becuase hashes have a different distribution than URLs).

Related

How can I measure the cost of a database index?

Is there a good method for judging whether the costs of creating a database index in Postgres (slower INSERTS, time to build an index, time to re-index) are worth the performance gains (faster SELECTS)?
I am actually going to disagree with Hexist. PostgreSQL's planner is pretty good, and it supports good sequential access to table files based on physical order scans, so indexes are not necessarily going to help. Additionally there are many cases where the planner has to pick an index. Additionally you are already creating primary keys for unique constraints and primary keys.
I think one of the good default positions with PostgreSQL (MySQL btw is totally different!) is to wait until you need an index to add one and then only add the indexes you most clearly need. This is, however, just a starting point and it assumes either a lack of a general lack of experience in looking at query plans or a lack of understanding of where the application is likely to go. Having experience in these areas matters.
In general, where you have tables likely to span more than 10 pages (that's 40kb of data and headers), it's a good idea to foreign keys. These can be assumed tob e clearly needed. Small lookup tables spanning 1 page should never have non-unique indexes because these indexes are never going to be used for selects (no query plan beats a sequential scan over a single page).
Beyond that point you also need to look at data distribution. Indexing boolean columns is usually a bad idea and there are better ways to index things relating to boolean searches (partial indexes being a good example). Similarly indexing commonly used function output may seem like a good idea sometimes, but that isn't always the case. Consider:
CREATE INDEX gj_transdate_year_idx ON general_journal (extract('YEAR' FROM transdate));
This will not do much. However an index on transdate might be useful if paired with a sparse index scan via a recursive CTE.
Once the basic indexes are in place, then the question becomes what other indexes do you need to add. This is often better left to later use case review than it is designed in at first. It isn't uncommon for people to find that performance significantly benefits from having fewer indexes on PostgreSQL.
Another major thing to consider is what sort of indexes you create and these are often use-case specific. A b-tree index on an array record for example might make sense if ordinality is important to the domain, and if you are frequently searching based on initial elements, but if ordinality is unimportant, I would recommend a GIN index, because a btree will do very little good (of course that is an atomicity red flag, but sometimes that makes sense in Pg). Even when ordinality is important, sometimes you need GIN indexes anyway because you need to be able to do commutitive scans as if ordinality was not. This is true if using ip4r for example to store cidr blocks and using an EXCLUDE constraint to ensure that no block contains any other block (the actual scan requires using an overlap operator rather than a contain operator since you don't know which side of the operator the violation will be found on).
Again this is somewhat database-specific. On MySQL, Hexist's recommendations would be correct, for example. On PostgreSQL, though, it's good to watch for problems.
As far as measuring, the best tool is EXPLAIN ANALYZE
Generally speaking, unless you have a log or archive table where you wont be doing selects on very frequently (or it's ok if they take awhile to run), you should index on anything your select/update/deelete statements will be using in a where clause.
This however is not always as simple as it seems, as just because a column is used in a where clause and is indexed, doesn't mean the sql engine will be able to use the index. Using the EXPLAIN and EXPLAIN ANALYZE capabilities of postgresql you can examine what indexes were used in selects and help you figure out if having an index on a column will even help you.
This is generally true because without an index your select speed goes down from some O(log n) looking operation down to O(n), while your insert speed only improves from cO(log n) to dO(log n) where d is usually less than c, ie you may speed up your inserts a little by not having an index, but you're going to kill your select speed if they're not indexed, so it's almost always worth it to have an index on your data if you're going to be selecting against it.
Now, if you have some small table that you do a lot of inserts and updates on, and frequently remove all the entries, and only periodically do some selects, it could turn out to be faster to not have any indexes.. however that would be a fairly special case scenario, so you'd have to do some benchmarking and decide if it made sense in your specific case.
Nice question. I'd like to add a bit more what #hexist had already mentioned and to the info provided by #ypercube's link.
By design, database don't know in which part of the table it will find data that satisfies provided predicates. Therefore, DB will perform a full or sequential scan of all table's data, filtering needed rows.
Index is a special data structure, that for a given key can precisely specify in which rows of the table such values will be found. The main difference when index is involved:
there is a cost for the index scan itself, i.e. DB has to find a value in the index first;
there's an extra cost of reading specific data from the table itself.
Working with index will lead to a random IO pattern, compared to a sequential one used in the full scan. You can google for the comparison figures of random and sequential disk access, but it might differ up to an order of magnitude (random being slower of course).
Still, it's clear that in some cases Index access will be cheaper and in others Full scan should be preferred. This depends on how many rows (out of all) will be returned by the specified predicate, or it's selectivity:
if predicate will return a relatively small number of rows, say, less then 10% of total, then it seems valuable to pick those directly via Index. This is a typical case for Primary/Unique keys or queries like: I need address information for customer with internal number = XXX;
if predicate has no big impact on the selectivity, i.e. if 30% (or more) rows are returned, then it's cheaper to do a Full scan, 'cos sequential disk access will beat random and data will be delivered faster. All reports, covering big areas (like a month, or all customers) fall here;
if there's a need to obtain an ordered list of values and there's an index, then doing Index scan is the fastest option. This is a special case of #2, when you need report data ordered by some column;
if number of distinct values in the column is relatively small compared to a total number of values, then Index will be a good choice. This is a case called Loose Index Scan, and typical queries will be like: I need 20 most recent purchases for each of the top 5 categories by number of goods.
How DB decides what to do, Index or Full scan? This is a runtime decision and it is based on the statistics, so make sure to keep those up to date. In fact, numbers provided above have no real life value, you have to evaluate each query independently.
All this is a very rough description of what happens. I would very much recommended to look into How PostgreSQL Planner Uses Statistics, this best what I've seen on the subject.

Multiple indexes on one column

Using Oracle, there is a table called User.
Columns: Id, FirstName, LastName
Indexes: 1. PK(Id), 2. UPPER(FirstName), 3. LOWER(FirstName), 4. Index(FirstName)
As you can see index 2, 3, 4 are indexes on the same column - FirstName.
I know this creates overhead, but my question is on selecting how will the database react/optimize?
For instance:
SELECT Id FROM User u WHERE
u.FirstName LIKE 'MIKE%'
Will Oracle hit the right index or will it not?
The problem is that via Hibernate this slows down the query VERY much (so it uses prepared statements).
Thanks.
UPDATE: Just to clarify indexes 2 and 3 are functional indexes.
In addition to Mat's point that either index 2 or 3 should be redundant because you should choose one approach to doing case-insensitive searches and to Richard's point that it will depend on the selectivity of the index, be aware that there are additional concerns when you are using the LIKE clause.
Assuming you are using bind variables (which it sounds like you are based on your use of prepared statements), the optimizer has to guess at how selective the actual bind value is going to be. Something short like 'S%' is going to be very non-selective, causing the optimizer to generally prefer a table scan. A longer string like 'Smithfield-Manning%', on the other hand, is likely to be very selective and would likely use index 4. How Oracle handles this variability will depend on the version.
In Oracle 10, Oracle introduced bind variable peeking. This meant that the first time Oracle parsed a query after a reboot (or after the query plan being aged out of the shared pool), Oracle looked at the bind value and decided what plan to use based on that value. Assuming that most of your queries would benefit from the index scan because users are generally searching on relatively selective values, this was great if the first query after a reboot had a selective condition. But if you got unlucky and someone did a WHERE firstname LIKE 'S%' immediately after a reboot, you'd be stuck with the table scan query plan until the query plan was removed from the shared pool.
Starting in Oracle 11, however, the optimizer has the ability to do adaptive cursor sharing. That means that the optimizer will try to figure out that WHERE firstname LIKE 'S%' should do a table scan and WHERE firstname LIKE 'Smithfield-Manning%' should do an index scan and will maintain multiple query plans for the same statement in the shared pool. That solves most of the problems that we had with bind variable peeking in earlier versions.
But even here, the accuracy of the optimizer's selectivity estimates are generally going to be problematic for medium-length strings. It's generally going to know that a single-character string is very weakly selective and that a 20 character string is highly selective but even with a 256 bucket histogram, it's not going to have a whole lot of information about how selective something like WHERE firstname LIKE 'Smit%' really is. It may know roughly how selective 'Sm%' is based on the column histogram but it's guessing rather blindly at how selective the next two characters are. So it's not uncommon to end up in a situation where most of the queries work efficiently but the optimizer is convinced that WHERE firstname LIKE 'Cave%' isn't selective enough to use an index.
Assuming that this is a common query, you may want to consider using Oracle's plan stability features to force Oracle to use a particular plan regardless of the value of a bind variable. This may mean that users that enter a single character have to wait even longer than they would otherwise have waited because the index scan is substantially less efficient than doing a table scan. But that may be worth it for other users that are searching for short but reasonably distinctive last names. And you may do things like add a ROWNUM limiter to the query or add logic to the front end that requires a minimum number of characters in the search box to avoid situations where a table scan would be more efficient.
It's a bit strange to have both the upper and lower function-based indexes on the same field. And I don't think the optimizer will use either in your query as it its.
You should pick one or the other (and probably drop the last one too), and only ever query on the upper (or lower)-case with something like:
select id from user u where upper(u.firstname) like 'MIKE%'
Edit: look at this post too, has some interesting info How to use a function-based index on a column that contains NULLs in Oracle 10+?
It may not hit any of your indexes, because you are returning ID in the SELECT clause, which is not covered by the indexes.
If the index is very selective, and Oracle decides it is still worthwhile using it to find 'MIKE%' then perform a lookup on the data to get the ID column, then it may use 4. Index(FirstName). 2 and 3 will only be used if the column searched uses the exact function defined in the index.

Maximizing performance of SQL indexes for VARCHAR columns with homogeneous prefixes

I'm designing a DB2 table, one VARCHAR column of which will store an alpha-numeric product identifier. The first few characters of these IDs have very little variation. The column will be indexed, and I'm concerned that performance may suffer because of the common prefixes.
As far as I can tell, DB2 does not use hash codes for selecting VARCHARs. (At least basic DB2, I don't know about any extensions.)
If this is to be a problem, I can think of three obvious solutions.
Create an extra, hash code column.
Store the text backward, to ensure good distribution of initial characters.
Break the product IDs into two columns, one containing a long enough prefix to produce better distribution in the remainder.
Each of these would be a hack, of course.
Solution #2 would provide the best key distribution. The backwards text could be stored in a separate column, or I could reverse the string after reading. Each approach involves overhead, which I would want to profile and compare.
With solution #3, the key distribution still would be non-optimal, and I'd need to concatenate the text after reading, or use 3 columns for the data.
If I leave my product IDs as-is, is my index likely to perform poorly? If so, what is the best method to optimize the performance?
I'm a SQL dba, not db2, but I wouldn't think that having common prefixes would hurt you at all, indexing wise.
The index pages simply store a "from" and "to" range of key values with pointers to the actual pages. The fact that an index page happens to store FrobBar001291 to FrobBar009281 shouldn't matter in the slightest to the db engine.
In fact, having these common prefixes allows the index to take advantage of other queries like:
SELECT * FROM Products WHERE ProductID LIKE 'FrobBar%'
I agree with BradC that I don't think this is a problem at all, and even if there was some small benefit to the alternatives you suggest, I imagine all the overhead and complexity would outweigh any benefits.
If you're looking to understand and improve index performance, there are a number of topics in the Info Center that you should consider (in particular the last two topics seem relevant): http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/nav/2_3_2_4_1 like:
Index structure
Index cleanup and maintenance
Asynchronous index cleanup
Asynchronous index cleanup for MDC tables
Online index defragmentation
Using relational indexes to improve performance
Relational index planning tips
Relational index performance tips

How to tell if a query will scale well?

What are some of the methods/techniques experienced SQL developers use to determine if a particular SQL query will scale well as load increases, rows in associated tables increase etc.
Some rules that I follow that make the most difference.
Don't use per-row functions in your queries like if, case, coalesce and so on. Work around them by putting data in the database in the format you're going to need it, even if that involves duplicate data.
For example, if you need to lookup surnames fast, store them in the entered form and in their lowercase form, and index the lowercase form. Then you don't have to worry about things like select * from tbl where lowercase(surname) = 'smith';
Yes, I know that breaks 3NF but you can still guarantee data integrity by judicious use of triggers or pre-computed columns. For example, an insert/update trigger on the table can force the lower_surname column to be set to the lowercase version of surname.
This moves the cost of conversion to the insert/update (which happens infrequently) and away from the select (which happens quite a lot more). You basically amortise the cost of conversion.
Make sure that every column used in a where clause is indexed. Not necessarily on its own but at least as the primary part of a composite key.
Always start off in 3NF and only revert if you have performance problems (in production). 3NF is often the easiest to handle and reverting should only be done when absolutely necessary.
Profile, in production (or elsewhere, as long as you have production data and schemas). Database tuning is not a set-and-forget operation unless the data in your tables never changes (very rare). You should be monitoring, and possibly tuning, periodically to avoid the possibility that changing data will bring down performance.
Don't, unless absolutely necessary, allow naked queries to your database. Try to control what queries can be run. Your job as a DBA will be much harder if some manager can come along and just run:
select * from very_big_table order by column_without_index;
on your database.
If managers want to be able to run ad-hoc queries, give them a cloned DBMS (or replica) so that your real users (the ones that need performance) aren't affected.
Don't use union when union all will suffice. If you know that there can be no duplicates between two selects of a union, there's no point letting the DBMS try to remove them.
Similarly, don't use select distinct on a table if you're retrieving all the primary key columns (or all columns in a unique constraint). There is no possibility of duplicates in those cases so, again, you're asking the DBMS to do unnecessary work.
Example: we had a customer with a view using select distinct * on one of their tables. Querying the view took 50 seconds. When we replaced it with a view starting select *, the time came down to sub-second. Needless to say, I got a good bottle of red wine out of that :-)
Try to avoid select * as much as possible. In other words, only get the columns you need. This makes little difference when you're using MySQL on your local PC but, when you have an app in California querying a database in Inner Mongolia, you want to minimise the amount of traffic being sent across the wire as much as possible.
don't make tables wide, keep them narrow as well as the indexes. Make sure that queries are fully covered by indexes and that those queries are SARGable.
Test with a ton of data before going in production, take a look at this: Your testbed has to have the same volume of data as on production in order to simulate normal usage
Pull up the execution plan and look for any of the following:
Table Scan
[Clustered] Index Scan
RID Lookup
Bookmark Lookup
Key Lookup
Nested Loops
Any of those things (in descending order from most to least scalable) mean that the database/query likely won't scale to much larger tables. An ideal query will have mostly index seeks, hash or merge joins, the occasional sort, and other low-impact operations (spools and so on).
The only way to prove that it will scale, as other answers have pointed out, is to test it on data of the desired size. The above is just a rule of thumb.
In addition (and along the same lines) to Robert's suggestion, consider the execution plan. Is it utilizing indexes? Are there any scans or such? Can you simply for the query in any way? For example, Eliminate IN in favor of EXISTS and only join to tables you need to join to.
You don't mention the technology -- keep in mind that different technologies can affect the efficiency of more complex queries.
I strongly recommend reading some reference material on this. This (hyperlink below) is probably a pretty good book to look into. Make sure to look under "Selectivity", among other topics.
SQL Tuning - Dan Tow

Using IN or a text search

I want to search a table to find all rows where one particular field is one of two values. I know exactly what the values would be, but I'm wondering which is the most efficient way to search for them:
for the sake of example, the two values are "xpoints" and "ypoints". I know for certain that there will be no other values in that field which has "points" at the end, so the two queries I'm considering are:
WHERE `myField` IN ('xpoints', 'ypoints')
--- or...
WHERE `myField` LIKE '_points'
which would give the best results in this case?
As always with SQL queries, run it through the profiler to find out. However, my gut instinct would have to say that the IN search would be quicker. Espcially in the example you gave, if the field was indexed, it would only have to do 2 lookups. If you did a like search, it may have to do a scan, because you are looking for records that end with a certain value. It would also be more accurate as LIKE '_points' could also return 'gpoints', or any other similar string.
Unless all of the data items in the column in question start with 'x' or 'y', I believe IN will always give you a better query. If it is indexed, as #Kibbee points out, you will only have to perform 2 lookups to get both. Alternatively, if it is not indexed, a table scan using IN will only have to check the first letter most of the time whereas with LIKE it will have to check two characters every time (assuming all items are at least 2 characters) -- since the first character is allowed to be anything.
Try it and see. Create a large amount of test data, Also, try it with and without an index on myfield. While you are at it, see if there's a noticeable difference between
LIKE 'points' and LIKE 'xpoint'.
It depends on what the optimizer does with each query.
For small amounts of data, the difference will be negligible. Do whichever one makes more sense. For large amounts of data the amount of disk I/O matters much more than the amount of CPU time.
I'm betting that IN will get you better results than LIKE, if there is an index on myfield. I'm also betting that 'xpoint_' runs faster than '_points'. But there's nothing like trying it yourself.
MySQL can't use an index when using string comparisons such as LIKE '%foo' or '_foo', but can use an index for comparisons like 'foo%' and 'foo_'.
So in your case, IN will be much faster assuming that the field is indexed.
If you're working with a limited set of possible values, it's worth specifying the field as an ENUM - MySQL will then store it internally as an integer and make this sort of lookup much faster, and save disk space.
It will be faster to do the IN-version than the LIKE-version. Especially when your wildcard isn't at the end of the comparison, but even under ideal conditions IN would still be ideal up until your query nears the size of your max-query insert.