To create index on a varchar column or not - Access - vb.net

I have read a couple of articles about when to create an index on a column and all of those were related to Mysql, SQL Server or Oracle. I have a fair bit of idea now about whether I should create an index on my column or not, but I would like to have a learned opinion on it before I actually try it.
I have a MS Access database which has around 15 tables. All tables have a column called [Locations] and this column is used in almost all WHERE clauses and most of the JOIN conditions. This column has 5 distinct values as of now i.e 5 locations viz A, B, C, D, E.
So my question is though this column is part of most WHERE clause and JOIN, the limited variety in values (just 5) is making me to hesitate to create an index on it.
Please advice.

It is important to bear in mind that an Access database is a "peer-to-peer" (as opposed to "client-server") database, so table scans can be particularly detrimental to performance, especially if the back-end database file is on a network share. Therefore it is always a good idea to ensure that there are indexes on all fields that participate in WHERE clauses or the ON conditions of JOINs.
Example: I have a sample table with one million rows and a field named [Category] that contains the value 'A' or 'B'. Without an index on the [Category] field the query
SELECT COUNT(*) AS n FROM [TestData] WHERE [Category] = 'B'
had to do a table scan and that generated about 48 MB of total network traffic. Simply adding an index on [Category] reduced the total network traffic to 0.27 MB for the exact same query.

Related

Oracle multiple vs single column index

Imagine I have a table with the following columns:
Column: A (numer(10)) (PK)
Column: B (numer(10))
Column: C (numer(10))
CREATE TABLE schema_name.table_name (
column_a number(10) primary_key,
column_b number(10) ,
column_c number(10)
);
Column A is my PK.
Imagine my application now has a flow that queries by B and C. Something like:
SELECT * FROM SCHEMA.TABLE WHERE B=30 AND C=99
If I create an index only using the Column B, this will already improve my query right?
The strategy behind this query would benefit from the index on column B?
Q1 - If so, why should I create an index with those two columns?
Q2 - If I decided to create an index with B and C, If I query selecting only B, would this one be affected by the index?
The simple answers to your questions.
For this query:
SELECT *
FROM SCHEMA.TABLE
WHERE B = 30 AND C = 99;
The optimal index either (B, C) or (C, B). The order does matter because the two comparisons are =.
An index on either column can be used, but all the matching values will need to be scanned to compare to the second value.
If you have an index on (B, C), then this can be used for a query on WHERE B = 30. Oracle also implements a skip-scan optimization, so it is possible that the index could also be used for WHERE C = 99 -- but it probably would not be.
I think the documentation for MySQL has a good introduction to multi-column indexes. It doesn't cover the skip-scan but is otherwise quite applicable to Oracle.
Short answer: always check the real performance, not theoretical. It means, that my answer requires verification at real database.
Inside SQL (Oracle, Postgre, MsSql, etc.) the Primary Key is used for at least two purposes:
Ordering of rows (e.g. if PK is incremented only then all values will be appended)
Link to row. It means that if you have any extra index, it will contain whole PK to have ability to jump from additional index to other rows.
If I create an index only using the Column B, this will already improve my query right?
The strategy behind this query would benefit from the index on column B?
It depends. If your table is too small, Oracle can do just full scan of it. For large table Oracle can (and will do in common scenario) use index for column B and next do range scan. In this case Oracle check all values with B=30. Therefore, if you can only one row with B=30 then you can achieve good performance. If you have millions of such rows, Oracle will need to do million of reads. Oracle can get this information via statistic.
Q1 - If so, why should I create an index with those two columns?
It is needed to direct access to row. In this case Oracle requires just few jumps to find your row. Moreover, you can apply unique modifier to help Oracle. Then it will know, that not more than single row will be returned.
However if your table has other columns, real execution plan will include access to PK (to retrieve other rows).
If I decided to create an index with B and C, If I query selecting only B, would this one be affected by the index?
Yes. Please check the details here. If index have several columns, than Oracle will sort them according to column ordering. E.g. if you create index with columns B, C then Oracle will able to use it to retrieve values like "B=30", e.g. when you restricted only B.
Well, it all depends.
If that table is tiny, you won't see any benefit regardless any indexes you might create - it is just too small and Oracle returns data immediately.
If the table is huge, then it depends on column's selectivity. There's no guarantee that Oracle will ever use that index. If optimizer decides (upon information it has - don't forget to regularly collect statistics!) that the index should not be used, then you created it in vain (though, you can choose to use a hint, but - unless you know what you're doing, don't do it).
How will you know what's going on? See the explain plan.
But, generally speaking, yes - indexes help.
Q1 - If so, why should I create an index with those two columns?
Which "two columns"? A? If it is a primary key column, Oracle automatically creates an index, you don't have to do that.
Q2 - If I decided to create an index with B and C, If I query selecting only B, would this one be affected by the index?
If you are talking about a composite index (containing both B and C columns, respectively), and if query uses B column, then yes - index will (OK, might be used). But, if query uses only column C, then this index will be completely useless.
In spite of this question being answered and one answer being accepted already, I'll just throw in some more information :-)
An index is an offer to the DBMS that it can use to access data quicker in some situations. Whether it actually uses the index is a decision made by the DBMS.
Oracle has a built-in optimizer that looks at the query and tries to find the best execution plan to get the results you are after.
Let's say that 90% of all rows have B = 30 AND C = 99. Why then should Oracle laboriously walk through the index only to have to access almost every row in the table at last? So, even with an index on both columns, Oracle may decide not to use the index at all and even perform the query faster because of the decision against the index.
Now to the questions:
If I create an index only using the Column B, this will already improve my query right?
It may. If Oracle thinks that B = 30 reduces the rows it will have to read from the table imensely, it will.
If so, why should I create an index with those two columns?
If the combination of B = 30 AND C = 99 limits the rows to read from the table further, it's a good idea to use this index instead.
If I decided to create an index with B and C, If I query selecting only B, would this one be affected by the index?
If the index is on (B, C), i.e. B first, then Oracle may find it useful, yes. In the extreme case that there are only the two columns in the table, that would even be a covering index (i.e. containing all columns accessed in the query) and the DBMS wouldn't have to read any table row, as all the information is already in the index itself. If the index is (C, B), i.e. C first, it is quite unlikely that the index would be used. In some edge-case situations, Oracle might do so, though.

Finding the "next 25 rows" in Oracle SQL based on an indexed column

I have a large table (~200M rows) that is indexed on a numeric column, Z. There is also an index on the key column, K.
K Z
= ==========================================
1 0.6508784068583483336644518457703156855132
2 0.4078768075307567089075462518978907890789
3 0.5365440453204830852096396398565048002638
4 0.7573281573257782352853823856682368153782
What I need to be able to do is find the 25 records "surrounding" a given record. For instance, the "next" record starting at K=3 would be K=1, followed by K=4.
I have been lead by several sources (most notably this paper from some folks at Florida State University) that SQL like the following should work. It's not hard to imagine that scanning along the indexed column in ascending or descending order would be efficient.
select * from (
select *
from T
where Z >= [origin's Z value]
order by Z asc
) where rownum <= 25;
In theory, this should find the 25 "next" rows, and a similar variation would find the 25 "previous" rows. However, this can take minutes and the explain plan consistently contains a full table scan. A full table scan is simply too expensive for my purpose, but nothing I do seems to prompt the query optimizer to take advantage of the index (short, of course, of changing the ">=" above to an equals sign, which indicates that the index is present and operational). I have tried several hints to no avail (index, index_asc in several permutations).
Is what I am trying to do impossible? If I were trying to do this on a large data structure over which I had more control, I'd build a linked list on the indexed column's values and a tree to find the right entry point. Then traversing the list would be very inexpensive (yes I might have to run all over the disk to find the records I'm looking for, but I surely wouldn't have to scan the whole table).
I'll add in case it's important to my query that the database I'm using is running Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit.
I constructed a small test case with 10K rows. When I populated the table such that the Z values were already ordered, the exact query you gave tended to use the index. But when I populated it with random values, and refreshed the table statistics, it started doing full table scans, at least for some values of n larger than 25. So there is a tipping point at which the optimizer decides that the amount of work it will do to look up index entries then find the corresponding rows in the table is more than the amount of work to do a full scan. (It might be wrong in its estimate, of course, but that is what it has to go on.)
I noticed that you are using SELECT *, which means the query is returning both columns. This means that the actual table rows must be accessed, since neither index includes both columns. This might push the optimizer towards preferring a full table scan for a larger samples. If the query could be fulfilled from the index alone, it would be more likely to use the index.
One possibility is that you don't really need to return the values of K at all. If so, I'd suggest that you change both occurrences of SELECT * to SELECT z. In my test, this change caused a query that had been doing a full table scan to use an index scan instead (and not access the table itself at all).
If you do need to include K in the result, then you might try creating an index on (Z, K). This index could be used to satisfy the query without accessing the table.

how to speed up a clustered index scan while selecting all fields on range of rows or all the rows

I have a table
Books(BookId, Name, ...... , PublishedYear)
I do have about 30 fields in my Books table, where BookId is the primary key (Identity column). I have about 2 million records for this table.
I know select * is evil performance killer..
I have a situation to select range of rows or all the rows having all the columns in it.
Select * from Books;
this query takes more than 2 seconds to scan through the data page and get all the records. On checking the execution it still uses the Clustered index scan.
Obviously 2 seconds my not be that bad, however when this table has to be joined with other tables which is executed in batch is taking time over 15 minutes (There are no duplicate records though on the final result at completion as the count is matching). The join criteria is pretty simple and yields no duplication.
Excluding this table alone has the batch execution completed in sub seconds.
Is there a way to optimize this having said that I will have to select all the columns :(
Thanks in advance.
I've just run a batch against my developer instance, one SELECT specifying all Columns and one using *. There is no evidence (nor should there) that there is any difference aside from the raw parsing of my input. If I remember correctly, that old saying really means: Do not SELECT columns you are not using, they use up resources without benefit.
When you try to improve performance in your code, always check your assumptions, they might only apply to some older version (of sql server etc) or other method.

TSQL join performance

My problem is that this query takes forever to run:
Select
tableA.CUSTOMER_NAME,
tableB.CUSTOMER_NUMBER,
TableB.RuleID
FROM tableA
INNER JOIN tableB on tableA.CUST_PO_NUMBER like tableB.CustomerMask
Here is the structure of the tables:
CREATE TABLE [dbo].[TableA](
[CUSTOMER_NAME] [varchar](100) NULL,
[CUSTOMER_NUMBER] [varchar](50) NULL,
[CUST_PO_NUMBER] [varchar](50) NOT NULL,
[ORDER_NUMBER] [varchar](30) NOT NULL,
[ORDER_TYPE] [varchar](30) NULL)
CREATE TABLE [dbo].[TableB](
[RuleID] [varchar](50) NULL,
[CustomerMask] [varchar](500) NULL)
TableA has 14 million rows and TableB has 1000 rows. Data in column customermask can be anything like ‘%’,’ttt%’,’%ttt%’..etc
How can I tune it to make it faster?
Thanks!
The short answer is don't use the LIKE operator to join two tables containing millions of rows. It's not going to be fast, no matter how you tune it. You might be able to improve it incrementally, but it will just be putting lipstick on a pig.
You need to have a distinct value on which to join the tables. Right now it has to do a complete scan of tableA, and do an item-by-item wildcard comparison between Customer_Name and CustomerMask. You're looking at 14 billion comparisons, all using the slow LIKE operator.
The only suggestion I can give is to re-think the architecture of associating rules with Customers.
While you can't change what's already there, you can create a new table like this:
CREATE TABLE [dbo].[TableC](
[CustomerMask] [varchar](500) NULL)
[CUST_PO_NUMBER] [varchar](50) NOT NULL)
Then have a trigger on both TableA and TableB that inserts / updates / deletes records in TableC if they no longer match the condition CUST_PO_NUMBER LIKE CustomerMask (for the trigger on TableB you need to only update TableC if the CustomerMask field has been changed.
Then in your query will just become:
SELECT
tableA.CUSTOMER_NAME,
tableB.CUSTOMER_NUMBER,
TableB.RuleID
FROM tableA
INNER JOIN tableC on tableA.CUST_PO_NUMBER = tableC.CUST_PO_NUMBER
INNER JOIN tableB on tableC.CustomerMask = tableB.CustomerMask
This will greatly improve your query performance and it shouldn't greatly affect your write performance. You will basically only be performing the like query once for each record (unless they change).
Only change order join then faster and enjoy! use this query:
Select tableA.CUSTOMER_NAME, tableB.CUSTOMER_NUMBER, TableB.RuleID
FROM tableB
INNER JOIN tableA
on tableB.CustomerMask like tableA.CUST_PO_NUMBER
Am I missing something? What about the following:
Select
tableA.CUSTOMER_NAME,
tableA.CUSTOMER_NUMBER,
tableB.RuleID
FROM tableA, tableB
WHERE tableA.CUST_PO_NUMBER = tableB.CustomerMask
EDIT2: Thinking about it, how many of those masks start and end with wildcards? You might gain some performance by first:
Indexing CUST_PO_NUMBER
Creating a persisted computed column CUST_PO_NUMBER_REV that's the reverse of CUST_PO_NUMBER
Indexing the persisted column
Putting statistics on these columns
Then you might build three queries, and UNION ALL the results together:
SELECT ...
FROM ...
ON CUSTOM_PO_NUMBER LIKE CustomerMask
WHERE /* First character of CustomerMask is not a wildcard but last one is */
UNION ALL
SELECT ...
FROM ...
ON CUSTOM_PO_NUMBER_REV LIKE REVERSE(CustomerMask)
WHERE /* Last character of CustomerMask is not a wildcard but first one is */
UNION ALL
SELECT ...
FROM ...
ON CUSTOM_PO_NUMBER LIKE CustomerMask
WHERE /* Everything else */
That's just a quick example, you'll need to take some care that the WHERE clauses give you mutually exclusive results (or use UNION, but aim for mutually exclusive results first).
If you can do that, you should have two queries using index seeks and one query using index scans.
EDIT: You can implement a sharding system to spread out the customers and customer masks tables across multiple servers and then have each server evaluate 1/n% of the results. You don't need to partition the data -- simple replication of the entire contents of each table will do. Link the servers to your main server and you can do something to the effect of:
SELECT ... FROM OPENQUERY(LinkedServer1, 'SELECT ... LIKE ... WHERE ID BETWEEN 0 AND 99')
UNION ALL
SELECT ... FROM OPENQUERY(LinkedServer2, 'SELECT ... LIKE ... WHERE ID BETWEEN 100 AND 199')
Note: the OPENQUERY may be extraneous, SQL Server might be smart enough to evaluate queries on remote servers and stream the results back. I know it doesn't do that for JET linked servers, but it might handle its own kind better.
That or through more hardware at the problem.
You can create an Indexed View of your query to improve performance.
From Designing Indexed Views:
For a standard view, the overhead of dynamically building the result set for each query that references a view can be significant for views that involve complex processing of large numbers of rows, such as aggregating lots of data, or joining many rows. If such views are frequently referenced in queries, you can improve performance by creating a unique clustered index on the view. When a unique clustered index is created on a view, the result set is stored in the database just like a table with a clustered index is stored.
Another benefit of creating an index on a view is that the optimizer starts using the view index in queries that do not directly name the view in the FROM clause. Existing queries can benefit from the improved efficiency of retrieving data from the indexed view without having to be recoded.
This should improve the performance of this particular query, but note that inserts, updates and deleted into the tables it uses may be slowed.
You can't use LIKE if you care about performance.
If you are trying to do approximate string matching (e.g. Test and est and best, etc.) and you don't want to use Sql full-text search take a look at this article.
At least you can shortlist approximate matches then run your wildcard test on them.
--EDIT 2--
Your problem is interesting in the context of your limitation. Thinking about it again, I am pretty sure that using 3 gram would boost the performance (going back to my initial suggestion).
Let's say if you setup your 3gram data, you'll be having the following tables:
Customer : 14M
Customer3Grams : Maximum 700M //Considering the field is varchar(50)
3Grams : 78
Pattern : 1000
Pattern3Grams : 50K
To join pattern to customer then you need the following join:
Pattern x Pattern3Grams x Customer3Grams x Customer
With appropriate indexing (which is easy) each look-up can happen in O(LOG(50K)+LOG(700M)+LOG(14M)) which is equal to 47.6.
Considering appropriate indexes are present the whole join can be calculated with less than 50,000 look-ups and of course the cost of scanning after look ups. I expect it to be very efficient (matter of seconds).
The cost of creating 3grams for each new customer is also minimal because it would be maximum of 50x75 possible three grams that should be appended to the customer3Grams table.
--EDIT--
Depending to your data I can also suggest hash based clustering. I assume customer numbers are numbers with some character patterns in them (e.g. 123231ttt3x4). If this is the case you can create a simple hash function that calculates the result of bit-wise OR for every letter (not numbers) and add it as an indexed column to your table. You can filter on the result of the hash before applying LIKE.
Depending to your data this may cluster your data effectively and improve your search by factor of the number of clusters (number of hash). You can test it by applying the hash and counting the number of distinct generated hash.

MySQL - Selecting data from multiple tables all with same structure but different data

Ok, here is my dilemma I have a database set up with about 5 tables all with the exact same data structure. The data is separated in this manner for localization purposes and to split up a total of about 4.5 million records.
A majority of the time only one table is needed and all is well. However, sometimes data is needed from 2 or more of the tables and it needs to be sorted by a user defined column. This is where I am having problems.
data columns:
id, band_name, song_name, album_name, genre
MySQL statment:
SELECT * from us_music, de_music where `genre` = 'punk'
MySQL spits out this error:
#1052 - Column 'genre' in where clause is ambiguous
Obviously, I am doing this wrong. Anyone care to shed some light on this for me?
I think you're looking for the UNION clause, a la
(SELECT * from us_music where `genre` = 'punk')
UNION
(SELECT * from de_music where `genre` = 'punk')
It sounds like you'd be happer with a single table. The five having the same schema, and sometimes needing to be presented as if they came from one table point to putting it all in one table.
Add a new column which can be used to distinguish among the five languages (I'm assuming it's language that is different among the tables since you said it was for localization). Don't worry about having 4.5 million records. Any real database can handle that size no problem. Add the correct indexes, and you'll have no trouble dealing with them as a single table.
Any of the above answers are valid, or an alternative way is to expand the table name to include the database name as well - eg:
SELECT * from us_music, de_music where `us_music.genre` = 'punk' AND `de_music.genre` = 'punk'
The column is ambiguous because it appears in both tables you would need to specify the where (or sort) field fully such as us_music.genre or de_music.genre but you'd usually specify two tables if you were then going to join them together in some fashion. The structure your dealing with is occasionally referred to as a partitioned table although it's usually done to separate the dataset into distinct files as well rather than to just split the dataset arbitrarily. If you're in charge of the database structure and there's no good reason to partition the data then I'd build one big table with an extra "origin" field that contains a country code but you're probably doing it for legitimate performance reason.
Either use a union to join the tables you're interested in http://dev.mysql.com/doc/refman/5.0/en/union.html or by using the Merge database engine http://dev.mysql.com/doc/refman/5.1/en/merge-storage-engine.html.
Your original attempt to span both tables creates an implicit JOIN. This is frowned upon by most experienced SQL programmers because it separates the tables to be combined with the condition of how.
The UNION is a good solution for the tables as they are, but there should be no reason they can't be put into the one table with decent indexing. I've seen adding the correct index to a large table increase query speed by three orders of magnitude.
The union statement cause a deal time in huge data. It is good to perform the select in 2 steps:
select the id
then select the main table with it