MSSQL: Why is this index 10 times faster than the other one? - sql

i found a very strange behaviour for which i have to explanation. We have a simple table with around 450.000 entries (MSSQL 2008 R2).
The indexes for this table are very simple:
Index #1 contains:
[OwnerUserID] -> int, 4 byte
[TargetQuestionID] -> int, 4 byte
[LastChange] -> date, 8 byte
Index #2 contains:
[LastChange] is a date, 8 byte
[OwnerUserID] is an int, 4 byte
[TargetQuestionID] is an int, 4 byte
As you can see, the difference is only in the slightly different order of the columns; in both indexes, the leafs have the same size, 16 bytes (far away from what i've seen doing some DBAs on really big databases)
The queries are simple:
Query #1:
- Asks just for the last entried element ( top(1) ) ordered by LastChange, so it takes only LastChange into account
Query #2:
- Asks just for the last entried element ( top(1) ) entried for a distinct OwnerUserID, so it takes OwnerUserID and LastChange into account
Results are:
Index #1 is super slow for query #1, albeit i thought it should be OK since the data leafs are really not big (16 bytes)
Index #2 is super slow for query #2 (but since it takes two values into account, OwnerUserID + LastChange = 8 bytes, i do not see any reason why it should be much slower/faster)
Our idea was to have only one index, but since the performance for each query scenario differs by 10 - 11 times, we ended up creating just BOTH of these indexes in parallel, where we thought we could go with one - since the index is not that big/complex that you would actually think this slight difference in column-order would hurt.
So, now we are wasting doubled space and since the table grows around by 10k rows per day, we will have diskspace issues somewhere in the future...
First, i thought this is because of some internal NHibernate issues, but we checked in Performance Monitor and the results are absolutely reproduceable.
It seems like MSSQL performance with indexes depends highly on the usage of datetime-columns, since this simple example shows that this could crash the whole performance :-/

Commonly indices are used to make a fast binary search possible, instead of slow sequential search. To achieve this they store the index keys in sorted order or in a tree. But a binary search is only possible, if the start of the key is known, and thus the order of the elements is important. In your case this means:
Query#1 needs the record with the lowest LastChange. This query can be optimized with an index, which starts with LastChange, e.g. Index#2. With Index#1 it needs to fall back to a sequential search.
Query#2 needs first to find all unique OwnerIds and an index which starts with the OwnerId can help here. Then it needs to find the lowest LastChange for a specific OwnerId. Index#1 does not help here anymore, because the next field in the index is not LastChange. Index#2 might help here if there are lots of records for the same OwnerId. Otherwise it will probably do an sequential search.
So for an index the order of fields should match the queries. Also you might need to update your statistics so that the query planner has an idea if it is better to do a sequential search (few entries per OwnerId) or use Index#2 too (lots of entries per OwnerId). I don't know if and how this can be done with mysql, only know it from postgresql.
Index is always a trade-off: it slows down inserts, but speeds up queries. So it highly depends on your application how many indices you have and how they will be constructed.

Related

select most recent values in very large table

I am an operations guy tasked with pulling data from a very large table. I'm not a DBA and cannot partition it or change the indexing. Table has nearly a billion entries, is not partitioned, and could probably be indexed "better". I need two fields, which we'll call mod_date and obj_id (mod_date is indexed). EDIT: I also add a filter for 'client' which I've blurred out in my screenshot of the explain plan.
My data:
Within the group of almost a billion rows, we have fewer than 10,000 obj_id values to query across several years (a few might even be NULL). Some of the <10k obj_ids -- probably between 1,000-2,500 -- have more than 10 million mod_date values each. When the obj_ids have over a few million mod_dates, each obj_id takes several minutes to scan and sort using MAX(mod_date). The full result set takes over 12 hours to query and no one has made it to completion without some "issue" (locked out, unplugged laptop, etc.). Even if we got the first 50 rows returned we'd still need to export to Excel ... it's only going to be about 8,000 rows with 2 columns but we can never make it to the end.
So here is a simplified query I'd use if it were a small table:
select MAX(trunc(mod_date,'dd')) as last_modified_date, obj_id
from my_table
where client = 'client_name'
and obj_type_id = 12
group by obj_id;
Cardinality is 317917582, "Cost" is 12783449
The issue:
The issue is the speed of the query with such a large unpartitioned table, given the current indexes. All the other answers I've seen about "most recent date" tend to use MAX, possibly in combination with FIRST_VALUE, which seem to require a full scan of all rows in order to sort them and then determine which is the most recent.
I am hoping there is a way to avoid that, to speed up the results. It seems that Oracle (I am using Oracle SQL developer) should be able to take an obj_id, look for the most recent mod_date row starting from "now" and working backwards, and move on as soon as it finds any mod_date value … because it's a date. Is there a way to do that?
Even with such a large table, obj_ids having fewer than 10,000 mod_dates can return the MAX(mod_date) very quickly (seconds or less). The issue we are having is the obj_ids having the most mod_dates (over 10 million) take the longest to scan and sort, when they "should" be the quickest if I could get Oracle to start looking at the most recent first … because it would find a recent date quickly and move on!
First, I'd say its a common misconception that in order to make a query run faster, you need an index (or better indexes). Full table scan makes sense when you're pulling more than 10% of the data (rough estimate, depends on multiblock read count, block size, etc).
My advice is to setup a materialized view (MY_MV or whatever) that simply does the group by query (across all ids). If you need to limit the ids to a 10k subset, just make sure you full scan the table (check explain plan). You can add a full hint if needed (select /*+ full(t) */ .. from big_table t ...)
Then do:
dbms_mview.refresh('MY_MV','C',atomic_refresh=>false);
Thats it. No issues with a client only returning the first x rows and when you go to pull everything it re-runs the entire query (ugh). Full scans are also easier to track in long opts (harder to tell what progress you've made if you are doing nested loops on an index for example).
Once its done, dump entire MV table to a file or whatever you need.
tbone has it right I think. Or, if you do not have authority to create a materialized view, as he suggests, you might create a shell script on the database server to run your query via SQL*Plus and spool the output to a file. Then, run that script using nohup and you shouldn't need to worry about laptops getting turned off, etc.
But I wanted to explain something about your comment:
Oracle should be able to take an obj_id, look for the most recent mod_date row starting from "now" and working backwards, and move on as soon as it finds any mod_date value … because it's a date. Is there a way to do that?
That would be a horrible way for Oracle to run your query, given the indexes you have listed. Let's step through it...
There is no index on obj_id, so Oracle needs to do a full table scan to make sure it gets all the distinct obj_id values.
So, it starts the FTS and finds obj_id 101. It then says "I need max(mod_date) for 101... ah ha! I have an index!" So, it does a reverse index scan. For each entry in the index, it looks up the row from table and checks it to see if it is obj_id 101. If the obj_id was recently updated, we're good because we find it and stop early. But if the obj_id has not been updated in a long time, we have to read many index entries and, for each, access the table row(s) to perform the check.
In the worst case -- if the obj_id is one of those few you mentioned where max(mod_date) will be NULL, we would use the index to look up EVERY SINGLE ROW in your table that has a non-null mod_date.
Doing so many index lookups would be an awful plan if it did that just once, but you're talking about doing it for several old or never-updated obj_id values.
Anyway, it's all academic. There is no Oracle query plan that will run the query that way. It's for good reason.
Without better indexing, you're just not going to improve upon a single full table scan.

Sql Server Paging issue

Friends,
I have already implemented paging in my SP -
with MyData As (
select ROW_NUMBER() over (order by somecolumn desc) AS [Row],
x,y,z,...
)
Select x,y,z,...
From MyData
Where [Row] between ((#currentPage - 1) * #pageSize + 1) and (#currentPage*#pageSize)
The problem here is that data is retried very fast if with clause return smaller number of rows but it takes long time when there are millions of records. Sometimes it times out.
Is there any other alternative?
Thanks for sharing your valuable time.
SQL server optimisation is a very broad subject and it is pretty much impossible to work out the issue with the limited amount of information you have posted. However since you're in a rush for a solution - Firstly I would suggest checking your actual execution plan, post it here, and making sure that the index is actually being used - if this is not the case then consider using the FASTFIRSTROW table hint to force the index to be used - check here and here on how it can improve things and here in how it can make things worse.
Next to consider is SQL parameter sniffing - it's unlikely from what you have said but possible check here for details enter link description here
For large scale performance gains you may need to look at architectural changes at the very least ensure that your transaction logs are on a different disk to your data.The reason you separate the database files from the log files is because database access is random and log access is sequential. Best practice dictates that you don't mix those two I/O types on the same disk
Also if you've got million of rows then you really need to consider splitting the data across multiple disks.
Finally I would strongly consider partioning either the table or the index see here for a start
The reason why your query is slow is that you have sort whole table on every request. To speed it up significantly you need to avoid sorting big chuck of data, at cost of CPU, HDD/Memory or limitations on pagination logic.
As there is not much information about how you table is sorted and if you insert in the middle / delete entries very often, I'll narrow down you question by making these assumptions:
I would imagine you have a table storing an archive of articles. New entries are mostly at the bottom of the table, entries from the middle of the table deleted rarely.
You sort always by the same column somecolumn and in the same order, e.g. descending.
You do not have any user entered filters (like article title or author).
This makes the table static in terms of the output: each article will be on the same place, unless a new one inserted. New one come to the top of your output. Then you can store ROW_NUMBER() OVER () as a column. A more convenient solution will be an IDENTITY column. It will speed up things if you create a clustered index on this column
alter table add [Record_Number] int null IDENTITY
This new column is added as null so you can populate values first time. Then you can make it not null.
On the other hand you can last row number very quickly by
select #Max_Row = SELECT MAX(row_number) from MyTable
Now when you have total number of rows, page size and page number you can select rows you need in one statement without sorting the whole lot.
Select * From MyTable
Where row_number between
(#Max_Row - #Page * #Page_Size) + 1 AND
#Max_Row -(#Page - 1) * #Page_Size
If you do have a filter in your CTE, then give some more information about how your data is structured, so we can think of a way to limit the scope of CTE.

SQL Query Slow? Should it be?

Using SQLite, Got a table with ~10 columns. Theres ~25million rows.
That table has an INDEX on 'sid, uid, area, type'.
I run a select like so:
SELECT sid from actions where uid=1234 and area=1 and type=2
That returns me 1571 results, and takes 4 minutes to complete.
Is that sane?
I'm far from an SQL expert, so hopefully someone can fill me in on what I'm missing. Why could this possibly take 4+ minutes with everything indexed?
Any recommended resources to learn about achieving high SQL performance? I feel like a lot of the Google results just give me opinions or anecdotes, I wouldn't mind a solid book.
Create uid+area+type index instead, or uid+area+type+sid
Since the index starts with the sid column, it must do a scan (start at the beginning, read to the end) of either the index or the table to find your data matching the other 3 columns. This means it has to read all 25 million rows to find the answer. Even if it's reading just the rows of the index rather than the table, that's a lot of work.
Imagine a phone book of the greater New York metropolitan area, organized by (with an 'index' on) Last Name, First Name.
You submit SELECT [Last Name] FROM NewYorkPhoneBook WHERE [First Name] = 'Thelma'
It has to read all 25 million entries to find all those Thelmas. Unless you either specify the last name and can then turn directly to the page where that last name first appers (a seek), or have an index organized by First Name (a seek on the index followed by a seek on the table, aka a "bookmark lookup"), there's no way around it.
The index you would create to make your query faster is on uid, area, type. You could include sid, though leave it out if sid is part of the primary key.
Note: Tables often do have multiple indexes. Just note that the more indexes, the slower the write performance. Unnecessary indexes can slow overall performance, sometimes radically so. Testing and eventually experience will help guide you in this. Also, reasoning it out as a real-world problem (like my phone book examples) can really help. If it wouldn't make sense with phone books (and separate phone book indexes) then it probably won't make sense in the database.
One more thing: even if you put an index on those columns, if your query is going to end up pulling a great percentage of the rows in the main table, it will still be cheaper to scan the table rather than do the bookmark lookup (seek the index then seek the table for each row found). The exact "tipping point" of whether to do a bookmark lookup with a seek, or to do a table scan isn't something I can tell you off the top of my head, but it is based on solid math.
The index is not really usefull as it does start with the wrong field... which means a table scan.
Looks like you have a normal computer there, not something made for databases. I run table scans over 650 million rows in about a minute on my lower end db server, but that means reading about a gigabyte per second from the discs, which are a RAID of 10k RM discs - RAID 10. Just to say that basically... that databases love IO, and that in a degree that you have never seen before. Basically larger db servers have many discs to satisfy the IOPS (IO per second) requirement. I have seen a server with 190 discs.
So, you ahve two choices: beed up your IOPS capability (means spending money), or set up indices that get used because they are "proper".
Proper means: an index only is usefull if the fields it contains are used from left to right. Not necessarily in the same order... but if a field is missed there is a chance the SQL System decides it is not worth pursuing the index and instead goes table scan (as in your case).
When you create your new index on uid, area and type, you should also do a select distinct on each one to determine which has the fewest distinct entries, then create your index such that the fewer the differences the earlier they show up in the index definition.

Different execution plan for similar queries

I am running two very similar update queries but for a reason unknown to me they are using completely different execution plans. Normally this wouldn't be a problem but they are both updating exactly the same amount of rows but one is using an execution plan that is far inferior to the other, 4 secs vs 2 mins, when scaled up this is causing me a massive problem.
The only difference between the two queries is one is using the column CLI and the other DLI. These columns are exactly the same datatype, and are both indexed exactly the same, but for the DLI query execution plan, the index is not used.
Any help as to why this is happening is much appreciated.
-- Query 1
UPDATE a
SET DestKey = (
SELECT TOP 1 b.PrefixKey
FROM refPrefixDetail AS b
WHERE a.DLI LIKE b.Prefix + '%'
ORDER BY len(b.Prefix) DESC )
FROM CallData AS a
-- Query 2
UPDATE a
SET DestKey = (
SELECT TOP 1 b.PrefixKey
FROM refPrefixDetail b
WHERE a.CLI LIKE b.Prefix + '%'
ORDER BY len(b.Prefix) DESC )
FROM CallData AS a
Examine the statistics on these two columns on the table (How the data values for the columns are distributed among all the rows). This will propbably explain the difference... One of these columns may have a distribution of values that could cause the query, in processsing, to need to examine a substantially higher number of rows than would be required by the other query, (The number or rows updated is controlled by the Top 1 part remember) then it is possible that the query optimizer will choose not to use the index... Updating statistics will make them more accurate, but if the distribution of values is such that the optimizer chooses not to use the index, then you may be out of luck...
Understanding how indices work is useful here. An index is a tree-structure of nodes, where each node (starting with a root node) contains information that allows the query processor to determine which branch of the tree to go to next, based on the value it is "searching" for. It is analogous to a binary-Tree except that in databases the trees are not binary, at each level there may be more than 2 branches below each node.
So, for an index, to traverse the index, from the root to the leaf level, requires that the processor read the index once for each level in the index hiearchy. (if the index is 5 levels deep for example, it needs to do 5 I/O operations for each record it searches for.
So in this example, say, if the query need to examine more than approximately 20% of the records in the table, (based on the value distribution of the column you are searching against), then the query optimizer will say to itself, "self, to find 20% of the records, with five I/O s per each record search, is equal to the same number of I/Os as reading the entire table.", so it just ignores the index and does a Table scan.
There's really no way to avoid this except by adding additonal criteria to your query to furthur restrict the number of records the query must examine to generate it's results....
Try updating your statistics. If that does not help try rebuilding your indexes. It is possible that the cardinality of the data in each column is quite different, causing different execution plans to be selected.

SQL: Inner joining two massive tables

I have two massive tables with about 100 million records each and I'm afraid I needed to perform an Inner Join between the two. Now, both tables are very simple; here's the description:
BioEntity table:
BioEntityId (int)
Name (nvarchar 4000, although this is an overkill)
TypeId (int)
EGM table (an auxiliar table, in fact, resulting of bulk import operations):
EMGId (int)
PId (int)
Name (nvarchar 4000, although this is an overkill)
TypeId (int)
LastModified (date)
I need to get a matching Name in order to associate BioEntityId with the PId residing in the EGM table. Originally, I tried to do everything with a single inner join but the query appeared to be taking way too long and the logfile of the database (in simple recovery mode) managed to chew up all the available disk space (that's just over 200 GB, when the database occupies 18GB) and the query would fail after waiting for two days, If I'm not mistaken. I managed to keep the log from growing (only 33 MB now) but the query has been running non-stop for 6 days now and it doesn't look like it's gonna stop anytime soon.
I'm running it on a fairly decent computer (4GB RAM, Core 2 Duo (E8400) 3GHz, Windows Server 2008, SQL Server 2008) and I've noticed that the computer jams occasionally every 30 seconds (give or take) for a couple of seconds. This makes it quite hard to use it for anything else, which is really getting on my nerves.
Now, here's the query:
SELECT EGM.Name, BioEntity.BioEntityId INTO AUX
FROM EGM INNER JOIN BioEntity
ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId
I had manually setup some indexes; both EGM and BioEntity had a non-clustered covering index containing TypeId and Name. However, the query ran for five days and it did not end either, so I tried running Database Tuning Advisor to get the thing to work. It suggested deleting my older indexes and creating statistics and two clustered indexes instead (one on each table, just containing the TypeId which I find rather odd - or just plain dumb - but I gave it a go anyway).
It has been running for 6 days now and I'm still not sure what to do...
Any ideas guys? How can I make this faster (or, at least, finite)?
Update:
- Ok, I've canceled the query and rebooted the server to get the OS up and running again
- I'm rerunning the workflow with your proposed changes, specifically cropping the nvarchar field to a much smaller size and swapping "like" for "=". This is gonna take at least two hours, so I'll be posting further updates later on
Update 2 (1PM GMT time, 18/11/09):
- The estimated execution plan reveals a 67% cost regarding table scans followed by a 33% hash match. Next comes 0% parallelism (isn't this strange? This is the first time I'm using the estimated execution plan but this particular fact just lifted my eyebrow), 0% hash match, more 0% parallelism, 0% top, 0% table insert and finally another 0% select into. Seems the indexes are crap, as expected, so I'll be making manual indexes and discard the crappy suggested ones.
I'm not an SQL tuning expert, but joining hundreds of millions of rows on a VARCHAR field doesn't sound like a good idea in any database system I know.
You could try adding an integer column to each table and computing a hash on the NAME field that should get the possible matches to a reasonable number before the engine has to look at the actual VARCHAR data.
For huge joins, sometimes explicitly choosing a loop join speeds things up:
SELECT EGM.Name, BioEntity.BioEntityId INTO AUX
FROM EGM
INNER LOOP JOIN BioEntity
ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId
As always, posting your estimated execution plan could help us provide better answers.
EDIT: If both inputs are sorted (they should be, with the covering index), you can try a MERGE JOIN:
SELECT EGM.Name, BioEntity.BioEntityId INTO AUX
FROM EGM
INNER JOIN BioEntity
ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId
OPTION (MERGE JOIN)
First, 100M-row joins are not at all unreasonable or uncommon.
However, I suspect the cause of the poor performance you're seeing may be related to the INTO clause. With that, you are not only doing a join, you are also writing the results to a new table. Your observation about the log file growing so huge is basically confirmation of this.
One thing to try: remove the INTO and see how it performs. If the performance is reasonable, then to address the slow write you should make sure that your DB log file is on a separate physical volume from the data. If it isn't, the disk heads will thrash (lots of seeks) as they read the data and write the log, and your perf will collapse (possibly to as little as 1/40th to 1/60th of what it could be otherwise).
Maybe a bit offtopic, but:
" I've noticed that the computer jams occasionally every 30 seconds (give or take) for a couple of seconds."
This behavior is characteristic for cheap RAID5 array (or maybe for single disk) while copying (and your query mostly copies data) gigabytes of information.
More about problem - can't you partition your query into smaller blocks? Like names starting with A, B etc or IDs in specific ranges? This could substantially decrease transactional/locking overhead.
I'd try maybe removing the 'LIKE' operator; as you don't seem to be doing any wildcard matching.
As recommended, I would hash the name to make the join more reasonable. I would strongly consider investigating assigning the id during the import of batches through a lookup if it is possible, since this would eliminate the need to do the join later (and potentially repeatedly having to perform such an inefficient join).
I see you have this index on the TypeID - this would help immensely if this is at all selective. In addition, add the column with the hash of the name to the same index:
SELECT EGM.Name
,BioEntity.BioEntityId
INTO AUX
FROM EGM
INNER JOIN BioEntity
ON EGM.TypeId = BioEntity.TypeId -- Hopefully a good index
AND EGM.NameHash = BioEntity.NameHash -- Should be a very selective index now
AND EGM.name LIKE BioEntity.Name
Another suggestion I might offer is try to get a subset of the data instead of processing all 100 M rows at once to tune your query. This way you don't have to spend so much time waiting to see when your query is going to finish. Then you could consider inspecting the query execution plan which may also provide some insight to the problem at hand.
100 million records is HUGE. I'd say to work with a database that large you'd require a dedicated test server. Using the same machine to do other work while performing queries like that is not practical.
Your hardware is fairly capable, but for joins that big to perform decently you'd need even more power. A quad-core system with 8GB would be a good start. Beyond that you have to make sure your indexes are setup just right.
do you have any primary keys or indexes? can you select it in stages? i.e. where name like 'A%', where name like 'B%', etc.
I had manually setup some indexes; both EGM and BioEntity had a non-clustered covering index containing TypeId and Name. However, the query ran for five days and it did not end either, so I tried running Database Tuning Advisor to get the thing to work. It suggested deleting my older indexes and creating statistics and two clustered indexes instead (one on each table, just containing the TypeId which I find rather odd - or just plain dumb - but I gave it a go anyway).
You said you made a clustered index on TypeId in both tables, although it appears you have a primary key on each table already (BioEntityId & EGMId, respectively). You do not want your TypeId to be the clustered index on those tables. You want the BioEntityId & EGMId to be clustered (that will physically sort your data in order of the clustered index on disk. You want non-clustered indexes on foreign keys you will be using for lookups. I.e. TypeId. Try making the primary keys clustered, and adding a non-clustered index on both tables that ONLY CONTAINS TypeId.
In our environment we have a tables that are roughly 10-20 million records apiece. We do a lot of queries similar to yours, where we are combining two datasets on one or two columns. Adding an index for each foreign key should help out a lot with your performance.
Please keep in mind that with 100 million records, those indexes are going to require a lot of disk space. However, it seems like performance is key here, so it should be worth it.
K. Scott has a pretty good article here which explains some issues more in depth.
Reiterating a few prior posts here (which I'll vote up)...
How selective is TypeId? If you only have 5, 10, or even 100 distinct values across your 100M+ rows, the index does nothing for you -- particularly since you're selecting all the rows anyway.
I'd suggest creating a column on CHECKSUM(Name) in both tables seems good. Perhaps make this a persisted computed column:
CREATE TABLE BioEntity
(
BioEntityId int
,Name nvarchar(4000)
,TypeId int
,NameLookup AS checksum(Name) persisted
)
and then create an index like so (I'd use clustered, but even nonclustered would help):
CREATE clustered INDEX IX_BioEntity__Lookup on BioEntity (NameLookup, TypeId)
(Check BOL, there are rules and limitations on building indexes on computed columns that may apply to your environment.)
Done on both tables, this should provide a very selective index to support your query if it's revised like this:
SELECT EGM.Name, BioEntity.BioEntityId INTO AUX
FROM EGM INNER JOIN BioEntity
ON EGM.NameLookup = BioEntity.NameLookup
and EGM.name = BioEntity.Name
and EGM.TypeId = BioEntity.TypeId
Depending on many factors it will still run long (not least because you're copying how much data into a new table?) but this should take less than days.
Why an nvarchar? Best practice is, if you don't NEED (or expect to need) the unicode support, just use varchar. If you think the longest name is under 200 characters, I'd make that column a varchar(255). I can see scenarios where the hashing that has been recommended to you would be costly (it seems like this database is insert intensive). With that much size, however, and the frequency and random nature of the names, your indexes will become fragmented quickly in most scenarios where you index on a hash (dependent on the hash) or the name.
I would alter the name column as described above and make the clustered index TypeId, EGMId/BioentityId (the surrogate key for either table). Then you can join nicely on TypeId, and the "rough" join on Name will have less to loop through. To see how long this query might run, try it for a very small subset of your TypeIds, and that should give you an estimate of the run time (although it might ignore factors like cache size, memory size, hard disk transfer rates).
Edit: if this is an ongoing process, you should enforce the foreign key constraint between your two tables for future imports/dumps. If it's not ongoing, the hashing is probably your best best.
I would try to solve the issue outside the box, maybe there is some other algorithm that could do the job much better and faster than the database. Of course it all depends on the nature of the data but there are some string search algorithm that are pretty fast (Boyer-Moore, ZBox etc), or other datamining algorithm (MapReduce ?) By carefully crafting the data export it could be possible to bend the problem to fit a more elegant and faster solution. Also, it could be possible to better parallelize the problem and with a simple client make use of the idle cycles of the systems around you, there are framework that can help with this.
the output of this could be a list of refid tuples that you could use to fetch the complete data from the database much faster.
This does not prevent you from experimenting with index, but if you have to wait 6 days for the results I think that justifies resources spent exploring other possible options.
my 2 cent
Since you're not asking the DB to do any fancy relational operations, you could easily script this. Instead of killing the DB with a massive yet simple query, try exporting the two tables (can you get offline copies from the backups?).
Once you have the tables exported, write a script to perform this simple join for you. It'll take about the same amount of time to execute, but won't kill the DB.
Due to the size of the data and length of time the query takes to run, you won't be doing this very often, so an offline batch process makes sense.
For the script, you'll want to index the larger dataset, then iterate through the smaller dataset and do lookups into the large dataset index. It'll be O(n*m) to run.
If the hash match consumes too many resources, then do your query in batches of, say, 10000 rows at a time, "walking" the TypeID column. You didn't say the selectivity of TypeID, but presumably it is selective enough to be able to do batches this small and completely cover one or more TypeIDs at a time. You're also looking for loop joins in your batches, so if you still get hash joins then either force loop joins or reduce the batch size.
Using batches will also, in simple recovery mode, keep your tran log from growing very large. Even in simple recovery mode, a huge join like you are doing will consume loads of space because it has to keep the entire transaction open, whereas when doing batches it can reuse the log file for each batch, limiting its size to the largest needed for one batch operation.
If you truly need to join on Name, then you might consider some helper tables that convert names into IDs, basically repairing the denormalized design temporarily (if you can't repair it permanently).
The idea about checksum can be good, too, but I haven't played with that very much, myself.
In any case, such a huge hash match is not going to perform as well as batched loop joins. If you could get a merge join it would be awesome...
I wonder, whether the execution time is taken by the join or by the data transfer.
Assumed, the average data size in your Name column is 150 chars, you will actually have 300 bytes plus the other columns per record. Multiply this by 100 million records and you get about 30GB of data to transfer to your client. Do you run the client remote or on the server itself ?
Maybe you wait for 30GB of data being transferred to your client...
EDIT: Ok, i see you are inserting into Aux table. What is the setting of the recovery model of the database?
To investigate the bottleneck on the hardware side, it might be interesting whether the limiting resource is reading data or writing data. You can start a run of the windows performance monitor and capture the length of the queues for reading and writing of your disks for example.
Ideal, you should place the db log file, the input tables and the output table on separate physical volumes to increase speed.