Index not used Postgres

Index not used Postgres - sql

Tracking indexes and analyzing the tables on which index add, we encounter some situations:
some of our tables have index, but when I execute a query with a clause where on index field, doesn't account in your idx_scan field respective. Same relname and schemaname, so, I couldn't be wrong.
Testing more, I deleted and create the table again, after that the query returned to account the idx_scan.
That occurred with another tables too, we executed some queries with indexes and didn't account idx_scan field, only in seq_scan and even if I create another field in the same table with index, this new field doesn't count idx_scan.
Whats the problem with these tables? What do we do wrong? Only if I create a new table with indexes that account in idx_scan, just in an old table that has wrong.
We did migration sometimes with this database, maybe it can be the problem? Happened on localhost and server online.
Another event that we saw, some indexes were accounted, idx_scan > 0, and when execute query select, does not increase idx_scan again, the number was fixed and just increase seq_scan.
I believe those problems can be related.
I appreciate some help, it's a big mystery prowling our DB and have no idea what the problem can be.

A couple suggestions (and what to add to your question).
The first is that index scans are not always favored to to sequential scans. For example, if your table is small or the planner estimates that most pages will need to be fetched, an index scan will be omitted in favor of a sequential scan.
Remember: no plan beats retrieving a single page off disk and sequentially running through it.
Similarly if you have to retrieve, say, 50% of the pages of a relation, doing an index scan is going to trade somewhat less disk/IO total for a great deal more random disk/IO. It might be a win if you use SSD's but certainly not with conventional hard drives. After all you don't really want to be waiting for platters to turn. If you are using SSD's you can tweak planner settings accordingly.
So index vs sequential scan is not the end of the story. The question is how many rows are retrieved, how big the tables are, what percentage of disk pages are retrieved, etc.
If it really is picking a bad plan (rather than a good plan that you didn't consider!) then the question becomes why. There are ways of setting statistics targets but these may not be really helpful.
Finally the planner really can't choose an index in some cases where you might like it to. For example, suppose I have a 10 million row table with records spanning 5 years (approx 2 million rows per year on average). I would like to get the distinct years. I can't do this with a standard query and index, but I can build a WITH RECURSIVE CTE to essentially execute the same query once for each year and that will use an index. Of course you had better have an index in that case or WITH RECURSIVE will do a sequential scan for each year which is certainly not what you want!
tl;dr: It's complicated. You want to make sure this is really a bad plan before jumping to conclusions and then if it is a bad plan see what you can do about it depending on your configuration.

Related

Make query run faster - IT HAS NO JOIN

I got a really huge amount of data that are used to be joined anywhere just to get it (because it was really slow the team decided to gather it all into one table), but now even though they're literally right in one table (no join needed).
It's still so slow. Taking a one day range filter event will lead to time out (took more than 10s, yes that's how bad it is).
What should I suggest to my DBA?

What is the "selectivity"? That is, how many rows does your select expect to retrieve? 100% of the rows? 1% of the rows? 0.01% of the rows?
1. Low selectivity
If the selectivity is low (i.e less than 5%, ideally less than 0.5%) then good indexing is the best practice.
If so, which columns in the where clause (filtering columns) have the best (lowest) selectivity? Add these columns first in the index.
Once you have decided on the best index, you can make the table a "clustered index" table using that index. That way the heap will be presorted (fast lookup) by the index columns, for improved io since the disk blocks will be looked up sequentially.
2. High selectivity
If the selectivity is high (20% or more), there's no much you can do on your side (development). You could still get some improvement by:
Removing unneeded columns.
Make sure the select uses a FULL TABLE SCAN.
Ask the DBA to assign more resources (SGA, disk priority, paralellism, etc.)
3. Otherwise
The amount of data you have vastly exceeds the database resources you have. There's nothing you can do about it, except to tell the client about this reality, and:
Find together a way of defining smaller queries that can be achievable.
4. Finally
If you don't understanf the terms of selectivity, full table scan, indexing, database resources, heap, disk blocks, I would recommend you study them. I'm fairly sure you need to fully understand them right now!

As others have said, you need an index. However if it's really huge you can partition the data.
This allows you to drop sections of the data without using time consuming deletes. For example if you're working with some sort of historical data and want to keep 3 months worth, you can partition by month, then each month drop the oldest partition.
However on a more general note, it's rarely a good idea to take a slow multi-table query and glom it all together to improve performance. What you really need is to figure out what's wrong with the slow query and fix it.
This is a job for your DBA.

Querying Oracle table of high degree of parallelism results in full table scan

Well, the title described what I've just encountered recently with Oracle database.
Here's some background:
Table in concern in partitioned by hash into 4 partitions.
Parallel degree of the table is 4.
Hash key equals PK.
There is quite a number of rows in the table, around 200M.
PK index is also partitioned (local partition).
Parallel degree of the index is 1.
Okay now I've got a query behaves strangely as I change the parallel degree of the table.
If table degree is 4, it results in full table scan (coordinated parallel full table scan) as revealed by explain plan. Takes 30 minutes or more to complete the query.
If table degree is 1-3, it correctly make use of the PK index (range scan, single threaded) and returns result in 20 seconds.
If I set both table degree and index degree to 4, results in full table scan (same result as the first scenario in above).
This behavior, however, does not happen in another database where I have an nearly identical clone of the table. The only difference is number of records. The table in another database is of slightly smaller size (minus 1-2 million). The smaller table, also with degree of 4, does not runs into full table scan with the same query.
I've spent some time on Googling around and found the following things about parallel query:
From Oracle official doc
A high degree of parallelism for a table skews the optimizer toward full table scans over range scans. Examine the DEGREE column in ALL_TABLES for the table to determine the degree of parallelism.
And from http://www.toadworld.com/Portals/0/GuyH/Articles/Oracle%20Parallel%20SQL%20Part%201.pdf
Parallel query should be applied when
The SQL performs at least one full table, index or partition scan
And from AskTom.com
Parallel query is suitable for a certain class of large problems: very large problems
that have no other solution. Parallel query is my last path of action for solving a
performance problem; it's never my first course of action.
It seems that parallel execution is designed for processing a very large scale of data when no other better solution exists. It attempts to give better performance by running things in parallel, with each CPU (process) dedicated to work on separated portion of data (block range, table partitions or index partitions). Such that it is not designed to speed up general query, or query that does not cover a sufficient portion of the whole table.
Is my above understanding correct that parallel should not be used as a mean to speed up general query?
If yes, is that also means that the best practice to turn off parallel (degree as 0) and enable for particular query/operation through hint or parallel clause?
And in addition to all, what should be the best practice for setting up PARALLEL? If what I want to do is give best read performance through multi-threading, what should the setup be?
Lots of questions here. Lots of thanks in advance.

As a general rule I agree with Tom. Our main base table is an approx 240m rows iot, plus other indexes, with somewhere between 10 and 1,000 insert, delete, update operations happening 24 hours a day. We generally get information out of it in split seconds and then if we want a lot of information go for the full scan and deal with the 2.5 hours it takes. In answer to some of your questions, if you're going to be doing more large queries than small ones then go with the partition. If not then don't.

For your specific query, parallelism likely isn't your biggest problem. The new estimated cost and time of a query will be very roughly equal to the original cost divided by the degree of parallelism. The optimizer could be wrong here; for example, if you only have one hard drive then the new plan probably won't be any faster at all. But a 4x estimate mistake shouldn't lead to a 90x performance difference. This leads me to believe that your plan was already on the brink of failure, and this just tipped it over. How close are the estimated and actual cardinalities of your non-parallel plan? Whatever is causing those differences might be responsible for the bulk of your problem.
For your more general questions, there are no simple answers. There are several dozen things you may need to consider for parallelism, only you can know which ones will apply to your situation. Your best bet is to stop trying to Google it, and instead read the manual. The Using Parallel Execution chapter in the Data Warehousing Guide is a good place to start.

Degree of a relation or table in SQL means number of attribute in a relation.
For Example: If a relation in SQL has three rows and four columns then its degree in four. Simply we can say that number of columns of a relation called its degree.

Improve performance of querys in Postgresql with an index

I have in PostgreSQL tables, each with millions of records and more that one hundred fields.
One of them is a date field, which we filter by this in our queries. The creation of an index for this date field improved the performance of the queries that read an small range of dates, but in big range of dates the performance decreased...
I must prioritize one over the other? The performance in small ranges can be improved without decreasing the big range queries?

Queries in PostgreSQL cannot be answered just using the information in an index. Whether or not the row is visible, from the perspective of the query that is executing, is stored in the main row itself. So when you add an index to something, and execute a query that uses it, there are two steps involved:
Navigate the index to determine which data blocks are used
Retrieve those blocks and return the rows that match the query
It is therefore possible that answering a query with an index can take longer than just going directly to the data blocks and fetching the rows. The most common case where this happens is if you are actually grabbing a large portion of the data. Typically if more than 20% of the table is used, it's considered fast to just sequentially access it. Sometimes the planner thinks less than 20% will be accessed, so the index is preferred, but that's not true; that's one way adding an index can slow a query. This may be the situation you're seeing, based on your description--if the large ranges are touching more of the table than the optimizer estimates, using an index can be a net slowdown.
To figure this out, the database collects statistics about each column in each table, to determine whether a particular WHERE condition is selective enough to use an index. The idea is that you need to have saved so many blocks by not reading the whole table that adding the index I/O on top of it is still a net win.
This computation can go wrong, such that you end up doing more I/O than had you just read the table directly, in a couple of cases. The cause of most of them show up if you run the query using EXPLAIN ANALYZE. If the "expected" values versus the "actual" numbers are very different, this can suggest the optimizer had bad statistics on the table. Another possibility is that the optimizer just made a mistake about how selective the query is--it thought it would only return a small number of rows, but it actually returns most of the table. Here, again, better statistics is the normal way to start working on that. If you're on PostgreSQL 8.3 or earlier, the amount of statistics collected is very low by default.
Some workloads end up adjusting the random_page_cost tunable as well, which controls where this index vs. table scan trade-off happens at. That's only something to consider after the stats information is checked though. See Tuning Your PostgreSQL Server for an intro to several things you can adjust here.

I'd try several things:
increase DB cache parameters
add the index on that date field
redesign/modify the application to work with smaller ranges (althogh this suggestion might seem obvious, it is usually first to be thrown away)

The creation of an index for this date field improved the performance of the queries that read an small range of dates, but in big range of dates the performance decreased...
Try clustering your table using that index. The performance decrease might be due to the entire table getting opened on large ranges. And if so, clustering the table along that index would lead to less disk seeks.

Two suggestions:
1) Investigate the use of table inheritance for time-series data. For example, create a child table per month and then INDEX the date on each table. PostgreSQL is smart enough to only perform index_scan's on the child tables that have the actual data in the date range. Once the child table is "sealed" because it is a new month, run CLUSTER on the table to sort the data by date.
2) Look at creating a bunch of INDEX's that use WHERE clauses.
Suggestion #1 is going to be the winner long term but will take some work to setup (but will scale/run forever), but suggestion #2 may be a quick interim fix if you have a limited date range that you care about scanning. Remember, you can only use IMMUTABLE functions in your INDEX's WHERE clause.
CREATE INDEX tbl_date_2011_05_idx ON tbl(date) WHERE date >= '2011-05-01' AND date <= '2011-06-01';

SQL Query Slow? Should it be?

Using SQLite, Got a table with ~10 columns. Theres ~25million rows.
That table has an INDEX on 'sid, uid, area, type'.
I run a select like so:
SELECT sid from actions where uid=1234 and area=1 and type=2
That returns me 1571 results, and takes 4 minutes to complete.
Is that sane?
I'm far from an SQL expert, so hopefully someone can fill me in on what I'm missing. Why could this possibly take 4+ minutes with everything indexed?
Any recommended resources to learn about achieving high SQL performance? I feel like a lot of the Google results just give me opinions or anecdotes, I wouldn't mind a solid book.

Create uid+area+type index instead, or uid+area+type+sid

Since the index starts with the sid column, it must do a scan (start at the beginning, read to the end) of either the index or the table to find your data matching the other 3 columns. This means it has to read all 25 million rows to find the answer. Even if it's reading just the rows of the index rather than the table, that's a lot of work.
Imagine a phone book of the greater New York metropolitan area, organized by (with an 'index' on) Last Name, First Name.
You submit SELECT [Last Name] FROM NewYorkPhoneBook WHERE [First Name] = 'Thelma'
It has to read all 25 million entries to find all those Thelmas. Unless you either specify the last name and can then turn directly to the page where that last name first appers (a seek), or have an index organized by First Name (a seek on the index followed by a seek on the table, aka a "bookmark lookup"), there's no way around it.
The index you would create to make your query faster is on uid, area, type. You could include sid, though leave it out if sid is part of the primary key.
Note: Tables often do have multiple indexes. Just note that the more indexes, the slower the write performance. Unnecessary indexes can slow overall performance, sometimes radically so. Testing and eventually experience will help guide you in this. Also, reasoning it out as a real-world problem (like my phone book examples) can really help. If it wouldn't make sense with phone books (and separate phone book indexes) then it probably won't make sense in the database.
One more thing: even if you put an index on those columns, if your query is going to end up pulling a great percentage of the rows in the main table, it will still be cheaper to scan the table rather than do the bookmark lookup (seek the index then seek the table for each row found). The exact "tipping point" of whether to do a bookmark lookup with a seek, or to do a table scan isn't something I can tell you off the top of my head, but it is based on solid math.

The index is not really usefull as it does start with the wrong field... which means a table scan.
Looks like you have a normal computer there, not something made for databases. I run table scans over 650 million rows in about a minute on my lower end db server, but that means reading about a gigabyte per second from the discs, which are a RAID of 10k RM discs - RAID 10. Just to say that basically... that databases love IO, and that in a degree that you have never seen before. Basically larger db servers have many discs to satisfy the IOPS (IO per second) requirement. I have seen a server with 190 discs.
So, you ahve two choices: beed up your IOPS capability (means spending money), or set up indices that get used because they are "proper".
Proper means: an index only is usefull if the fields it contains are used from left to right. Not necessarily in the same order... but if a field is missed there is a chance the SQL System decides it is not worth pursuing the index and instead goes table scan (as in your case).

When you create your new index on uid, area and type, you should also do a select distinct on each one to determine which has the fewest distinct entries, then create your index such that the fewer the differences the earlier they show up in the index definition.

SQL: Inner joining two massive tables

I have two massive tables with about 100 million records each and I'm afraid I needed to perform an Inner Join between the two. Now, both tables are very simple; here's the description:
BioEntity table:
BioEntityId (int)
Name (nvarchar 4000, although this is an overkill)
TypeId (int)
EGM table (an auxiliar table, in fact, resulting of bulk import operations):
EMGId (int)
PId (int)
Name (nvarchar 4000, although this is an overkill)
TypeId (int)
LastModified (date)
I need to get a matching Name in order to associate BioEntityId with the PId residing in the EGM table. Originally, I tried to do everything with a single inner join but the query appeared to be taking way too long and the logfile of the database (in simple recovery mode) managed to chew up all the available disk space (that's just over 200 GB, when the database occupies 18GB) and the query would fail after waiting for two days, If I'm not mistaken. I managed to keep the log from growing (only 33 MB now) but the query has been running non-stop for 6 days now and it doesn't look like it's gonna stop anytime soon.
I'm running it on a fairly decent computer (4GB RAM, Core 2 Duo (E8400) 3GHz, Windows Server 2008, SQL Server 2008) and I've noticed that the computer jams occasionally every 30 seconds (give or take) for a couple of seconds. This makes it quite hard to use it for anything else, which is really getting on my nerves.
Now, here's the query:
SELECT EGM.Name, BioEntity.BioEntityId INTO AUX
FROM EGM INNER JOIN BioEntity
ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId
I had manually setup some indexes; both EGM and BioEntity had a non-clustered covering index containing TypeId and Name. However, the query ran for five days and it did not end either, so I tried running Database Tuning Advisor to get the thing to work. It suggested deleting my older indexes and creating statistics and two clustered indexes instead (one on each table, just containing the TypeId which I find rather odd - or just plain dumb - but I gave it a go anyway).
It has been running for 6 days now and I'm still not sure what to do...
Any ideas guys? How can I make this faster (or, at least, finite)?
Update:
- Ok, I've canceled the query and rebooted the server to get the OS up and running again
- I'm rerunning the workflow with your proposed changes, specifically cropping the nvarchar field to a much smaller size and swapping "like" for "=". This is gonna take at least two hours, so I'll be posting further updates later on
Update 2 (1PM GMT time, 18/11/09):
- The estimated execution plan reveals a 67% cost regarding table scans followed by a 33% hash match. Next comes 0% parallelism (isn't this strange? This is the first time I'm using the estimated execution plan but this particular fact just lifted my eyebrow), 0% hash match, more 0% parallelism, 0% top, 0% table insert and finally another 0% select into. Seems the indexes are crap, as expected, so I'll be making manual indexes and discard the crappy suggested ones.

I'm not an SQL tuning expert, but joining hundreds of millions of rows on a VARCHAR field doesn't sound like a good idea in any database system I know.
You could try adding an integer column to each table and computing a hash on the NAME field that should get the possible matches to a reasonable number before the engine has to look at the actual VARCHAR data.

For huge joins, sometimes explicitly choosing a loop join speeds things up:
SELECT EGM.Name, BioEntity.BioEntityId INTO AUX
FROM EGM
INNER LOOP JOIN BioEntity
ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId
As always, posting your estimated execution plan could help us provide better answers.
EDIT: If both inputs are sorted (they should be, with the covering index), you can try a MERGE JOIN:
SELECT EGM.Name, BioEntity.BioEntityId INTO AUX
FROM EGM
INNER JOIN BioEntity
ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId
OPTION (MERGE JOIN)

First, 100M-row joins are not at all unreasonable or uncommon.
However, I suspect the cause of the poor performance you're seeing may be related to the INTO clause. With that, you are not only doing a join, you are also writing the results to a new table. Your observation about the log file growing so huge is basically confirmation of this.
One thing to try: remove the INTO and see how it performs. If the performance is reasonable, then to address the slow write you should make sure that your DB log file is on a separate physical volume from the data. If it isn't, the disk heads will thrash (lots of seeks) as they read the data and write the log, and your perf will collapse (possibly to as little as 1/40th to 1/60th of what it could be otherwise).

Maybe a bit offtopic, but:
" I've noticed that the computer jams occasionally every 30 seconds (give or take) for a couple of seconds."
This behavior is characteristic for cheap RAID5 array (or maybe for single disk) while copying (and your query mostly copies data) gigabytes of information.
More about problem - can't you partition your query into smaller blocks? Like names starting with A, B etc or IDs in specific ranges? This could substantially decrease transactional/locking overhead.

I'd try maybe removing the 'LIKE' operator; as you don't seem to be doing any wildcard matching.

As recommended, I would hash the name to make the join more reasonable. I would strongly consider investigating assigning the id during the import of batches through a lookup if it is possible, since this would eliminate the need to do the join later (and potentially repeatedly having to perform such an inefficient join).
I see you have this index on the TypeID - this would help immensely if this is at all selective. In addition, add the column with the hash of the name to the same index:
SELECT EGM.Name
,BioEntity.BioEntityId
INTO AUX
FROM EGM
INNER JOIN BioEntity
ON EGM.TypeId = BioEntity.TypeId -- Hopefully a good index
AND EGM.NameHash = BioEntity.NameHash -- Should be a very selective index now
AND EGM.name LIKE BioEntity.Name

Another suggestion I might offer is try to get a subset of the data instead of processing all 100 M rows at once to tune your query. This way you don't have to spend so much time waiting to see when your query is going to finish. Then you could consider inspecting the query execution plan which may also provide some insight to the problem at hand.

100 million records is HUGE. I'd say to work with a database that large you'd require a dedicated test server. Using the same machine to do other work while performing queries like that is not practical.
Your hardware is fairly capable, but for joins that big to perform decently you'd need even more power. A quad-core system with 8GB would be a good start. Beyond that you have to make sure your indexes are setup just right.

do you have any primary keys or indexes? can you select it in stages? i.e. where name like 'A%', where name like 'B%', etc.

I had manually setup some indexes; both EGM and BioEntity had a non-clustered covering index containing TypeId and Name. However, the query ran for five days and it did not end either, so I tried running Database Tuning Advisor to get the thing to work. It suggested deleting my older indexes and creating statistics and two clustered indexes instead (one on each table, just containing the TypeId which I find rather odd - or just plain dumb - but I gave it a go anyway).
You said you made a clustered index on TypeId in both tables, although it appears you have a primary key on each table already (BioEntityId & EGMId, respectively). You do not want your TypeId to be the clustered index on those tables. You want the BioEntityId & EGMId to be clustered (that will physically sort your data in order of the clustered index on disk. You want non-clustered indexes on foreign keys you will be using for lookups. I.e. TypeId. Try making the primary keys clustered, and adding a non-clustered index on both tables that ONLY CONTAINS TypeId.
In our environment we have a tables that are roughly 10-20 million records apiece. We do a lot of queries similar to yours, where we are combining two datasets on one or two columns. Adding an index for each foreign key should help out a lot with your performance.
Please keep in mind that with 100 million records, those indexes are going to require a lot of disk space. However, it seems like performance is key here, so it should be worth it.
K. Scott has a pretty good article here which explains some issues more in depth.

Reiterating a few prior posts here (which I'll vote up)...
How selective is TypeId? If you only have 5, 10, or even 100 distinct values across your 100M+ rows, the index does nothing for you -- particularly since you're selecting all the rows anyway.
I'd suggest creating a column on CHECKSUM(Name) in both tables seems good. Perhaps make this a persisted computed column:
CREATE TABLE BioEntity
(
BioEntityId int
,Name nvarchar(4000)
,TypeId int
,NameLookup AS checksum(Name) persisted
)
and then create an index like so (I'd use clustered, but even nonclustered would help):
CREATE clustered INDEX IX_BioEntity__Lookup on BioEntity (NameLookup, TypeId)
(Check BOL, there are rules and limitations on building indexes on computed columns that may apply to your environment.)
Done on both tables, this should provide a very selective index to support your query if it's revised like this:
SELECT EGM.Name, BioEntity.BioEntityId INTO AUX
FROM EGM INNER JOIN BioEntity
ON EGM.NameLookup = BioEntity.NameLookup
and EGM.name = BioEntity.Name
and EGM.TypeId = BioEntity.TypeId
Depending on many factors it will still run long (not least because you're copying how much data into a new table?) but this should take less than days.

Why an nvarchar? Best practice is, if you don't NEED (or expect to need) the unicode support, just use varchar. If you think the longest name is under 200 characters, I'd make that column a varchar(255). I can see scenarios where the hashing that has been recommended to you would be costly (it seems like this database is insert intensive). With that much size, however, and the frequency and random nature of the names, your indexes will become fragmented quickly in most scenarios where you index on a hash (dependent on the hash) or the name.
I would alter the name column as described above and make the clustered index TypeId, EGMId/BioentityId (the surrogate key for either table). Then you can join nicely on TypeId, and the "rough" join on Name will have less to loop through. To see how long this query might run, try it for a very small subset of your TypeIds, and that should give you an estimate of the run time (although it might ignore factors like cache size, memory size, hard disk transfer rates).
Edit: if this is an ongoing process, you should enforce the foreign key constraint between your two tables for future imports/dumps. If it's not ongoing, the hashing is probably your best best.

I would try to solve the issue outside the box, maybe there is some other algorithm that could do the job much better and faster than the database. Of course it all depends on the nature of the data but there are some string search algorithm that are pretty fast (Boyer-Moore, ZBox etc), or other datamining algorithm (MapReduce ?) By carefully crafting the data export it could be possible to bend the problem to fit a more elegant and faster solution. Also, it could be possible to better parallelize the problem and with a simple client make use of the idle cycles of the systems around you, there are framework that can help with this.
the output of this could be a list of refid tuples that you could use to fetch the complete data from the database much faster.
This does not prevent you from experimenting with index, but if you have to wait 6 days for the results I think that justifies resources spent exploring other possible options.
my 2 cent

Since you're not asking the DB to do any fancy relational operations, you could easily script this. Instead of killing the DB with a massive yet simple query, try exporting the two tables (can you get offline copies from the backups?).
Once you have the tables exported, write a script to perform this simple join for you. It'll take about the same amount of time to execute, but won't kill the DB.
Due to the size of the data and length of time the query takes to run, you won't be doing this very often, so an offline batch process makes sense.
For the script, you'll want to index the larger dataset, then iterate through the smaller dataset and do lookups into the large dataset index. It'll be O(n*m) to run.

If the hash match consumes too many resources, then do your query in batches of, say, 10000 rows at a time, "walking" the TypeID column. You didn't say the selectivity of TypeID, but presumably it is selective enough to be able to do batches this small and completely cover one or more TypeIDs at a time. You're also looking for loop joins in your batches, so if you still get hash joins then either force loop joins or reduce the batch size.
Using batches will also, in simple recovery mode, keep your tran log from growing very large. Even in simple recovery mode, a huge join like you are doing will consume loads of space because it has to keep the entire transaction open, whereas when doing batches it can reuse the log file for each batch, limiting its size to the largest needed for one batch operation.
If you truly need to join on Name, then you might consider some helper tables that convert names into IDs, basically repairing the denormalized design temporarily (if you can't repair it permanently).
The idea about checksum can be good, too, but I haven't played with that very much, myself.
In any case, such a huge hash match is not going to perform as well as batched loop joins. If you could get a merge join it would be awesome...

I wonder, whether the execution time is taken by the join or by the data transfer.
Assumed, the average data size in your Name column is 150 chars, you will actually have 300 bytes plus the other columns per record. Multiply this by 100 million records and you get about 30GB of data to transfer to your client. Do you run the client remote or on the server itself ?
Maybe you wait for 30GB of data being transferred to your client...
EDIT: Ok, i see you are inserting into Aux table. What is the setting of the recovery model of the database?
To investigate the bottleneck on the hardware side, it might be interesting whether the limiting resource is reading data or writing data. You can start a run of the windows performance monitor and capture the length of the queues for reading and writing of your disks for example.
Ideal, you should place the db log file, the input tables and the output table on separate physical volumes to increase speed.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas