Redshift : Query Optimisation - sql

Need help in optimizing the below query. Please suggest here
Db : Redshift
Sort Key:
order Table : install_ts
order_item: Install_ts
Suborder. : install_ts
suborder_item: install_ts
Dist Key: Not Added
query: Extracting selected columns from below table (not all)
select *, rank() OVER (PARTITION BY oo.id,
oi.id
ORDER BY i.suborder_id DESC) AS suborder_rank,
FROM order_item oi
LEFT JOIN order oo ON oo.id=oi.order_id
LEFT JOIN sub_order_item i ON i.order_item_id = oi.id
LEFT JOIN sub_order s ON i.suborder_id = s.id
WHERE
(
(oo.update_ts between '2021-04-13 18:29:59' and '2021-04-14 18:29:59' AND oo.create_ts>=current_date-500)
OR
oo.create_ts between '2021-04-13 18:29:59' and '2021-04-14 18:29:59'
OR
oi.create_ts between '2021-04-13 18:29:59' and '2021-04-14 18:29:59'
OR
(oi.update_ts between '2021-04-13 18:29:59' and '2021-04-14 18:29:59' AND oi.create_ts>=current_date-500)
)

Without knowing your data in more detail and knowing the full query load you want to optimize it will difficult to get things exactly correct. However I see a number of things in this that are POTENTIAL areas for significant performance issues. I do this work for many clients and while the optimization methods in Redshift differ from other databases there are a number of things you can do.
First off you need to remember that Redshift is a networked cluster of computers acting as a coherent database. This allows it to scale to very large database sizes and achieve horizontally scalable compute performance. It also means that there are network cables in play in the middle of your query.
This leads to the issue that I see most often at issue - massive amounts of data being redistributed across these network cable during the query. In your case you have not assigned a distribution key which will lead to random distribution of your data. But when you perform those joins based on "id" all the rows with the same id need to be moved to one slide (compute element) of the Redshift cluster. Basically all your data is being transferred around your cluster and this takes time. Assigning a common distribution key that is also a common join on (or group by) column, such as id, will greatly reduce this network traffic.
The next most common issue is scanning (or writing) too much data from (or to) disk. The disks on Redshift are fast but the data is often massive and the data is still moving through cables (faster than network but still takes time). Redshift helps this by having metadata stored with your data on disk and when it can it will not read unneeded data based on your Where clauses. In your case you have where clauses that are reducing the needed rows (I hope) which is good. But you are using a column that is not the sort key. If these columns correlates well with each other this may not be a problem but if they do not you could be reading more data than you need. You will want to look at the metadata for these table so see how well your sort keys are working for you. Also, Redshift only reads the columns you reference in your query so if you don't ask for a columns its data is not read from disk. Having "*" in your select is reading all the columns - if you don't need all the columns think about listing only the needed columns.
If your query is working on data that doesn't fit in memory during execution the extra data will need to spill to disk. This creates writes to disk of the swapped data and then reads of it back in. This can happen multiple times during the query. I don't know if this is happening in your case but you will want to check if you are spilling during your query.
The thing that impacts everything else is issues in the query and these take a number of forms. If your joins are not properly qualified then the amount of data can explode making the two issues above much worse. Window functions can be expensive especially when a partition different than the distribution key is used but also when the amount of data being operated on is larger than is needed. There are many other but these are two possible areas that look like they may be in your query. I just cannot tell by the info provided.
I'd start by looking at the explain plan and the actual execution information from system tables. Big numbers for rows or cost or time will lead you to the likely areas that need optimization. Like I said I do this work for customer all the time and can help here as time allows but much more information will be needed to understand what is really impacting your query.

Related

How small should a table using Diststyle ALL be in Amazon Redshift?

How small should a table using Diststyle ALL be in Amazon Redshift?
It says here: http://dwbitechguru.blogspot.com/2014/11/performance-tuning-in-amazon-redshift.html
that for vey small tables, redshift should use diststyle ALL instead of EVEN or KEY. How Small is small? If I was to specify a row number in the where clause of the query: select relname, reldiststyle from pg_class how many rows should I specify?
It really depends on the cluster size you are using. DISTSTYLE ALL will copy the data of your table to all nodes - to mitigate data transfer requirement across nodes. You can find out the size of your table and Redshift nodes available size, if you can afford to copy table multiple times per node, do it!
Also, if you have a requirement of joining other tables with this table very very frequently, like in 70% of your queries, I believe it is worth the space if you want better query performance.
If your Join keys across tables are same in terms of cardinality, then you can also afford to distribute all tables on that key so that similar keys lie in same node which will obviate replication of data.
I would suggest trying out the two options above, and comparing average query run times of around 10 queries and then come to a decision.
By considering a Star Schema, the distribution style All is normally used for dimension tables. Doing this have the advantage to speed up joins, let's explain this through an example. If we would like to obtain the quantity saled of each product by country, we would require to join the fact_sales with the dim_store table on store_id key.
So, setting diststyle all on dim_store enable us to do a JOIN result in parallel compared to the disvantage of shuffling when enabling diststyle even. However, you can let Redshift automatically handle an optimal distribution style by setting distyle auto, for more info check this link.

AWS Redshift column limit?

I've been doing some load testing of AWS Redshift for a new application, and I noticed that it has a column limit of 1600 per table. Worse, queries slow down as the number of columns increases in a table.
What doesn't make any sense here is that Redshift is supposed to be a column-store database, and there shouldn't in theory be an I/O hit from columns that are not selected in a particular where clause.
More specifically, when TableName is 1600 columns, I found that the below query is substantially slower than if TableName were, say, 1000 columns and the same number of rows. As the number of columns decreases, performance improves.
SELECT COUNT(1) FROM TableName
WHERE ColumnName LIKE '%foo%'
My three questions are:
What's the deal? Why does Redshift have this limitation if it claims to be a column store?
Any suggestions for working around this limitation? Joins of multiple smaller tables seems to eventually approximate the performance of a single table. I haven't tried pivoting the data.
Does anyone have a suggestion for a fast, real-time performance, horizontally scalable column-store database that doesn't have the above limitations? All we're doing is count queries with simple where restrictions against approximately 10M (rows) x 2500 (columns) data.
I can't explain precisely why it slows down so much but I can verify that we've experienced the same thing.
I think part of the issue is that Redshift stores a minimum of 1MB per column per node. Having a lot of columns creates a lot of disk seek activity and I/O overhead.
1MB blocks are problematic because most of that will be empty space but it will still be read off of the disk
Having lots of blocks means that column data will not be located as close together so Redshift has to do a lot more work to find them.
Also, (just occurred to me) I suspect that Redshift's MVCC controls add a lot of overhead. It tries to ensure you get a consistent read while your query is executing and presumably that requires making a note of all the blocks for tables in your query, even blocks for columns that are not used. Why is an implicit table lock being released prior to end of transaction in RedShift?
FWIW, our columns were virtually all BOOLEAN and we've had very good results from compacting them (bit masking) into INT/BIGINTs and accessing the values using the bit-wise functions. One example table went from 1400 cols (~200GB) to ~60 cols (~25GB) and the query times improved more than 10x (30-40 down to 1-2 secs).

Understanding "Resources exceeded during query execution" with GROUP EACH BY in BigQuery

I'm writing a background job to automatically process A/B test data in BigQuery, and I'm finding that I'm hitting "Resources exceeded during query execution" when doing large GROUP EACH BY statements. I saw from Resources Exceeded during query execution that reducing the number of groups can make queries succeed, so I split up my data into smaller pieces, but I'm still hitting errors (although less frequently). It would be nice to get a better intuition about what actually causes this error. In particular:
Does "resources exceeded" always mean that a shard ran out of memory, or could it also mean that the task ran out of time?
What's the right way to approximate the memory usage and the total memory I have available? Am I correct in assuming each shard tracks about 1/n of the groups and keeps the group key and all aggregates for each group, or is there another way that I should be thinking about it?
How is the number of shards determined? In particular, do I get fewer shards/resources if I'm querying over a smaller dataset?
The problematic query looks like this (in practice, it's used as a subquery, and the outer query aggregates the results):
SELECT
alternative,
snapshot_time,
SUM(column_1),
...
SUM(column_139)
FROM
my_table
CROSS JOIN
[table containing 24 unix timestamps] timestamps
WHERE last_updated_time < timestamps.snapshot_time
GROUP EACH BY alternative, user_id, snapshot_time
(Here's an example failed job: 124072386181:job_XF6MksqoItHNX94Z6FaKpuktGh4 )
I realize this query may be asking for trouble, but in this case, the table is only 22MB and the query results in under a million groups and it's still failing with "resources exceeded". Reducing the number of timestamps to process at once fixes the error, but I'm worried that I'll eventually hit a data scale large enough that this approach as a whole will stop working.
As you've guessed, BigQuery chooses a number of parallel workers (shards) for GROUP EACH and JOIN EACH queries based on the size of the tables being operated upon. It is a rough heuristic, but in practice, it works pretty well.
What is interesting about your query is that the GROUP EACH is being done over a larger table than the original table because of the expansion in the CROSS JOIN. Because of this, we choose a number of shards that is too small for your query.
To answer your specific questions:
Resources exceeded almost always means that a worker ran out of memory. This could be a shard or a mixer, in Dremel terms (mixers are the nodes in the computation tree that aggregate results. GROUP EACH BY pushes aggregation down to the shards, which are the leaves of the computation tree).
There isn't a good way to approximate the amount of resources available. This changes over time, with the goal that more of your queries should just work.
The number of shards is determined by the total bytes processed in the query. As you've noticed, this heuristic doesn't work well with joins that expand the underlying data sets. That said, there is active work underway to be smarter about how we pick the number of shards. To give you an idea of scale, your query got scheduled on only 20 shards, which is a tiny fraction of what a larger table would get.
As a workaround, you could save the intermediate result of the CROSS JOIN as a table, and running the GROUP EACH BY over that temporary table. That should let BigQuery use the expanded size when picking the number of shards. (if that doesn't work, please let me know, it is possible that we need to tweak our assignment thresholds).

Hive - Efficient join of two tables

I am joining two large tables in Hive (one is over 1 billion rows, one is about 100 million rows) like so:
create table joinedTable as select t1.id, ... from t1 join t2 ON (t1.id = t2.id);
I have bucketed the two tables in the same way, clustering by id into 100 buckets for each, but the query is still taking a long time.
Any suggestions on how to speed this up?
As you bucketed the data by the join keys, you could use the Bucket Map Join. For that the amount of buckets in one table must be a multiple of the amount of buckets in the other table. It can be activated by executing set hive.optimize.bucketmapjoin=true; before the query. If the tables don't meet the conditions, Hive will simply perform the normal Inner Join.
If both tables have the same amount of buckets and the data is sorted by the bucket keys, Hive can perform the faster Sort-Merge Join. To activate it, you have to execute the following commands:
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
You can find some visualizations of the different join techniques under https://cwiki.apache.org/confluence/download/attachments/27362054/Hive+Summit+2011-join.pdf.
As i see it the answer is a bit more complicated than what #Adrian Lange offered.
First you must understand a very important difference between BucketJoin and Sort-Merge Bucket Join (SMBJ):
To perform a bucketjoin "the amount of buckets in one table must be a multiple of the amount of buckets in the other table" as stated before and in addition hive.optimize.bucketmapjoin must be set to true.
Issuing a join, hive will convert it into a bucketjoin if the above condition take place BUT pay attention that hive will not enforce the bucketing! this means that creating the table bucketed isn't enough for the table to actually be bucketed into the specified amount of buckets as hive doesn't enforce this unless hive.enforce.bucketing is set to true (which means that the amount of buckets actually is set by the amount of reducers in the final stage of the query inserting data into the table).
As from the performance side, please note that when using a bucketjoin a single task reads the "smaller" table into the distributed cache before the mappers access it and do the join - This stage would probably be very very long and ineffective when your table has ~100m rows!
After wards the join will be done same as in a regular join done in the reducers.
To perform a SMBJ both tables have to have the exact same amount of buckets , on the same columns and sorted by these columns in addition to setting hive.optimize.bucketmapjoin.sortedmerge to true.
As in the previous optimization, Hive doesn't enforce the bucketing and the sorting but rather assumes you made sure that the tables are actually bucketed and sorted (not only by definition but by setting hive.enforce.sorting or manually sorting the data while inserting it) - This is very important as it may lead to wrong results in both cases.
As from the performace side, this optimization is way more efficient for the following reasons :
Each mapper reads both buckets and there is no single task contention for distributed cache loading
The join being performed is a merge-sort join as the data is already sorted which is highly more efficient.
Please note the following considerations :
in both cases set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
should be executed
in both cases a /*+ MAPJOIN(b) */ should be applied in the query (right after the select and where b is the smaller table)
How many buckets?
This should be viewed from this angle : the consideration should be applied strictly to the bigger table as it has more impact from this direction and latter the configuration will be applied to the smaller table as a must. I think as a rule of thumb each bucket should contain between 1 and 3 blocks, probably somewhere near 2 blocks. so if your block size is 256MB it seams reasonable to me to have ~512MB of data in each bucket in the bigger table so this becomes a simple division issue.
Also, don't forget that these optimizations alone won't always guarantee a faster query time.
Lets say you choose to do a SMBJ, this adds the cost of sorting 2 tables prior to running the join - so the more times you will run your query the less you are "paying" for this sorting stage.
Sometimes, a simple join will lead to the best performance and none of the above optimization will help and you will have to optimize the regular join process either in the application/logical level or by tuning MapReduce / Hive settings like memory usage / parallelism etc.
I dont think Its a must criteria "the amount of buckets in one table must be a multiple of the amount of buckets in the other table" for map bucket join,We can have same number of buckets also.

SQL: Inner joining two massive tables

I have two massive tables with about 100 million records each and I'm afraid I needed to perform an Inner Join between the two. Now, both tables are very simple; here's the description:
BioEntity table:
BioEntityId (int)
Name (nvarchar 4000, although this is an overkill)
TypeId (int)
EGM table (an auxiliar table, in fact, resulting of bulk import operations):
EMGId (int)
PId (int)
Name (nvarchar 4000, although this is an overkill)
TypeId (int)
LastModified (date)
I need to get a matching Name in order to associate BioEntityId with the PId residing in the EGM table. Originally, I tried to do everything with a single inner join but the query appeared to be taking way too long and the logfile of the database (in simple recovery mode) managed to chew up all the available disk space (that's just over 200 GB, when the database occupies 18GB) and the query would fail after waiting for two days, If I'm not mistaken. I managed to keep the log from growing (only 33 MB now) but the query has been running non-stop for 6 days now and it doesn't look like it's gonna stop anytime soon.
I'm running it on a fairly decent computer (4GB RAM, Core 2 Duo (E8400) 3GHz, Windows Server 2008, SQL Server 2008) and I've noticed that the computer jams occasionally every 30 seconds (give or take) for a couple of seconds. This makes it quite hard to use it for anything else, which is really getting on my nerves.
Now, here's the query:
SELECT EGM.Name, BioEntity.BioEntityId INTO AUX
FROM EGM INNER JOIN BioEntity
ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId
I had manually setup some indexes; both EGM and BioEntity had a non-clustered covering index containing TypeId and Name. However, the query ran for five days and it did not end either, so I tried running Database Tuning Advisor to get the thing to work. It suggested deleting my older indexes and creating statistics and two clustered indexes instead (one on each table, just containing the TypeId which I find rather odd - or just plain dumb - but I gave it a go anyway).
It has been running for 6 days now and I'm still not sure what to do...
Any ideas guys? How can I make this faster (or, at least, finite)?
Update:
- Ok, I've canceled the query and rebooted the server to get the OS up and running again
- I'm rerunning the workflow with your proposed changes, specifically cropping the nvarchar field to a much smaller size and swapping "like" for "=". This is gonna take at least two hours, so I'll be posting further updates later on
Update 2 (1PM GMT time, 18/11/09):
- The estimated execution plan reveals a 67% cost regarding table scans followed by a 33% hash match. Next comes 0% parallelism (isn't this strange? This is the first time I'm using the estimated execution plan but this particular fact just lifted my eyebrow), 0% hash match, more 0% parallelism, 0% top, 0% table insert and finally another 0% select into. Seems the indexes are crap, as expected, so I'll be making manual indexes and discard the crappy suggested ones.
I'm not an SQL tuning expert, but joining hundreds of millions of rows on a VARCHAR field doesn't sound like a good idea in any database system I know.
You could try adding an integer column to each table and computing a hash on the NAME field that should get the possible matches to a reasonable number before the engine has to look at the actual VARCHAR data.
For huge joins, sometimes explicitly choosing a loop join speeds things up:
SELECT EGM.Name, BioEntity.BioEntityId INTO AUX
FROM EGM
INNER LOOP JOIN BioEntity
ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId
As always, posting your estimated execution plan could help us provide better answers.
EDIT: If both inputs are sorted (they should be, with the covering index), you can try a MERGE JOIN:
SELECT EGM.Name, BioEntity.BioEntityId INTO AUX
FROM EGM
INNER JOIN BioEntity
ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId
OPTION (MERGE JOIN)
First, 100M-row joins are not at all unreasonable or uncommon.
However, I suspect the cause of the poor performance you're seeing may be related to the INTO clause. With that, you are not only doing a join, you are also writing the results to a new table. Your observation about the log file growing so huge is basically confirmation of this.
One thing to try: remove the INTO and see how it performs. If the performance is reasonable, then to address the slow write you should make sure that your DB log file is on a separate physical volume from the data. If it isn't, the disk heads will thrash (lots of seeks) as they read the data and write the log, and your perf will collapse (possibly to as little as 1/40th to 1/60th of what it could be otherwise).
Maybe a bit offtopic, but:
" I've noticed that the computer jams occasionally every 30 seconds (give or take) for a couple of seconds."
This behavior is characteristic for cheap RAID5 array (or maybe for single disk) while copying (and your query mostly copies data) gigabytes of information.
More about problem - can't you partition your query into smaller blocks? Like names starting with A, B etc or IDs in specific ranges? This could substantially decrease transactional/locking overhead.
I'd try maybe removing the 'LIKE' operator; as you don't seem to be doing any wildcard matching.
As recommended, I would hash the name to make the join more reasonable. I would strongly consider investigating assigning the id during the import of batches through a lookup if it is possible, since this would eliminate the need to do the join later (and potentially repeatedly having to perform such an inefficient join).
I see you have this index on the TypeID - this would help immensely if this is at all selective. In addition, add the column with the hash of the name to the same index:
SELECT EGM.Name
,BioEntity.BioEntityId
INTO AUX
FROM EGM
INNER JOIN BioEntity
ON EGM.TypeId = BioEntity.TypeId -- Hopefully a good index
AND EGM.NameHash = BioEntity.NameHash -- Should be a very selective index now
AND EGM.name LIKE BioEntity.Name
Another suggestion I might offer is try to get a subset of the data instead of processing all 100 M rows at once to tune your query. This way you don't have to spend so much time waiting to see when your query is going to finish. Then you could consider inspecting the query execution plan which may also provide some insight to the problem at hand.
100 million records is HUGE. I'd say to work with a database that large you'd require a dedicated test server. Using the same machine to do other work while performing queries like that is not practical.
Your hardware is fairly capable, but for joins that big to perform decently you'd need even more power. A quad-core system with 8GB would be a good start. Beyond that you have to make sure your indexes are setup just right.
do you have any primary keys or indexes? can you select it in stages? i.e. where name like 'A%', where name like 'B%', etc.
I had manually setup some indexes; both EGM and BioEntity had a non-clustered covering index containing TypeId and Name. However, the query ran for five days and it did not end either, so I tried running Database Tuning Advisor to get the thing to work. It suggested deleting my older indexes and creating statistics and two clustered indexes instead (one on each table, just containing the TypeId which I find rather odd - or just plain dumb - but I gave it a go anyway).
You said you made a clustered index on TypeId in both tables, although it appears you have a primary key on each table already (BioEntityId & EGMId, respectively). You do not want your TypeId to be the clustered index on those tables. You want the BioEntityId & EGMId to be clustered (that will physically sort your data in order of the clustered index on disk. You want non-clustered indexes on foreign keys you will be using for lookups. I.e. TypeId. Try making the primary keys clustered, and adding a non-clustered index on both tables that ONLY CONTAINS TypeId.
In our environment we have a tables that are roughly 10-20 million records apiece. We do a lot of queries similar to yours, where we are combining two datasets on one or two columns. Adding an index for each foreign key should help out a lot with your performance.
Please keep in mind that with 100 million records, those indexes are going to require a lot of disk space. However, it seems like performance is key here, so it should be worth it.
K. Scott has a pretty good article here which explains some issues more in depth.
Reiterating a few prior posts here (which I'll vote up)...
How selective is TypeId? If you only have 5, 10, or even 100 distinct values across your 100M+ rows, the index does nothing for you -- particularly since you're selecting all the rows anyway.
I'd suggest creating a column on CHECKSUM(Name) in both tables seems good. Perhaps make this a persisted computed column:
CREATE TABLE BioEntity
(
BioEntityId int
,Name nvarchar(4000)
,TypeId int
,NameLookup AS checksum(Name) persisted
)
and then create an index like so (I'd use clustered, but even nonclustered would help):
CREATE clustered INDEX IX_BioEntity__Lookup on BioEntity (NameLookup, TypeId)
(Check BOL, there are rules and limitations on building indexes on computed columns that may apply to your environment.)
Done on both tables, this should provide a very selective index to support your query if it's revised like this:
SELECT EGM.Name, BioEntity.BioEntityId INTO AUX
FROM EGM INNER JOIN BioEntity
ON EGM.NameLookup = BioEntity.NameLookup
and EGM.name = BioEntity.Name
and EGM.TypeId = BioEntity.TypeId
Depending on many factors it will still run long (not least because you're copying how much data into a new table?) but this should take less than days.
Why an nvarchar? Best practice is, if you don't NEED (or expect to need) the unicode support, just use varchar. If you think the longest name is under 200 characters, I'd make that column a varchar(255). I can see scenarios where the hashing that has been recommended to you would be costly (it seems like this database is insert intensive). With that much size, however, and the frequency and random nature of the names, your indexes will become fragmented quickly in most scenarios where you index on a hash (dependent on the hash) or the name.
I would alter the name column as described above and make the clustered index TypeId, EGMId/BioentityId (the surrogate key for either table). Then you can join nicely on TypeId, and the "rough" join on Name will have less to loop through. To see how long this query might run, try it for a very small subset of your TypeIds, and that should give you an estimate of the run time (although it might ignore factors like cache size, memory size, hard disk transfer rates).
Edit: if this is an ongoing process, you should enforce the foreign key constraint between your two tables for future imports/dumps. If it's not ongoing, the hashing is probably your best best.
I would try to solve the issue outside the box, maybe there is some other algorithm that could do the job much better and faster than the database. Of course it all depends on the nature of the data but there are some string search algorithm that are pretty fast (Boyer-Moore, ZBox etc), or other datamining algorithm (MapReduce ?) By carefully crafting the data export it could be possible to bend the problem to fit a more elegant and faster solution. Also, it could be possible to better parallelize the problem and with a simple client make use of the idle cycles of the systems around you, there are framework that can help with this.
the output of this could be a list of refid tuples that you could use to fetch the complete data from the database much faster.
This does not prevent you from experimenting with index, but if you have to wait 6 days for the results I think that justifies resources spent exploring other possible options.
my 2 cent
Since you're not asking the DB to do any fancy relational operations, you could easily script this. Instead of killing the DB with a massive yet simple query, try exporting the two tables (can you get offline copies from the backups?).
Once you have the tables exported, write a script to perform this simple join for you. It'll take about the same amount of time to execute, but won't kill the DB.
Due to the size of the data and length of time the query takes to run, you won't be doing this very often, so an offline batch process makes sense.
For the script, you'll want to index the larger dataset, then iterate through the smaller dataset and do lookups into the large dataset index. It'll be O(n*m) to run.
If the hash match consumes too many resources, then do your query in batches of, say, 10000 rows at a time, "walking" the TypeID column. You didn't say the selectivity of TypeID, but presumably it is selective enough to be able to do batches this small and completely cover one or more TypeIDs at a time. You're also looking for loop joins in your batches, so if you still get hash joins then either force loop joins or reduce the batch size.
Using batches will also, in simple recovery mode, keep your tran log from growing very large. Even in simple recovery mode, a huge join like you are doing will consume loads of space because it has to keep the entire transaction open, whereas when doing batches it can reuse the log file for each batch, limiting its size to the largest needed for one batch operation.
If you truly need to join on Name, then you might consider some helper tables that convert names into IDs, basically repairing the denormalized design temporarily (if you can't repair it permanently).
The idea about checksum can be good, too, but I haven't played with that very much, myself.
In any case, such a huge hash match is not going to perform as well as batched loop joins. If you could get a merge join it would be awesome...
I wonder, whether the execution time is taken by the join or by the data transfer.
Assumed, the average data size in your Name column is 150 chars, you will actually have 300 bytes plus the other columns per record. Multiply this by 100 million records and you get about 30GB of data to transfer to your client. Do you run the client remote or on the server itself ?
Maybe you wait for 30GB of data being transferred to your client...
EDIT: Ok, i see you are inserting into Aux table. What is the setting of the recovery model of the database?
To investigate the bottleneck on the hardware side, it might be interesting whether the limiting resource is reading data or writing data. You can start a run of the windows performance monitor and capture the length of the queues for reading and writing of your disks for example.
Ideal, you should place the db log file, the input tables and the output table on separate physical volumes to increase speed.