PostgreSQL Query Optimization and the Postmaster Process' - optimization

I currently working with a larger wikipedia-dump derived PostgreSQL database; it contains about 40 GB of data. The database is running on an HP Proliant ML370 G5 server with Suse Linux Enterprise Server 10; I am querying it from my laptop over a private network managed by a simple D-Link router. I assigned static DHCP (private) IPs to both laptop and server.
Anyway, from my laptop, using pgAdmin III, I send off some SQL commands/queries; some of these are CREATE INDEX, DROP INDEX, DELETE, SELECT, etc. Sometimes I send a command (like CREATE INDEX), it returns, telling me that the query was executed perfectly, etc. However, the postmaster process assigned to such a command seems to remain sleeping on the server. Now, I do not really mind this, for I say to myself that PostgreSQL maintains a pool of postmasters ready to process queries. Yet, if this process eats up 6 GB of it 9.4 GB assigned RAM, I worry (and it does so for the moment). Now maybe this is a cache of data that is kept in [shared] memory in case another query happens to need to use that same data, but I do not know.
Another thing is bothering me.
I have 2 tables. One is the page table; I have an index on its page_id column. The other is the pagelinks tables which has the pl_from column that references either nothing or a variable in the page.page_id column; unlike the page_id column, the pl_from has no index (yet). To give you an idea of the scale of the tables and the necessity for me to find a viable solution, page table has 13.4 million rows (after I deleted those I do not need) while the pagelinks table has 293 million.
I need to execute the following command to clean the pagelinks table of some of its useless rows:
DELETE FROM pagelinks USING page WHERE pl_from NOT IN (page_id);
So basically, I wish to rid the pagelinks table of all links coming from a page not in the page table. Even after disabling the nested loops and/or sequential scans, the query optimizer always gives me the following "solution":
Nested Loop (cost=494640.60..112115531252189.59 rows=3953377028232000 width=6)
Join Filter: ("outer".pl_from <> "inner".page_id)"
-> Seq Scan on pagelinks (cost=0.00..5889791.00 rows=293392800 width=17)
-> Materialize (cost=494640.60..708341.51 rows=13474691 width=11)
-> Seq Scan on page (cost=0.00..402211.91 rows=13474691 width=11)
It seems that such a task would take more than weeks to complete; obviously, this is unacceptable. It seems to me that I would much rather it use the page_id index to do its thing...but it is a stubborn optimizer and I might be wrong.

To your second question; you could try creating a new table with just the records you need with a CREATE TABLE AS statement; if the new table is sufficiently small, it might be faster- but it might not help either.

Indeed, I decided to CREATE a temporary table to speed up query execution:
CREATE TABLE temp_to_delete AS(
(SELECT DISTINCT pl_from FROM pagelinks)
EXCEPT
(SELECT page_id FROM page));
DELETE FROM pagelinks USING temp_to_delete
WHERE pagelinks.pl_from IN (temp_to_delete.pl_from);
Surprisingly, this query completed in about 4 hours while the initial query had remained active for about 14hrs before I decided to kill it. More specifically, the DELETE returned:
Query returned successfully: 31340904 rows affected, 4415166 ms execution time.
As for the first part of my question, it seems that the postmaster process indeed keeps some info in cache; when another query requires info not in the cache and some memory (RAM), the cache is emptied. And the postmasters are indeed but a pool of process'.
It has also occurred to me that the gnome-system-monitor is a myth for it gives incomplete information and is worthless in informational value. It is mostly due to this application that I have been so confused lately; for example, it does not consider the memory usage of other users (like the postgres user!) and even tells me that I have 12 GB of RAM left when this is so untrue. Hence, I tried out a couple of system monitors for I like to know how postgreSQL is using its resources, and it seems that xosview is indeed a valid tool.
Hope this helps!

Your postmaster process will stay there as long as the connection to the client is open.
Does pgadmin close the connection ? I don't know.
Memory used could be shared_buffers (check your config settings) or not.
Now, the query. For big maintenance operations like this, feel free to set work_mem to something large like a few GB. You look like you got lots of RAM, so use it.
set work_mem to '4GB';
EXPLAIN DELETE FROM pagelinks WHERE pl_from NOT IN (SELECT page_id FROM page);
It should seq scan page, hash it, and seq scan pagelinks, peeking in the hash to check for page_ids. It should be quite fast (much faster than 4 hours !) but you need a large work_mem for the hash.
But since you delete a significant portion of your table, it might be faster to do it like this :
CREATE TABLE pagelinks2 AS SELECT a.* FROM pagelinks a JOIN pages b ON a.pl_from = b.page_id;
(you could use a simple JOIN instead of IN)
You can also add an ORDER BY on this query, and your new table will be nicely ordered on disk for optimal access later.

Related

Executing same simple select statement or stored procedure on SQL Azure takes long time or times-out

I have two SQL Server Azure instances with Standard S2: 50 DTUs. When I run simple select statements on two instances, one of them takes more time than other or times out. Slower one have more records in tables in slower instance.
Both the instances have same table schema. Number of records in tables present in slower instances, LogEvidence table have 1324928 and LogItem table have 649391. Number of records in tables present in faster instances, LogEvidence table have 89504 and LogItem table have 89496.
Below is the simple select statement
select count(*) from logitem
Above simple select statement takes 0s on faster instance and on slower instance it takes 138s. And if I execute any stored procedure, slower instance takes more times or times out.
Time taken by both instances should be almost same.
Those simple queries perform big scans on the table and involve reading all rows. If the table has a clustered index you don't have to perform a SELECT COUNT(*) to know the number of records the table has. The following query should to that faster:
SELECT OBJECT_NAME(ps.object_id) , i.name , row_count
FROM sys.dm_db_partition_stats AS ps INNER JOIN sys.indexes AS i
ON ps.index_id = i.index_id AND ps.object_id = i.object_id
WHERE i.name like '%logitem%'
If the table does not have an Id please add an autoid on the table and make it the clustered index.
You can also try to add a useless WHERE clause like below to the query, and you may get a better performance.
SELECT count(*)
FROM logitem
WHERE id > 0
Where Id is the autoid column.
I had some experience with azure, and from your description I think there is one of following things you can do:
Since you are using only count, then indexes play no role. Though I understand other answer says to use where id>0, but azure should count 1M rows without 30 second timeout. But for other queries you need Indexes, or Azure will fail.
Check if your server is not under maintenance, it is low chance but it does happen with us, we are on s4 and occasionally our server just get slow, but after 10-30 minute it works fine. Maybe the actual hardware get in some process that slows it down.
This is most important reason for slow execution, especially if you have lot of write and delete happen on your server. Check the database size. Azure database got fragmented too quickly, we have to optimize it's data fragmentation every 10 days, if your bacpac size is 100MB and your database size in Azure shows like 5-6 GB, then it definitely need optimization as lot of fragments were generated. MSDN has given some queries to recreate indexes and remove fragmentation, I don't remember them URL, but simple google search will bring that. It should speed things up.
Azure has feature that auto generate indexes, check if both table share same indexes, maybe your faster version has some index Azure create by itself.
You should step back and ponder your assumption:
1. "performance should be about the same" - you have more data in one case vs. the other. In the limit, you should expect the performance of the second one to potentially be somewhat slower than the original one.
Now, let's go into the "why" it can be slower and how you can investigate each case:
Step 1: Look at the query plans for each case and see what you have. Likely, you will have something like:
StreamAgg <- Clustered Index Scan
(if you have other b-tree indexes, you might scan one of them and it might be faster since the index would not be as wide and thus the index will have fewer pages to scan)
Step 2: You can look at the actual execution times and resource use for each query to see why they are different. One way to do this is to run "set statistics time on", then "set statistics io on", then run your query. it will dump out extra information into SSMS when you run the query from there. (You can read about this here: https://learn.microsoft.com/en-us/sql/t-sql/statements/set-statistics-io-transact-sql?view=sql-server-2017)
If you review the output from each one, you may find reasons for why the performance is different. One possible explanation is that the amount of memory is limited in an S2 and you are just at the boundary for where all the pages fit in memory vs. not for the two examples. In that case, doing a count(*) query would need to cycle through all the pages and do much more IO than in the smaller case where they might all be in memory already.
Step 3: You can also potentially examine the query store to get insight into why one case is fast and one case is not. An overview of how to use it is here:
https://learn.microsoft.com/en-us/sql/relational-databases/performance/monitoring-performance-by-using-the-query-store?view=sql-server-2017
Note: it is on-by-default in SQL Azure so you can just go look at the time window when you ran the queries to get insight into what was happening at that time in your database.
Finally, you might consider ways to make the query go faster if you need it to be faster.
* creating a narrow b-tree index on the table may help for that one query (count(*) doesn't return any columns and just needs a count of rows from some non-filtered index).
* you could use a Columnstore (which requires an S3 or above for memory reasons). This kind of column-oriented index is optimized for this kind of query and would be much faster as the size of the table increases in the future.
Hope that help

My Postgres database wasn't using my index; I resolved it, but don't understand the fix, can anyone explain what happened?

My database schema in relevant part is there is a table called User, which had a boolean field Admin. There was an index on this field Admin.
The day before I restored my full production database onto my development machine, and then made only very minor changes to the database, so they should have been very similar.
When I ran the following command on my development machine, I got the expected result:
EXPLAIN SELECT * FROM user WHERE admin IS TRUE;
Index Scan using index_user_on_admin on user (cost=0.00..9.14 rows=165 width=3658)
Index Cond: (admin = true)
Filter: (admin IS TRUE)
However, when I ran the exact same command on my production machine, I got this:
Seq Scan on user (cost=0.00..620794.93 rows=4966489 width=3871)
Filter: (admin IS TRUE)
So instead of using the exact index that was a perfect match for the query, it was using a sequential scan of almost 5 million rows!
I then tried to run EXPLAIN ANALYZE SELECT * FROM user WHERE admin IS TRUE; with the hope that ANALYZE would make Postgres realize a sequential scan of 5 million rows wasn't as good as using the index, but that didn't change anything.
I also tried to run REINDEX INDEX index_user_on_admin in case the index was corrupted, without any benefit.
Finally, I called VACUUM ANALYZE user and that resolved the problem in short order.
My main understanding of vacuum is that it is used to reclaim wasted space. What could have been going on that would cause my index to misbehave so badly, and why did vacuum fix it?
It was most likely the ANALYZE that helped, by updating the data statistics used by the planner to determine what would be the best way to run a query. VACUUM ANALYZE just runs the two commands in order, VACUUM first, ANALYZE second, but ANALYZE itself would probably be enough to help.
The ANALYZE option to EXPLAIN has completely nothing to do with the ANALYZE command. It just causes Postgres to run the query and report the actual run times, so that they can be compared with the planner predictions (EXPLAIN without the ANALYZE only displays the query plan and what the planner thinks it will cost, but does not actually run the query). So EXPLAIN ANALYZE did not help because it did not update the statistics. ANALYZE and EXPLAIN ANALYZE are two completely different actions that just happen to use the same word.
PostgreSQL keeps a number of advanced statistics about the table condition, index condition, data, etc... This can get out of sync sometimes. Running VACUUM will correct the problem.
It is likely that when you reloaded the table from scratch on development, it had the same effect.
Take a look at this:
http://www.postgresql.org/docs/current/static/maintenance.html#VACUUM-FOR-STATISTICS
A partial index seems a good solution for your issue:
CREATE INDEX admin_users_ix ON users (admin)
WHERE admin IS TRUE;;
Has no sense to index a lot of tuples over a identical field.
Here is what I think is the most likely explanation.
Your index is useful only when a very small number of rows are returned (btw, I don't like to index bools for this reason-- you might consider using a partial index instead, or even adding a where admin is true since that will keep your index only to the cases where it is likely to be usable anyway).
If more than around, iirc, 10% of the pages in the table are to be retrieved, the planner is likely to choose a lot of sequential disk I/O over a smaller amount of random disk I/O because that way you don't have to wait for platters to turn. The seek speed is a big issue there and PostgreSQL will tend to try to balance that against the amount of actual data to be retrieved from the relation.
You had statistics gathered which indicated that the table was either smaller than it was or there were more admins as a portion of users than you had, and so the planner used bad information to make the decision.
VACUUM ANALYZE does three things. First it freezes tuples visible to all transactions so that transaction wraparound is not an issue. Then it allocates tuples visible to no transactions as free space. Neither of these affected your issue. However the third is that it analyzes the tables and gather statistics on the tables. Keep in mind this is a random sampling and therefore sometimes can be off. My guess is that the previous run, it grabbed the page with lots of admins and thus grossly overestimated the number of admins of the system.
This is probably a good time to double check your autovacuum settings because it is also possible that the statistics are very much out of date elsewhere but that is far from certain. In particular, cost-based vacuum settings have defaults that sometimes make it so that vacuum never fully catches up.

SQL queries exceedingly slow when referencing primary key

I have a MS SQL table with about 8 million records. There is a primary key (with clustered index with only 0.8% fragmentation) on the column "ID". When I run seemingly any query referencing the ID column, the query takes very long (and in fact ultimately crashes my application). This includes simple queries like "SELECT * FROM table WHERE ID=2020". By contrast, queries that do not reference ID (such as "SELECT TOP 100 * from table") are just fine.
Any ideas?
I'd check statistics if you've already checked fragmentation
Have they been disabled or not updated?
They quick way to check is to use STATS_DATE
If the query is taking 10 minutes(!?!) you've got something seriously wrong. Even a table scan of 8 million records should take only a second or two. I would check the event log for an indications of imminent hardware failure, or try moving the database to a different server to see if there's some other hardware fault.
I'd suspect that the query is doing a table scan (or, at least, an index scan on a really large or statistics-out-of-date index). Generate an estimated execution plan (Control-L in SSMS) or have SQL Server return the execution plan it actually used because it's sometimes different (Control-M to enable it, then run your query normally - it will create a new tab next to your results).
Once you have the execution plan, search for a table scan or an index scan, and that is most likely the source of your slowness. The "estimated execution plan" may even recommend and index to help the query return more quickly - newer versions of SQL Server/SSMS include this feature.
Though I suspect you won't find anything interesting - your query is just a single step - here's a quick intro on reading execution plans: http://sqlserverpedia.com/wiki/Examining_Query_Execution_Plans
Is ID a GUID by any chance? SQL Server has an issue with the hash generation on GUID columns which makes them perform badly in indexes.
Take your exact query and grab the "estimated execution plan" from Sql Server Management Studio. If you query is triggering a table scan (which it shouldn't be) then this would explain the timing. Perhaps the plan might show you what else is happening.
I assume that you have already rebuilt the index (based on your 0.8% fragmentation comment).
I guess if you are using the default read level and you have a long running update you may be hitting a deadlock as well? That could definitely cause this too.

SQL: Inner joining two massive tables

I have two massive tables with about 100 million records each and I'm afraid I needed to perform an Inner Join between the two. Now, both tables are very simple; here's the description:
BioEntity table:
BioEntityId (int)
Name (nvarchar 4000, although this is an overkill)
TypeId (int)
EGM table (an auxiliar table, in fact, resulting of bulk import operations):
EMGId (int)
PId (int)
Name (nvarchar 4000, although this is an overkill)
TypeId (int)
LastModified (date)
I need to get a matching Name in order to associate BioEntityId with the PId residing in the EGM table. Originally, I tried to do everything with a single inner join but the query appeared to be taking way too long and the logfile of the database (in simple recovery mode) managed to chew up all the available disk space (that's just over 200 GB, when the database occupies 18GB) and the query would fail after waiting for two days, If I'm not mistaken. I managed to keep the log from growing (only 33 MB now) but the query has been running non-stop for 6 days now and it doesn't look like it's gonna stop anytime soon.
I'm running it on a fairly decent computer (4GB RAM, Core 2 Duo (E8400) 3GHz, Windows Server 2008, SQL Server 2008) and I've noticed that the computer jams occasionally every 30 seconds (give or take) for a couple of seconds. This makes it quite hard to use it for anything else, which is really getting on my nerves.
Now, here's the query:
SELECT EGM.Name, BioEntity.BioEntityId INTO AUX
FROM EGM INNER JOIN BioEntity
ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId
I had manually setup some indexes; both EGM and BioEntity had a non-clustered covering index containing TypeId and Name. However, the query ran for five days and it did not end either, so I tried running Database Tuning Advisor to get the thing to work. It suggested deleting my older indexes and creating statistics and two clustered indexes instead (one on each table, just containing the TypeId which I find rather odd - or just plain dumb - but I gave it a go anyway).
It has been running for 6 days now and I'm still not sure what to do...
Any ideas guys? How can I make this faster (or, at least, finite)?
Update:
- Ok, I've canceled the query and rebooted the server to get the OS up and running again
- I'm rerunning the workflow with your proposed changes, specifically cropping the nvarchar field to a much smaller size and swapping "like" for "=". This is gonna take at least two hours, so I'll be posting further updates later on
Update 2 (1PM GMT time, 18/11/09):
- The estimated execution plan reveals a 67% cost regarding table scans followed by a 33% hash match. Next comes 0% parallelism (isn't this strange? This is the first time I'm using the estimated execution plan but this particular fact just lifted my eyebrow), 0% hash match, more 0% parallelism, 0% top, 0% table insert and finally another 0% select into. Seems the indexes are crap, as expected, so I'll be making manual indexes and discard the crappy suggested ones.
I'm not an SQL tuning expert, but joining hundreds of millions of rows on a VARCHAR field doesn't sound like a good idea in any database system I know.
You could try adding an integer column to each table and computing a hash on the NAME field that should get the possible matches to a reasonable number before the engine has to look at the actual VARCHAR data.
For huge joins, sometimes explicitly choosing a loop join speeds things up:
SELECT EGM.Name, BioEntity.BioEntityId INTO AUX
FROM EGM
INNER LOOP JOIN BioEntity
ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId
As always, posting your estimated execution plan could help us provide better answers.
EDIT: If both inputs are sorted (they should be, with the covering index), you can try a MERGE JOIN:
SELECT EGM.Name, BioEntity.BioEntityId INTO AUX
FROM EGM
INNER JOIN BioEntity
ON EGM.name LIKE BioEntity.Name AND EGM.TypeId = BioEntity.TypeId
OPTION (MERGE JOIN)
First, 100M-row joins are not at all unreasonable or uncommon.
However, I suspect the cause of the poor performance you're seeing may be related to the INTO clause. With that, you are not only doing a join, you are also writing the results to a new table. Your observation about the log file growing so huge is basically confirmation of this.
One thing to try: remove the INTO and see how it performs. If the performance is reasonable, then to address the slow write you should make sure that your DB log file is on a separate physical volume from the data. If it isn't, the disk heads will thrash (lots of seeks) as they read the data and write the log, and your perf will collapse (possibly to as little as 1/40th to 1/60th of what it could be otherwise).
Maybe a bit offtopic, but:
" I've noticed that the computer jams occasionally every 30 seconds (give or take) for a couple of seconds."
This behavior is characteristic for cheap RAID5 array (or maybe for single disk) while copying (and your query mostly copies data) gigabytes of information.
More about problem - can't you partition your query into smaller blocks? Like names starting with A, B etc or IDs in specific ranges? This could substantially decrease transactional/locking overhead.
I'd try maybe removing the 'LIKE' operator; as you don't seem to be doing any wildcard matching.
As recommended, I would hash the name to make the join more reasonable. I would strongly consider investigating assigning the id during the import of batches through a lookup if it is possible, since this would eliminate the need to do the join later (and potentially repeatedly having to perform such an inefficient join).
I see you have this index on the TypeID - this would help immensely if this is at all selective. In addition, add the column with the hash of the name to the same index:
SELECT EGM.Name
,BioEntity.BioEntityId
INTO AUX
FROM EGM
INNER JOIN BioEntity
ON EGM.TypeId = BioEntity.TypeId -- Hopefully a good index
AND EGM.NameHash = BioEntity.NameHash -- Should be a very selective index now
AND EGM.name LIKE BioEntity.Name
Another suggestion I might offer is try to get a subset of the data instead of processing all 100 M rows at once to tune your query. This way you don't have to spend so much time waiting to see when your query is going to finish. Then you could consider inspecting the query execution plan which may also provide some insight to the problem at hand.
100 million records is HUGE. I'd say to work with a database that large you'd require a dedicated test server. Using the same machine to do other work while performing queries like that is not practical.
Your hardware is fairly capable, but for joins that big to perform decently you'd need even more power. A quad-core system with 8GB would be a good start. Beyond that you have to make sure your indexes are setup just right.
do you have any primary keys or indexes? can you select it in stages? i.e. where name like 'A%', where name like 'B%', etc.
I had manually setup some indexes; both EGM and BioEntity had a non-clustered covering index containing TypeId and Name. However, the query ran for five days and it did not end either, so I tried running Database Tuning Advisor to get the thing to work. It suggested deleting my older indexes and creating statistics and two clustered indexes instead (one on each table, just containing the TypeId which I find rather odd - or just plain dumb - but I gave it a go anyway).
You said you made a clustered index on TypeId in both tables, although it appears you have a primary key on each table already (BioEntityId & EGMId, respectively). You do not want your TypeId to be the clustered index on those tables. You want the BioEntityId & EGMId to be clustered (that will physically sort your data in order of the clustered index on disk. You want non-clustered indexes on foreign keys you will be using for lookups. I.e. TypeId. Try making the primary keys clustered, and adding a non-clustered index on both tables that ONLY CONTAINS TypeId.
In our environment we have a tables that are roughly 10-20 million records apiece. We do a lot of queries similar to yours, where we are combining two datasets on one or two columns. Adding an index for each foreign key should help out a lot with your performance.
Please keep in mind that with 100 million records, those indexes are going to require a lot of disk space. However, it seems like performance is key here, so it should be worth it.
K. Scott has a pretty good article here which explains some issues more in depth.
Reiterating a few prior posts here (which I'll vote up)...
How selective is TypeId? If you only have 5, 10, or even 100 distinct values across your 100M+ rows, the index does nothing for you -- particularly since you're selecting all the rows anyway.
I'd suggest creating a column on CHECKSUM(Name) in both tables seems good. Perhaps make this a persisted computed column:
CREATE TABLE BioEntity
(
BioEntityId int
,Name nvarchar(4000)
,TypeId int
,NameLookup AS checksum(Name) persisted
)
and then create an index like so (I'd use clustered, but even nonclustered would help):
CREATE clustered INDEX IX_BioEntity__Lookup on BioEntity (NameLookup, TypeId)
(Check BOL, there are rules and limitations on building indexes on computed columns that may apply to your environment.)
Done on both tables, this should provide a very selective index to support your query if it's revised like this:
SELECT EGM.Name, BioEntity.BioEntityId INTO AUX
FROM EGM INNER JOIN BioEntity
ON EGM.NameLookup = BioEntity.NameLookup
and EGM.name = BioEntity.Name
and EGM.TypeId = BioEntity.TypeId
Depending on many factors it will still run long (not least because you're copying how much data into a new table?) but this should take less than days.
Why an nvarchar? Best practice is, if you don't NEED (or expect to need) the unicode support, just use varchar. If you think the longest name is under 200 characters, I'd make that column a varchar(255). I can see scenarios where the hashing that has been recommended to you would be costly (it seems like this database is insert intensive). With that much size, however, and the frequency and random nature of the names, your indexes will become fragmented quickly in most scenarios where you index on a hash (dependent on the hash) or the name.
I would alter the name column as described above and make the clustered index TypeId, EGMId/BioentityId (the surrogate key for either table). Then you can join nicely on TypeId, and the "rough" join on Name will have less to loop through. To see how long this query might run, try it for a very small subset of your TypeIds, and that should give you an estimate of the run time (although it might ignore factors like cache size, memory size, hard disk transfer rates).
Edit: if this is an ongoing process, you should enforce the foreign key constraint between your two tables for future imports/dumps. If it's not ongoing, the hashing is probably your best best.
I would try to solve the issue outside the box, maybe there is some other algorithm that could do the job much better and faster than the database. Of course it all depends on the nature of the data but there are some string search algorithm that are pretty fast (Boyer-Moore, ZBox etc), or other datamining algorithm (MapReduce ?) By carefully crafting the data export it could be possible to bend the problem to fit a more elegant and faster solution. Also, it could be possible to better parallelize the problem and with a simple client make use of the idle cycles of the systems around you, there are framework that can help with this.
the output of this could be a list of refid tuples that you could use to fetch the complete data from the database much faster.
This does not prevent you from experimenting with index, but if you have to wait 6 days for the results I think that justifies resources spent exploring other possible options.
my 2 cent
Since you're not asking the DB to do any fancy relational operations, you could easily script this. Instead of killing the DB with a massive yet simple query, try exporting the two tables (can you get offline copies from the backups?).
Once you have the tables exported, write a script to perform this simple join for you. It'll take about the same amount of time to execute, but won't kill the DB.
Due to the size of the data and length of time the query takes to run, you won't be doing this very often, so an offline batch process makes sense.
For the script, you'll want to index the larger dataset, then iterate through the smaller dataset and do lookups into the large dataset index. It'll be O(n*m) to run.
If the hash match consumes too many resources, then do your query in batches of, say, 10000 rows at a time, "walking" the TypeID column. You didn't say the selectivity of TypeID, but presumably it is selective enough to be able to do batches this small and completely cover one or more TypeIDs at a time. You're also looking for loop joins in your batches, so if you still get hash joins then either force loop joins or reduce the batch size.
Using batches will also, in simple recovery mode, keep your tran log from growing very large. Even in simple recovery mode, a huge join like you are doing will consume loads of space because it has to keep the entire transaction open, whereas when doing batches it can reuse the log file for each batch, limiting its size to the largest needed for one batch operation.
If you truly need to join on Name, then you might consider some helper tables that convert names into IDs, basically repairing the denormalized design temporarily (if you can't repair it permanently).
The idea about checksum can be good, too, but I haven't played with that very much, myself.
In any case, such a huge hash match is not going to perform as well as batched loop joins. If you could get a merge join it would be awesome...
I wonder, whether the execution time is taken by the join or by the data transfer.
Assumed, the average data size in your Name column is 150 chars, you will actually have 300 bytes plus the other columns per record. Multiply this by 100 million records and you get about 30GB of data to transfer to your client. Do you run the client remote or on the server itself ?
Maybe you wait for 30GB of data being transferred to your client...
EDIT: Ok, i see you are inserting into Aux table. What is the setting of the recovery model of the database?
To investigate the bottleneck on the hardware side, it might be interesting whether the limiting resource is reading data or writing data. You can start a run of the windows performance monitor and capture the length of the queues for reading and writing of your disks for example.
Ideal, you should place the db log file, the input tables and the output table on separate physical volumes to increase speed.

Long UPDATE in postgresql

I have been running an UPDATE on a table containing 250 million rows with 3 index'; this UPDATE uses another table containing 30 million rows. It has been running for about 36 hours now. I am wondering if their is a way to find out how close it is to being done for if it plans to take a million days to do its thing, I will kill it; yet if it only needs another day or two, I will let it run. Here is the command-query:
UPDATE pagelinks SET pl_to = page_id
FROM page
WHERE
(pl_namespace, pl_title) = (page_namespace, page_title)
AND
page_is_redirect = 0
;
The EXPLAIN is not the issue here and I only mention the big table's having multiple indexes in order to somewhat justify how long it takes to UPDATE it. But here is the EXPLAIN anyway:
Merge Join (cost=127710692.21..135714045.43 rows=452882848 width=57)
Merge Cond: (("outer".page_namespace = "inner".pl_namespace) AND ("outer"."?column4?" = "inner"."?column5?"))
-> Sort (cost=3193335.39..3219544.38 rows=10483593 width=41)
Sort Key: page.page_namespace, (page.page_title)::text
-> Seq Scan on page (cost=0.00..439678.01 rows=10483593 width=41)
Filter: (page_is_redirect = 0::numeric)
-> Sort (cost=124517356.82..125285665.74 rows=307323566 width=46)
Sort Key: pagelinks.pl_namespace, (pagelinks.pl_title)::text"
-> Seq Scan on pagelinks (cost=0.00..6169460.66 rows=307323566 width=46)
Now I also sent a parallel query-command in order to DROP one of pagelinks' indexes; of course it is waiting for the UPDATE to finish (but I felt like trying it anyway!). Hence, I cannot SELECT anything from pagelinks for fear of corrupting the data (unless you think it would be safe to kill the DROP INDEX postmaster process?).
So I am wondering if their is a table that would keep track of the amount of dead tuples or something for It would be nice to know how fast or how far the UPDATE is in the completion of its task.
Thx
(PostgreSQL is not as intelligent as I thought; it needs heuristics)
Did you read the PostgreSQL documentation for "Using EXPLAIN", to interpret the output you're showing?
I'm not a regular PostgreSQL user, but I just read that doc, and then compared to the EXPLAIN output you're showing. Your UPDATE query seems to be using no indexes, and it's forced to do table-scans to sort both page and pagelinks. The sort is no doubt large enough to need temporary disk files, which I think are created under your temp_tablespace.
Then I see the estimated database pages read. The top-level of that EXPLAIN output says (cost=127710692.21..135714045.43). The units here are in disk I/O accesses. So it's going to access the disk over 135 million times to do this UPDATE.
Note that even 10,000rpm disks with 5ms seek time can achieve at best 200 I/O operations per second under optimal conditions. This would mean that your UPDATE would take 188 hours (7.8 days) of disk I/O, even if you could sustain saturated disk I/O for that period (i.e. continuous reads/writes with no breaks). This is impossible, and I'd expect the actual throughput to be off by at least an order of magnitude, especially since you have no doubt been using this server for all sorts of other work in the meantime. So I'd guess you're only a fraction of the way through your UPDATE.
If it were me, I would have killed this query on the first day, and found another way of performing the UPDATE that made better use of indexes and didn't require on-disk sorting. You probably can't do it in a single SQL statement.
As for your DROP INDEX, I would guess it's simply blocking, waiting for exclusive access to the table, and while it's in this state I think you can probably kill it.
This is very old, but if you want a way for you to monitore your update... Remember that sequences are affected globally, so you just can create one to monitore this update in another session by doing this:
create sequence yourprogress;
UPDATE pagelinks SET pl_to = page_id
FROM page
WHERE
(pl_namespace, pl_title) = (page_namespace, page_title)
AND
page_is_redirect = 0 AND NEXTVAL('yourprogress')!=0;
Then in another session just do this (don't worry about transactions, as sequences are affected globally):
select last_value from yourprogress;
This will show how many lines are being affected, so you can estimate how long you will take.
At just end restart your sequence to do another try:
alter sequence yourprogress restart with 1;
Or just drop it:
drop sequence yourprogress;
You need indexes or, as Bill pointed out, it will need to do sequential scans on all the tables.
CREATE INDEX page_ns_title_idx on page(page_namespace, page_title);
CREATE INDEX pl_ns_title_idx on pagelink(pl_namespace, pl_title);
CREATE INDEX page_redir_idx on page(page_is_redirect);