PostgreSQL: VACUUM FULL duration estimation

PostgreSQL: VACUUM FULL duration estimation - sql

I inherited a PostgreSQL database in production with one table that is around 250 GB in size. It only has around ten thousand live rows which I estimate to be not more than 20 MB.
The table grew to such a size because AUTOVACUUM has been turned off at some time. (I know why this was done. It will be reactivated and the original issue has been fixed, so this is not part of the question.)
Our problem is that many queries take pretty long time. For example, a SELECT count(*) FROM foo; takes around 15 minutes.
Now after considering other options, I'd like to run a VACUUM FULL on the table. I try to estimate the duration this would take to complete so I can plan a maintenance window.
In my understanding, VACUUM FULL creates a new table, copies all live tuples to it and replaces the original table with this copy.
My estimation would be that this process doesn't take much longer than a simple query like the above on this table as the live data is pretty slim in overall size and count.
Would you agree that my expectation of the run time of 'VACUUM FULL' is somehow realistic? If not, why not?
Are there best practises for estimating VACUUM FULL durations?

The only dependable estimate can be had by restoring a file system backup on a similar machine and test it. That's what I would recommend.
The duration will not only depend on the size, but also on the amount of bloat: if there are fewer real data, it will be faster.
That said, I'd ask for a maintenance window of 2 hours, which should be ample on anything but very questionable hardware.

Related

BigQuery query taking a long time

a simple count query on one of my tables takes a long time to complete (~18 secs), this table has around half a million rows, and making the same query in a bigger table (around 3 mil) takes less than 3 secs. The schema is exactly the same and the query is a simple SELECT count(*) FROM [dataset.table]
Any ideas why this is happening and what can I do to prevent it?

It looks like the issue with your table is that it was created in a lot of small chunks; this takes more work to query, since we spend a lot of time on filesystem operations (listing files and opening them).
Even so, a table the size of yours should not be so slow; BigQuery is currently experiencing high filesystem load that is causing high variability in latency. We're actively working on resolving this one. So that is the first problem.
The second problem is that we probably should do a better job of compacting the table. I've filed an internal bug that we should tweak our heuristics to be a bit more aggressive in compaction.
As a workaround, you can compact the table manually by copying the table in place. In other words, run a SELECT * from ... and writing the output to the same table, using writeDisposition:WRITE_TRUNCATE, destinationTable:<your table> and allowLargeResults:true and flattenSchema:false.
Again, this last step shouldn't be needed, but for now it should improve your situation.

Loading huge flatfiles to SQL table is too slow via SSIS package

I receive about 8 huge delimited flatfiles to be loaded into an SQL server (2012)table once every week. Total number of rows in all the files would be about 150 million and each file has different number of rows. I have a simple SSIS package which loads data from flatfiles(using foreach container) into a history table. And then a select query runs on this history table to select current weeks data and loads into a staging table.
We ran into problems as history table grew very large(8 billion rows). So I decided to back up the data in history table and truncate. Before truncation the package execution time ranged from 15hrs to 63 hrs in that order.We hoped after truncation it should go back to 15hrs or less.But to my surprise even after 20+ hours the package is still running. The worst part is that it is still loading the history table. Latest count is around 120 million. It still has to load the staging data and it might take just as long.
Neither history table nor staging tables have any indexes, which is why select query on the history table used to take most of the execution time. But loading from all the flatfiles to history table was always under 3 hrs.
I hope i'm making sense. Can someone help me understand what could be the reason behind this unusual execution time for this week? Thanks.
Note: The biggest file(8GB) was read at flatfile source in 3 minutes. So I'm thinking source is not the bottle neck here.

There's no good reason, IMHO, why that server should take that long to load that much data. Are you saying that the process which used to take 3 hours, now takes 60+? Is it the first (data-load) or the second (history-table) portion that has suddenly become slow? Or, both at once?
I think the first thing that I would do is to "trust, but verify" that there are no indexes at play here. The second thing I'd look at is the storage allocation for this tablespace ... is it running out of room, such that the SQL server is having to do a bunch of extra calesthenics to obtain and to maintain storage? How does this process COMMIT? After every row? Can you prove that the package definition has not changed in the slightest, recently?
Obviously, "150 million rows" is not a lot of data, these days; neither is 8GB. If you were "simply" moving those rows into an un-indexed table, "3 hours" would be a generous expectation. Obviously, the only credible root-cause of this kind of behavior is that the disk-I/O load has increased dramatically, and I am healthily suspicious that "excessive COMMITs" might well be part of the cause: re-writing instead of "lazy-writing," re-reading instead of caching.

SQL SERVER - Execution plan

I have a VIEW in both Databases. At one database, takes less then 1 second to run and but in the other database 1 minute or more to go. I check indexes and everything is the same. The diference between the number of rows is below than 10 millions of rows from each other database.
I check de exectuion plan, and what i found is that, the database that takes more time, i have 3 Hash Match(1 aggregate and 2 right outer join) that is responssible for 100% on the query batch. On the other database i don't have this in the execution plan.
Can anyone tell me where can i begin to search the problem?
Thank you, sorry for the bad english.

You can check this link here for a quick explanation on different types of joins.
Basically, with the information you've given us, here are some of the alternatives for what might be wrong:
One DB has indexes the other doesn't.
The size difference between some of the joined tables in one DB over the other, is dramatic enough to change the type of join used.
While your indexes might be the same on both DB table groups, as you said.. it's possible the other DB has outdated / bad statistics or too much index fragmentation, resulting in sub-optimal plans.
EDIT:
Regarding your comment below, it's true that rebuilding indexes is similar to dropping & recreating indexes. And since creating indexes also creates the statistics for those indexes, rebuilding will take care of them as well. Sometimes that's not enough however.
While officially default statistics should be built with about 20% sampling rate of the actual data, in reality the sampling rate can be as low as just a few percents depending on how massive the table is. It's rarely anywhere near 20%. Because of that, many DBA's build statistics manually with FULLSCAN to obtain a 100% sampling rate.
The statistics take equally much storage space either way, so there are really no downsides to this aside from the extra time required in maintenance plans. In my current project, we have several situations where the default sampling rate for the statistics is not enough, and would still produce bad plans. So we routinely update all statistics with FULLSCAN every few weeks to make sure the performance stays top notch.

View Clustered Index Seek over 0.5 million rows takes 7 minutes

Take a look at this execution plan: http://sdrv.ms/1agLg7K
It’s not estimated, it’s actual. From an actual execution that took roughly 30 minutes.
Select the second statement (takes 47.8% of the total execution time – roughly 15 minutes).
Look at the top operation in that statement – View Clustered Index Seek over _Security_Tuple4.
The operation costs 51.2% of the statement – roughly 7 minutes.
The view contains about 0.5M rows (for reference, log2(0.5M) ~= 19 – a mere 19 steps given the index tree node size is two, which in reality is probably higher).
The result of that operator is zero rows (doesn’t match the estimate, but never mind that for now).
Actual executions – zero.
So the question is: how the bleep could that take seven minutes?! (and of course, how do I fix it?)
EDIT: Some clarification on what I'm asking here.
I am not interested in general performance-related advice, such as "look at indexes", "look at sizes", "parameter sniffing", "different execution plans for different data", etc.
I know all that already, I can do all that kind of analysis myself.
What I really need is to know what could cause that one particular clustered index seek to be so slow, and then what could I do to speed it up.
Not the whole query.
Not any part of the query.
Just that one particular index seek.
END EDIT
Also note how the second and third most expensive operations are seeks over _Security_Tuple3 and _Security_Tuple2 respectively, and they only take 7.5% and 3.7% of time. Meanwhile, _Security_Tuple3 contains roughly 2.8M rows, which is six times that of _Security_Tuple4.
Also, some background:
This is the only database from this project that misbehaves.
There are a couple dozen other databases of the same schema, none of them exhibit this problem.
The first time this problem was discovered, it turned out that the indexes were 99% fragmented.
Rebuilding the indexes did speed it up, but not significantly: the whole query took 45 minutes before rebuild and 30 minutes after.
While playing with the database, I have noticed that simple queries like “select count(*) from _Security_Tuple4” take several minutes. WTF?!
However, they only took several minutes on the first run, and after that they were instant.
The problem is not connected to the particular server, neither to the particular SQL Server instance: if I back up the database and then restore it on another computer, the behavior remains the same.

First I'd like to point out a little misconception here: although the delete statement is said to take nearly 48% of the entire execution, this does not have to mean it takes 48% of the time needed; in fact, the 51% assigned inside that part of the query plan most definitely should NOT be interpreted as taking 'half of the time' of the entire operation!
Anyway, going by your remark that it takes a couple of minutes to do a COUNT(*) of the table 'the first time' I'm inclined to say that you have an IO issue related to said table/view. Personally I don't like materialized views very much so I have no real experience with them and how they behave internally but normally I would suggest that fragmentation is causing its toll on the underlying storage system. The reason it works fast the second time is because it's much faster to access the pages from the cache than it was when fetching them from disk, especially when they are all over the place. (Are there any (max) fields in the view ?)
Anyway, to find out what is taking so long I'd suggest you rather take this code out of the trigger it's currently in, 'fake' an inserted and deleted table and then try running the queries again adding times-stamps and/or using some program like SQL Sentry Plan Explorer to see how long each part REALLY takes (it has a duration column when you run a script from within the program).
It might well be that you're looking at the wrong part; experience shows that cost and actual execution times are not always as related as we'd like to think.

Observations include:
Is this the biggest of these databases that you are working with? If so, size matters to the optimizer. It will make quite a different plan for large datasets versus smaller data sets.
The estimated rows and the actual rows are quite divergent. This is most apparent on the fourth query. "delete c from #alternativeRoutes...." where the _Security_Tuple5 estimates returning 16 rows, but actually used 235,904 rows. For that many rows an Index Scan could be more performant than Index Seeks. Are the statistics on the table up to date or do they need to be updated?
The "select count(*) from _Security_Tuple4" takes several minutes, the first time. The second time is instant. This is because the data is all now cached in memory (until it ages out) and the second query is fast.
Because the problem moves with the database then the statistics, any missing indexes, et cetera are in the database. I would also suggest checking that the indexes match with other databases using the same schema.
This is not a full analysis, but it gives you some things to look at.

Fyodor,
First:
The problem is not connected to the particular server, neither to the particular SQL Server instance: if I back up the database and then restore it on another computer, the behavior remains the same.
I presume that you: a) run this query in isolated environment, b) the data is not under mutation.
Is this correct?
Second: post here your CREATE INDEX script. Do you have a funny FILLFACTOR? SORT_IN_TEMPDB?
Third: which type is your ParentId, ObjectId? int, smallint, uniqueidentifier, varchar?

libpq very slow for large (20 million record) database

I am new to SQL/RDBMS.
I have an application which adds rows with 10 columns in PostgreSQL server using the libpq library. Right now, my server is running on same machine as my visual c++ application.
I have added around 15-20 million records. The simple query of getting total count is taking 4-5 minutes using select count(*) from <tableName>;.
I have indexed my table with the time I am entering the data (timecode). Most of the time I need count with different WHERE / AND clauses added.
Is there any way to make things fast? I need to make it as fast as possible because once the server moves to network, things will become much slower.
Thanks

I don't think network latency will be a large factor in how long your query takes. All the processing is being done on the PostgreSQL server.
The PostgreSQL MVCC design means each row in the table - not just the index(es) - must be walked to calculate the count(*) which is an expensive operation. In your case there are a lot of rows involved.
There is a good wiki page on this topic here http://wiki.postgresql.org/wiki/Slow_Counting with suggestions.
Two suggestions from this link, one is to use an index column:
select count(index-col) from ...;
... though this only works under some circumstances.
If you have more than one index see which one has the least cost by using:
EXPLAIN ANALYZE select count(index-col) from ...;
If you can live with an approximate value, another is to use a Postgres specific function for an approximate value like:
select reltuples from pg_class where relname='mytable';
How good this approximation is depends on how often autovacuum is set to run and many other factors; see the comments.

Consider pg_relation_size('tablename') and divide it by the seconds spent in
select count(*) from tablename
That will give the throughput of your disk(s) when doing a full scan of this table. If it's too low, you want to focus on improving that in the first place.
Having a good I/O subsystem and well performing operating system disk cache is crucial for databases.
The default postgres configuration is meant to not consume too much resources to play nice with other applications. Depending on your hardware and the overall utilization of the machine, you may want to adjust several performance parameters way up, like shared_buffers, effective_cache_size or work_mem. See the docs for your specific version and the wiki's performance optimization page.
Also note that the speed of select count(*)-style queries have nothing to do with libpq or the network, since only one resulting row is retrieved. It happens entirely server-side.

You don't state what your data is, but normally the why to handle tables with a very large amount of data is to partition the table. http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html
This will not speed up your select count(*) from <tableName>; query, and might even slow it down, but if you are normally only interested in a portion of the data in the table this can be helpful.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas