analyze table, optimize table, how often? - sql

in mysql i have 3 tables. one is 500,000, another 300,000, and finally around 5,000
they each get maybe 50-500 additional rows daily
should i run analyze and optimize table on them? if so then how often?

optimize table rebuilds the table for InnoDB so it could take a wicked long time to run. It's used for reclaiming space and recreating indexes. I'd say run that rarely if at all.
optimize table doc
analyze should be redone whenever the overall distribution of the indexed data changes significantly. So if you're inserting the same type of stuff at the same rate over time - no need to run analyze often - do it maybe once a month. But if things change drastically - such that you get way more of one type of data than another or something else unusual - then run it afterwards.
I run it for example after loading a new table with data and perhaps a good idea would be to run it against all like once a week if you have no clue.

Related

Is it possible to implement point in time recovery (PITR) in PostgreSQL for a single table?

Let's say I have a database with lots of tables, but there's one big table that's being updated regularly. At any given point in time, this table contains billions of rows, and let's say that the table is updated so regularly that we can expect a 100% refresh of the table by the end of each quarter. So the volume of data being moved around is in the order tens of billions. Because this table is changing so constantly, I want to implement a PITR, but only for this one table. I have two options:
Hack PostgreSQL's in-house PITR to apply only for one table.
Build it myself by creating a base backup, set up continuous archiving, and using a python script to execute the log of SQL statements up to a point in time (or use PostgreSQL's EXECUTE statement to loop through the archive). The big con with this is that it won't have the timeline functionality.
My problem is, I don't know if option 1 is even possible, and I don't know if option 2 even makes sense (looping through billions of rows sounds like it defeats the purpose of PITR, which is speed and convenience.) What other options do I have?

Select LIMIT 1 takes long time on postgresql

I'm running a simple query on localhost PostgreSQL database and it runs too long:
SELECT * FROM features LIMIT 1;
I expect such query to be finished in a fraction of a second as it basically says "peek anywhere in the database and pick one row". Or it doesn't?
table size is 75GB with estimated row count 1.84405e+008
I'm the only user of the database
the database server was just started, so I guess nothing is cached in memory
I totally agree with #larwa1n with the content he comment on your post.
The reason here, I guess, is the performance of SELECT is too slow.
With my experience maybe there are another reasons. I list as below:
The table is too big, so let add some WHERE CLAUSE and INDEX
The performance of your server/disk drive is too slow.
Other process take most resource.
Another reason maybe come from maintenance task, let check again does the autovacuum is running? If not, check is this table is vacuum already? If not, let do a vacuum full on that table. Sometimes, when you do a lot of insert/update/delete on a large table without vacuum will make the table save in fragmented disk block, which will take longer time in query.
Hopefully, this answer will help you find out the final reason.

BigQuery query taking a long time

a simple count query on one of my tables takes a long time to complete (~18 secs), this table has around half a million rows, and making the same query in a bigger table (around 3 mil) takes less than 3 secs. The schema is exactly the same and the query is a simple SELECT count(*) FROM [dataset.table]
Any ideas why this is happening and what can I do to prevent it?
It looks like the issue with your table is that it was created in a lot of small chunks; this takes more work to query, since we spend a lot of time on filesystem operations (listing files and opening them).
Even so, a table the size of yours should not be so slow; BigQuery is currently experiencing high filesystem load that is causing high variability in latency. We're actively working on resolving this one. So that is the first problem.
The second problem is that we probably should do a better job of compacting the table. I've filed an internal bug that we should tweak our heuristics to be a bit more aggressive in compaction.
As a workaround, you can compact the table manually by copying the table in place. In other words, run a SELECT * from ... and writing the output to the same table, using writeDisposition:WRITE_TRUNCATE, destinationTable:<your table> and allowLargeResults:true and flattenSchema:false.
Again, this last step shouldn't be needed, but for now it should improve your situation.

Loading huge flatfiles to SQL table is too slow via SSIS package

I receive about 8 huge delimited flatfiles to be loaded into an SQL server (2012)table once every week. Total number of rows in all the files would be about 150 million and each file has different number of rows. I have a simple SSIS package which loads data from flatfiles(using foreach container) into a history table. And then a select query runs on this history table to select current weeks data and loads into a staging table.
We ran into problems as history table grew very large(8 billion rows). So I decided to back up the data in history table and truncate. Before truncation the package execution time ranged from 15hrs to 63 hrs in that order.We hoped after truncation it should go back to 15hrs or less.But to my surprise even after 20+ hours the package is still running. The worst part is that it is still loading the history table. Latest count is around 120 million. It still has to load the staging data and it might take just as long.
Neither history table nor staging tables have any indexes, which is why select query on the history table used to take most of the execution time. But loading from all the flatfiles to history table was always under 3 hrs.
I hope i'm making sense. Can someone help me understand what could be the reason behind this unusual execution time for this week? Thanks.
Note: The biggest file(8GB) was read at flatfile source in 3 minutes. So I'm thinking source is not the bottle neck here.
There's no good reason, IMHO, why that server should take that long to load that much data. Are you saying that the process which used to take 3 hours, now takes 60+? Is it the first (data-load) or the second (history-table) portion that has suddenly become slow? Or, both at once?
I think the first thing that I would do is to "trust, but verify" that there are no indexes at play here. The second thing I'd look at is the storage allocation for this tablespace ... is it running out of room, such that the SQL server is having to do a bunch of extra calesthenics to obtain and to maintain storage? How does this process COMMIT? After every row? Can you prove that the package definition has not changed in the slightest, recently?
Obviously, "150 million rows" is not a lot of data, these days; neither is 8GB. If you were "simply" moving those rows into an un-indexed table, "3 hours" would be a generous expectation. Obviously, the only credible root-cause of this kind of behavior is that the disk-I/O load has increased dramatically, and I am healthily suspicious that "excessive COMMITs" might well be part of the cause: re-writing instead of "lazy-writing," re-reading instead of caching.

how to automatically determine which tables need a vacuum / reindex in postgresql

i've written a maintenance script for our database and would like to run that script on whichever tables most need vacuuming/reindexing during our down time each day. is there any way to determine that within postgres?
i would classify tables needing attention like this:
tables that need vacuuming
tables that need reindexing (we find this makes a huge difference to performance)
i see something roughly promising here
It sounds like you are trying to re-invent auto-vacuum. Any reason you can't just enable that and let it's do it's job?
For the actual information you want, look at pg_stat_all_tables and pg_stat_all_indexes.
For a good example of how to use the data in it, look at the source for auto-vacuum. It doesn't query the views directly, but it uses that information.
I think you really should consider auto-vacuum.
However, if i did understood right your needs, that's what i'll do:
For every table (how many tables do you have?) define the criterias;
For example, talbe 'foo' need to be reindex every X new records and vacuum every X update, delete or insert
Write out your own application to do that.
Every day it check the tables status, save it in a log (to compare the rows difference over the time), and then reindex/vacuum the tables whose match yours criterias.
Sounds a little hacking, but i think is a good way to do an custom-autovacuum-with-custom-'triggers'-criteria
How about adding the same trigger function that runs after any CRUD action to all the tables.
The function will receives table name, checks the status of the table, and then run vacuum or reindex on that table.
Should be a "simple" pl/sql trigger, but then those are never simple...
Also, if your DB machine is strong enough, and your downtime long enough, just run a script every night to reindex it all, and vacuum it all... that way even if your criteria was not met at test time (night) but the was close to it (few records less than your criteria), it will not pose an issue the next day when it does reach the criteria...