I use temp tables frequently to simplify data loads (easier debugging, cleaner select statements, etc). If performance demands it, I'll create a physical table etc.
I noticed recently that I automatically declare my temp tables as global (##temp_load) as opposed to local (#temp_table). I don't know why but that's been my habit for years. I never need the tables to be global but I'm curious if there is additional overhead for creating them as global. And should I work on changing my habits.
Are there additional risks for making them global?
Non-Global temp tables are pretty much guaranteed never to collide.
Global temp tables are similar to materialized tables in that the name needs to be unique per server.
As a rule, only use ##GLOBAL_TEMP tables when you must.
Otherwise, if you are writing a proc that could me run more than once simultaneously, the procs will interact with each other in unpredictable ways, making it extremely difficult to troubleshoot - Instance 1 can change data being used by Instance 2 which causes Instance 3 to generate incorrect results as well.
My personal opinion on Temp tables is that I only use them when:
I have a medium-to-large resultset (more than 1m rows)
I will need to index that resultset
I will not need to use that resultset more than once per iteration of the process
I am confident I will not need to resume the process at any point
I highlighted that last bullet because this is the main reason I try to minimize temp table use:
If you have a long-running process, and you use temp tables to store intermediate data sets, and something dies say 90% of the way through, you have to completely restart if that data is not in a materialized table most of the time.
Some of my processes run for days on billions of rows of data, so I am not interested in restarting from scratch ever.
Related
Let's say I have a database with lots of tables, but there's one big table that's being updated regularly. At any given point in time, this table contains billions of rows, and let's say that the table is updated so regularly that we can expect a 100% refresh of the table by the end of each quarter. So the volume of data being moved around is in the order tens of billions. Because this table is changing so constantly, I want to implement a PITR, but only for this one table. I have two options:
Hack PostgreSQL's in-house PITR to apply only for one table.
Build it myself by creating a base backup, set up continuous archiving, and using a python script to execute the log of SQL statements up to a point in time (or use PostgreSQL's EXECUTE statement to loop through the archive). The big con with this is that it won't have the timeline functionality.
My problem is, I don't know if option 1 is even possible, and I don't know if option 2 even makes sense (looping through billions of rows sounds like it defeats the purpose of PITR, which is speed and convenience.) What other options do I have?
My first thought is to first load data from S3 to a temporary table, apply the necessary transformations and then INSERT INTO target, final table. All the tables would have the same columns and are in Redshift.
However, how big of a performance hit would there be because of using multiple UPDATEs? Would it be better to split UPDATEs and filtering between multiple temporary tables for daily batch processing.
Instead of S3 -> TEMP -> FINAL, the flow would look like S3 -> TEMP1 -> ... -> TEMPN - > FINAL, where "->" would be "INSERT INTO".
Also, is it better to create temporary tables (CREATE TEMP TABLE) on the spot and dropping them every day, or use persisting temporary tables that would be truncated every day. I think using persisting temp tables would be the better choice as it allows me to check how the data looked as it was loaded and transformed that day.
As you are seeing there are lots of ways to run an update process and which is better will depend on factors that are not presented here. First off let's clarify what a TEMP table is and differentiate it from a staging table. A temp table only lives as long as the current session (connection) is active. If the connection drops then so does the TEMP table. A staging table is a permanent table used for staging data which more closely matches what you are describing you parts of your question. I'll use these two terms to be clear about which is being made (TEMP or staging).
Your question revolves around how big of a performance hit it would be to have a series of tables in the ETL (ELT?) process to improve, I expect, diagnosability / debug-ability. This is a fine goal but there are some downsides as with all tradeoffs in the real world.
If this is correct these tables will need to be staging tables as TEMP tables will disappear when the ETL session ends.
Saving a bunch of staging tables when one could be used has some downsides but how big these are depends on you situation. If your cluster is fairly idle and the ETL data payload isn't huge then the impact to the ETL process of the extra tables will be real but not large (a couple of seconds or less). These impacts are mostly around setting up (or truncating) the staging or TEMP tables. But if your cluster is running other workloads when ETL runs then the impact can be much larger.
You see there are many "resources" in a Redshift cluster that all need to be shared by everything running on the database. Some like memory allocation can be (somewhat) controlled through the WLM. Others cannot. The two biggies are network bandwidth and disk bandwidth. There is a fixed capacity to these bandwidths in Redshift and even though they are high, they are finite. There are other limits to Redshift's ability to execute a total workload but these in my experience are the big two.
Every time you create a table, TEMP or permanent, the data is stored to disk. This means a write to disk as well as distributing the data per the distribution settings of the table. Then when the table is accessed the data needs to be read from disk. All this unneeded data movement will have some impact, how large will depend on how big it is and what else is going on at the time. So you see the impact will be moderately small up to very large depending on a lot of factors, not the least of which is how many tables are you creating. The cost of doing this will need to be offset by the benefit of having these extra tables which is a business decision.
A common pattern is to load (COPY) data to a temp or staging table and then extract the DELETE patterns to one staging table and the INSERT data to another. Once the deletes and inserts are applied to then save these tables with a date stamp in the name and possibly unloaded to S3. After a while these sets of data are deleted, 1 month is common. This way you can figure out 'what happened' if things go sideways. This plus good database backups can be used to recover from code bugs.
Your secondary question is about whether it is better to drop and recreate or truncate. There have been a number of performance improvements to both of these statements. With a grain of salt, I'll offer my slightly dated experience comparing these. Both are fast but I saw drop and recreate as slightly faster (fewer dependencies to manage). That said the main difference is in how they interoperate with other aspects of the database. DROP will fail if there are dependent views (unless cascaded) and table permissions will be lost. DROP cannot be run in a transaction block and since it needs an exclusive lock on the table can be held off my another session reading the table. TRUNCATE can run in a transaction block but will force a commit so transaction changes will become visible to all. It is usually these differences that made the decision about TRUNCATE vs. DROP and there are other options such as DELETE and ALTER TABLE APPEND that have their own set of plusses and minuses.
So I'd generally advise against creating more tables than are actually needed in the ETL process when all needs are weighed (including performance and business needs). You may have excess capacity now but usually Redshift clusters get busier over time. The guiding principal here is don't move large amounts of data more times than is necessary.
I have a doubt in mind when retrieving data from database.
There are two tables and master table id always inserted to other table.
I know that data can retrieve from two table by joining but want to know,
if i first retrieve all my desire data from master table and then in loop (in programing language) join to other table and retrieve data, then which is efficient and why.
As far as efficiency goes the rule is you want to minimize the number of round trips to the database, because each trip adds a lot of time. (This may not be as big a deal if the database is on the same box as the application calling it. In the world I live in the database is never on the same box as the application.) Having your application loop means you make a trip to the database for every row in the master table, so the time your operation takes grows linearly with the number of master table rows.
Be aware that in dev or test environments you may be able to get away with inefficient queries if there isn't very much test data. In production you may see a lot more data than you tested with.
It is more efficient to work in the database, in fewer larger queries, but unless the site or program is going to be very busy, I doubt that it'll make much difference that the loop is inside the database or outside the database. If it is a website application then looping large loops outside the database and waiting on results will take a more significant amount of time.
What you're describing is sometimes called the N+1 problem. The 1 is your first query against the master table, the N is the number of queries against your detail table.
This is almost always a big mistake for performance.*
The problem is typically associated with using an ORM. The ORM queries your database entities as though they are objects, the mistake is assume that instantiating data objects is no more costly than creating an object. But of course you can write code that does the same thing yourself, without using an ORM.
The hidden cost is that you now have code that automatically runs N queries, and N is determined by the number of matching rows in your master table. What happens when 10,000 rows match your master query? You won't get any warning before your database is expected to execute those queries at runtime.
And it may be unnecessary. What if the master query matches 10,000 rows, but you really only wanted the 27 rows for which there are detail rows (in other words an INNER JOIN).
Some people are concerned with the number of queries because of network overhead. I'm not as concerned about that. You should not have a slow network between your app and your database. If you do, then you have a bigger problem than the N+1 problem.
I'm more concerned about the overhead of running thousands of queries per second when you don't have to. The overhead is in memory and all the code needed to parse and create an SQL statement in the server process.
Just Google for "sql n+1 problem" and you'll lots of people discussing how bad this is, and how to detect it in your code, and how to solve it (spoiler: do a JOIN).
* Of course every rule has exceptions, so to answer this for your application, you'll have to do load-testing with some representative sample of data and traffic.
I have a table where i have millions of records. The total size of that table only is somewhere 6-7 GigaByte. This table is my application log table. This table is growing really fast, which makes sense. Now I want to move records from log table into backup table. Here is the scenario and here is my question.
Table Log_A
Insert into Log_b select * from Log_A;
Delete from Log_A;
I am using postgres database. the question is
When this query is performed Does all the records from Log_A gets load in physical memory ? NOTE: My both of the above query runs inside a stored procedure.
If No, then how will it works ?
I hope this question applies for all database.
I hope if somebody could provide me some idea on this.
In PostgreSQL, that's likely to execute a sequential scan, loading some records into shared_buffers, inserting them, writing the dirty buffers out, and carrying on.
All the records will pass through main memory, but they don't all have to be in memory at once. Because they all get read from disk using normal buffered reads (pread) it will affect the operating system disk cache, potentially pushing other data out of the cache.
Other databases may vary. Some could execute the whole SELECT before processing the INSERT (though I'd be surprised if any serious ones did). Some do use O_DIRECT reads or raw disk I/O to avoid the OS cache affects, so the buffer cache effects might be different. I'd be amazed if any database relied on loading the whole SELECT into memory, though.
When you want to see what PostgreSQL is doing and how, the EXPLAIN and EXPLAIN (BUFFERS, ANALYZE) commands are quite useful. See the manual.
You may find writable common table expressions interesting for this purpose; it lets you do all this in one statement. In this simple case there's probably little benefit, but it can be a big win in more complex data migrations.
BTW, make sure to run that pair of queries wrapped in BEGIN and COMMIT.
Probably not.
Each record is individually processed; this particular query doesn't need to have knowledge of any of the other records to successfully execute. So the only record that needs to be in memory at any given moment is the one currently being processed.
But it really depends on whether or not the database thinks it can do it faster by loading up the whole table. Check the execution plan of the query.
If your setup allows it, just rename the old table and create a new empty one. Much faster, obviously, as no copying is done at all.
ALTER TABLE log_a RENAME TO log_b;
CREATE TABLE log_a (LIKE log_b INCLUDING ALL);
The LIKE clause copies the structure of the (now renamed) old table. INCLUDING ALL includes defaults, constraints, indexes, ...
Foreign key constraints or views depending on the table or other less common dependencies (but not queries in plpgsql functions) might be a hurdle for this route. You would have to recreate those to have them point to the new table. But a logging table like you describe probably carries no such dependencies.
This acquires an exclusive lock on the table. I assume, typical write access will be INSERT only in your case? One way to deal with concurrent access would then be to create the new table in a different schema and alter the search_path for your application user. Then the applications starts to write to the new table without concurrency issues. Of course, you wouldn't schema-qualify the table name in your INSERT statements for this to take effect.
CREATE SCHEMA log20121018;
CREATE TABLE log20121018.log_a (LIKE log20121011.log_a INCLUDING ALL);
ALTER ROLE myrole SET search_path = app, log20121018, public;
Or alter the search_path setting at whatever level is effective for you:
globally, per database, per role, per session, per function ...
So let's say I have a few million records to pull from in order to generate some reports, and instead of running my reports of the live table, I create a temp where I can then create my indexes and use it for further data extraction.
I know cached tables tend to be quicker / faster seeing as the data is stored in memory, but I'm curious to know if there are instances where using a physical temp table is better than Global Temporary Tables and why? What kind of scenario would one be better than the other when dealing with larger volumes of data?
Global Temporary Tables in Oracle are not like temporary tables in SQL Server. They are not cached in memory, they are written to the temporary tablespace.
If you are handling a large amount of data and retaining it for a reasonable amount of time - which seems likely as you want to build additional indexes - I think you should use a regular table. This is even more the case if your scenario has a single session, perhaps a background job, working with the data.
I use Subquery Factoring before I consider temp tables. If there's a need for reuse in various functions or procedures, I turn it into a view (which can turn into a materialized view depending on the data returned).
According to asktom:
...temp table and global temp table are synonymous in Oracle.
For reporting, temporary tables are helpful in that data can only be seen by the session that created it, meaning that you shouldn't have to worry about any concurrency issues.
With a non-temporary table you need to add a session handle/identifier to the table in order to distinguish between sessions.
The primary difference between ordinary (heap) tables and global temp tables in Oracle is their visibility and volatility:
Once rows are committed to an ordinary table they are visible to other sessions and are retained until deleted.
Rows in a global temp table are never visible to other sessions, and are not retained after the session ends.
So the choice should primarily be down to what your application design needs, rather than just about performance (not to say performance isn't important).
The contents of an Oracle temporary table are only visible within the session that created the data and will disappear when the session ends. So you will have to copy the data for every report.
Is this report you are doing a one time operation or will the report be run periodically? Copying large quantities of data just to run a report does not seem a good solution to me. Why not run the report on the original data?
If you can't use the original tables you may be able to create a meterialized view so the latest data is available when you need it.