Oracle - To Global Temp or NOT to Global Temp - sql

So let's say I have a few million records to pull from in order to generate some reports, and instead of running my reports of the live table, I create a temp where I can then create my indexes and use it for further data extraction.
I know cached tables tend to be quicker / faster seeing as the data is stored in memory, but I'm curious to know if there are instances where using a physical temp table is better than Global Temporary Tables and why? What kind of scenario would one be better than the other when dealing with larger volumes of data?

Global Temporary Tables in Oracle are not like temporary tables in SQL Server. They are not cached in memory, they are written to the temporary tablespace.
If you are handling a large amount of data and retaining it for a reasonable amount of time - which seems likely as you want to build additional indexes - I think you should use a regular table. This is even more the case if your scenario has a single session, perhaps a background job, working with the data.

I use Subquery Factoring before I consider temp tables. If there's a need for reuse in various functions or procedures, I turn it into a view (which can turn into a materialized view depending on the data returned).
According to asktom:
...temp table and global temp table are synonymous in Oracle.

For reporting, temporary tables are helpful in that data can only be seen by the session that created it, meaning that you shouldn't have to worry about any concurrency issues.
With a non-temporary table you need to add a session handle/identifier to the table in order to distinguish between sessions.

The primary difference between ordinary (heap) tables and global temp tables in Oracle is their visibility and volatility:
Once rows are committed to an ordinary table they are visible to other sessions and are retained until deleted.
Rows in a global temp table are never visible to other sessions, and are not retained after the session ends.
So the choice should primarily be down to what your application design needs, rather than just about performance (not to say performance isn't important).

The contents of an Oracle temporary table are only visible within the session that created the data and will disappear when the session ends. So you will have to copy the data for every report.
Is this report you are doing a one time operation or will the report be run periodically? Copying large quantities of data just to run a report does not seem a good solution to me. Why not run the report on the original data?
If you can't use the original tables you may be able to create a meterialized view so the latest data is available when you need it.

Related

How many temporary/staging tables to use during the transform step of ETL?

My first thought is to first load data from S3 to a temporary table, apply the necessary transformations and then INSERT INTO target, final table. All the tables would have the same columns and are in Redshift.
However, how big of a performance hit would there be because of using multiple UPDATEs? Would it be better to split UPDATEs and filtering between multiple temporary tables for daily batch processing.
Instead of S3 -> TEMP -> FINAL, the flow would look like S3 -> TEMP1 -> ... -> TEMPN - > FINAL, where "->" would be "INSERT INTO".
Also, is it better to create temporary tables (CREATE TEMP TABLE) on the spot and dropping them every day, or use persisting temporary tables that would be truncated every day. I think using persisting temp tables would be the better choice as it allows me to check how the data looked as it was loaded and transformed that day.
As you are seeing there are lots of ways to run an update process and which is better will depend on factors that are not presented here. First off let's clarify what a TEMP table is and differentiate it from a staging table. A temp table only lives as long as the current session (connection) is active. If the connection drops then so does the TEMP table. A staging table is a permanent table used for staging data which more closely matches what you are describing you parts of your question. I'll use these two terms to be clear about which is being made (TEMP or staging).
Your question revolves around how big of a performance hit it would be to have a series of tables in the ETL (ELT?) process to improve, I expect, diagnosability / debug-ability. This is a fine goal but there are some downsides as with all tradeoffs in the real world.
If this is correct these tables will need to be staging tables as TEMP tables will disappear when the ETL session ends.
Saving a bunch of staging tables when one could be used has some downsides but how big these are depends on you situation. If your cluster is fairly idle and the ETL data payload isn't huge then the impact to the ETL process of the extra tables will be real but not large (a couple of seconds or less). These impacts are mostly around setting up (or truncating) the staging or TEMP tables. But if your cluster is running other workloads when ETL runs then the impact can be much larger.
You see there are many "resources" in a Redshift cluster that all need to be shared by everything running on the database. Some like memory allocation can be (somewhat) controlled through the WLM. Others cannot. The two biggies are network bandwidth and disk bandwidth. There is a fixed capacity to these bandwidths in Redshift and even though they are high, they are finite. There are other limits to Redshift's ability to execute a total workload but these in my experience are the big two.
Every time you create a table, TEMP or permanent, the data is stored to disk. This means a write to disk as well as distributing the data per the distribution settings of the table. Then when the table is accessed the data needs to be read from disk. All this unneeded data movement will have some impact, how large will depend on how big it is and what else is going on at the time. So you see the impact will be moderately small up to very large depending on a lot of factors, not the least of which is how many tables are you creating. The cost of doing this will need to be offset by the benefit of having these extra tables which is a business decision.
A common pattern is to load (COPY) data to a temp or staging table and then extract the DELETE patterns to one staging table and the INSERT data to another. Once the deletes and inserts are applied to then save these tables with a date stamp in the name and possibly unloaded to S3. After a while these sets of data are deleted, 1 month is common. This way you can figure out 'what happened' if things go sideways. This plus good database backups can be used to recover from code bugs.
Your secondary question is about whether it is better to drop and recreate or truncate. There have been a number of performance improvements to both of these statements. With a grain of salt, I'll offer my slightly dated experience comparing these. Both are fast but I saw drop and recreate as slightly faster (fewer dependencies to manage). That said the main difference is in how they interoperate with other aspects of the database. DROP will fail if there are dependent views (unless cascaded) and table permissions will be lost. DROP cannot be run in a transaction block and since it needs an exclusive lock on the table can be held off my another session reading the table. TRUNCATE can run in a transaction block but will force a commit so transaction changes will become visible to all. It is usually these differences that made the decision about TRUNCATE vs. DROP and there are other options such as DELETE and ALTER TABLE APPEND that have their own set of plusses and minuses.
So I'd generally advise against creating more tables than are actually needed in the ETL process when all needs are weighed (including performance and business needs). You may have excess capacity now but usually Redshift clusters get busier over time. The guiding principal here is don't move large amounts of data more times than is necessary.

What are the benefits of a Make Table vs a Select query in Access?

I know you can run SELECT queries on top of SELECT queries in Access, but the application also provides the Make Table query type.
I'm wondering what the benefits/reasons for using Make Table might be?
You would usually use Make Table for performance reasons. If you have a fairly complex query that returns a subset of your table's data, and that you may need to retrieve multiple times, it can be expensive to re-run the query multiple times.
Using Make Table allows you to incur the cost of running the expensive query once, and make a copy of the query results into a table. Querying this copy would then be a lot less expensive than running your original expensive query.
This is usually a good option when you don't expect your original data to change frequently, or if you don't care that you are working of a copy of the data that may not be 100% up-to-date with the original data.
Notice what the following article on Create a make table query has to say:
Typically, you create make table queries when you need to copy or archive data. For example, suppose you have a table (or tables) of past sales data, and you use that data in reports. The sales figures cannot change because the transactions are at least one day old, and constantly running a query to retrieve the data can take time — especially if you run a complex query against a large data store. Loading the data into a separate table and using that table as a data source can reduce workload and provide a convenient data archive. As you proceed, remember that the data in your new table is strictly a snapshot; it has no relationship or connection to its source table or tables.
The main defense here is that a make table query creates a table. And when you done with the table then effort and time to delete that table and recover the VERY LARGE increase in the database file will have to occur. For general reports and a query of data make much more send. A comparison would be to build a NEW garage every time you want to park your car.
The database engine and query system can fetch and pull rows at a very high rate and those results are then able to be rendered into a report or form, and this occurs without having to create a temp table. It makes little sense to go through all of the trouble of having the system create a WHOLE NEW table for such results of data when they can with ease be sent to a report.
In other words creating a whole table just to display or use some data that the database engine already fetched and returned makes little sense. A table is a set of rows that holds data that can be updated and the results are permanent. A query is a “on the fly” results or sub set of data that only exists in memory and is discarded after you use the results.
So for general reporting and display of data, it makes no sense to create a temp table. MUCH WORSE of an issue is that if you have two users wanting to run a report, if they both need different results and you send the results to the SAME temp table, then you have a big mess and collision between the two users. So use of a temp table in Access for the most part makes little sense, and this is EVEN MORE so when working in a multi-user environment. And as noted, once the table is created, then after you are done you need to delete and remove the table. And with many users in a multi-user database this becomes even more of a problem and issue.
However in a multi-user environment as pointed out that if the resulting data needs additional processing, then sending the results to a temp table can be of use. This approach however suggests that EACH USER has their own front end and own copy of the application side. And better is that the temp table is created outside of the front end application that resides on each computer. Since the application part (front end) is placed on each computer, then creating of a temp table does not occur in the production database (back end) and as a result you can have multiple users function correctly without each individual user creating a temp table in the production back end database. So if one is to adopt a make table query, it likely should occur on each local workstation and not in the back end database when you have a multiple user database application.
Thus for the most part a make table and that of reports and query of data are VERY different goals and tasks. You don't want nor as a general rule create a whole brand new table for a simple query. In a multi user database system the users might run 100's of reports in a given day and FEW if any systems will send such data to a temp table in place of sending the query results directly to the report.
It creates a table - which is useful if you have a need for that table which you may have for temporary use where you have to modify the data for calculations or further processing while not disturbing the original data.

When this query is performed, do all the records get loaded into physical memory?

I have a table where i have millions of records. The total size of that table only is somewhere 6-7 GigaByte. This table is my application log table. This table is growing really fast, which makes sense. Now I want to move records from log table into backup table. Here is the scenario and here is my question.
Table Log_A
Insert into Log_b select * from Log_A;
Delete from Log_A;
I am using postgres database. the question is
When this query is performed Does all the records from Log_A gets load in physical memory ? NOTE: My both of the above query runs inside a stored procedure.
If No, then how will it works ?
I hope this question applies for all database.
I hope if somebody could provide me some idea on this.
In PostgreSQL, that's likely to execute a sequential scan, loading some records into shared_buffers, inserting them, writing the dirty buffers out, and carrying on.
All the records will pass through main memory, but they don't all have to be in memory at once. Because they all get read from disk using normal buffered reads (pread) it will affect the operating system disk cache, potentially pushing other data out of the cache.
Other databases may vary. Some could execute the whole SELECT before processing the INSERT (though I'd be surprised if any serious ones did). Some do use O_DIRECT reads or raw disk I/O to avoid the OS cache affects, so the buffer cache effects might be different. I'd be amazed if any database relied on loading the whole SELECT into memory, though.
When you want to see what PostgreSQL is doing and how, the EXPLAIN and EXPLAIN (BUFFERS, ANALYZE) commands are quite useful. See the manual.
You may find writable common table expressions interesting for this purpose; it lets you do all this in one statement. In this simple case there's probably little benefit, but it can be a big win in more complex data migrations.
BTW, make sure to run that pair of queries wrapped in BEGIN and COMMIT.
Probably not.
Each record is individually processed; this particular query doesn't need to have knowledge of any of the other records to successfully execute. So the only record that needs to be in memory at any given moment is the one currently being processed.
But it really depends on whether or not the database thinks it can do it faster by loading up the whole table. Check the execution plan of the query.
If your setup allows it, just rename the old table and create a new empty one. Much faster, obviously, as no copying is done at all.
ALTER TABLE log_a RENAME TO log_b;
CREATE TABLE log_a (LIKE log_b INCLUDING ALL);
The LIKE clause copies the structure of the (now renamed) old table. INCLUDING ALL includes defaults, constraints, indexes, ...
Foreign key constraints or views depending on the table or other less common dependencies (but not queries in plpgsql functions) might be a hurdle for this route. You would have to recreate those to have them point to the new table. But a logging table like you describe probably carries no such dependencies.
This acquires an exclusive lock on the table. I assume, typical write access will be INSERT only in your case? One way to deal with concurrent access would then be to create the new table in a different schema and alter the search_path for your application user. Then the applications starts to write to the new table without concurrency issues. Of course, you wouldn't schema-qualify the table name in your INSERT statements for this to take effect.
CREATE SCHEMA log20121018;
CREATE TABLE log20121018.log_a (LIKE log20121011.log_a INCLUDING ALL);
ALTER ROLE myrole SET search_path = app, log20121018, public;
Or alter the search_path setting at whatever level is effective for you:
globally, per database, per role, per session, per function ...

Performance Overhead on SQL Server Temp Tables

I use temp tables frequently to simplify data loads (easier debugging, cleaner select statements, etc). If performance demands it, I'll create a physical table etc.
I noticed recently that I automatically declare my temp tables as global (##temp_load) as opposed to local (#temp_table). I don't know why but that's been my habit for years. I never need the tables to be global but I'm curious if there is additional overhead for creating them as global. And should I work on changing my habits.
Are there additional risks for making them global?
Non-Global temp tables are pretty much guaranteed never to collide.
Global temp tables are similar to materialized tables in that the name needs to be unique per server.
As a rule, only use ##GLOBAL_TEMP tables when you must.
Otherwise, if you are writing a proc that could me run more than once simultaneously, the procs will interact with each other in unpredictable ways, making it extremely difficult to troubleshoot - Instance 1 can change data being used by Instance 2 which causes Instance 3 to generate incorrect results as well.
My personal opinion on Temp tables is that I only use them when:
I have a medium-to-large resultset (more than 1m rows)
I will need to index that resultset
I will not need to use that resultset more than once per iteration of the process
I am confident I will not need to resume the process at any point
I highlighted that last bullet because this is the main reason I try to minimize temp table use:
If you have a long-running process, and you use temp tables to store intermediate data sets, and something dies say 90% of the way through, you have to completely restart if that data is not in a materialized table most of the time.
Some of my processes run for days on billions of rows of data, so I am not interested in restarting from scratch ever.

Load data in a Global Temporary Table

I have two different Oracle sessions ("session A" and "session B") on the same Oracle user.
A Global Temporary Table is populated, in "session A", with about 320,000 records.
How can I quickly insert the same 320,000 records in the global temporary table of the "session B"?
Thank you in advance for your kind suggestions!
EDIT: I have forgotten to specify that I am allowed to create ONLY GLOBAL TEMPORARY TABLES.
EDIT: I have forgotten to specify that I am not allowed to create database links
The data within a temporary table is only ever visible to the current session, so I don't think there's a way to do what you want to do without another approach.
DBMS_PIPE is the 'classic' mechanism for pushing information from one session to another. Session A would have to push data into the pipe and session B would have to pull it.
But generally the idea of databases is that sessions are independent and any commonality is in the preserved data. Going against this suggests you are using the wrong tool.
The data in a global temporary table is only visible to the session that inserted it. So you would have to run the same process that populated the table in session B.
Of course, the fact that you appear to want to access the same 320,000 rows in two different sessions would seem to imply that a global temporary table is not the appropriate data structure to be using. Perhaps you want to load that data into a permanent table (possibly along with some sort of identifier if you will have multiple SessionA/ SessionB pairs). Or perhaps whatever logic Session B is running ought to be run by Session A.
And just taking a step back, since Oracle implements multi-version read consistency such that readers don't block writers and writers don't block readers, it would be very unusual to need to have a 320,000 row temporary table in the first place.