When this query is performed, do all the records get loaded into physical memory?

When this query is performed, do all the records get loaded into physical memory? - sql

I have a table where i have millions of records. The total size of that table only is somewhere 6-7 GigaByte. This table is my application log table. This table is growing really fast, which makes sense. Now I want to move records from log table into backup table. Here is the scenario and here is my question.
Table Log_A
Insert into Log_b select * from Log_A;
Delete from Log_A;
I am using postgres database. the question is
When this query is performed Does all the records from Log_A gets load in physical memory ? NOTE: My both of the above query runs inside a stored procedure.
If No, then how will it works ?
I hope this question applies for all database.
I hope if somebody could provide me some idea on this.

In PostgreSQL, that's likely to execute a sequential scan, loading some records into shared_buffers, inserting them, writing the dirty buffers out, and carrying on.
All the records will pass through main memory, but they don't all have to be in memory at once. Because they all get read from disk using normal buffered reads (pread) it will affect the operating system disk cache, potentially pushing other data out of the cache.
Other databases may vary. Some could execute the whole SELECT before processing the INSERT (though I'd be surprised if any serious ones did). Some do use O_DIRECT reads or raw disk I/O to avoid the OS cache affects, so the buffer cache effects might be different. I'd be amazed if any database relied on loading the whole SELECT into memory, though.
When you want to see what PostgreSQL is doing and how, the EXPLAIN and EXPLAIN (BUFFERS, ANALYZE) commands are quite useful. See the manual.
You may find writable common table expressions interesting for this purpose; it lets you do all this in one statement. In this simple case there's probably little benefit, but it can be a big win in more complex data migrations.
BTW, make sure to run that pair of queries wrapped in BEGIN and COMMIT.

Probably not.
Each record is individually processed; this particular query doesn't need to have knowledge of any of the other records to successfully execute. So the only record that needs to be in memory at any given moment is the one currently being processed.
But it really depends on whether or not the database thinks it can do it faster by loading up the whole table. Check the execution plan of the query.

If your setup allows it, just rename the old table and create a new empty one. Much faster, obviously, as no copying is done at all.
ALTER TABLE log_a RENAME TO log_b;
CREATE TABLE log_a (LIKE log_b INCLUDING ALL);
The LIKE clause copies the structure of the (now renamed) old table. INCLUDING ALL includes defaults, constraints, indexes, ...
Foreign key constraints or views depending on the table or other less common dependencies (but not queries in plpgsql functions) might be a hurdle for this route. You would have to recreate those to have them point to the new table. But a logging table like you describe probably carries no such dependencies.
This acquires an exclusive lock on the table. I assume, typical write access will be INSERT only in your case? One way to deal with concurrent access would then be to create the new table in a different schema and alter the search_path for your application user. Then the applications starts to write to the new table without concurrency issues. Of course, you wouldn't schema-qualify the table name in your INSERT statements for this to take effect.
CREATE SCHEMA log20121018;
CREATE TABLE log20121018.log_a (LIKE log20121011.log_a INCLUDING ALL);
ALTER ROLE myrole SET search_path = app, log20121018, public;
Or alter the search_path setting at whatever level is effective for you:
globally, per database, per role, per session, per function ...

Related

How to perform data archive in SQL Server with many tables?

Let's say I have a database with many tables in it. I want to perform data archiving on certain tables, that is create a same table with same structures (same constraint, indexes, columns, triggers, etc) as a new table and insert specific data into the new table from the old table.
Example, current table has data from 2008-2017 and I want to move only data from 2010-2017 into the new table. Then after that, I can delete the old table and rename the new table with naming conventions similar to old table.
How should I approach this?

For the sort of clone-rename-drop logic you're talking about, the basics are pretty straight forward. Really the only time this is a good idea is if you have a table with a large amount of data, which you can't afford down time or blocking on, and you only plan to do this one. The process looks something like this:
Insert all the data from your original table into the clone table
In a single transaction, sp_rename the original table from (for example) myTable to myTable_OLD (just something to distinguish it from the real table). Then sp_rename the clone table from (for example) myTable_CLONE to myTable
Drop myTable_OLD when you're happy everything has worked how you want. If it didn't work how you want, just sp_rename the objects back.
Couple considerations to think about if you go that route
Identity columns: If your table has any identities on it, you'll have to use identity_insert on then reseed the identity to pick up at where the old identity left off
Do you have the luxury of blocking the table while you do this? Generally if you need to do this sort of thing, the answer is no. What I find works well is to insert all the rows I need using (nolock), or however you need to do it so the impact of the select from the original table is mitigated. Then, after I've moved 99% of the data, I will then open a transaction, block the original table, insert just the new data that's come in since the bulk of the data movement, then do the sp_rename stuff
That way you don't lock anything for the bulk of the data movement, and you only block the table for the very last bit of data that came into the original table between your original insert and your sp_rename
How you determine what's come in "since you started" will depend on how your table is structured. If you have an identity or a datestamp column, you can probably just pick rows which came in after the max of those fields you moved over. If your table does NOT have something you can easily hook into, you might need to get creative.
Alternatives
A couple other alternatives that came to mind:
Table Partitioning:
This shards a single table across multiple partitions (which can be managed sort of like individual tables). You can, say, partition you data by year, then when you want to purge the trailing year of data, you "switch out" that partition to a special table which you can then truncate. All those operations are meta-data only, so they're super fast. This also works really well for huge amounts of data where deletes and all their pesky transaction logging aren't feasible
The downside to table partitioning is it's kind of a pain to set up and manage.
Batched Deletes:
If you're data isn't too big, you could just do batched deletes on the trailing end of your data. If you can find a way to get clustered index seeks for your deletes, they should be reasonably lightweight. As long as you're not accumulating data faster than you can get rid of it, the benefit of this kind of thing is you just run it semi-continuously and it just nibbles away at the trailing end of your data
Snapshot Isolation:
If deletes cause too much blocking, you can also set up something like snapshot isolation, which basically stores historical versions of rows in tempdb. Any query which sets isolation level read committed snapshot will then read those pre-change rows instead of contend for locks on the "real" table. You can then do batched deletes to your hearts content and know that any queries that hit the table will never get blocked by a delete (or any other DML operation) because they'll either read the pre-delete snapshot, or they'll read the post-delete snapshot. They won't wait for an in-process delete to figure out whether it's going to commit or rollback. This is not without its drawbacks as well unfortunately. For large data sets, it can put a big burden on tempdb and it too can be a little bit of a black box. It's also going to require buy-in from your DBAs.

Oracle BEST PRACTICE to update 50 million child rows in a table using value from parent table

I have a child table with 100 million rows and need to update 50 million rows of a column using the value from the parent table. I have read around that assuming if we have enough space, it would be the fastest to "create table as select", but I want to know if anyone disagrees or if other factors are required in order to make a better guess? Would it be better to go this route versus using pl/sql's BULK COLLECT FORALL UPDATE feature?

If you have a lot of data then CREATE TABLE AS SELECT is definitely faster because it does not require UNDO table space. However, to recreate all the indices on the new table can be quite a hassle due to name conflicts.
Good news is: 50 min rows is not really a lot of data. If you have a modern midrange server it should not cause problems so it is not worth the extra work. The best way to find out is to make a copy of the original table (including all indices) and try the update there. Then you get a rough idea how long it will take.

Parallel UPDATE is probably the best option for a large change to a child table. (If you have Enterprise Edition, sufficient resources, a sane configuration, etc.)
alter session enable parallel dml;
update /*+ parallel */ ...;
(You might want to play with different parallel numbers, like parallel(8). The default degree of parallelism is usually good enough. But some platforms like SPARC inflate their "CPU_COUNT", leading to ridiculous degrees of parallelism.)
Parallel UPDATE is likely not the optimal solution. Recreating the objects can be faster because it can almost completely avoid generating REDO and UNDO. But re-creating objects is usually buggy and getting that optimal performance is tricky.
Here are things to consider before you decide to simply drop and recreate a table:
Grants. Save and re-apply the object grants after the objects are recreated.
Dependent objects. The process needs to re-create all objects, and dependent objects, in the exact same way. This can be painfully difficult depending on how complex your schema is. DBMS_METADATA can be tricky, and in some cases still won't make the objects exactly the same way. If you decide to hard-code the DDL instead you have to remember to update the process whenever the objects change.
Invalid objects. Most objects will automatically recompile when necessary. But you probably don't want to wait for that because it always looks bad to have invalid objects. And even if they do compile correctly, some programs may still get those pesky ORA-04068: existing state of packages has been discarded errors. (Because most PL/SQL programmers are unaware of session state and make every package variable public by default.)
Statistics. Simply re-gathering them after the table is re-created is not always sufficient. Histograms depend on whether columns were used in a predicate. If the table is re-created all the columns are new and no histograms will be initially created.
Direct-path writes are elusive. A parent-child table implies a foreign key, which normally prevents direct-path writes. The process needs to disable or drop the foreign key. And also set the table and index to NOLOGGING, and then remember to set them back to LOGGING at the end. When you re-create the foreign key, if you want to do it in parallel you have to initially create it as NOVALIDATE, set the table to parallel, enable validate the constraint, and then set the table back to NOPARALLEL.
In a large data warehouse it's worth going through all those steps and building code for dealing with all the issues. If this is your only large table UPDATE I suggest you avoid that work and accept a slightly non-optimal solution.

Making structural changes to very large tables in an online environment

So here's what I'm facing.
The Problem
A large table, with ~230,000,000
rows.
We want to change the
clustering index and primary key of
this table to a simple bigint
identity field. There is one other
empty field being added to the table,
for future use.
The existing table
has a composite key. For the sake of
argument, let's say it's 2 bigint's.
The first one may have 1 or 10,000
'children' in the 2nd part of the
key.
Requirements
Minimal downtime, like preferably the
length of time it takes to run
SP_Rename.
Existing rows may change
while we're copying data. The updates
must be reflected in the new table.
Ideas
Put a trigger on existing table,
to update row in new table if it
already exists there.
Iterate through original table, copying data
into new table ~10,000 at a time.
Maybe 2,000 of the first part of the
old key.
When the copy is
complete, rename the old table to
"ExistingTableOld" and the new one
from "NewTable" to "ExistingTable".
This should allow stored procs to
continue to run without intervention
Are there any glaring omissions in the plan, or best practices I'm ignoring?

Difficult problem. Your plan sounds good, but I'm not totally sure you really need to batch the query as long as you run it in a transaction isolation level of READ UNCOMMITTED to stop locks being generated.

My experience making big schema changes is big changes are best done during a maintenance window—at night/over a weekend—when users are booted off the system. Just like running dbcc checkdb with the repair option. Then, when things go south, you have the option to roll back to the full backup that you providentially made right before starting the upgrade.
Item #3 on your list: Renaming the old/new tables. You'll probably want to recompile the stored procedures/views. My experience is that execution plans are bound against the object ids rather than object names.
Consider table dbo.foo: if it is renamed to dbo.foo_old, any stored procedures or user-defined functions won't necessarily error out until the dependent object is recompiled and its execution plan rebound. Cached execution plans continue to work perfectly fine.
sp_recompile is your friend.

Oracle - To Global Temp or NOT to Global Temp

So let's say I have a few million records to pull from in order to generate some reports, and instead of running my reports of the live table, I create a temp where I can then create my indexes and use it for further data extraction.
I know cached tables tend to be quicker / faster seeing as the data is stored in memory, but I'm curious to know if there are instances where using a physical temp table is better than Global Temporary Tables and why? What kind of scenario would one be better than the other when dealing with larger volumes of data?

Global Temporary Tables in Oracle are not like temporary tables in SQL Server. They are not cached in memory, they are written to the temporary tablespace.
If you are handling a large amount of data and retaining it for a reasonable amount of time - which seems likely as you want to build additional indexes - I think you should use a regular table. This is even more the case if your scenario has a single session, perhaps a background job, working with the data.

I use Subquery Factoring before I consider temp tables. If there's a need for reuse in various functions or procedures, I turn it into a view (which can turn into a materialized view depending on the data returned).
According to asktom:
...temp table and global temp table are synonymous in Oracle.

For reporting, temporary tables are helpful in that data can only be seen by the session that created it, meaning that you shouldn't have to worry about any concurrency issues.
With a non-temporary table you need to add a session handle/identifier to the table in order to distinguish between sessions.

The primary difference between ordinary (heap) tables and global temp tables in Oracle is their visibility and volatility:
Once rows are committed to an ordinary table they are visible to other sessions and are retained until deleted.
Rows in a global temp table are never visible to other sessions, and are not retained after the session ends.
So the choice should primarily be down to what your application design needs, rather than just about performance (not to say performance isn't important).

The contents of an Oracle temporary table are only visible within the session that created the data and will disappear when the session ends. So you will have to copy the data for every report.
Is this report you are doing a one time operation or will the report be run periodically? Copying large quantities of data just to run a report does not seem a good solution to me. Why not run the report on the original data?
If you can't use the original tables you may be able to create a meterialized view so the latest data is available when you need it.

Persistent temp tables in SQL?

Is it possible to have a 'persistent' temp table in MS-SQL? What I mean is that I currently have a background task which generates a global temp table, which is used by a variety of other tasks (which is why I made it global). Unfortunately if the table becomes unused, it gets deleted by SQL automatically - this is gracefully handled by my system, since it just queues it up to be rebuilt again, but ideally I would like it just to be built once a day. So, ideally I could just set something like set some timeout parameter, like "If nothing touches this for 1 hour, then delete".
I really don't want it in my existing DB because it will cause loads more headaches related to managing the DB (fragmentation, log growth, etc), since it's effectively rollup data, only useful for a 24 hour period, and takes up more than one gigabyte of HD space.
Worst case my plan is to create another DB on the same drive as tempdb, call it something like PseudoTempDB, and just handle the dropping myself.
Any insights would be greatly appreciated!

If you create a table as tempdb.dbo.TempTable, it won't get dropped until:
a - SQL Server is restarted
b - You explicitly drop it
If you would like to have it always available, you could create that table in model, so that it will get copied to tempdb during the restart (but it will also be created on any new database you create afterwards, so you would have to delete manually) or use a startup stored procedure to have it created. There would be no way of persisting the data through restarts though.

I would go with your plan B, "create another DB on the same drive as tempdb, call it something like PseudoTempDB, and just handle the dropping myself."

How about creating a permanent table? Say, MyTable. Once every 24 hours, refresh the data like this:
Create a new table MyTableNew and populate it
Within a transaction, drop MyTable, and use rename_object to rename MyTableNew to MyTable
This way, you're recreating the table every day.
If you're worried about log files, store the table in a different database and set it to Recovery Model: Simple.

I have to admit to doing a double-take on this question: "persistent" and "temp" don't usually go together! How about a little out-of-the-box thinking? Perhaps your background task could periodically run a trivial query to keep SQL from marking the table as unused. That way, you'd take pretty direct control over creation and tear down.

After 20 years of experience dealing with all major RDBMS in existence, I can only suggest a couple of things for your consideration:
Note the oxymoronic concepts: "persistent" and "temp" are complete opposites. Choose one, and one only.
You're not doing your database any favors writing data to the temp DB for a manual, semi-permanent, user-driven basis. Normal tablespaces (i.e. user) are already there for that purpose. The temp DB is for temporary things.
If you already know that such a table will be permanently used ("daily basis" IS permanent), then create it as a normal table on a user database/schema.
Every time that you delete and recreate the very same table you're fragmenting your whole database. And have the perverse bonus of never giving a chance for the DB engine optimizer to assist you in any sort of crude optimization. Instead, try truncating it. Your rollback segments will thank you for that small relief and disk space will probably still be allocated for when you repopulate it again the next day. You can force that desired behavior by specifying a separate tablespace and datafile for that table alone.
Finally, and utterly more important: Stop mortifying you and your DB engine for a measly 1 GB of data. You're wasting CPU, I/O cycles, adding latency, fragmentation, and so on for the sake of saving literally 0.02 cents of hardware real state. Talk about dropping to the floor in a tuxedo to pick up a brown cent. 😂

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas