How to quickly duplicate rows in SQL

How to quickly duplicate rows in SQL - sql

Edit: Im running SQL Server 2008
I have about 400,000 rows in my table. I would like to duplicate these rows until my table has 160 million rows or so. I have been using an statement like this:
INSERT INTO [DB].[dbo].[Sales]
([TotalCost]
,[SalesAmount]
,[ETLLoadID]
,[LoadDate]
,[UpdateDate])
SELECT [TotalCost]
,[SalesAmount]
,[ETLLoadID]
,[LoadDate]
,[UpdateDate]
FROM [DB].[dbo].[Sales]
This process is very slow. and i have to re-issue the query some large number of times Is there a better way to do this?

To do this many inserts you will want to disable all indexes and constraints (including foreign keys) and then run a series of:
INSERT INTO mytable
SELECT fields FROM mytable
If you need to specify ID, pick some number like 80,000,000 and include in the SELECT list ID+80000000. Run as many times as necessary (no more than 10 since it should double each time).
Also, don't run within a transaction. The overhead of doing so over such a huge dataset will be enormous. You'll probably run out of resources (rollback segments or whatever your database uses) anyway.
Then re-enable all the constraints and indexes. This will take a long time but overall it will be quicker than adding to indexes and checking constraints on a per-row basis.

Since each time you run that command it will double the size of your table, you would only need to run it about 9 times (400,000 * 29 = 204,800,000). Yes, it might take a while because copying that much data takes some time.

The speed of the insert will depend on a number of things...the physical disk speed, indexes, etc. I would recommend removing all indexes from the table and adding them back when you're done. If the table is heavily indexed then that should help quite a bit.
You should be able to repeatedly run that query in a loop until the desired number of rows is achieved. Every time you run it you'll double the data, so you'll end up with:
400,000
800,000
1,600,000
3,200,000
6,400,000
12,800,000
25,600,000
51,200,000
102,400,000
204,800,000
After nine executions.

You don't state your SQL database, but most have a bulk loading tool to handle this scenario. Check the docs. If you have to do it with INSERTs, remove all indexes from the table first and reapply them after the data is INSERTed; this will generally be much faster than indexing during insertion.

this may still take a while to run... you might want to turn off logging while you create your data.
INSERT INTO [DB].[dbo].[Sales] (
[TotalCost] ,[SalesAmount] ,[ETLLoadID]
,[LoadDate] ,[UpdateDate]
)
SELECT s.[TotalCost] ,s.[SalesAmount] ,s.[ETLLoadID]
,s.[LoadDate] ,s.[UpdateDate]
FROM [DB].[dbo].[Sales] s (NOLOCK)
CROSS JOIN (SELECT TOP 400 totalcost FROM [DB].[dbo].[Sales] (NOLOCK)) o

Related

How to improve the execution time of inserting to a table from selecting rows from a table of million rows in ORACLE?

An empty table T1 where rows have to be inserted by selecting rows from another table T2 in ORACLE.
Like,
INSERT INTO T1
SELECT * FROM T2;
The issue is table T2 has about 10 million of rows. This simple SELECT statement seems to execute around 25-30 secs individually. But when it inserts into T1, it takes 20-30 mins to complete.
Why the above statement is taking long time to execute and what is the best approach or how to improve upon to insert data to table T1 selecting from table T2?

For one thing, the "apparent" execution time of a simple SELECT query is a bit misleading: the database engine figures out how to do the query then returns only the first "chunk" of information to you. (As you then move through the dataset, additional "chunks" are transparently supplied as needed.) But when you specify INSERT, now the database has no choice but to actually go through all those millions of rows.
There are often specialized tools that are specifically intended for "bulk" data operations such as this one. These might be significantly faster.
Another standard practice is to temporarily disable indexes. This avoids the overhead of updating the indexes for every record: the index will be completely rebuilt when you turn it back on. (The "bulk operations" tools aforementioned will usually do things like that automagically.)

Adding an APPEND hint may enable a direct path insert, which can avoid generating extra REDO data used for recovery:
INSERT /*+ append */ INTO T1
SELECT * FROM T2;
Adding parallelism can further improve performance:
ALTER SESSION ENABLE PARALLEL DML;
INSERT /*+ parallel append */ INTO T1
SELECT * FROM T2;
Those two features could shrink the run time from minutes to seconds but there are a lot of caveats you need to understand. Direct-path writes lock the table, and are not recoverable; if the data is important you may not want to wait for the next full backup. Parallel queries work harder, not smarter, and may steal resources from more important jobs. Finding the optimal degree of parallelism is tricky, and direct-path inserts have many limitations, like triggers and some kinds of referential integrity constraints.
With the right hardware, system configuration, and code, you can realistically improve performance by 100x. But if you're new to these features, prepare to spend hours learning about them.

limit to insert rows into a table (Oracle)

In Oracle pl/sql, I have join few tables and insert into another table, which would result in Thousands/Lakhs or it could be in millions. Can insert as
insert into tableA
select * from tableB;
Will there be any chance of failure because of number of rows ?
Or is there a better way to insert values in case of more no of records.
Thanks in Advance

Well, everything is finite inside the machine, so if that select returns too many rows, it for sure won't work (although there must be maaany rows, the number is dependent on your storage and memory size, your OS, and maybe other things).
If you think your query can surpass the limit, then do the insertion in batches, and commit after each batch. Of course you need to be aware you must do something if at 50% of the inserts you decide you need to cancel the process (as a rollback will not be useful here).

My recommended steps are different because performance typically increases when you load more data in one SQL statement using SQL or PL/SQL:
I would recommend checking the size of your rollback segment (RBS segment) and possibly bring online a larger dedicated one for such transaction.
For inserts, you can say something like 'rollback consumed' = 'amount of data inserted'. You know the typical row width from the database statistics (see user_tables after analyze table tableB compute statistics for table for all columns for all indexes).
Determine how many rows you can insert per iteration.
Insert this amount of data in big insert and commit.
Repeat.
Locking normally is not an issue with insert, since what does not yet exist can't be locked :-)
When running on partitioned tables, you might want to consider different scenarios allowing the (sub)partitions to distribute the work together. When using SQL*Loader by loading from text files, you might use different approach too, such as direct path which adds preformatted data blocks to the database without the SQL engine instead of letting the RDBMS handle the SQL.

To create limited number of rows you can use ROW_NUM which is a pseudo column .
for example to create table with 10,000 rows from another table having 50,000 rows you can use.
insert into new_table_name select * from old_table_name where row_num<10000;

How can I efficiently manipulate 500k records in SQL Server 2005?

I am getting a large text file of updated information from a customer that contains updates for 500,000 users. However, as I am processing this file, I often am running into SQL Server timeout errors.
Here's the process I follow in my VB application that processes the data (in general):
Delete all records from temporary table (to remove last month's data) (eg. DELETE * FROM tempTable)
Rip text file into the temp table
Fill in extra information into the temp table, such as their organization_id, their user_id, group_code, etc.
Update the data in the real tables based on the data computed in the temp table
The problem is that I often run commands like UPDATE tempTable SET user_id = (SELECT user_id FROM myUsers WHERE external_id = tempTable.external_id) and these commands frequently time out. I have tried bumping the timeouts up to as far as 10 minutes, but they still fail. Now, I realize that 500k rows is no small number of rows to manipulate, but I would think that a database purported to be able to handle millions and millions of rows should be able to cope with 500k pretty easily. Am I doing something wrong with how I am going about processing this data?
Please help. Any and all suggestions welcome.

subqueries like the one you give us in the question:
UPDATE tempTable SET user_id = (SELECT user_id FROM myUsers WHERE external_id = tempTable.external_id)
are only good on one row at a time, so you must be looping. Think set based:
UPDATE t
SET user_id = u.user_id
FROM tempTable t
inner join myUsers u ON t.external_id=u.external_id
and remove your loops, this will update all rows in one statement and be significantly faster!

Needs more information. I am manipulating 3-4 million rows in a 150 million row table regularly and I am NOT thinking this is a lot of data. I have a "products" table that contains about 8 million entries - includign full text search. No problems either.
Can you just elaborte on your hardware? I assume "normal desktop PC" or "low end server", both with absolutely non-optimal disc layout, and thus tons of IO problems - on updates.

Make sure you have indexes on your tables that you are doing the selects from. In your example UPDATE command, you select the user_id from the myUsers table. Do you have an index with the user_id column on the myUsers table? The downside of indexes is that they increase time for inserts/updates. Make sure you don't have indexes on the tables you are trying to update. If the tables you are trying to update do have indexes, consider dropping them and then rebuilding them after your import.
Finally, run your queries in SQL Server Management Studio and have a look at the execution plan to see how the query is being executed. Look for things like table scans to see where you might be able to optimize.

Look at the KM's answer and don't forget about indexes and primary keys.

Are you indexing your temp table after importing the data?
temp_table.external_id should definitely have an index since it is in the where clause.

There are more efficient ways of importing large blocks of data. Look in SQL Books Online under BCP (Bulk Copy Protocol.)

Select million+ records while huge insert is running

I am trying to extract application log file from a single table. The select query statement is pretty straightforward.
select top 200000 *
from dbo.transactionlog
where rowid>7
and rowid <700000 and
Project='AmWINS'
The query time for above select is above 5 mins. Is it considered long? While the select is running, the bulk insertion is also running.
[EDIT]
Actually, I am having serious problem on my current Production logging database,
Basically, we only have one table (transactionlog). all the application log will be insert into this table. For Project like AmWINS, base on select count result, we have about 800K++ records inserted per day. The insertion of record are running 24 hours daily in Production environment. User would like to extract data from the table if user want to check the transaction logs. Therefore, we need to select the records out from the table if necessary.
I tried to simulate on UAT enviroment to pump in the volumn as per Production which already grow up to 10millions records until today. and while i try to extract records, at the same time, I simulate with a bulk insertion to make it look like as per production environment. It took like 5 mins just to extract 200k records.
During the extraction running, I monitor on the SQL phyiscal server CPU is spike up to 95%.
the tables have 13 fields and a identity turn on(rowid) with bigint. rowid is the PK.
Indexes are create on Date, Project, module and RefNumber.
the tables are created on rowlock and pagelock enabled.
I am using SQL server 2005.
Hope you guys can give me some professional advices to enlighten me. Thanks.

It may be possible for you to use the "Nolock" table hint, as described here:
Table Hints MSDN
Your SQL would become something like this:
select top 200000 * from dbo.transactionlog with (no lock) ...
This would achieve better performance if you aren't concerned about the complete accuracy of the data returned.

What are you doing with the 200,000 rows? Are you running this over a network? Depending on the width of your table, just getting that amount of data across the network could be the bulk of the time spent.

It depends on your hardware. Pulling 200000 rows out while there is data being inserted requires some serious IO, so unless you have a 30+disk system, it will be slow.
Also, is your rowID column indexed? This will help with the select, but could slow down the bulk insert.

I am not sure, but doesn't bulk insert in MS SQL lock the whole table?

As ck already said. Indexing is important. So make sure you have an appropriate index ready. I would not only set an index on rowId but also on Project. Also I would rewrite the where-clause to:
WHERE Project = 'AmWINS' AND rowid BETWEEN 8 AND 699999
Reason: I guess Project is more restrictive than rowid and - correct me, if I'm wrong - BETWEEN is faster than a < and > comparison.

You could also export this as a local dat or sql file.

No amount of indexing will help here because it's a SELECT * query so it's most likely a PK scan or an horrendous bookup lookup
And the TOP is meaningless because there is no ORDER BY.
The simultaneous insert is probably misleading as far as I can tell, unless the table only has 2 columns and the bulk insert is locking the whole table. With a simple int IDENTITY column the insert and select may not interfere with each other too.
Especially if the bulk insert is only a few 1000s of rows (or even 10,000s)
Edit. The TOP and rowid values do not imply a million plus

Oracle SQL technique to avoid filling trans log

Newish to Oracle programming (from Sybase and MS SQL Server). What is the "Oracle way" to avoid filling the trans log with large updates?
In my specific case, I'm doing an update of potentially a very large number of rows. Here's my approach:
UPDATE my_table
SET a_col = null
WHERE my_table_id IN
(SELECT my_table_id FROM my_table WHERE some_col < some_val and rownum < 1000)
...where I execute this inside a loop until the updated row count is zero,
Is this the best approach?
Thanks,

The amount of updates to the redo and undo logs will not at all be reduced if you break up the UPDATE in multiple runs of, say 1000 records. On top of it, the total query time will be most likely be higher compared to running a single large SQL.
There's no real way to address the UNDO/REDO log issue in UPDATEs. With INSERTs and CREATE TABLEs you can use a DIRECT aka APPEND option, but I guess this doesn't easily work for you.

Depends on the percent of rows almost as much as the number. And it also depends on if the update makes the row longer than before. i.e. going from null to 200bytes in every row. This could have an effect on your performance - chained rows.
Either way, you might want to try this.
Build a new table with the column corrected as part of the select instead of an update. You can build that new table via CTAS (Create Table as Select) which can avoid logging.
Drop the original table.
Rename the new table.
Reindex, repoint contrainst, rebuild triggers, recompile packages, etc.
you can avoid a lot of logging this way.

Any UPDATE is going to generate redo. Realistically, a single UPDATE that updates all the rows is going to generate the smallest total amount of redo and run for the shortest period of time.
Assuming you are updating the vast majority of the rows in the table, if there are any indexes that use A_COL, you may be better off disabling those indexes before the update and then doing a rebuild of those indexes with NOLOGGING specified after the massive UPDATE statement. In addition, if there are any triggers or foreign keys that would need to be fired/ validated as a result of the update, getting rid of those temporarily might be helpful.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas