How to merge 500 million table with another 500 million table - sql

I have to merge two 500M+ row tables.
What is the best method to merge them?
I just need to display the records from these two SQL-Server tables if somebody searches on my webpage.
These are fixed tables, no one will ever change data in these tables once they are live.
create a view myview as select * from table1 union select * from table2
Is there any harm using the above method?
If I start merging 500M rows it will run for days and if machine reboots it will make the database go into recovery mode, and then I have to start from the beginning again.
Why Am I merging these table?
I have a website which provides a search on the person table.
This table have columns like Name, Address, Age etc
We got 500 million similar .txt files which we loaded into some other
table.
Now we want the website search page to query both tables to see if
a person exists in the table.
We get similar .txt files of 100 million or 20 million, which we load
to this huge table.
How we are currently doing it?
We import the .txt files into separate tables ( some columns are different
in .txt)
Then we arrange the columns and do the data type conversions
Then insert this staging table into the liveCopy huge table ( in
test environment)
We have SQL server 2008 R2
Can we use table partitioning for performance benefits?
Is it ok to create monthly small tables and create a view on top of
them?
How can indexing be done in this case?
We only load new data once in a month and do the select
Does replication help?
Biggest issue I am facing is managing huge tables.
I hope I explained the situation .
Thanks & Regards

1) Usually developers, to achieve more performance, are splitting large tables into smaller ones and call this as partitioning (horizontal to be more precise, because there is also vertical one). Your view is a sample of such partitions joined. Of course, it is mostly used to split a large amount of data into range of values (for example, table1 contains records with column [col1] < 0, while table2 with [col1] >= 0). But even for unsorted data it is ok too, because you get more room for speed improvements. For example - parallel reads if put tables to different storages. So this is a good choice.
2) Another way is to use MERGE statement supported in SQL Server 2008 and higher - http://msdn.microsoft.com/en-us/library/bb510625(v=sql.100).aspx.
3) Of course you can copy using INSERT+DELETE, but in this case or in case of MERGE command used do this in a small batches. Smth like:
SET ROWCOUNT 10000
DECLARE #Count [int] = 1
WHILE #Count > 0 BEGIN
... INSERT+DELETE/MERGE transcation...
SET #Count = ##ROWCOUNT
END

If your purpose is truly just to move the data from the two tables into one table, you will want to do it in batches - 100K records at a time, or something like that. I'd guess you crashed before because your T-Log got full, although that's just speculation. Make sure to throw in a checkpoint after each batch if you are in Full recovery mode.
That said, I agree with all the comments that you should provide why you are doing this - it may not be necessary at all.

You may want to have a look at an Indexed View.
In this way, you can set up indexes on your view and get the best performance out of it. The expensive part of using Indexed Views is in the CRUD operations - but for read performance it would be your best solution.
http://www.brentozar.com/archive/2013/11/what-you-can-and-cant-do-with-indexed-views/
https://www.simple-talk.com/sql/learn-sql-server/sql-server-indexed-views-the-basics/

If the two tables are linked one to one, then you are wasting the cpu time a lot for each read. Especially that you mentioned that the tables don't change at all. You should have only one table in this case.
Try creating a new table including (at least) the two columns from the two tables.
You can do this by:
Select into newTable
from A left join B on A.x=B.y
or (if some people don't have the information of the text file)
Select into newTable
from A inner join B on A.x=B.y
And note that you have to have made index on the join fields at least (to speed up the process).
More details about the fields may help giving more precise answer as well.

Related

Should I generate massive amounts of SQL data on the client or in SQL Server?

I am writing a program to generate a massive (~1 billion records spread across ~20 tables) amount of data and populate tables in SQL Server. This is data that spans across multiple tables with potentially multiple foreign key constraints, as well as multiple 'enum' like tables, whose distribution of values need to be seemingly random as well and are referenced often from other tables. This leads to a lot of ORDER BY NEWID() type code, which seems slow to me.
My question is: which strategy would be more performant:
Generate and insert data in SQL Server, using set based operations and a bunch of ORDER BY NEWID() to get randomness
Generate all the data on a client (should make operations like choosing a random value from an enum table much faster), then import the data into SQL Server
I can see some positives and negatives from both strategies. Obviously the generation of random data would be easier and probably more performant in the client. However getting that data to the server would be slow. Otherwise, importing the data and inserting it in a set based operation should be similar in scale.
Has anyone done something similar?
ORDER BY NEWID(), as other members said can be extremely expensive operation.
There are other, faster ways of getting random data in SQL Server:
SELECT * FROM StackOverflow.dbo.Users TABLESAMPLE (.01 PERCENT);
or
DECLARE #row bigint=(
SELECT RAND(CHECKSUM(NEWID()))*SUM([rows]) FROM sys.partitions
WHERE index_id IN (0, 1) AND [object_id]=OBJECT_ID(‘dbo.thetable’));
SELECT *
FROM dbo.thetable
ORDER BY (SELECT NULL)
OFFSET #row ROWS FETCH NEXT 1 ROWS ONLY;
Credits to Brent Ozar and his recent blog post: https://www.brentozar.com/archive/2018/03/get-random-row-large-table/
I would choose generation of massive data amounts on RDBMS side..
You don't need to create billions of newid
Create one table with a million random and reference it a number of times. If you random repeats every million rows I suspect all would be OK.
Do a random stating point and increment. Use % on increment to cycle.
If you need values 0 - n again use a %.

Redshift performance difference between CTAS and select count

I have query A, which mostly left joins several different tables.
When I do:
select count(1) from (
A
);
the query returns the count in approximately 40 seconds. The count is not big, at around 2.8M rows.
However, when I do:
create table tbl as A;
where A is the same query, it takes approximately 2 hours to complete. Query A returns 14 columns (not many) and all the tables used on the query are:
Vacuumed;
Analyzed;
Distributed across all nodes (DISTSTYLE ALL);
Encoded/Compressed (except on their sortkeys).
Any ideas on what should I look at?
When using CREATE TABLE AS (CTAS), a new table is created. This involves copying all 2.8 million rows of data. You didn't state the size of your table, but this could conceivable involve a lot of data movement.
CTAS does not copy the DISTKEY or SORTKEY. The CREATE TABLE AS documentation says that the default DISTKEY is EVEN. Therefore, the CTAS operation would also have involved redistributing the data amongst nodes. Since the source table was DISTKEY ALL, at least the data was available on each node for distribution, so this shouldn't have been too bad.
If your original table DDL included compression, then these settings would probably have been copied across. If the DDL did not specify compression, then the copy to the new table might have triggered the automatic compression analysis, which involves loading 100,000 rows, choosing a compression type for each column, dropping that data and then starting the load again. This could consume some time.
Finally, it comes down to the complexity of Query A. It is possible that Redshift was able to optimize the query by reading very little data from disk because it realized that very few columns of data (or perhaps no columns) were required to read from disk to display the count. This really depends upon the contents of that Query.
It could simply be that you've got a very complex query that takes a long time to process (that wasn't processed as part of the Count). If the query involves many JOIN and WHERE statements, it could be optimized by wise use of DISTKEY and SORTKEY values.
CREATE TABLE writes all data that is returned by the query to disk, count query does not, that explains the difference. Writing all rows is more expensive operation compared to reading row count.

UPDATE large number of columns in SQL

In a database on the cloud, I have a table with about ten thousand columns. I'm going to update it every few minutes with some local data which is the output of a local code (below my_col_val[]). My questions are:
1- What is the best and fastest way to update each row? (For Loop?)
2- Using a char pointer to save the SQL query (szSQL[]) is the best way when it contains a SQL quesry of size of order 1MB?
My code (in C) now roughly looks like:
char * szSQL[?];// (What is the best size?)
char * my_col [?];
char * my_col_val[?];
SQLHSTMT hStmt = NULL;
sprintf(szSQL, "UPDATE my_table SET %s='%s',...,%s='%s'\ // there should be 8000 %s='%s' statements
WHERE ID = my_ID FROM my_table", my_col[0], my_col_val[0], ..., my_col[n], my_col_val[n]); //wher n=8000
SQLExecDirect(hstm, szSQL, SQL_NTS);
I like #Takarii 's solution using three tables. The best strategy involves 1) how to insert the new rows of measurements and 2) what will it be used for. The latter is of particular interest as that may need additional indexes, and these must be maintained by the db when executing the insert statements. The least indexes are required, the faster the inserts will be. For example, although there is a relation between the three tables, the measurement table could not declare its foreign key with other tables, reducing this index' overhead.
As the table will grow and grow, the db will get slower and slower. Then, it can be beneficial to create a new table for each day of measurements.
As the sensor data is of diferent types, the data could be inserted as string data and only be parsed by the retriever program.
Another help could be that, if the recorded data is only retrieved periodically, the measurements could be written to a flat file and inserted in batch periodically, let's say every hour.
Maybe these ideas can be of help.
Based on your comments, and your request above, Here are my suggestions:
1) As you suggested, an individual table for each machine (not ideal, but will work)
Working on that assumption, you will want an individual row for each sensor, but the problem comes when you need to add additional machines - generally table create privileges are restricted by sysadmins
2) Multiple tables to identify sensor information and assignment, along with a unified results table.
Table 1 - Machine
Table 2 - Sensor
Table 3 - Results
Table 1 would contain the information about the machine with which your sensors are assigned (machine_id, **insert extra columns as needed**)
Table 2 contains the sensor information - this is where your potential 10,000 columns would go, however they are now rows with ID's (sensor_id, sensor_name)
Table 3 contains the results of the sensor readings, with an assignment to a sensor and then to a machine (result_id, machine_id(fk), sensor_id(fk), result)
Then using joins, you can pull out the data for each machine as needed. This will be far more efficient than your current 10k column design

How can I efficiently manipulate 500k records in SQL Server 2005?

I am getting a large text file of updated information from a customer that contains updates for 500,000 users. However, as I am processing this file, I often am running into SQL Server timeout errors.
Here's the process I follow in my VB application that processes the data (in general):
Delete all records from temporary table (to remove last month's data) (eg. DELETE * FROM tempTable)
Rip text file into the temp table
Fill in extra information into the temp table, such as their organization_id, their user_id, group_code, etc.
Update the data in the real tables based on the data computed in the temp table
The problem is that I often run commands like UPDATE tempTable SET user_id = (SELECT user_id FROM myUsers WHERE external_id = tempTable.external_id) and these commands frequently time out. I have tried bumping the timeouts up to as far as 10 minutes, but they still fail. Now, I realize that 500k rows is no small number of rows to manipulate, but I would think that a database purported to be able to handle millions and millions of rows should be able to cope with 500k pretty easily. Am I doing something wrong with how I am going about processing this data?
Please help. Any and all suggestions welcome.
subqueries like the one you give us in the question:
UPDATE tempTable SET user_id = (SELECT user_id FROM myUsers WHERE external_id = tempTable.external_id)
are only good on one row at a time, so you must be looping. Think set based:
UPDATE t
SET user_id = u.user_id
FROM tempTable t
inner join myUsers u ON t.external_id=u.external_id
and remove your loops, this will update all rows in one statement and be significantly faster!
Needs more information. I am manipulating 3-4 million rows in a 150 million row table regularly and I am NOT thinking this is a lot of data. I have a "products" table that contains about 8 million entries - includign full text search. No problems either.
Can you just elaborte on your hardware? I assume "normal desktop PC" or "low end server", both with absolutely non-optimal disc layout, and thus tons of IO problems - on updates.
Make sure you have indexes on your tables that you are doing the selects from. In your example UPDATE command, you select the user_id from the myUsers table. Do you have an index with the user_id column on the myUsers table? The downside of indexes is that they increase time for inserts/updates. Make sure you don't have indexes on the tables you are trying to update. If the tables you are trying to update do have indexes, consider dropping them and then rebuilding them after your import.
Finally, run your queries in SQL Server Management Studio and have a look at the execution plan to see how the query is being executed. Look for things like table scans to see where you might be able to optimize.
Look at the KM's answer and don't forget about indexes and primary keys.
Are you indexing your temp table after importing the data?
temp_table.external_id should definitely have an index since it is in the where clause.
There are more efficient ways of importing large blocks of data. Look in SQL Books Online under BCP (Bulk Copy Protocol.)

SQL - Optimizing performance of bulk inserts and large joins?

I am doing ETL for log files into a PostgreSQL database, and want to learn more about the various approaches used to optimize performance of loading data into a simple star schema.
To put the question in context, here's an overview of what I do currently:
Drop all foreign key and unique
constraints
Import the data (~100 million records)
Re-create the constraints and run analyze on the fact table.
Importing the data is done by loading from files. For each file:
1) Load the data from into a temporary table using COPY (the PostgreSQL bulk upload tool)
2) Update each of the 9 dimension tables with any new data using an insert for each such as:
INSERT INTO host (name)
SELECT DISTINCT host_name FROM temp_table
EXCEPT
SELECT name FROM host;
ANALYZE host;
The analyze is run at the end of the INSERT with the idea of keeping the statistics up to date over the course of tens of millions of updates (Is this advisable or necessary? At minimum it does not seem to significantly reduce performance).
3) The fact table is then updated with an unholy 9-way join:
INSERT INTO event (time, status, fk_host, fk_etype, ... )
SELECT t.time, t.status, host.id, etype.id ...
FROM temp_table as t
JOIN host ON t.host_name = host.name
JOIN url ON t.etype = etype.name
... and 7 more joins, one for each dimension table
Are there better approaches I'm overlooking?
I've tried several different approaches to trying to normalize the data incoming from a source as such and generally I've found the approach you're using now to be my choice. Its easy to follow and minor changes stay minor. Trying to return the generated id from one of the dimension tables during stage 2 only complicated things and usually generates far too many small queries to be efficient for large data sets. Postgres should be very efficient with your "unholy join" in modern versions and using "select distinct except select" works well for me. Other folks may know better, but I've found your current method to be my perferred method.
During stage 2 you know the primary key of each dimension you're inserting data into (after you've inserted it), but you're throwing this information away and rediscovering it in stage 3 with your "unholy" 9-way join.
Instead I'd recommend creating one sproc to insert into your fact table; e.g. insertXXXFact(...), which calls a number of other sprocs (one per dimension) following the naming convention getOrInsertXXXDim, where XXX is the dimension in question. Each of these sprocs will either look-up or insert a new row for the given dimension (thus ensuring referential integrity), and should return the primary key of the dimension your fact table should reference. This will significantly reduce the work you need to do in stage 3, which is now reduced to a call of the form insert into XXXFact values (DimPKey1, DimPKey2, ... etc.)
The approach we've adopted in our getOrInsertXXX sprocs is to insert a dummy value if one is not available and have a separate cleanse process to identify and enrich these values later on.