How Database Query Data Processed Between 3 Computers? - sql

I'm trying to understand how data is really processed and sent between servers, but am unable to find a step by step detailed explanation. To explain using an example:
Assume you have 100M records that's 10GB in size and you need to transfter the data between 2 database servers over a LAN using a separate desktop computer to execute the script. In total you have three computers:
Server A has the source database you are selecting the data from.
Server B has the destination database you are writing the data to.
Desktop C is where you are executing your script from, using a client tool like SQL Developer or TOAD.
The SQL would look something like this:
Insert Into ServerBDatabase.MyTable (MyFields)
Select MyFields
From ServerADatabase.MyTable
Can someone explain where the data is processed and how it is sent? Or even point me to a good resource to read up on the subject.
For example, I'm trying to understand if it's something like this:
The client tool on Desktop C sends the query command to the DBMS on Server A. At this point, all that is being sent over the LAN is the text of the command.
The DBMS on Server A receives the query command, interprets it, and then
processes the data on Server A. At this point, the full 10GB data is loaded into memory for processing and nothing is being sent over the LAN.
Once the full 100M records are done being processed on Server A, the DBMS on Server A sends both the text of the query command and the full 100M records of data over the LAN to the DBMS on Server B. Given bandwidth constraints, this data is broken up and sent in chunks over the LAN at some amount of bits per second and loaded on Server B's memory in those data chunk amounts.
As the DBMS on Server B recieves the chunks of data over the LAN, it then pieces it back together on Server B in it's memory to retun it back to it's full 100M records state.
Once the 100M records are fully pieced back together by the DBMS on Server B, the DBMS on Server B then executes the query command to insert the records from it's memory in the destination table one row at a time and writes it to disk.
All of this are assumptions, so I know I could have it all wrong which is why I'm seeking help. Please help correct my errors and/or fill in the blanks.

This is your query:
Insert Into ServerBDatabase.MyTable (MyFields)
Select MyFields
From ServerADatabase.MyTable;
Presumably, you are executing this on a client tool connected to Server A (although the same reasoning applies to Server B). The entire query is sent to the server, and Server A would need to have a remote connection to Server B.
In other words, once the query is sent, the client has nothing to do with it -- other than waiting for the query to be sent.
As for the details of inserting from one database to another, those depend on the database. I would expect the data to be streamed into the table on Server B, all within a single transaction. The exact details might depend on the specific database software and its configuration.

Related

Database copy limit per database reached. The database X cannot have more than 10 concurrent database copies (Azure SQL)

In our application, we have a master database 'X'. For each new client, we will create a new database copy of master database 'X'.
I am using the following SQL command which will be executed against Azure SQL server.
CREATE DATABASE [NEW NAME] AS COPY OF [MASTER DB]
We are using a custom queue tier so that we can create more than one client at a time parallelly.
I am facing issues in following scenario.
I am trying to create 70 clients. Once 25 clients got created I am getting below error.
Database copy limit per database reached. The database 'BlankDBClient' cannot have more than 10 concurrent database copies
Can you please share your thoughts on this?
SQL Azure has logic to do various operations online/automatically for you (backups, upgrades, etc). There are IOs required to do each copy, so there are limits in place because the machine does not have infinite iops. (Those limits may change a bit over time as we work to improve the service, get newer hardware, etc).
In terms of what options you have, you could:
Restore N databases from a database backup (which would still have IO limits but they may be higher for you depending on your reservation size)
Consider models to copy in parallel using a single source to hierarchically create what you need (copy 2 from one, then copy 2 from each of the ones you just copied, etc)
Stage out the copies over time based on the limits you get back from the system.
Try a larger reservation size for the source and target during the copy to get more IOPS and lower the time to perform the operations.
In addition to Connor answer, you can consider to a have a dacpac or bacpac of that master database stored on Azure Storage and once you have submitted 25 concurrent database copies you can start restoring the dacpac from Azure Storage.
You can also monitor how many database copies are showing COPYING on the state_desc column of the following queries, after sending the first batch of 25 copies, and when those queries return less than 25 rows, start sending more copies until reaching the 25 limit. Keep doing this until finishing the queue of copies required.
Select
[sys].[databases].[name],
[sys].[databases].[state_desc],
[sys].[dm_database_copies].[start_date],
[sys].[dm_database_copies].[modify_date],
[sys].[dm_database_copies].[percent_complete],
[sys].[dm_database_copies].[error_code],
[sys].[dm_database_copies].[error_desc],
[sys].[dm_database_copies].[error_severity],
[sys].[dm_database_copies].[error_state]
From
[sys].[databases]
Left
Outer
Join
[sys].[dm_database_copies]
On
[sys].[databases].[database_id] = [sys].[dm_database_copies].[database_id]
Where
[sys].[databases].[state_desc] = 'COPYING'
SELECT state_desc, *
FROM sys.databases
WHERE [state_desc] = 'COPYING'

exporting a table containing a large number of records to to another database on a linked server

I have a task in a project that required the results of a process, which could be anywhere from 1000 up to 10,000,000 records (approx upper limit), to be inserted into a table with the same structure in another database across a linked server. The requirement is to be able to transfer in chunks to avoid any timeouts
In doing some testing I set up a linked server and using the following code to test transfered approx 18000 records:
DECLARE #BatchSize INT = 1000
WHILE 1 = 1
BEGIN
INSERT INTO [LINKEDSERVERNAME].[DBNAME2].[dbo].[TABLENAME2] WITH (TABLOCK)
(
id
,title
,Initials
,[Last Name]
,[Address1]
)
SELECT TOP(#BatchSize)
s.id
,s.title
,s.Initials
,s.[Last Name]
,s.[Address1]
FROM [DBNAME1].[dbo].[TABLENAME1] s
WHERE NOT EXISTS (
SELECT 1
FROM [LINKEDSERVERNAME].[DBNAME2].[dbo].[TABLENAME2]
WHERE id = s.id
)
IF ##ROWCOUNT < #BatchSize BREAK
This works fine however it took 5 mins to transfer the data.
I would like to implement this using SSIS and am looking for any advice in how to do this and speed up the process.
Open Visual Studio/Business Intelligence Designer Studio (BIDS)/SQL Server Data Tools-BI edition(SSDT)
Under the Templates tab, select Business Intelligence, Integration Services Project. Give it a valid name and click OK.
In Package.dtsx which will open by default, in the Connection Managers section, right click - "New OLE DB Connection". In the Configure OLE DB Connection Manager section, Click "New..." and then select your server and database for your source data. Click OK, OK.
Repeat the above process but use this for your destination server (linked server).
Rename the above connection managers from server\instance.databasename to something better. If databasename does not change across the environments then just use the database name. Otherwise, go with the common name of it. i.e. if it's SLSDEVDB -> SLSTESTDB -> SLSPRODDB as you migrate through your environments, make it SLSDB. Otherwise, you end up with people talking about the connection manager whose name is "sales dev database" but it's actually pointing at production.
Add a Data Flow to your package. Call it something useful besides Data Flow Task. DFT Load Table2 would be my preference but your mileage may vary.
Double click the data flow task. Here you will add an OLE DB Source, a Lookup Task and a OLE DB Destination. Probably, as always, it will depend.
OLE DB Source - use the first connection manager we defined and a query
SELECT
s.id
,s.title
,s.Initials
,s.[Last Name]
,s.[Address1]
FROM [dbo].[TABLENAME1] s
Only pull in the columns you need. Your query current filters out any duplicates that already exist in the destination. Doing that can be challenging. Instead, we'll bring the entirety of TABLENAME1 into the pipeline and filter out what we don't need. For very large volumes in your source table, this may be an untenable approach and we'd need to do something different.
From the Source we need to use a Lookup Transformation. This will allow us to detect the duplicates. Use the second connection manager we defined, one that points to the destination. Change the NoMatch from "Fail Component" to "Redirect Unmatched rows" (name approximate)
Use your query to pull back the key value(s)
SELECT T2.id
FROM [dbo].[TABLENAME2] AS T2;
Map T2.id to the id column.
When the package starts, it will issue the above query against the target table and cache all the values of T2.id into memory. Since this is only a single column, that shouldn't be too expensive but again, for very large tables, this approach may not work.
There are 3 outputs now available from the Lookup: Match, NoMatch and Error. Match will be anything that exists in the source and destination. You don't care about those as you are only interested in what exists in source and not destination. When you might care is if you have to determine whether there is change between the values in source and the destination. NoMatch are the rows that exist in Source but don't exist in Destination. That's the stream you want. For completeness, Error would capture things that went very wrong but I've not experience it "in the wild" with a lookup.
Connect the NoMatch stream to the OLE DB Destination. Select your Table Name there and ensure the words Fast Load are in the destination. Click on the Columns tab and make sure everything is routed up.
Whether you need to fiddle with the knobs on the OLE DB Destination is highly variable. I would test it, especially with your larger sets of data and see whether the timeout conditions are a factor.
Design considerations for larger sets
It depends.
Really, it does. But, I would look at identifying where the pain point lies.
If my source table is very large and pulling all that data into the pipeline just to filter it back out, then I'd look at something like a Data Flow to first bring all the rows in my Lookup over to the Source database (use the T2 query) and write it into a staging table and make the one column your clustered key. Then modify your source query to reference your staging table.
Depending on how active the destination table is (whether any other process could load it), I might keep that lookup in the data flow to ensure I don't load duplicates. If this process is the only one that loads it, then drop the Lookup.
If the lookup is at fault - it can't pull in all the IDs then either go with the first alternate listed above or look at changing your caching mode from Full to Partial. Do realize that this will issue a query to the target system for potentially all the rows that come out of the source database.
If the destination is giving issues - I'd determine what the issue is. If it's network latency for the loading of data, drop the value of MaximumCommitInsertSize from 2147483647 to something reasonable, like your batch size from above (although 1k might be a bit low). If you're still encountering blocking, then perhaps staging the data to a different table on the remote server and then doing an insert locally might be an approach.

2 SQL servers but different tempdb IO pattern on 1 spikes up and down 5MB/sec-0.2MB/sec

I have 2 MSSQL servers (lets call then SQL1 and SQL2) running a total of 1866 databases
SQL1 has 993 databases (993203 registered users)
SQL2 have 873 databases (931259 registered users)
Each SQL server has a copy of a InternalMaster database (for some shared table data) and then multiple customers, 1 database per customer (Customer/client not registered user).
At the time of writing this we had just over 10,000 users online using our software.
SQL2 behaves as expected and Database I/O is generally 0.2MB/sec and goes up and down in a normal flow, IO's goes up on certain reports and queries and so on in a random fashion.
However SQL1 has a constant pattern almost like a life support machine.
I don't understand why both servers which have the same infrastructure, work so differently? The spike starts at around 2MB/sec and then increases to a max of around 6MB/sec. Both servers have identical IOPS provisions of data, log and transaction partitions and identical AWS specs. The Data file I/O shows that tempdb is the culprit of this spike.
Any advice would be great as I just can't get my head around how 1 tempdb would act different to another when running the same software and setup on both servers.
Regards
Liam
Liam,
Please see this website that explains how to configure TEMPDB. By looking at the image, you only have one file for the TEMPDB database.
http://www.brentozar.com/sql/tempdb-performance-and-configuration/
Hope this helps

Keeping temp tables in memory after switching servers?

How can I get #tempTable to stay in memory after switching servers. Is this possible?
Select *
into #tempTable
from dbo.table
I have data in server 1 that I want to compare in server 2, but I only have readonly access to server 2 (so I can't just move my data there), and the table in server 2 is too big to move to server 1. This is why I want to know how to keep a temp table in memory after connecting to a new server.
Any help would be appreciated, thanks.
What you write is possible, but it just creates a temporary table on the server where the table is.
You probably want:
select *
into #Server2Table
from server2.database.dbo.table
You can then use #Server2Table in additional queries on the same connection where you copied it (such as the same window in SSMS or the same job step or the same stored procedure). If you need it in a more permanent location, either use a global temporary table (starts with ##) or a "real" table.
This requires the ability to link servers, using something like:
sp_addlinkedserver server2
You would run this on server1. Perhaps your DBA will need to set it up.
I have found that queries often run faster when loading cross-server tables into a temporary table. This is not because temporary tables are stored in memory. This is because there is more information available about a table on a local server for the SQL optimizer to take advantage of.
You could export the data from your query and import that into a database on your first server. You can use the SQL server import/export wizard. Bit convoluted, but if this happens a lot you can use SSIS to automate this move or even simply tick the 'Save this package' box. Once you have the data in your first server you can do whatever you like with it.

SSIS data import with resume

I need to push a large SQL table from my local instance to SQL Azure. The transfer is a simple, 'clean' upload - simply push the data into a new, empty table.
The table is extremely large (~100 million rows) and consist only of GUIDs and other simple types (no timestamp or anything).
I create an SSIS package using the Data Import / Export Wizard in SSMS. The package works great.
The problem is when the package is run over a slow or intermittent connection. If the internet connection goes down halfway through, then there is no way to 'resume' the transfer.
What is the best approach to engineering an SSIS package to upload this data, in a resumable fashion? i.e. in case of connection failure, or to allow the job to be run only between specific time windows.
Normally, in a situation like that, I'd design the package to enumerate through batches of size N (1k row, 10M rows, whatever) and log to a processing table what the last successful batch transmitted would be. However, with GUIDs you can't quite partition them out into buckets.
In this particular case, I would modify your data flow to look like Source -> Lookup -> Destination. In your lookup transformation, query the Azure side and only retrieve the keys (SELECT myGuid FROM myTable). Here, we're only going to be interested in rows that don't have a match in the lookup recordset as those are the ones pending transmission.
A full cache is going to cost about 1.5GB (100M * 16bytes) of memory assuming the Azure side was fully populated plus the associated data transfer costs. That cost will be less than truncating and re-transferring all the data but just want to make sure I called it out.
Just order by your GUID when uploading. And make sure you use the max(guid) from Azure as your starting point when recovering from a failure or restart.