I am working on building an application to assist with taking data from different sources dynamically Files and Emails (usually CSV and Excel), APIs, and other SQL Databases and processing them and moving them to a central SQL server. All the tables are being uploaded to the main SQL server and processed to insert new rows into the destination table and update rows with changed data if some is available. The main SQL server is a Microsoft SQL server.
When the data is uploaded to the main server for comparison it is being stored in a temporary table which is being dropped after the comparison is done. The statement that I am using created dynamically by the program in order to allow it to be dynamic to different datasets. What I have been using is a NOT EXISTS which when I run it on a table that is 380k+ rows of data it has been taking 2+ hours to process that data. I have also tried EXCEPT, however I am unable to use that as some of the tables contain text fields which can't be used for the EXCEPT statement. The datasets that are being uploaded to the server are written to and read from at different intervals based on the schedules built into the program.
I was looking to find a more efficient way or improvements that I might be able to make use of in order to bring down the run times for this table. The program that manages the server is running on a remote server than the SQL instance which runs on part of the organizations SQL farm. I am not very experienced with SQL so I appreciate all the help I might be able to get. Below I added links to the code and an example statement produced by the system when going to run the comparison.
C# Code: https://pastebin.com/8PeUvekG
SQL Statement: https://pastebin.com/zc9kshJw
INSERT INTO vewCovid19_SP
(Street_Number,Street_Dir,Street_Name,Street_Type,Apt,Municipality,County,x_cord,y_cord,address,Event_Number,latitude,longitude,Test_Type,Created_On_Date,msg)
SELECT A.Street_Number,A.Street_Dir,A.Street_Name,A.Street_Type,A.Apt,A.Municipality,A.County,A.x_cord,A.y_cord,A.address,A.Event_Number,A.latitude,A.longitude,A.Test_Type,A.Created_On_Date,A.msg
FROM #TEMP_UPLOAD A
WHERE NOT EXISTS
SELECT * FROM vewCovid19_SP B
WHERE ISNULL(CONVERT(VARCHAR,A.Street_Number), 'NULL') = ISNULL(CONVERT(VARCHAR,B.Street_Number), 'NULL')
AND ISNULL(CONVERT(VARCHAR,A.Street_Dir), 'NULL') = ISNULL(CONVERT(VARCHAR,B.Street_Dir), 'NULL')
AND ISNULL(CONVERT(VARCHAR,A.Apt), 'NULL') = ISNULL(CONVERT(VARCHAR,B.Apt), 'NULL')
AND ISNULL(CONVERT(VARCHAR,A.Street_Name), 'NULL') = ISNULL(CONVERT(VARCHAR,B.Street_Name), 'NULL')
AND ISNULL(CONVERT(VARCHAR,A.Street_Type), 'NULL') = ISNULL(CONVERT(VARCHAR,B.Street_Type), 'NULL'));
DROP TABLE #TEMP_UPLOAD"
One simple query form would be to load a new table from a UNION (which includes de-duplication).
eg
insert into viewCovid19_SP_new
select *
from vewCovid19_SP
union
select *
from #temp_upload
then swap the tables with ALTER TABLE ... SWITCH or drop and sp_rename.
Related
I'm trying to find the best way to duplicate all the rows in a table.By this i mean inserting them again in the same table, but i need to update a single column only on the values that were inserted.
This is to help me write a script so I can automate some work on some clients.
I can't use select * as it will throw an error because of the identity columns but at the same time i don't want to be manually writting all the column names for several tables.
Is there a simple way to translate this into SQL server?
Sorry for not showing a piece of code, but i have nothing at the moment and i'm not really fluent in SQL.
EDIT: I have ended up following the advice of JamieD77 in the comments below this post by moving everything to a table, drop the id column, updating what i need and then moving back as it seems to be the most effiecient.
In SQL Server Management Studio you can drag the "Columns" folder under a table and drop it on the query window and it will paste a comma-delimited list of all the columns.
Or run a query like:
select string_agg(quotename(name),', ')
from sys.columns
where object_id = object_id('MyTable')
I am uploading a data set from excel into out local server that we use to run offline queries and wish to use the information in that table to filter on a table that is in our production system so i can then pull through the filtered table onto out local DB rather than pull the entire table from prod to our local. I was looking at creating a variable that contained all the account numbers in a string and then use that in an IN(?) statement. This worked, but it only picked up the first account in the string and not the remainder.
I am unable to create a temp table in prod or anything as it is locked down for DBA only. I have tried some stuff in SSIS like variable creation etc but it hasn't worked. I am trying to avoid dumping the entire table. If I have to do this I will set it up to update from the valid from field to capture new records.
SELECT
*
FROM account_table
where [account number] in (?)
I have created the ? variable in an separate Execute task SQL box where the ? is
select
string_agg([account number], ', ')
from [distinct account_table].
I have also tried:
select
string_agg([account number], ',')
from [distinct account_table]
but that didn't work.
You can use expressions instead of passing parameters.
Open the Execute SQL Task editor, go to Expressions tab, add a new expression for the SqlStatementSource property as the following:
"SELECT
*
FROM account_table
where [account number] in ('" +
#[User::Variable] + "')"
On the other hand, if i am facing a similar situation i will use lookup transformation to do that instead of SQL statements.
First of all, linked servers is not needed here, it's merely a bridge that connects two servers.
I might be missing something obvious here.. You say that you've loaded the dataset to your local db and now you want to get the data from prod that contains those accounts.
My suggestion is to create a dataflow and use a oledb source to get the data from prod. Then add a lookup that joins to your data set in your local db (join on account, make sure that you connect to your local db and that you redirect no matching rows to no match output ).
Then add a oledb destination to load the data to your local db.
I have a third party application from which queries will hit the SQL Server 2008 database to fetch the data ASAP (near to real time). The same query can be called by multiple users at different times.
Is there a way to store the latest result and serve the results for subsequent queries without actually hitting the database again and again for the same piece of data?
Get the results from a procedure that stores data in a global temporary table, or change to a permanent table if you regularly drop connections: change tempdb..##Results to Results. param = 1 refreshes the data:
Create procedure [getresults] (#refresh int = 0)
as
begin
IF #refresh = 1 and OBJECT_ID('tempdb..##Results') IS NOT NULL
drop table ##Results
IF OBJECT_ID('tempdb..##Results') IS NULL
select * into ##Results from [INSERT SQL HERE]
SELECT * FROM ##RESULTS
END
Can you create an indexed view for the data?
When the data is updated the view will be updated when the 3rd party makes a call the view contents will be returned without needing to hit the base tables
Unfortunately the SQL server you are using doesn't have cache system like for example, MySQL query cache. But as per the documentation I just saw here: Buffer Management
Data pages which are read during a SELECT are first brought into the buffer cache. Subsequent requests reading the same data can thus be served quicker than the initial request without needing to access the disc.
I have a stored proc as the SQL command text, which is getting passed a parameter that contains a table name. The proc then returns data from that table. I cannot call the table directly as the OLE DB source because some business logic needs to happen to the result set in the proc. In SQL 2008 this worked fine. In an upgraded 2012 package I get "The metadata could not be determined because ... contains dynamic SQL. Consider using the WITH RESULT SETS clause to explicitly describe the result set."
The problem is I cannot define the field names in the proc because the table name that gets passed as a parameter can be a different value and the resulting fields can be different every time. Anybody encounter this problem or have any ideas? I've tried all sorts of things with dynamic SQL using "dm_exec_describe_first_result_set", temp tables and CTEs that contains WITH RESULT SETS, but it doesn't work in SSIS 2012, same error. Context is a problem with a lot of the dynamic SQL approaches.
This is latest thing I tried, with no luck:
DECLARE #sql VARCHAR(MAX)
SET #sql = 'SELECT * FROM ' + #dataTableName
DECLARE #listStr VARCHAR(MAX)
SELECT #listStr = COALESCE(#listStr +',','') + [name] + ' ' + system_type_name FROM sys.dm_exec_describe_first_result_set(#sql, NULL, 1)
exec('exec(''SELECT * FROM myDataTable'') WITH RESULT SETS ((' + #listStr + '))')
So I ask out of kindness, by why on God's green earth are you using an SSIS Data Flow task to handle dynamic source data like this?
The reason you're running into trouble is because you're perverting every purpose of an SSIS Data flow task:
to extract a known source with known metadata that can be statically typed and cached in design-time
to run through a known process with straightforward (and ideally asynchronous) transformations
to take that transformed data and load it into a known destination also with known metadata
It's fine to have parameterized data sources that bring back different data. But to have them bring back entirely different metadata each time with no congruity between the different sets is, frankly, ridiculous, and I'm not entirely sure I want to know how you handled all your column metadata in the working 2008 package.
This is why it wants you add a WITH RESULTS SET to the SSIS query - so it can generate some metadata. It doesn't do this at runtime - it can't! It has to have a known set of columns (because it aliases them all into compiled variables anyway) to work with. It expects the same columns every time it runs that Data Flow Task - the exact same columns, down to the names, the types, and the constraints.
Which leads to one (terrible, terrible) solution - just stick all the data into a temporary table with Column1, Column2 ... ColumnN and then use the same variable you're using as the table name parameter to conditionally branch your code and do whatever you want with the columns.
Another more sane solution would be to create a data flow task for each of your source tables, and use your parameter in a precedence constraint to just pick which data flow task should run.
For a solution this poorly tailored for an out-of-the-box ETL, you should also highly consider just rolling your own in C# or a script task instead of the Data Flow Task provided by SSIS.
In short, please don't do this. Think of the children (packages)!
I've used CozyRoc Dynamic DataFlow Plus to achieve this.
Using configuration tables to build the SQL Select statements, I have a single SSIS package that loads data from Oracle and Sybase (or any OLEDB source) to MS SQL. Some of the result sets are in the millions of rows and performance is excellent.
Instead of writing a new package every time a new table is needed, this can be configured in minutes and run on a the pre-tested and robust existing package.
Without it I would have been up for writing hundreds of packages.
I need to update my contacts database in SQL Server with changes made in a remote database (also SQL Server, on a different server on the same local network). I can't make any changes to the remote database, which is a commercial product. I'm connected to the remote database using a linked server. Both tables contain around 200K rows.
My logic at this point is very simple: [simplified pseudo-SQL follows]
/* Get IDs of new contacts into local temp table */
Select remote.ID into #NewContactIDs
From Remote.Contacts remote
Left Join Local.Contacts local on remote.ID=local.ID
Where local.ID is null
/* Get IDs of changed contacts */
Select remote.ID into #ChangedContactIDs
From Remote.Contacts remote
Join Local.Contacts local on remote.ID=local.ID
Where local.ModifyDate < remote.ModifyDate
/* Pull down all new or changed contacts */
Select ID, FirstName, LastName, Email, ...
Into #NewOrChangedContacts
From Remote.Contacts remote
Where remote.ID in (
Select ID from #NewContactIDs
union
Select ID from #ChangedContactIDs
)
Of course, doing those joins and comparisons over the wire is killing me. I'm sure there's a better way - advice?
Consider maintaining a lastCompareTimestamp (the last time you did the compare) in your local system. Grab all the remote records with ModifyDates > lastCmpareTimestamp and throw them in a local temp table. Work with them locally from there.
The last compare date is a great idea
One other method I have had great success with is SSIS (though it has a learning curve, and might be overkill unless you do this type of thing a lot):
Make a package
Set a data source for each of the two tables. If you expect a lot of change pull the whole tables, if you expect only incremental changes then filter by mod date. Make sure the results are ordered
Funnel both sets into a Full Outer Join
Split the results of the join into three buckets: unchanged, changed, new
Discard the unchanged records, send the new records to an insert destination, and send the changed records to either a staging table for a SQL-based update, or - for few rows - an OLEDB command with a parameterized update statement.
OR, if on SQL Server 2008, use Merge