I have the following stored procedure to loop through hundreds of different JSON files that are downloaded to the server every day.
The issue is that the query takes a good 15 minutes to run, I will need to create something similar soon for a larger amount of JSON files, is somebody able to point me in the correct direction in regards to increasing the performance of the query?
DECLARE #json VARCHAR(MAX) = ''
DECLARE #Int INT = 1
DECLARE #Union INT = 0
DECLARE #sql NVARCHAR(max)
DECLARE #PageNo INT = 300
WHILE (#Int < #PageNo)
BEGIN
SET #sql = (
'SELECT
#cnt = value
FROM
OPENROWSET (BULK ''C:\JSON\tickets' + CONVERT(varchar(10), #Int) + '.json'', SINGLE_CLOB) as j
CROSS APPLY OPENJSON(BulkColumn)
WHERE
[key] = ''tickets''
')
EXECUTE sp_executesql #sql, N'#cnt nvarchar(max) OUTPUT', #cnt=#json OUTPUT
IF NOT EXISTS (SELECT * FROM OPENJSON(#json) WITH ([id] int) j JOIN tickets t on t.id = j.id)
BEGIN
INSERT INTO
tickets (id, Field1)
SELECT
*
FROM OPENJSON(#json)
WITH ([id] int, Field1 int)
END
END
It seems your BULK INSERT in the loop is the bottleneck. Generally a BULK INSERT is the fastest way to retrieve data. Anyway, here it seems the amount of files is your problem.
To make things faster you would want to read the JSON files in parallel. You could do that by first creating the complete dynamic sql query for all files or maybe for some file groups and simultaneously read.
I would rather advise to use Integration Services with a script component as a source in parallel data flow tasks. First read all files from your destination folder, split them for example in 4 groups, for each group have a loop container that runs in parallel. Depending on your executing machine, you can use as many parallel flows as possible. Allready 2 dataflows should make up for the overhead of integration services.
Another option would be to write a CLR (common language runtime) stored procedure and parallelly deserialize JSON using C#.
It also depends on the machine doing the job. You would want to have enough random-access memory and free cpu power, so it should be considered to do the import while the machine is not busy.
So one method I've had success with when loading data into tables from lots of individual XML files, which you might be able to apply to this problem is by using the FileTable feature of SQL server.
The way it worked was to set up a filetable in the database, then allow access to the FileStream share that was created on the server for the process that was uploading the XML files. XML files were then dropped into the share and were immediately available in the database for querying using xPath.
A process would then run xPath queries would load the required data from the XML into the required tables and keep track of which files had been loaded, then when the next schedule came along, only load data from the newest files.
A scheduled task on the machine would then remove files when they were no longer required.
Have a read up on FileTable here:
FileTables (SQL Server)
It's available in all SQL server editions.
Related
I'm currently working on a .NET application and want to make it as modular as possible. I've already created a basic SELECT procedure, which returns data by checking inputted parameters on SQL Server side.
I want to create a procedure that parses structured data as string and inserts its' contents to corresponding table in database.
For example, I have a table as
CREATE TABLE ExampleTable (
id_exampleTable int IDENTITY (1, 1) NOT NULL,
exampleColumn1 nvarchar(200) NOT NULL,
exampleColumn2 int NULL,
exampleColumn3 int NOT NULL,
CONSTRAINT pk_exampleTable PRIMARY KEY ( id_exampleTable )
)
And my procedure starts as
CREATE PROCEDURE InsertDataIntoCorrespondingTable
#dataTable nvarchar(max), --name of Table in my DB
#data nvarchar(max) --normalized string parameter as 'column1, column2, column3, etc.'
AS
BEGIN
IF #dataTable = 'table'
BEGIN
/**Parse this string and execute insert command**/
END
ELSE IF /**Other statements**/
END
TL;DR
So basically, I'm looking for a solution that can help me achieve something like this
EXEC InsertDataIntoCorrespondingTableByID(
#dataTable = 'ExampleTable',
#data = '''exampleColumn1'', 2, 3'
)
Which should be equal to just
INSERT INTO ExampleTable SELECT 'exampleColumn1', 2, 3
Sure, I can push data as INSERT statements (for each and every 14 tables inside DB...), generated inside an app, but I want to conquer T-SQL :)
This might be reasonable (to some degree) on an RDBMS that supports structured data like JSON or XML natively, but doing this the way you are planning is going to cause some real pain-in-the-rear support and, more importantly, a sql injection attack vector. I would leave this to the realm of the web backend server where it belongs.
You are likely going to invent your own structured data markup language and parser to solve this as sql server. That's a wheel that doesn't need to be reinvented. If you do end up building this, highly consider going with JSON to avoid all the issues that structured data inherently bring with it, assuming your version of sql server supports json parsing/packaging.
Your front end that packages your data into your SDML is going to have to assume column ordinals, but column ordinal is not something that one should rely on in a database. SQL Amateurs often do, I know from years in the industry and dealing with end users that are upset when a new column is introduced in a position they don't want it. Adding a column to a table shouldn't break an application. If it does, that application has bad code.
Regarding the sql injection attack vector, your SP code is going to get ugly. You'll need to parse out each item in #data into a variable of its own in order to properly parameterize your dynamic sql that is being built. See here under the "working with parameters" section for what that will look like. Failure to add this to your SP code means that values passed in that #data SDML could become executable SQL instead of literals and that would be very bad. This is not easy to solve in SP language. Where it IS easy to solve though is in the backend server code. Every database library on the planet supports parameterized query building/execution natively.
Once you have this built you will be dynamically generating an INSERT statement and dynamically generating variables or an array or some data structure to pass in parameters to the INSERT statement to avoid sql injection attacks. It's going to be dynamic, on top of dynamic, on top of dynamic which leads to:
From a support context, imagine that your application just totally throws up one day. You have to dive into investigate. You track the SDML that your front end created that caused the failure, and you open up your SP code to troubleshoot. Imagine what this code ends up looking like
It has to determine if the table exists
It has to parse the SDML to get each literal
It has to read DB metadata to get the column list
It has to dynamically write the insert statement, listing the columns from metadata and dynamically creating sql parameters for the VALUES() list.
It has to execute sending a dynamic number of variables into the dynamically generated sql.
My support staff would hang me out to dry if they had to deal with that, and I'm the one paying them.
All of this is solved by using a proper backend to handle communication, deeper validation, sql parameter binding, error catching and handling, and all the other things that backend servers are meant to do.
I believe that your back end web server should be VERY aware of the underlying data model. It should be the connection between your view, your data, and your model. Leave the database to the things it's good at (reading and writing data). Leave your front end to the things that it's good at (presenting a UI for the end user).
I suppose you could do something like this (may need a little extra work)
declare #columns varchar(max);
select #columns = string_agg(name, ', ') WITHIN GROUP ( ORDER BY column_id )
from sys.all_columns
where object_id = object_id(#dataTable);
declare #sql varchar(max) = select concat('INSERT INTO ',#dataTable,' (',#columns,') VALUES (', #data, ')')
exec sp_executesql #sql
But please don't. If this were a good idea, there would be tons of examples of how to do it. There aren't so it's probably not a good idea.
There are however tons of examples of using ORMs or auto-generated code in stead - because that way your code is maintainable, debugable and performant.
I have several linked servers and I want insert a value into each of those linked servers. On first try executing, I've waited too long for the INSERT using CURSOR. It's done for about 17 hours. But I'm curious for those INSERT queries, and I checked a single line of my INSERT query using Display Estimated Execution Plan, it showed a Cost of 46% on Remote Insert and Constant Scan for 54%.
Below of my code snippets I worked before
DECLARE #Linked_Servers varchar(100)
DECLARE CSR_STAGGING CURSOR FOR
SELECT [Linked_Servers]
FROM MyTable_Contain_Lists_of_Linked_Server
OPEN CSR_STAGGING
FETCH NEXT FROM CSR_STAGGING INTO #Linked_Servers
WHILE ##FETCH_STATUS = 0
BEGIN
BEGIN TRY
EXEC('
INSERT INTO ['+#Linked_Servers+'].[DB].[Schema].[Table] VALUES (''bla'',''bla'',''bla'')
')
END TRY
BEGIN CATCH
DECLARE #ERRORMSG as varchar(8000)
SET #ERRORMSG = ERROR_MESSAGE()
END CATCH
FETCH NEXT FROM CSR_STAGGING INTO #Linked_Servers
END
CLOSE CSR_STAGGING
DEALLOCATE CSR_STAGGING
Also below, figure of how I check my estimation execution plan of my query
I check only INSERT query, not all queries.
How can I get best practice and best performance using Remote Insert?
You can try this, but I think the difference should be negligibly better. I do recall that when reading on the differences of approaches with doing inserts across linked servers, most of the standard approaches where basically on par with each other, though its been a while since I looked that up, so do not quote me.
It will also require you to do some light rewriting due to the obvious differences in approach (and assuming that you would be able to do so anyway). The dynamic sql required to do this might be tricky though as I am not entirely sure if you can call openquery within dynamic sql (I should know this but ive never needed to either).
However, if you can use this approach, the main benefit is that the where clause gets the destination schema without having to select any data (because 1 will never equal 0).
INSERT OPENQUERY (
[your-server-name],
'SELECT
somecolumn
, another column
FROM destinationTable
WHERE 1=0'
-- this will help reduce the scan as it will
-- get schema details without having to select data
)
SELECT
somecolumn
, another column
FROM sourceTable
Another approach you could take is to build a insert proc on the destination server/DB. Then you just call the proc by sending the params over. While yes this is a little bit more work, and introduces more objects to maintain, it add simplicity into your process and potentially reduces I/O when sending things across the linked servers, not to mention might save on CPU cost of your constant scans as well. I think its probably a more clean cut approach instead of trying to optimize linked server behavior.
I am in a designing stage of an application, I have a huge functionality of importing data into a SQL Server database. As there are numerous tables in database, I want to avoid conventional based approach of creating models and writing stored procedures for each Import. Is there a way by which I can use create single stored procedure for different tables and insert data into them?
Note: columns will vary from table to table.
Thanks in advance
Well, I would stick to comments discouraging it, but on other hand, if this procedure will be super simple and maintenance will be transferred to JSON creator, you can do it like this:
declare #tablename as nvarchar(max)
declare #json as nvarchar(max)
declare #query as nvarchar(max)
set #tablename = (SELECT TableName FROM YourAllowedTableNamesList WHERE Tablename = #tablename)
Set #query =
'Insert into ' + #tablename +
'SELECT * FROM OPENJSON(' + #json + ')'
Exec (#query)
Yes I have done something like this at my current shop. Your question is too broad so I will give you only a broad overview of what we have done.
We wrote a console app that gets a SQL Command from a meta table and executes it on the source into an in-memory DataTable. It then bulk-inserts that data into a staging table on the destination database.
Then we run a generic merge proc that looks at the system tables to get the Primary Keys and datatypes of the final destination table and constructs INSERT and UPDATE statements using dynamic sql.
Despite the well-meaning warnings of others, it's working well for us, though it does have some limitations, such as an inability to handle BLOB datatypes in a generic way. There may be other limitations that we just haven't encountered yet as well.
I've just written something to insert 10000 rows into a table for the purposes of load testing.
The data in each of the rows is the same and uninteresting.
I did it like this:
DECLARE #i int = 0
WHILE #i < 10000 BEGIN
exec blah.CreateBasicRow ;
SET #i = #i + 1
END
All create basic row does is fill out the not null columns with something valid.
It turns out this is very slow and it even seems to hang occasionally! What are my alternatives? Would it be better to write something to generate a long file with all the data repeated with fewer insert clauses? Are there any other options?
Update
A constraint is that this needs to be in a form that sqlcmd can deal with - our database versioning process produces sql files to be run by sqlcmd. So I could generate a patch file with the data in a different form but I couldn't use a different tool to insert the data.
You can speed this exact code up by wrapping a transaction around the loop. That way SQL Server does not have to harden the log to disk on each iteration (possibly multiple times depending on how often you issue a DML statement in that proc).
That said, the fastest way to go is to insert all records at once. Something like
insert into Target
select someComputedColumns
from Numbers n
WHERE n.ID <= 10000
This should execute in <<1sec for typical cases. It breaks the encapsulation of using that procedure, though.
I've been searching for a solution on how to get the filename of using SQL Server. I know that it's possible if you're using C#. But how is it done in SQL?
For example, I have a file (example: uploadfile.txt) located in C:\ that is about to be uploaded. I have a table which has a field "filename". How do I get the filename of this file?
This is the script that I have as of the moment.
-- Insert to table
BULK INSERT Price_Template_Host
FROM 'C:\uploadfile.txt'
WITH
(
FIELDTERMINATOR = '\t',
ROWTERMINATOR = '\n'
)
-- Insert into transaction log table filename and datetime()
To the best of my knowledge, there is no direct method in T-SQL to locate a file on the file system. After all this is not what the language is intended to be used for. From the script you have provided, BULK INSERT requires that the fully qualified file name already be known at the time of the statement call.
There are of course a whole variety of ways you could identify/locate a file, outside of using T-SQL for example using SSIS, perhaps you could use xp_cmdshell (has security caveats), or create a managed code module within SQL Server to perform this task.
To provide you with specific guidence, it may help if you could provide us all with details of the business process that you are trying to implement.
I would personally attach this problem with an SSIs package, which would give you much more flexibility in terms of load and subsequent logging. However, if you're set on doing this through T-SQL, consider exec'ing dynamically-constructed SQL:
declare #cmd nvarchar(max), #filename nvarchar(255)
set #filename = 'C:\uploadfile.txt'
set #cmd =
'BULK INSERT Price_Template_Host
FROM '''+#filename+'''
WITH
(
FIELDTERMINATOR = ''\t'',
ROWTERMINATOR = ''\n''
)'
-- Debug only
print #cmd
-- Insert to table
exec(#cmd)
-- Insert into transaction log table filename and datetime()
insert into dbo.LoadLog (filename, TheTime)
values (#filename, getdate())
If I understand your question correctly, this paramaterizes the filename so that you can capture it further down in the script.