Batch insert data counting new inserts - sql

Suppose i have a simple schema with a composite pk with columns. e.g
pk1: string
pk2: int
date: Timestamp
I am reading data from somewhere else in batches of about 50 and would like to store this. The data source i am pulling from is a sliding window so I will be receiving data from the data source that i have already inserted so i cant just blindly insert otherwise i get a pk constraint violation.
I would like a reasonable way to insert the new items as a batch but also knowing how many new items i actually inserted for logging purposes.

doing the insert
For postgresql version 9.5+, it is possible to use the following:
insert ... on conflict do nothing
example:
INSERT INTO users (id, user_name, email)
VALUES (1, 'hal', 'hal#hal.hal')
ON CONFLICT DO NOTHING
For recent earlier versions (since 9+, i think), it is possible to create a CTE from raw values & then insert from there:
WITH batch (id, user_name, email) AS (
VALUES
(1, 'hal', 'hal#hal.hal'),
(2, 'sal', 'sal#sal.sal')
)
INSERT INTO users (id, user_name, email) (
SELECT id, user_name, email
FROM batch
WHERE batch.id NOT IN (SELECT id FROM users)
)
or, instead of using a CTE, stage the values in a staging table that is truncated after every batch is processed.
Also, note that it might be necessary to explicitly cast strings to appropriate data types if the CTE method is used.
A third option would be to implement this using a stored procedure & trigger. This is more complicated than the other two, but would work with earlier versions of postgresql.
logging
Both of those methods should report the number of rows inserted, but the logging would have to be performed by the database client.
e.g. in Python, the library psycopg2 is used to interact with postgresql, and psycopg2 cursor objects have a property rowcount. I'm sure other well designed libraries written in other languages / frameworks will have implemented this same functionality somehow. Logging the # of rows inserted will have to be done from the part of the program interacting with the database.
However, if the logs of how many rows are inserted are required in the same database, then both the upsert & the logging may be performed via a single trigger + stored procedure.
Finally, as this is a special case of upsert, more information can be found by searching postgresql upsert on stack overflow or other sites. I found the following from the postgresql wiki very informative:
https://wiki.postgresql.org/wiki/UPSERT#PostgreSQL_.28today.29

Related

how to have postgres ignore inserts with a duplicate key but keep going

I am inserting record data in a collection in memory into postgres and want the database to ignore any record that already exists in the database (by virtue of having the same primary key) but keep going with the rest of my inserts.
I'm using clojure and hugsql, btw, but I'm guessing the answer might be language agnostic.
As I'm essentially treating the database as a set in this way I may be engaging in an antipattern.
If you're using Postgres 9.5 or newer (which I assume you are, since it was released back in January 2016), there's a very useful ON CONFLICT cluase you can use:
INSERT INTO mytable (id, col1, col2)
VALUES (123, 'some_value', 'some_other_value')
ON CONFLICT (id) DO NOTHING
I had to solve this for an early version of Postgres so instead of having a single INSERT statement with muliple rows, I used multiple INSERT statements and just ran all of them in a script and made sure that an error would not stop the script (I used Adminer with "stop on error" unchecked) so that the ones that don't throw an error are executed and then all of the new entries got inserted.

FileTable and Foreign Key from another table

I try to use FileTable with Entity Framework (I know it is not supported directly). So I use custom Sql commands to insert and delete (no update) the data. My problem is I have a table which refers to the FileTable with a foreign key to the stream_id of the FileTable. If I insert into the FileTable, how can I get the stream_id back?
I want to use SqlBulkCopy to insert lots of files, I can bulk insert into the FileTable, but SqlBulkCopy won´t tell me the inserted stream_id values.
If I execute single insert statements with select scopeIdentity() or something similar, the performance becomes worse.
I want to insert like 5.000 files (2MB until 20MB) into the FileTable and connect them with my own Table via foreign key. Is this bad practice and I should use a simple path column and store the data directly in the filesystem? I thought FileTable is doing exactly this for me, because I need to secure the database and the files are always in sync even if I go one hour or 4 days back in the past. I cannot backup the database and the filesystem exactly at the same time so they are 100 percent synchronized.
I want to use SqlBulkCopy to insert lots of files, I can bulk insert into the FileTable, but SqlBulkCopy won´t tell me the inserted stream_id values.
SqlBulkCopy doesn't allow to retrieve inserted identity values or any other values.
Solution 1
You can find on the web a lot of code snippets to insert into a temporary table using SqlBulkCopy. Then from the temporary table to the destination table using the OUTPUT clause to get the stream_id values.
It's a few more steps, but the performance is still very great.
Solution 2
Disclaimer: I'm the owner of the project Entity Framework Extensions
Disclaimer: I'm the owner of the project Bulk Operations
Both libraries are not free but allow to overcome SqlBulkCopy limitation more easily.
Both of them support to output identity value.
// Easy to customize
var bulk = new BulkOperation<Customer>(connection);
bulk.BatchSize = 1000;
bulk.ColumnInputExpression = c => new { c.Name, c.FirstName };
bulk.ColumnOutputExpression = c => c.CustomerID;
bulk.ColumnPrimaryKeyExpression = c => c.Code;
bulk.BulkMerge(customers);
// Easy to use
var bulk = new BulkOperation(connection);
bulk.BulkInsert(dt);
bulk.BulkUpdate(dt);
bulk.BulkDelete(dt);
bulk.BulkMerge(dt);

PL/SQL embedded insert into table that may not exist

I much prefer using this 'embedded' style inserts in a pl/sql block (opposed to the execute immediate style dynamic sql - where you have to delimit quotes etc).
-- a contrived example
PROCEDURE CreateReport( customer IN VARCHAR2, reportdate IN DATE )
BEGIN
-- drop table, create table with explicit column list
CreateReportTableForCustomer;
INSERT INTO TEMP_TABLE
VALUES ( customer, reportdate );
END;
/
The problem here is that oracle checks if 'temp_table' exists and that it has the correct number of colunms and throws a compile error if it doesn't exist.
So I was wondering if theres any way round that?! Essentially I want to use a placeholder for the table name to trick oracle into not checking if the table exists.
EDIT:
I should have mentioned that a user is able to execute any 'report' (as above). A mechanism that will execute an arbitrary query but always write to the temp_table ( in the user's schema). Thus each time the report proc is run it drops the temp_table and recreates it with, most probably, a different column list.
You could use a dynamic SQL statement to insert into the maybe-existent temp_table, and then catch and handle the exception that occurs when the table doesn't exist.
Example:
execute immediate 'INSERT INTO '||TEMP_TABLE_NAME||' VALUES ( :customer, :reportdate )' using customer, reportdate;
Note that having the table name vary in a dynamic SQL statement is not very good, so if you ensure the table names stay the same, that would be best.
Maybe you should be using a global temporary table (GTT). These are permanent table structures that hold temporary data for an Oracle session. Many different sessions can insert data into the same GTT, and each will only be able to see their own data. The data is automatically deleted either on COMMIT or when the session ends, according to the GTT's definition.
You create the GTT (once only) like this:
create globabal temporary table my_gtt
(customer number, report_date date)
on commit delete/preserve* rows;
* delete as applicable
Then your programs can just use it like any other table - the only difference being it always begins empty for your session.
Using GTTs are much preferable to dropping/recreating tables on the fly - if your application needs a different structure for each report, I strongly suggest you work out all the different structures that each report needs, and create separate GTTs as needed by each, instead of creating ordinary tables at runtime.
That said, if this is just not feasible (and I've seen good examples when it's not, e.g. in a system that supports a wide range of ad-hoc requests from users), you'll have to go with the EXECUTE IMMEDIATE approach.

How can remove lock from table in SQL Server 2005?

I am using the Function in stored procedure , procedure contain transaction and update the table and insert values in the same table , while the function is call in procedure is also fetch data from same table.
i get the procedure is hang with function.
Can have any solution for the same?
If I'm hearing you right, you're talking about an insert BLOCKING ITSELF, not two separate queries blocking each other.
We had a similar problem, an SSIS package was trying to insert a bunch of data into a table, but was trying to make sure those rows didn't already exist. The existing code was something like (vastly simplified):
INSERT INTO bigtable
SELECT customerid, productid, ...
FROM rawtable
WHERE NOT EXISTS (SELECT CustomerID, ProductID From bigtable)
AND ... (other conditions)
This ended up blocking itself because the select on the WHERE NOT EXISTS was preventing the INSERT from occurring.
We considered a few different options, I'll let you decide which approach works for you:
Change the transaction isolation level (see this MSDN article). Our SSIS package was defaulted to SERIALIZABLE, which is the most restrictive. (note, be aware of issues with READ UNCOMMITTED or NOLOCK before you choose this option)
Create a UNIQUE index with IGNORE_DUP_KEY = ON. This means we can insert ALL rows (and remove the "WHERE NOT IN" clause altogether). Duplicates will be rejected, but the batch won't fail completely, and all other valid rows will still insert.
Change your query logic to do something like put all candidate rows into a temp table, then delete all rows that are already in the destination, then insert the rest.
In our case, we already had the data in a temp table, so we simply deleted the rows we didn't want inserted, and did a simple insert on the rest.
This can be difficult to diagnose. Microsoft has provided some information here:
INF: Understanding and resolving SQL Server blocking problems
A brute force way to kill the connection(s) causing the lock is documented here:
http://shujaatsiddiqi.blogspot.com/2009/01/killing-sql-server-process-with-x-lock.html
Some more Microsoft info here: http://support.microsoft.com/kb/323630
How big is the table? Do you have problem if you call the procedure from separate windows? Maybe the problem is related to the amount of data the procedure is working with and lack of indexes.

Is it possible in SQL Server to create a function which could handle a sequence?

We are looking at various options in porting our persistence layer from Oracle to another database and one that we are looking at is MS SQL. However we use Oracle sequences throughout the code and because of this it seems moving will be a headache. I understand about #identity but that would be a massive overhaul of the persistence code.
Is it possible in SQL Server to create a function which could handle a sequence?
That depends on your current use of sequences in Oracle. Typically a sequence is read in the Insert trigger.
From your question I guess that it is the persistence layer that generates the sequence before inserting into the database (including the new pk)
In MSSQL, you can combine SQL statements with ';', so to retrieve the identity column of the newly created record, use INSERT INTO ... ; SELECT SCOPE_IDENTITY()
Thus the command to insert a record return a recordset with a single row and a single column containing the value of the identity column.
You can of course turn this approach around, and create Sequence tables (similar to the dual table in Oracle), in something like this:
INSERT INTO SequenceTable (dummy) VALUES ('X');
SELECT #ID = SCOPE_IDENTITY();
INSERT INTO RealTable (ID, datacolumns) VALUES (#ID, #data1, #data2, ...)
I did this last year on a project. Basically, I just created a table with the name of the sequence, current value, & increment amount.
Then I created a 4 procs :
GetCurrentSequence( sequenceName)
GetNextSequence( sequenceName)
CreateSequence( sequenceName, startValue, incrementAmount)
DeleteSequence( sequenceName)
But there is a limitation you may not appreciate; functions cannot have side effects. So you could create a function for GetCurrentSequence(...), but GetNextSequence(...) would need to be a proc, since you will probably want to increment the current sequence value. However, if it's a proc, you won't be able to use it directly in your insert statements.
So instead of
insert into mytable(id, ....) values( GetNextSequence('MySequence'), ....);
Instead you will need to break it up over 2 lines;
declare #newID int;
exec #newID = GetNextSequence 'MySequence';
insert into mytable(id, ....) values(#newID, ....);
Also, SQL Server doesn't have any mechanism that can do something like
MySequence.Current
or
MySequence.Next
Hopefully, somebody will tell me I am incorrect with the above limitations, but I'm pretty sure they are accurate.
Good luck.
If you have a lot of code, you're going to want to do a massive overhaul of the code anyway; what works well in Oracle is not always going to work well in MSSQL. If you have a lot of cursors, for instance, while you could convert them line for line to MSSQL, you're not going to get good performance.
In short, this is not an easy undertaking.