Data Import Question: Should I use a cursor? - sql

I'm currently working on a SQL import routine to import data from a legacy application to a more modern robust system. The routine simply imports data from a flat-file legacy table (stored as a .csv file) into SQL Server that follows the classic order/order-detail pattern. Here's what both tables look like:
**LEGACY_TABLE**
Cust_No
Item_1_No
Item_1_Qty
Item_1_Prc
Item_2_No
Item_2_Qty
Item_2_Prc
...
Item_7_No
Item_7_Qty
Item_7_Prc
As you can see, the legacy table is basically a 22 column spreadsheet that is used to represent a customer, along with up to 7 items and their quantity and purchase price, respectively.
The new table(s) look like this:
**INVOICE**
Invoice_No
Cust_No
**INVOICE_LINE_ITEM**
Invoice_No
Item_No
Item_Qty
Item_Prc
My quick-and-dirty approach has been to create a replica of the LEGACY_TABLE (let's call it LEGACY_TABLE_SQL) in SQL Server. This table will be populated from the .csv file using a database import that is already built into the application.
From there, I created a stored procedure to actually copy each of the values in the LEGACY_TABLE_SQL table to the INVOICE/INVOICE_LINE_ITEM tables as well as handle the underlying logical constraints (i.e. performing existence tests, checking for already open invoices, etc.). Finally, I've created a database trigger that calls the stored procedure when new data is inserted into the LEGACY_TABLE_SQL table.
The stored procedure looks something like this:
CREATE PROC IMPORT_PROCEDURE
#CUST_NO
#ITEM_NO
#ITEM_QTY
#ITEM_PRC
However, instead of calling the procedure once, I actually call the stored procedure seven times (once for each item) using a database trigger. I only execute the stored procedure when the ITEM_NO is NOT NULL, to account for blank items in the .csv file. Therefore, my trigger looks like this:
CREATE TRIGGER IMPORT_TRIGGER
if ITEM_NO_1 IS NOT NULL
begin
exec IMPORT_PROCEDURE (CUST_NO,ITEM_NO_1, ITEM_QTY_1, ITEM_PRC_1)
end
...so on and so forth.
I'm not sure that this is the most efficient way to accomplish this task. Does anyone have any tips or insight that they wouldn't mind sharing?

I would separate the import process from any triggers. A trigger is useful if you're going to have rows being constantly added to the import table from a constantly running, outside source. It doesn't sound like this is your situation though, since you'll be importing an entire file at once. Triggers tend to hide code and can be difficult to work with in some situations.
How often are you importing these files?
I would have an import process that is mostly stand-alone. It might use stored procedures or tables in the database, but I wouldn't use triggers. A simple approach would be something like below. I've added a column to the Legacy_Invoices (also renamed to something that's more descriptive) so that you can track when items have been imported and from where. You can expand this to track more information if necessary.
Also, I don't see how you're tracking invoice numbers in your code. I've assumed an IDENTITY column in the Legacy_Invoices. This is almost certainly insufficient since I assume that you're creating invoices in your own system as well (outside of the legacy system). Without knowing your invoice numbering scheme though, it's impossible to give a solution there.
BEGIN TRAN
DECLARE
#now DATETIME = GETDATE()
UPDATE Legacy_Invoices
SET
import_datetime = #now
WHERE
import_status = 'Awaiting Import'
INSERT INTO dbo.Invoices (invoice_no, cust_no)
SELECT DISTINCT invoice_no, cust_no
FROM
Legacy_Invoices
WHERE
import_datetime = #now
UPDATE Legacy_Invoices
SET
import_status = 'Invoice Imported'
WHERE
import_datetime = #now
INSERT INTO dbo.Invoice_Lines (invoice_no, item_no, item_qty, item_prc)
SELECT
invoice_no,
item_no_1,
item_qty_1,
item_prc_1
FROM
Legacy_Invoices LI
WHERE
import_datetime = #now AND
import_status = 'Invoice Imported' AND
item_no_1 IS NOT NULL
UPDATE Legacy_Invoices
SET
import_status = 'Item 1 Imported'
WHERE
import_datetime = #now AND
import_status = 'Invoice Imported'
<Repeat for item_no_2 through 7>
COMMIT TRAN
Here's a big caveat though. While cursors are normally not desirable in SQL and you want to use set-based processing versus RBAR (row by agonizing row) processing, data imports are often an exception.
The problem with the above is that if one row fails, that whole import step fails. Also, it's very difficult to run a single entity (invoice plus line items) through business logic when you're importing them in bulk. This is one place where SSIS really shines. It's extremely fast (assuming that you set it up properly), even when importing one entity at a time. You can then put all sorts of error-handling in it to make sure that the import runs smoothly. One import row has an erroneous invoice number? No problem, mark it as an error and move on. A row has item# 2 filled in, but no item#1 or has a price without a quantity? No problem, mark the error and move on.
For a single import I might stick with the code above (adding in appropriate error handling of course), but for a repeating process I would almost certainly use SSIS. You can import millions of rows in seconds or minutes even with individual error handling on each business entity.
If you have any problems with getting SSIS running (there are tutorials all over the web and on MSDN at Microsoft) then post any problems here and you should get quick answers.

I'm not sure why you would add a trigger. Will you be continuing to use the LEGACY_TABLE_SQL?
If not then how about this one time procedure? It uses Oracle syntax but can be adapted to most databases
PROCEDURE MIGRATE IS
CURSOR all_data is
SELECT invoice_no, cust_no,Item_1_no,Item_1_qty........
FROM LEGACY_TABLE_SQL;
BEGIN
FOR data in all_data LOOP
INSERT INTO INVOICE (invoice_no, cust_no) VALUES (data.invoice_no, data.cust_no);
IF Item_1_no IS NOT NULL THEN
INSERT INTO INVOICE_LINE_ITEM(invoice_no,Item_1_no,Item_1_qty....) VALUES(data.invoice_no,data.Item_1_no,data.Item_1_qty....)
END IF;
--further inserts for each item
END LOOP;
COMMIT;
END;
This can be further optimized in Oracle with a BULK_COLLECT.
I would create the INVOICE_LINE_ITEM table with default values of 0 for all items.
I would also consider these possibilities:
is the invoice number really unique now and in the future? it may be a good idea to add a pseudo key based off a sequence
is there any importance to null item_no entries? Could this indicate a back order, short shipment or just bad data entry?
EDIT: as you advise that you will continue to use the legacy table you need to prioritize what you want. Is efficiency and performance your number one priority, maintainability, synchronous transaction
For example:
- if performance is not really critical then implement this as you outlined
- if this will have to be maintained then you might want to invest more into the coding
- if you do not require a synchronous transaction then you could add a column to your LEGACY_TABLE_SQL called processed with a default value of 0. Then, once a day or hour, schedule a job to get all the orders that have not been processed.

Related

How do you deduplicate records in a BigQuery table?

We have a script that should be running daily at 12 am on GCP cloud function and scheduler that sends data to a table in bigquery.
The cron job unfortunately used to send the data every minute at 12 am, that means that the file would be uploaded 60 times instead of only one time
The cron timer was * * 3 * * * instead of 00 3 * * *
How can we fix the table?
Noting that the transferred data is now deleted from the source, so far we depend on getting the unique values, but the table is getting too large
Any help would be much appreciated
I have two options for you, plus a comment on how to avoid this in future. I recommend reading and comparing both options before proceeding.
Option One
If this is a one-off fix, I recommend you simply
navigate to the table (your_dataset.your_table) in the UI
click 'snapshot' and create a snapshot in case you make a mistake in the next part
run SELECT DISTINCT * FROM your_dataset.your_table in the UI
click 'save results' and select 'bigquery table' then save as a new table (e.g. your_dataset.your_table_deduplicated)
navigate back to the old table and click the 'delete' button, then authorise the deletion
navigate to the new table and click the 'copy' button, then save it in the location the old table was in before (i.e. call the copy your_dataset.your_table)
delete your_dataset.your_table_deduplicated
This procedure will result in your replacing the current table with another with the same schema but without duplicated records. You should check that it looks as you expect before you discard your snapshot.
Option Two
A quicker approach, if you're comfortable with it, would be using the Data Manipulation Language (DML).
There is a DELETE statement, but you'd have to construct an appropriate WHERE clause to only delete the duplicate rows.
There is a simpler approach, which is equivalent to option one and just requires you to run this query:
CREATE OR REPLACE TABLE your_dataset.your_table AS
SELECT DISTINCT * FROM your_dataset.your_table
Again, you may wish to take a snapshot before running this.
The Future
If you have a cloud function that sends data to BigQuery on a schedule, then best-practice would be for this function to be idempotent (i.e. doesn't matter how many times you run it, if the input is the same the output is the same).
A typical pattern would be to add a stage to your function to pre-filter the new records.
Depending on your requirements, this stage could
prepare the new records you want to insert, which should have some unique, immutable ID field
SELECT some_unique_id FROM your_dataset.your_table -> old_record_ids
filter the new records, e.g. in python new_records = [record for record in prepared_records if record["id"] not in old_record_ids]
upload only the records that don't exist yet
This will prevent the sort of issues you have encountered here.

SSIS Inserting incrementing ID with starting range into multiple tables at a time

Is there are one or some reliable variants to solve easy task?
I've got a number of XML files which will be converting into 6 SQL tables (via SSIS).
Before the end of this process i need to add a new (in fact - common for all tables) column (or field) into each of them.
This column represents ID with assigning range and +1 incrementing step. Like (350000, 1)
Yes, i know how to solve it on SSMS SQL stage. But i need a solution at SSIS's pre-SQL converting lvl.
I'm sure there should be well-known pattern-solutions to deal with it.
I am going to take a stab at this. Just to be clear, I don't have a lot of information in your question to go on.
Most XML files that I have dealt with have a common element (let's call it a customer) with one to many attributes (this can be invoices, addresses, email, contacts, etc).
So your table structure will be somewhat star shaped around the customer.
So your XML will have a core customer information on a 1 to 1 basis that can be loaded into a single main table, and will have array information of invoices and an array of addresses etc. Those arrays would be their own tables referencing the customer as a key.
I think you are asking how to create that key.
Load the customer data first and return the identity column to be used as a foreign key when loading the other tables.
I find it easiest to do so in script component. I'm only going to explain how to get the key back. I personally would handle the whole process in C# (deserializing and all).
Add this to Using Block:
Using System.Data.OleDB;
Add this into your main or row processing depending on where the script task / component is:
string SQL = #"INSERT INTO Customer(CustName,field1, field2,...)
values(?,?,?,...); Select cast(scope_identity() as int);";
OleDBCommanad cmd = new OleDBCommand();
cmd.CommandType = System.Data.CommandType.Text;
cmd.CommandText = SQL;
cmd.Parameters.AddWithValue("#p1",[CustName]);
...
cmd.Connection.Open();
int CustomerKey = (int)cmd.ExecuteScalar(); //ExecuteScalar returns the value in first row / first column which in our case is scope_identity
cmd.Connection.Close();
Now you can use CustomerKey for all of the other tables.

How do I not waste Generator values when using them server side with Firebird?

Check this simple piece of code that uses a generator to create unique primary keys in a Firebird table:
CREATE OR ALTER TRIGGER ON_BEFOREINSERT_PK_BOOKING_ITEM FOR BOOKING_ITEM BEFORE INSERT POSITION 0
AS
BEGIN
IF ((NEW.booking_item_id IS NULL) OR (NEW.booking_item_id = 0)) THEN BEGIN
SELECT GEN_ID(LastIdBookingItem, 1) FROM RDB$DATABASE INTO :NEW.booking_item_id;
END
END!
This trigger grabs and increments then assigns a generated value for the booking item id thus creating an auto-incremented key for the BOOKING_ITEM table. The trigger even checks that the booking id has not already been assigned a value.
The problem is the auto-incremented value will be lost (wasted) if, for some reason, the BOOKING_ITEM record cannot be posted.
I have a couple of ideas on how to avoid this wasting but have concerns about each one. Here they are:
Decrement the counter if a posting error occurs. Within the trigger I set up a try-except block (do try-except blocks even exist in Firebird PSQL?) and run a SELECT GEN_ID(LastIdBookingItem, -1) FROM RDB$DATABASEon post exceptions. Would this work? What if another transaction sneaks in and increments the generator before I decrement it? That would really mess things up.
Use a Temporary Id. Set the id to some unique temp value that I change to the generator value I want on trigger AFTER INSERT. This method feels somewhat contrived and requires a way that ensures that the temp id is unique. But what if the booking_item_id was supplied client side, how would I distinguish that from a temp id?. Plus I need another trigger
Use Transaction Control. This is like option 1. except instead of using the try-except block to reset the generator I start a transaction and then roll it back if the record fails to post. I don't know the syntax for using transaction control. I thought I read somewhere that SAVEPOINT/SET TRANSACTION is not allowed in PSQL. Plus the roll back would have to happen in the AFTER INSERT trigger so once again I need another trigger.
Surely this is an issue for any Firebird developer that wants to use Generators. Any other ideas? Is there something I'm missing?
Sequences are outside transaction control, and meddling with them to get 'gapless' numbers will only cause troubles because another transaction could increment the sequence as well concurrently, leading to gaps+duplicates instead of no gaps:
start: generator value = 1
T1: increment: value is 2
T2: increment: value is 3
T1: 'rollback', decrement: value is 2 (and not 1 as you expect)
T3: increment: value is 3 => duplicate value
Sequences should primarily be used for generating artificial primary keys, and you shouldn't care about the existence of gaps: it doesn't matter as long as the number uniquely identifies the record.
If you need an auditable sequence of numbers, and the requirement is that there are no gaps, then you shouldn't use a database sequence to generate it. You could use a sequence to assign numbers after creating and committing the invoice itself (so that it is sure it is persisted). An invoice without a number is simply not final yet. However even here there is a window of opportunity to get a gap, eg if an error or other failure occurs between assigning the invoice number and committing.
Another way might be to explicitly create a zero-invoice (marked as cancelled/number lost) with the gap numbers, so that the auditor knows what happened to that invoice.
Depending on local law and regulations, you shouldn't 're-use' or recycle lost numbers as that might be construed as fraud.
You might find other ideas in "An Auditable Series of Numbers". This also contains a Delphi project using IBObjects, but the document itself describes the problem and possible solutions pretty well.
What if, instead of using generators, you create a table with as many columns as the number of generators, giving each column the name of a generator. Something like:
create table generators
(
invoiceNumber integer default 0 not null,
customerId integer default 0 not null,
other generators...
)
Now, you have a table where you can increment invoice number using a SQL inside a transaction, something like:
begin transaction
update generator set invoiceNumber = invoiceNumber + 1 returning invoiceNumber;
insert into invoices set ..........
end transaction.
if anything goes wrong, the transaction would be rolled-back, together with the new
invoice number. I think there would be no more gaps in the sequence.
Enio

Can I insert in a programmatically defined PostgreSQL table using SQL language?

Context: I'm trying to INSERT data in a partitioned table. My table is partitioned by months, because I have lots of data (and a volume expected to increase) and the most recent data is more often queried. Any comment on the partition choice is welcome (but an answer to my question would be more than welcome).
The documentation has a partition example in which, when inserting a line, a trigger is called that checks the new data date and insert it accordingly in the right "child" table. It uses a sequence of IF and ELSIF statements, one for each month. The guy (or gal) maintaining this has to create a new table and update the trigger function every month.
I don't really like this solution. I want to code something that will work perfectly and that I won't need to update every now and then and that will outlive me and my grand-grand-children.
So I was wondering if I could make a trigger that would look like this:
INSERT INTO get_the_appropriate_table_name(NEW.date) VALUES (NEW.*);
Unfortunately all my attempts have failed. I tried using "regclass" stuffs but with no success.
In short, I want to make up a string and use it as a table name. Is that possible?
I was just about to write a trigger function using EXECUTE to insert into a table according to the date_parts of now(), or create it first if it should not exist .. when I found that somebody had already done that for us - right under the chapter of the docs you are referring to yourself:
http://www.postgresql.org/docs/9.0/interactive/ddl-partitioning.html
Scroll all the way down to user "User Comments". Laban Mwangi posted an example.
Update:
The /interactive branch of the Postgres manual has since been removed, links are redirected. So the comment I was referring to is gone. Look to these later, closely related answers for detailed instructions:
INSERT with dynamic table name in trigger function
Table name as a PostgreSQL function parameter
For partitioning, there are better solutions by now, like range partitioning in Postgres 10 or later. Example:
Better database for “keep always the 5 latest entries per ID and delete older”?

Reporting Services / Supporting robust filtering

I'm looking for sort of a 'best practice' way to tackle this common scenario. I think the question is best asked with an example. Let's assume the following:
The goal is to write an 'order summary' report which displays a listing of orders based on various filtering criteria.
Example: The user wants to report on all orders created between X and Y dates
Example: The user wants to report on all orders with the status of 'open'
Example: The user wants to report on all orders generated by another user XYZ
Example: The user wants to report on all orders between $1000 and $10000
These reports will likely be launched from different pages, but perhaps there might be an 'advanced search' page which allows them to check/uncheck filters and define parameters
I want to use remote processing to generate the report
Creating a single report with all of these filters implemented via report parameters and report filters becomes cumbersome and unmaintainable VERY quickly. This leads me to believe that I should create a single stored procedure which accepts all of the possible filter values (and a NULL if the result set should not be filtered by the parameter).
Do you agree with this assessment?
If so, I am not a TSQL expert and would like to have some general advice on how to implement this stored procedure. So far I am doing it like this:
Create a table variable of orderID #resultset
Populate #resultset initially via the first filter (i chose start and stop date)
For each filter:
If the filter is defined, create a table variable #tempresultset and insert all records from #resultset WHERE (filter is applicable)
Delete from #resultset, insert into #resultset select orderid from #tempresultset
return the #resultset after all filters have been applied
This just doesnt feel right / efficient... Is there a better way to approach this?
Any other suggestions or advice on how to approach this general problem would be greatly appreciated. I feel somewhat lost on the proper way to implement this solution to what seems like should be a very common problem.
After some researching I've found a good way to implement these optional filters within a single select statement within a stored procedure:
It looks something like:
SELECT ordernumber FROM orders
--Filter #1 - based on Parameter #1
WHERE (#param1 IS NULL) OR (somefield = #param1)
--Filter #2 - based on Parameter #2
AND WHERE (#param2 IS NULL) or (somefield2 = #param2)
--Filter #3 - based on Parameter #3
AND WHERE (#param3 IS NULL) or (somefield3 = #param3)