I have developed an online survey that stores my data in a Microsoft SQL 2005 database. I have written a set of outlier checks on my data in R. The general workflow for these scripts is:
Read data from SQL database with sqlQuery()
Perform outlier analysis
Write offending respondents back to database in separate table using sqlSave()
The table I am writing back to has the structure:
CREATE TABLE outliers2(
modelid int
, password varchar(50)
, reason varchar(50),
Constraint PK_outliers2 PRIMARY KEY(modelid, reason)
)
GO
As you can see, I've set the primary key to be modelid and reason. The same respondent may be an outlier for multiple checks, but I do not want to insert the same modelid and reason combo for any respondent.
Since we are still collecting data, I would like to be able to update these scripts on a daily / weekly basis as I develop the models I am estimating on the data. Here is the general form of the sqlSave() command I'm using:
sqlSave(db, db.insert, "outliers2", append = TRUE, fast = FALSE, rownames = FALSE)
where db is a valid ODBC Connection and db.insert has the form
> head(db.insert)
modelid password reason
1 873 abkd WRONG DIRECTION
2 875 ab9d WRONG DIRECTION
3 890 akdw WRONG DIRECTION
4 905 pqjd WRONG DIRECTION
5 941 ymne WRONG DIRECTION
6 944 okyt WRONG DIRECTION
sqlSave() chokes when it tries to insert a row that violates the primary key constraint and does not continue with the other records for the insert. I would have thought that setting fast = FALSE would have alleviated this problem, but it doesn't.
Any ideas on how to get around this problem? I could always drop the table at the beginning of the first script, but that seems pretty heavy handed and will undoubtedly lead to problems down the road.
In this case, everything is working as expected. You uploading everything as a batch and SQL Server is stopping the batch as soon as it finds an error. Unfortunately, I don't know of a graceful built-in solution. But, I think it is possible to build a system in the database to handle this more efficiently. I like doing data storage/management in databases rather than within R, so my solution is very database heavy. Others may offer you a solution that is more R oriented.
First, create a simple table, without constraints, to hold your new rows and adjust your sqlSave statement accordingly. This is where R will upload the information to.
CREATE TABLE tblTemp(
modelid int
, password varchar(50)
, reason varchar(50)
, duplicate int()
)
GO
Your query to put information into this table should assume 'No' for the column 'duplicate'. I use a pattern where 1=Y & 5=N. You could also only mark those that are outliers but I tend to prefer to be explicit with my logic.
You will also need a place to dump all rows which violate the PK in outliers2.
CREATE TABLE tblDuplicates(
modelid int
, password varchar(50)
, reason varchar(50)
)
GO
OK. Now all you need to do is to create a trigger to move the new rows from tblTemp to outliers2. This trigger will move all duplicate rows to tblDuplicates for later handling, deletion, whatever.
CREATE TRIGGER FindDups
ON tblOutliersTemp
AFTER INSERT
AS
I'm not going to go through and write the entire trigger. I don't have a SQL Server 2005 to test it against and I would probably make a syntax error and I don't want to give you bad code, but here's what the trigger needs to do:
Identify all rows in tblTemp that would violate the PK in outliers2. Where duplicates are found, change the duplicates to 1. This would be done with an UPDATE statement.
Copy all rows where duplicate=1 to tblDuplicates. You would do this with an INSERT INTO tblDuplicates ......
Now copy the non-duplicate rows to outliers2 with an INSERT INTO statement that looks almost exactly like the one used in step 2.
DROP all rows from tblTemp, to clear it out for your next batch of updates. This step is important.
The nice part about doing it this way is sqlSave() won't error out just because you have a violation of your PK and you can deal with the matches at a later time, like tomorrow. :-)
Related
I have several tables that I need to contain a "Null" value, so that if another table links to that particular record, it basically gets "Nothing". All of these tables have differing numbers of records - if I stick something on the end, it gets messy trying to find the "Null" record. So I would want to perform an INSERT query to append a record that either has the value 0 for the ID field, or a fixed number like 9999999. I've tried both, and Access doesn't let me run the query because of a key violation.
The thing is, I've run the same query before, and it's worked fine. But then I had to delete all the data and re-upload it, and now that I'm trying it again, it's not working. Here is the query:
INSERT INTO [Reading] ([ReadingID], [EntryFK], [Reading], [NotTrueReading])
VALUES (9999999, 0, "", FALSE)
Where 9999999 is, I've also tried 0. Both queries fail because of key violations.
I know that this isn't good db design. Nonetheless, is there any way to make this work? I'm not sure why I can't do it now whereas I could do it before.
I'm not sure if I'm fully understanding the issue here, but there may be a couple of reasons why this isn't working. The biggest thing is that any sort of primary key column has to be unique for every record in your lookup table. Like you mentioned above, 0 is a pretty common value for 'unknown' so I think you're on the right track.
Does 0 or 9999999 already exist in [Reading]? If so, that could be one explanation. When you wiped the table out before, did you completely drop and recreate the table or just truncate it? Depending on how the table was set up, some databases will 'remember' all of the keys it used in the past if you simply deleted all of the data in that table and re-inserted it rather thank dropping and recreating it (that is, if I had 100 records in a table and then truncated it (or deleted those records), the next time I insert a record into that table it'll still start at 101 as its default PK value).
One thing you could do is to drop and recreate the table and set it up so that the primary key is generated by the database itself if it isn't already (aka an 'identity' type of column) and ensure that it starts at 0. Once you do that, the first record you will want to insert is your 'unknown' value (0) like so where you let the database itself handle what the ReadingID will be:
INSERT INTO [Reading] ([EntryFK], [Reading], [NotTrueReading]) VALUES (0, "", FALSE)
Then insert the rest of your data. If the other table looking up to [Reading] has a null value in the FK column, then you can always join back to [Reading] on coalesce(fk_ReadingID,0) = Reading.ReadingID.
Hope that helps in some capacity.
I have a database with 2 tables: CurrentTickets & ClosedTickets. When a user creates a ticket via web application, a new row is created. When the user closes a ticket, the row from currenttickets is inserted into ClosedTickets and then deleted from CurrentTickets. If a user reopens a ticket, the same thing happens, only in reverse.
The catch is that one of the columns being copied back to CurrentTickets is the PK column (TicketID)that idendity is set to ON.
I know I can set the IDENTITY_INSERT to ON but as I understand it, this is generally frowned upon. I'm assuming that my database is a bit poorly designed. Is there a way for me to accomplish what I need without using IDENTITY_INSERT? How would I keep the TicketID column autoincremented without making it an identity column? I figure I could add another column RowID and make that the PK but I still want the TicketID column to autoincrement if possible but still not be considered an Idendity column.
This just seems like bad design with 2 tables. Why not just have a single tickets table that stores all tickets. Then add a column called IsClosed, which is false by default. Once a ticket is closed you simply update the value to true and you don't have to do any copying to and from other tables.
All of your code around this part of your application will be much simpler and easier to maintain with a single table for tickets.
Simple answer is DO NOT make an Identity column if you want your influence on the next Id generated in that column.
Also I think you have a really poor schema, Rather than having two tables just add another column in your CurrentTickets table, something like Open BIT and set its value to 1 by default and change the value to 0 when client closes the Ticket.
And you can Turn it On/Off as many time as client changes his mind, with having to go through all the trouble of Insert Identity and managing a whole separate table.
Update
Since now you have mentioned its SQL Server 2014, you have access to something called Sequence Object.
You define the object once and then every time you want a sequential number from it you just select next value from it, it is kind of hybrid of an Identity Column and having a simple INT column.
To achieve this in latest versions of SQL Server use OUTPUT clause (definition on MSDN).
OUTPUT clause used with a table variable:
declare #MyTableVar (...)
DELETE FROM dbo.CurrentTickets
OUTPUT DELETED.* INTO #MyTableVar
WHERE <...>;
INSERT INTO ClosedTicket
Select * from #MyTableVar
Second table should have ID column, but without IDENTITY property. It is enforced by the other table.
Let's say I have a table defined as follows:
CREATE TABLE SomeTable
(
P_Id int PRIMARY KEY IDENTITY,
CompoundKey varchar(255) NOT NULL,
)
CompoundKey is a string with the primary key P_Id concatenated to the end, like Foo00000001 which comes from "Foo" + 00000001. At the moment, entries insertions into this table happen in 2 steps.
Insert a dummy record with a place holder string for CompoundKey.
Update the CompoundKey with the column with the generated compound key.
I'm looking for a way to avoid the 2nd update entirely and do it all with one insert statement. Is this possible? I'm using MS SQL Server 2005.
p.s. I agree that this is not the most sensible schema in the world, and this schema will be refactored (and properly normalized) but I'm unable to make changes to the schema for now.
Your could use a computed column; change the schema to read:
CREATE TABLE SomeTable
(
P_Id int PRIMARY KEY IDENTITY,
CompoundKeyPrefix varchar(255) NOT NULL,
CompoundKey AS CompoundKeyPrefix + CAST(P_Id AS VARCHAR(10))
)
This way, SQL Server will automagically give you your compound key in a new column, and will automatically maintain it for you. You may also want to look into the PERSIST keyword for computed columns which will cause SQL Server to materialise the value in the data files rather than having to compute it on the fly. You can also add an index against the column should you so wish.
A trigger would easily accomplish this
This is simply not possible.
The "next ID" doesn't exist and thus cannot be read to fulfill the UPDATE until the row is inserted.
Now, if you were sourcing your autonumbers from somwhere else you could, but I don't think that's a good answer to your question.
Even if you want to use triggers, an UPDATE is still executed even if you don't manually execute it.
You can obscure the population of the CompoundKey, but at the end of the day it's still going to be an UPDATE
I think your safest bet is just to make sure the UPDATE is in the same transaction as the INSERT or use a trigger. But, for the academic argument of it, an UPDATE still occurs.
Two things:
1) if you end up using two inserts, you must use transaction! Otherwise other processes may see the database in inconsistent state (i.e. seeing record without CompoundKey).
2) I would refrain from trying to paste the Id to the end of CompoundKey in transaction, trigger etc. It is much cleaner to do it at the output if you need it, e.g. in queries (select concat(CompoundKey, Id) as CompoundKeyId ...). If you need it as a foreign key in other tables, just use the pair (CompoundKey, Id).
I have a nightly SSIS process that exports a TON of data from an AS400 database system. Due to bugs in the AS400 DB software, ocassional duplicate keys are inserted into data tables. Every time a new duplicate is added to an AS400 table, it kills my nightly export process. This issue has moved from being a nuisance to a problem.
What I need is to have an option to insert only unique data. If there are duplicates, select the first encountered row of the duplicate rows. Is there SQL Syntax available that could help me do this? I know of the DISTINCT ROW clause but that doesn't work in my case because for most of the offending records, the entirety of the data is non-unique except for the fields which comprise the PK.
In my case, it is more important for my primary keys to remain unique in my SQL Server DB cache, rather than having a full snapshot of data. Is there something I can do to force this constraint on the export in SSIS/SQL Server with out crashing the process?
EDIT
Let me further clarify my request. What I need is to assure that the data in my exported SQL Server tables maintains the same keys that are maintained the AS400 data tables. In other words, creating a unique Row Count identifier wouldn't work, nor would inserting all of the data without a primary key.
If a bug in the AS400 software allows for mistaken, duplicate PKs, I want to either ignore those rows or, preferably, just select one of the rows with the duplicate key but not both of them.
This SELECT statement should probably happen from the SELECT statement in my SSIS project which connects to the mainframe through an ODBC connection.
I suspect that there may not be a "simple" solution to my problem. I'm hoping, however, that I'm wrong.
Since you are using SSIS, you must be using OLE DB Source to fetch the data from AS400 and you will be using OLE DB Destination to insert data into SQL Server.
Let's assume that you don't have any transformations
Add a Sort transformation after the OLE DB Source. In the Sort Transformation, there is a check box option at the bottom to remove duplicate rows based on a give set of column values. Check all the fields but don't select the Primary Key that comes from AS400. This will eliminate the duplicate rows but will insert the data that you still need.
I hope that is what you are looking for.
In SQL Server 2005 and above:
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY almost_unique_field ORDER BY id) rn
FROM import_table
) q
WHERE rn = 1
There are several options.
If you use IGNORE_DUP_KEY (http://www.sqlservernation.com/home/creating-indexes-with-ignore_dup_key.html) option on your primary key, SQL will issue a warning and only the duplicate records will fail.
You can also group/roll-up your data but this can get very expensive. What I mean by that is:
SELECT Id, MAX(value1), MAX(value2), MAX(value3) etc
Another option is to add an identity column (and cluster on this for an efficient join later) to your staging table and then create a mapping in a temp table. The mapping table would be:
CREATE TABLE #mapping
(
RowID INT PRIMARY KEY CLUSTERED,
PKIN INT
)
INSERT INTO #mapping
SELECT PKID, MIN(rowid) FROM staging_table
GROUP BY PKID
INSERT INTO presentation_table
SELECT S.*
FROM Staging_table S
INNER JOIN #mapping M
ON S.RowID = M.RowID
If I understand you correctly, you have duplicated PKs that have different data in the other fields.
First, put the data from the other database into a staging table. I find it easier to research issues with imports (especially large ones) if I do this. Actually I use two staging tables (and for this case I strongly recommend it), one with the raw data and one with only the data I intend to import into my system.
Now you can use and Execute SQL task to grab the one of the records for each key (see #Quassnoi for an idea of how to do that you may need to adjust his query for your situation). Personally I put an identity into my staging table, so I can identify which is the first or last occurance of duplicated data. Then put the record you chose for each key into your second staging table. If you are using an exception table, copy the records you are not moving to it and don't forget a reason code for the exception ("Duplicated key" for instance).
Now that you have only one record per key in a staging table, your next task is to decide what to do about the other data that is not unique. If there are two different business addresses for the same customer, which do you chose? This is a matter of business rules definition not strictly speaking SSIS or SQL code. You must define the business rules for how you chose the data when the data needs to be merged between two records (what you are doing is the equivalent of a de-dupping process). If you are lucky there is a date field or other way to determine which is the newest or oldest data and that is the data they want you to use. In that case once you have selected just one record, you are done the intial transform.
More than likely though you may need different rules for each other field to choose the correct one. In this case you write SSIS transforms in a data flow or Exec SQl tasks to pick the correct data and update the staging table.
Once you have the exact record you want to import, then do the data flow to move to the correct production tables.
I want to create a history table to track field changes across a number of tables in DB2.
I know history is usually done with copying an entire table's structure and giving it a suffixed name (e.g. user --> user_history). Then you can use a pretty simple trigger to copy the old record into the history table on an UPDATE.
However, for my application this would use too much space. It doesn't seem like a good idea (to me at least) to copy an entire record to another table every time a field changes. So I thought I could have a generic 'history' table which would track individual field changes:
CREATE TABLE history
(
history_id LONG GENERATED ALWAYS AS IDENTITY,
record_id INTEGER NOT NULL,
table_name VARCHAR(32) NOT NULL,
field_name VARCHAR(64) NOT NULL,
field_value VARCHAR(1024),
change_time TIMESTAMP,
PRIMARY KEY (history_id)
);
OK, so every table that I want to track has a single, auto-generated id field as the primary key, which would be put into the 'record_id' field. And the maximum VARCHAR size in the tables is 1024. Obviously if a non-VARCHAR field changes, it would have to be converted into a VARCHAR before inserting the record into the history table.
Now, this could be a completely retarded way to do things (hey, let me know why if it is), but I think it it's a good way of tracking changes that need to be pulled up rarely and need to be stored for a significant amount of time.
Anyway, I need help with writing the trigger to add records to the history table on an update. Let's for example take a hypothetical user table:
CREATE TABLE user
(
user_id INTEGER GENERATED ALWAYS AS IDENTITY,
username VARCHAR(32) NOT NULL,
first_name VARCHAR(64) NOT NULL,
last_name VARCHAR(64) NOT NULL,
email_address VARCHAR(256) NOT NULL
PRIMARY KEY(user_id)
);
So, can anyone help me with a trigger on an update of the user table to insert the changes into the history table? My guess is that some procedural SQL will need to be used to loop through the fields in the old record, compare them with the fields in the new record and if they don't match, then add a new entry into the history table.
It'd be preferable to use the same trigger action SQL for every table, regardless of its fields, if it's possible.
Thanks!
I don't think this is a good idea, as you generate even more overhead per value with a big table where more than one value changes. But that depends on your application.
Furthermore you should consider the practical value of such a history table. You have to get a lot of rows together to even get a glimpse of context to the value changed and it requeries you to code another application that does just this complex history logic for an enduser. And for an DB-admin it would be cumbersome to restore values out of the history.
it may sound a bit harsh, but that is not the intend. An experienced programmer in our shop had a simmilar idea through table journaling. He got it up and running, but it ate diskspace like there's no tomorrow.
Just think about what your history table should really accomplish.
Have you considered doing this as a two step process? Implement a simple trigger that records the original and changed version of the entire row. Then write a separate program that runs once a day to extract the changed fields as you describe above.
This makes the trigger simpler, safer, faster and you have more choices for how to implement the post processing step.
We do something similar on our SQL Server database, but the audit tables are for each indvidual table audited (one central table would be huge as our database is many many gigabytes in size)
One thing you need to do is make sure you also record who made the change. You should also record the old and new value together (makes it easier to put data back if you need to) and the change type (insert, update, delete). You don't mention recording deletes from the table, but we find those the some of the things we most frequently use the table for.
We use dynamic SQl to generate the code to create the audit tables (by using the table that stores the system information) and all audit tables have the exact same structure (makes is easier to get data back out).
When you create the code to store the data in your history table, create the code as well to restore the data if need be. This will save tons of time down the road when something needs to be restored and you are under pressure from senior management to get it done now.
Now I don't know if you were planning to be able to restore data from your history table, but once you have once, I can guarantee that management will want it used that way.
CREATE TABLE HIST.TB_HISTORY (
HIST_ID BIGINT GENERATED ALWAYS AS IDENTITY (START WITH 0, INCREMENT BY 1, NO CACHE) NOT NULL,
HIST_COLUMNNAME VARCHAR(128) NOT NULL,
HIST_OLDVALUE VARCHAR(255),
HIST_NEWVALUE VARCHAR(255),
HIST_CHANGEDDATE TIMESTAMP NOT NULL
PRIMARY KEY(HIST_SAFTYNO)
)
GO
CREATE TRIGGER COMMON.TG_BANKCODE AFTER
UPDATE OF FRD_BANKCODE ON COMMON.TB_MAINTENANCE
REFERENCING OLD AS oldcol NEW AS newcol FOR EACH ROW MODE DB2SQL
WHEN(COALESCE(newcol.FRD_BANKCODE,'#null#') <> COALESCE(oldcol.FRD_BANKCODE,'#null#'))
BEGIN ATOMIC
CALL FB_CHECKING.SP_FRAUDHISTORY_ON_DATACHANGED(
newcol.FRD_FRAUDID,
'FRD_BANKCODE',
oldcol.FRD_BANKCODE,
newcol.FRD_BANKCODE,
newcol.FRD_UPDATEDBY
);--
INSERT INTO FB_CHECKING.TB_FRAUDMAINHISTORY(
HIST_COLUMNNAME,
HIST_OLDVALUE,
HIST_NEWVALUE,
HIST_CHANGEDDATE