I have written a program in C# to read data from to tables, transform it and write it to four other tables.
Three of the destination tables has an not null integer pointing to the primary key on the last destination table. I call this V.
When my program have read a chunk of data to memory and transformed it, it will use the SqlBulkCopy to write to the table V. Upon completion it will use a select-statement to retrieve the primary keys.
These primary keys will be assigned properly to the three other destination tables in memory. Finally the last three tables will be written to the database using SqlBulkCopy in one transaction.
However, if the program writes successfully to the table V, but fails to write the other three tables I have a bunch of dirty data.
Can I somehow map the data going into table V with their primary keys, inside a transaction surrounding the all the code?
I would be sincerely happy for any suggestion to solve my problem.
Place a foreign key on the three tables' columns that are referring back to the primary key on table V. That will make certain there is no dirty data. You may have to rework your logic and the order the tables are written in, but adding the foreign key should clean up the process and make it work smoother after refactoring.
Related
My question is probably very specific to Postgres, probably not.
A program which I cannot modify has access to Postgress via npgsql and a simple select command, all I know.
I also have access via npgsql. The table is defined as:
-- Table: public.n_data
-- DROP TABLE public.n_data;
CREATE TABLE public.n_data
(
u_id integer,
p_id integer NOT NULL,
data text,
CONSTRAINT nc PRIMARY KEY (p_id)
)
WITH (
OIDS=FALSE
);
ALTER TABLE public.n_data
OWNER TO postgres;
(If that info is useful anyway)
I access one single big column, read from it and write back to it.
This all works fine so far.
The Question is: how does Postgres handles it if we write at the same time.
Any Problems there?
And if Postgres does not handle that automatically, how about when I read the data, process it and in the meantime data changes, and I write back that data after I processed it---> lost data.
Its a bit tricky to test for data integrity, since this datablock is huge, and corruptions are hard to find.
I do it with c# if that means anything.
Locking (in most1) relational databases (including Postgres) is always on row level, never on column level (it's columns and rows in a relational database not "cells", "fields" or "records")
If two transactions modify the same row, the second one will have to wait until the first one commits or rolls back.
If two transactions modify different rows then they can do that without any problems as long as they don't modify columns that are part of a unique constraint or primary key to the same value.
Read access to data is never blocked in Postgres by regular DML statements. So yes while one transaction modifies data, another one will see the old data until the first transaction commits the changes ("read consistency").
To handle lost updates you can either use the serializable isolation level or make all transactions follow the pattern that they first need to obtain a lock on the row (select ... for update) and hold that until they are finished. Search for "pessimistic locking" to get more details about this pattern.
Another option is to include a "modified" timestamp in your table. When a process reads the data it also reads the modification timestamp. When it sends back the new changes it includes a where modified_at = <value obtained when reading> - if the data has changed the condition will not hold true and nothing will be updated and you need to restart your transaction. Search for "optimistic locking" to find more details about this pattern.
1 some DBMS do page locking and some escalate many row level locks to a table lock. Neither is the case in Postgres
In our DB (on SQL Server 2005) we have a "Customers" table, whose primary key is Client Code, a surrogate, bigint IDENTITY(1,1) key; the table is referenced by a number of other tables in our DB thru a foreign key.
A new CR implementation we are estimating would require us to change ID column type to varchar, Client Code generation algorithm being shifted from a simple numeric progression to a strict 2-char representation, with codes ranging from 01 to 99, then progressing like this:
1A -> 2A -> ... -> 9A -> 1B -> ... 9Z
I'm fairly new to database design, but I smell some serious problems here. First of all, what about this client code generation algorithm? What if I need a Client Code to go beyond 9Z code limit?
The I have some question: would this change be feasible, the table being already filled with a fair amount of data, and referenced by multiple entities? If so, how would you approach this problem, and how would you implement Client Code generation?
I would leave the primary key as it is and would create another key (unique) on the client code generated.
I would do that anyway. It's always better to have a short number primary key instead of long char keys.
In some situation you might prefer a GUID (for replication purposes) but a number int/bigint is alway preferable.
You can read more here and here.
My biggest concern with what you are proposing is that you will be limited to 360 primary records. That seems like a small number.
Performing the change is a multi-step operation. You need to create the new field in the core table and all its related tables.
To do an in-place update, you need to generate the code in the core table. Then you need to update all the related tables to have the code based on the old id. Then you need to add the foreign key constraint to all the related tables. Then you need to remove the old key field from all the related tables.
We only did that in our development server. When we upgraded the live databases, we created a new database for each and copied the data over using a python script that queried the old database and inserted into the new database. I now update that script for every software upgrade so the core engine stays the same, but I can specify different tables or data modifications. I get the bonus of having a complete backup of the original database if something unexpected happens when upgrading production.
One strong argument in favor of a non-identity/guid code is that you want a human readable/memorable code and you need to be able to move records between two systems.
Performance is not necessarily a concern in SQL Server 2005 and 2008. We recently went through a change where we moved from int ids everywhere to 7 or 8 character "friendly" record codes. We expected to see some kind of performance hit, but we in fact saw a performance improvement.
We also found that we needed a way to quickly generate a code. Our codes have two parts, a 3 character alpha prefix and a 4 or 5 digit suffix. Once we had a large number of codes (15000-20000) we were finding it to slow to parse the code into prefix and suffix and find the lowest unused code (it took several seconds). Because of this, we also store the prefix and the suffix separately (in the primary key table) so that we can quickly find the next available lowest code with a particular prefix. The cached prefix and suffix made the search almost fee.
We allow changing of the codes and they changed values propagate by cascade update rules on the foreign key relationship. We keep an identity key on the core code table to simplify the update of the code.
We don't use an ORM, so I don't know what specific things to be aware of with that. We also have on the order of 60,000 primary keys in our biggest instance, but have hundreds of tables related and tables with millions of related values to the code table.
One big advantage that we got was, in many cases, we did not need to do a join to perform operations. Everywhere in the software the user references things by friendly code. We don't have to do a lookup of the int ID (or a join) to perform certain operations.
The new code generation algorithm isn't worth thinking about. You can write a program to generate all possible codes in just a few lines of code. Put them in a table, and you're practically done. You just need to write a function to return the smallest one not yet used. Here's a Ruby program that will give you all the possible codes.
# test.rb -- generate a peculiar sequence of two-character codes.
i = 1
('A'..'Z').each do |c|
(1..9).each do |n|
printf("'%d%s', %d\n", n, c, i)
i += 1
end
end
The program will create a CSV file that you should be able to import easily into a table. You need two columns to control the sort order. The new values don't naturally sort the way your requirements specify.
I'd be more concerned about the range than the algorithm. If you're right about the requirement, you're limited to 234 client codes. If you're wrong, and the range extends from "1A" to "ZZ", you're limited to less than a thousand.
To implement this requirement in an existing table, you need to follow a careful procedure. I'd try it several times in a test environment before trying it on a production table. (This is just a sketch. There are a lot of details.)
Create and populate a two-column table to map
existing bigints to the new CHAR(2).
Create new CHAR(2) columns in all the
tables that need them.
Update all the new CHAR(2) columns.
Create new NOT NULL UNIQUE or PRIMARY KEY constraints and new FOREIGN KEY constraints on the new CHAR(2) columns.
Rewrite user interface code (?) to target the new columns. (Might not be necessary if you rename the new CHAR(2) and old BIGINT columns.)
Set a target date to drop the old BIGINT columns and constraints.
And so on.
Not really addressing whether this is a good idea or not, but you can change your foreign keys to cascade the updates. What will happen once you're done doing that is that when you update the primary key in the parent table, the corresponding key in the child table will be updated accordingly.
I'm creating a database with several sql files
1 file creates the tables.
1 file adds constraints.
1 file drops constraints.
The primary is a constraint however I've been told by someone to define your primary key in the table definition but not given a reason why.
Is it better to define the primary key as a constraint that can be added and dropped or is it better to do it in the table definition.
My current thinking is to do it in the table definition because doing it as a removable constraint could potentially lead to some horrible issues with duplicate keys.
But dropping constraints could lead to serious issues anyway so it is expected that if someone did drop the primary key, they would have taken appropriate steps to avoid problems as they should have for any other data entry
A primary key is a constraint, but a constraint is not necessarily a primary key. Short of doing some major database surgery, there should never be a need to drop a primary key, ever.
Defining the primary key along with the table is good practice - if you separate the table and the key definition, that opens the window to the key definition getting lost or forgotten. Given that any decent database design utterly depends on consistent keys, you don't ever want to have even the slightest chance that your primary keys aren't functioning properly.
From a maintainability perspective I would say that it is better to have the Primary Key in the table definition as it is a very good indicator of what the table will most likely be used for.
The other constraints are important as well though and your argument holds.
All of this is somewhat platform specific, but a primary key is a logical concept, whereas a constraint (or unique index, or whatever) is a physical thing that implements the logical concept of "primary key".
That's another reason to argue for putting it with the table itself - it's logical home - rather than the constraints file.
For effective source control it usually makes sense to have a separate script for each object (constraints included). That way you can track changes to each object individually.
There's a certain logical sense in keeping everything related to a table in one file--column definitions, keys, indexes, triggers, etc. If you never have to rebuild a very large database from SQL, that will work fine almost all the time. The few times it doesn't work well probably aren't worth changing the process of keeping all the related things together in one file.
But if you have to rebuild a very large database, or if you need to move a database onto a different server for testing, or if you just want to fiddle around with things, it makes sense to split things up. In PostgreSQL, we break things up like this. All these files are under version control.
All CREATE DOMAIN statements in one file.
Each CREATE TABLE statement in a separate file. That file includes all constraints except FOREIGN KEY constraints, expressed as ALTER TABLE statements. (More about this in a bit.)
Each table's FOREIGN KEY constraints in a separate file.
Each table's indexes for non-key columns in a separate file.
Each table's triggers in a separate file. (If a table has three triggers, all three go in one file.)
Each table's data in a separate file. (Only for tables loaded before bringing the database online.)
Each table's rules in a separate file.
Each function in a separate file. (Functions are PostgreSQL's equivalent to stored procedures.)
Without foreign key constraints, we can load tables in any order. After the tables are loaded, we can run a single script to rebuild all the foreign keys. The makefile takes care of bundling the right individual files together. (Since they're separate files, we can run them individually if we want to.)
Tables load faster if they don't have constraints. I said we put each CREATE TABLE statement in a separate file. The file includes all constraints except FOREIGN KEY constraints, expressed as ALTER TABLE statements. You can use the streaming editor sed to split those files into two pieces. One piece has the column definitions; the other piece has all the 'ALTER TABLE ADD CONSTRAINT' statements. The makefile takes care of splitting the source files and bundling them together--all the table definitions in one SQL file, and all the ALTER TABLE statements in another. Then we can run a single script to create all the tables, load the tables, then run a single script to rebuild all the constraints.
make is your friend.
I am working on a legacy database. I am not able to change the schema :( in a couple of tables the primary key uses multiple columns.
In the app I read the data in each row into a table the user then updates the data and I write the data back into the table.
Currently I concatenate the various PK columns and store them as a unique id for when I put the data back into the table.
Now I was wondering if there is a more efficient way to do that. Coming from a mySQL background I am not aware of any but thought SQL Server 2005 may have a function
SELECT PRIMARYKEY() as pk, ... FROM table WHERE ...
the above would select the key that the database engine uses as the primary key for the given record
I searched and couldn't find anything. Its probably just me being fussy but I don't like the concatenation trick.
DC
In SQL Server, there is no equivalent of PRIMARYKEY() that I would be aware of, really. You can consult the system catalog views to find out which columns make up the primary key, but you can't just simply select the primary key value(s) with a function call.
I would agree with StarShip3000 - what do you concatenate your PK values for? While I don't think a compound primary key made up of several columns is necessarily a very good idea, if it's a legacy system and you can't change it, I wouldn't bother concatenating the PK values on read, and then having to split them apart again when you write your data back. Just leave the structure as it is - compound keys aren't generally recommended, but they are indeed supported, no problem.
"Currently I concatenate the various PK columns and store them as a unique id for when I put the data back into the table."
Can't you just store the pk as two columns in the target table and use that to join back to the two columns on the source table?
What benefit is concatenating giving you here?
Edit: Let me completely rephrase this, because I'm not sure there's an XML way like I was originally describing.
Yet another edit: This needs to be a repeatable process, and it has to be able to be set up in a way that it can be called in C# code.
In database A, I have a set of tables, related by PKs and FKs. A parent table, with child and grandchild tables, let's say.
I want to copy a set of rows from database A to database B, which has identically named tables and fields. For each table, I want to insert into the same table in database B. But I can't be constrained to use the same primary keys. The copy routine must create new PKs for each row in database B, and must propagate those to the child rows. I'm keeping the same relations between the data, in other words, but not the same exact PKs and FKs.
How would you solve this? I'm open to suggestions. SSIS isn't completely ruled out, but it doesn't look to me like it'll do this exact thing. I'm also open to a solution in LINQ, or using typed DataSets, or using some XML thing, or just about anything that'll work in SQL Server 2005 and/or C# (.NET 3.5). The best solution wouldn't require SSIS, and wouldn't require writing a lot of code. But I'll concede that this "best" solution may not exist.
(I didn't make this task up myself, nor the constraints; this is how it was given to me.)
I think the SQL Server utility tablediff.exe might be what you are looking for.
See also this thread.
First, let me say that SSIS is your best bet. But, to answer the question you asked...
I don't believe you will be able to get away with creating new id's all around, although you could but you would need to take the original IDs to use for lookups.
The best you can get is one insert statement for table. Here is an example of the code to do SELECTs to get you the data from your XML Sample:
declare #xml xml
set #xml='<People Key="1" FirstName="Bob" LastName="Smith">
<PeopleAddresses PeopleKey="1" AddressesKey="1">
<Addresses Key="1" Street="123 Main" City="St Louis" State="MO" ZIP="12345" />
</PeopleAddresses>
</People>
<People Key="2" FirstName="Harry" LastName="Jones">
<PeopleAddresses PeopleKey="2" AddressesKey="2">
<Addresses Key="2" Street="555 E 5th St" City="Chicago" State="IL" ZIP="23456" />
</PeopleAddresses>
</People>
<People Key="3" FirstName="Sally" LastName="Smith">
<PeopleAddresses PeopleKey="3" AddressesKey="1">
<Addresses Key="1" Street="123 Main" City="St Louis" State="MO" ZIP="12345" />
</PeopleAddresses>
</People>
<People Key="4" FirstName="Sara" LastName="Jones">
<PeopleAddresses PeopleKey="4" AddressesKey="2">
<Addresses Key="2" Street="555 E 5th St" City="Chicago" State="IL" ZIP="23456" />
</PeopleAddresses>
</People>
'
select t.b.value('./#Key', 'int') PeopleKey,
t.b.value('./#FirstName', 'nvarchar(50)') FirstName,
t.b.value('./#LastName', 'nvarchar(50)') LastName
from #xml.nodes('//People') t(b)
select t.b.value('../../#Key', 'int') PeopleKey,
t.b.value('./#Street', 'nvarchar(50)') Street,
t.b.value('./#City', 'nvarchar(50)') City,
t.b.value('./#State', 'char(2)') [State],
t.b.value('./#Zip', 'char(5)') Zip
from
#xml.nodes('//Addresses') t(b)
What this does is take Nodes from the XML and parse out the data. To get the relational id from people we use ../../ to go up the chain.
Dump the XML approach and use the import wizard / SSIS.
By far the easiest way is Red Gate's SQL Data Compare. You can set it up to do just what you described in a minute or two.
I love Red Gate's SQL Compare and Data Compare too but it won't meet his requirements for the changing primary keys as far as I can tell.
If cross database queries/linked servers are an option you could do this with a stored procedure that copies the records from parent/child in DB A into temporary tables on DB B and then add a column for the new primary key in the temp child table that you would update after inserting the headers.
My question is if the records don't have the same primary key how do you tell if it's a new record? Is there some other candidate key? If these are new tables why can't they have the same primary key?
I have created the same thing with a set of stored procedures.
Database B will have its own primary keys, but store Database A's primary keys, for debuging purposes. It means I can have more than one Database A!
Data is copied via a linked server. Not too fast; SSIS is faster. But SSIS is not for beginners, and it is not easy to code something that works with changing source tables.
And it is easy to call a stored procedure from C#.
I'd script it in a Stored Procedure, using Inserts to do the hard work. Your code will take the PKs from Table A (presumably via ##Scope_Identity) - I assume that the PK for Table A is an Identity field?
You could use temporary tables, cursors or you might prefer to use the CLR - it might lend itself to this kind of operation.
I'd be surprised to find a tool that could do this off the shelf with either a) pre-determined keys, or b) identity fields (clearly Tables B & C don't have them).
Are you clearing the destination tables each time and then starting again? That will make a big difference to the solution you need to implement. If you are doing a complete re-import each time then you could do something like the following:
Create a temporary table or table variable to record the old and new primary keys for the parent table.
Insert the parent table data into the destination and use the OUTPUT clause to capture the new ID's and insert them with the old IDs into the temp table.
NOTE: Using the output clause is efficient and allows you to do the insert in bulk without cycling through each record to be inserted.
Insert the child table data. Join to the temp table to retrieve the new foreign key required.
The above process could be done using T-SQL Script, C# code or SSIS. My preference would be for SSIS.
If you are adding each time then you may need to keep a permanent table to track the relationship between source database primary keys and destination database primary keys (at least for the parent table). If you needed to keep this kind of data out of the destination database, you could get SSIS to store/retrieve it from some kind of logging database or even a flat file.
You could probably avoid the above scenario if there is a combination of fields in the parent table that can be used to uniquely identify that record and therefore "find" the primary key for that record in the destination database.
I think most likely what I'm going to use is typed datasets. It won't be a generalized solution; we'll have to regenerate them if any of the tables change. But based on what I've been told, that's not a problem; the tables aren't expected to change much.
Datasets will make it reasonably easy to loop through the data hierarchically and refresh PKs from the database after insert.
When dealing with similar tasks I simply created a set of stored procedures to do the job.
As the task that you specified is pretty custom, you are not likely to find "ready to use" solution.
Just to give you some hints:
If the databases are on different servers use linked servers so you can access both source and destination tables simply through TSQL
In the stored procedure:
Identify the parent items that need to be copied - you said that the primary keys are different so you need to use unique constraints instead (you should be able to define them if the tables are normalised)
Identify the child items that need to be copied based on the identified parents, to check if some of them are already in the destination db use the unique constraints approach again
Identify the grandchild items (same logic as with parent-child)
Copy data over starting with the lowest level (grandchildren, children, parents)
There is no need for cursors etc, simply store the immediate results in the temporary table (or table variable if working within one stored procedure)
That approach worked for me pretty well.
You can of course add parameter to the main stored procedure so you can either copy all new records or only ones that you specify.
Let me know if that is of any help.