We have an application which stores its data in SQL Server. Each table has a bigint primary key. We used to generate these exclusively on demand, i.e. when you go to insert a new row, you first make a call to generate the next ID, and then you do the insert.
We added support to run in offline mode: if your connection is down (or SQL Server is down), it saves the data to a local file until you go back online, and then syncs everything you've done since then.
This required being able to generate IDs on the client side. Instead of asking SQL for the next 1 ID, it now asks for the next hundred or thousand or 10,000 IDs, and then stores the range locally, so it doesn't have to ask for more until those 10,000 run out. It would actually get them in smaller chunks, so when 5000 run out, it still has a buffer of 5000, and it can ask for 5000 more.
The problem is, as soon as this went live, we started getting reports of primary key violations. We stored the data in the Windows registry, in HKEY_CURRENT_USER (the only place in the registry a user is guaranteed to be able to write to). So after some research, we discovered that HKEY_CURRENT_USER is part of the roaming profile, so it's possible the IDs could get overwritten with an old version. Especially if the user logs into multiple computers on the network simultaneously.
So we re-wrote the part that generates IDs to read/write a file from the user's "Local Settings" directory. Surely that shouldn't get overwritten by an old version. But even now, I still see occasional primary key violations. The only thing we can do in that case is delete any keys in the file, kick the user out of the application, and don't let them back in until they get new ID ranges.
But if "Local Settings" isn't safe, what would be? Is there anywhere you can store a persistent value on a computer which is guaranteed not to be rolled back to an old version? Can anyone explain why "Local Settings" does not meet this criteria?
I've done some consideration of a GUID like solution, but that has problems on its own.
in distributed environment as yours, your best bet is using GUID
Do you have to use the same key when you persist the data locally that you use when you sync with the database?
I would be sorely tempted to use a GUID when you persist the data locally and then generate the real key when you're actually writing the data to the database. Or persist the data locally starting with a value of 1 and then generate real keys when you actually write the data to the database.
Setup an IDENTITY (http://www.simple-talk.com/sql/t-sql-programming/identity-columns/) on the bigint primary key so that SQL Server generates the values automatically.
When your application is offline, you keep the pending changes local. When it comes back online, you send your updates (including new records) and SQL Server would INSERT them and automatically assign a primary key since you have the IDENTITY setup.
If you need to know what key value was generated/used after an insert you can utilize the ##IDENTITY property (http://msdn.microsoft.com/en-us/library/aa933167%28v=sql.80%29.aspx)
Related
I recently started working with access and there's something that so far has cause me no problems but I'm concern that it could bring me some issues as the database continues expanding.
When I create tables, Microsoft Access recommend to use their default primary key, which I usually do, the problem is that for some reason when the table get populated the primary key "ID" keeps being inconsistent, it will go from 4 to 2679 (just random example) and it skip lots of numbers, If I'm correct this primary key get set as auto increment automatically, correct? so why is it skipping all numbers in between?
The Table gets populated with a simple SQL query using Visual Studio and C# language. See below a photo from my access table
enter image description here
SQL Server used to do that (in v6.0/6.5 and possibly later ones). It's quite conceivable that Access uses te same mechanism.
IDENTITY works by having the next number (or last, who cares) stored on disc in the DB. To speed up access it is cached in memory, and only occasionally written back to disc (it is SQL Server after all). Depending on how SQL Server was shut down the disk update might be missed. When the server was restarted is had some way of detecting that the disc version was stale and would up it by some number.
Oracle does the same with SEQUENCE's. This got complicated on multi-machine cluster installations where there are multiple servers for the same database. To support this, the first time a server had to get a sequence number it got a lot of them (the Cache variable part of a SEQUENCE's definition, default 20 IIRC) and updated the SEQUENCE assuming that it would use all of the numbers assigned. If it didn't use all the numbers assigned then there would be gaps in the numbers used. (It also meant that with a SEQUENCE in a cluster, the SEQUENCE numbers would not necessarily be used sequentially: machine A writes 21, B writes 41, A writes 22, etc.) I've never checked but I assume that a SQL Server in a fail-over cluster might have the same gaps.
Apply the same mechanism to Access where there is no central server for the DB, just potentially lots of local ones on each client's machine. You can see that there is the potential for gaps.
I have been wondering about the uniqueness of the GUID across the sql servers.
I have one central Database server and 100's of client databases (both SQL Servers). I have a merge replication (bi-directional) setup to sync the data between client and master servers. The sync process will happen 2-3 times a day.
For each of the tables to be synced I am using GUID as PrimaryKey and each table locally gets new records added and new GUIDs are generated locally.
When GUIDs are getting created at each client machine as well as at master DB server, how it will make sure it generates the unique GUID across all Client & Master DBs?
How it will keep track of GUID generated at other client/server DB, so that it will not repeat that GUID?
GUIDs are unique (for your purposes)
There are endless debates on the internet - I like this one
I think GUID's are not really necessarily unique. Their uniqueness comes from the fact that it's extremely unlikely to generate the same GUID randomly but that's all.
But for your purpose, that should be ok - they should be unique on a distributed system with extremely high probability.
You will have to do more research, but I think GUID is based upon MAC address and timestamp, if I remember right.
http://www.sqlteam.com/article/uniqueidentifier-vs-identity
I know some MCM's who have come across a unique key violation on a GUID.
How can this happen? Well, in the Virtual World, you have virtual adapters.
If you copy one virtual machine from one host to another, you can have the same adapter, MAC address?
Now if both images are running at the same time, it is possible to get no unique GUIDs.
However, the condition is rare. You can always add another field to the key to make it unique.
There is a whole debate on whether or not to use a GUID as a clustered PK. Remember, any other index will take a copy of the PK in the leaf (nodes). This is 16 bytes for every record x number of indexes.
I hope this helps.
John
You don't need to do anything special to ensure a GUID/Uniqueidentifier is globally unique. That basic guarantee is the motivating requirement for the GUID.
I'm synchronizing SQL Server 2008 with ~6 SQL Server 2008 Express clients (everything R2 I believe), using the SyncOrchestrator or specifically using http://code.msdn.microsoft.com/windowsdesktop/Database-SyncSQL-Server-e97d1208 as a base with slight modifications. To my knowledge this means all connections are peers or nodes.
I have 2 scopes. One is download only and the other is upload only. The download only scope is ridden with identity columns primarily because I didn't know any better and still couldn't wrap my head around introducing Guids as the PK on the client side. It doesn't totally matter as all clients should have exact replicas of about 8 or so tables and these machines don't touch this data in any way, only read it.
The upload only scope uses Guids as fortunately I can control that portion of the database and there would be no way 10 clients all using the same identity seed could sync back to the server properly. Both scopes use the default provisioning with bulk inserts and the whole 9 yards so there shouldn't be anything I'm doing on the provisioning end to screw this up.
I initially set everything up not using PerformPostRestoreFixup AND the initial database would be manually synchronized with insert statements from the host. This seemed fine but no updates or deletes seemed to ever be applied. You can safely ignore this (only used for historical accuracy and to prove my ineptness) as I then used VS2010 Database Projects to rebuild the database down to schema only & synchronized. I then used the steps outlined here (http://social.microsoft.com/Forums/br/syncdevdiscussions/thread/9ac6d1a1-1565-4b82-a8d8-3d4a9ff5d07b) (sync, backup, restore, call performpostrestorefixup, sync on x clients) and on my dev box where I'm setting all this up I could see updates and deletes just fine. Its when I deploy this to the x clients that I'm not seeing a mirror of the database as I think I should.
The initial sync will complain and try to synchronize all records again. I believe this is expected. During ApplyChangeFailed event on the client I set everything other than DbConflictType.ErrorsOccurred to ApplyAction.RetryWithForceWrite. This may be a source of problems as I initially thought this should be done to force the change down to the client. I want the server to always win in this scenario but during trace I always see the phrase "Local wins" during the bulk insert/update calls. It's possible I'm seeing the error before the re-apply happens but it's awkward to look at.
The only problem I seem to be having is with the download only scope. The initial client database is about a week old now and if I use the performpostrestorefixup steps I don't see any of the updates that have applied between now and then as I think I should. It's as if SyncFx almost prefers a blank database on the client side to kick off the initial sync then all the updates seem to apply just fine with no ApplyChangesFailed events kicking off.
If anyone has seen this before or has a clue where to go I would greatly appreciate it. My brain has fried trying to determine what it is that's going on. My last ditch effort will be to deploy blank databases to all the clients and have them start the sync. I've had no issues with this on the dev side but I can only test one other client to know if that'll do anything different. Aside from that I don't know what to do other than to keep doing manual syncs which would defeat this purpose entirely. I thought PerformPostRestoreFixup would alleviate the issue entirely but I seem to be having the same problems with or without it or perhaps I'm not looking at what I need to be.
Thanks
I wanted to report and close the entry with my findings.
When I would deploy a previously configured client database, I'd often get ApplyChangeFailed events in the form of this log:
"[05:30:41 PM] - ApplyChange Failed: TableName: , Stage: ApplyingInserts, ConflictType: LocalInsertRemoteInsert, Action: RetryWithForceWrite"
This is what I thought would be expected as it tried to reinsert the data that is already there. What this should've been changed to was an update statement during RetryWithForceWrite but I found the data was not updating with what was being sent down.
Once I started each client with a completely blank database and provisioned locally, all of these errors went away. It's as if every client expects some unique id only it sets. I'm also using x64 builds versus x86 which may have some or no bearing on the results. I wish I could determine what exactly happened but it seems that when in doubt, and whenever possible, starting from absolute zero and letting sync fill in the data is your safest option.
i have a table in an access database
this access database is used on a regular basis, basically from 9-5
someone else has a copy of this exact table. sometimes records are added, sometimes deleted, and sometimes data within the records is updated.
i need to update the access database table with the offsite table every hour or so. what is the best algorithm of updating the data? there are about 5000 records.
would it severely lock up the table for a few seconds every hour?
i would like to publicly apologize for my rude comment to david fenton
My impression is that this question ties together pieces you've been exploring with your previous questions:
a file "listener" to detect the presence of a new file and do something with it when found
list files with some extension in a folder
DoCmd.TransferText to pull file data into your database
Insert, Update, Delete records in a table based on an imported set of records
Maybe it's time to give us a more detailed picture of what you're dealing with.
Tony asked if both sites are on the same WAN (Wide Area Network). You replied they are on Windows. Elsewhere you said you're using a network. Please tell us about the network.
I'm still unsure whether you need a one-way or two-way data exchange. You've talked about importing changes from the remote table into the local master table. Do you need to do the same type of operation at the remote site: import changes made to the table at the master site?
Tell us what needs to happen regarding the issue James raised. Can local and remote users ever edit the same record? If they can, how will you resolve the conflict? Similarly, what should happen if a remote user updates a record and a local user deletes their copy of that record?
Based on what you've told us so far, this sounds like a real challenge for Access, made more challenging by the rate of record changes (5,000 per hour). I like the outline Kevin suggested. However your challenge will be more complicated since you also need to account for record deletions at both sites.
It seems like you may have to create something which duplicates Access' Replication feature. Maybe you should look at the Jet Replication Wiki to see if you can modify your design to take advantage of Replication. I can't help you there, and unfortunately you appear to have frustrated David Fenton who is a leading authority on Jet Replication.
If a few seconds performance is critical, you'd rather move to a better database engine (like Sqlite, MySQL, MS SQL server). If you want a single file, then Sqlite is the best for you. All these use by-single-record locks, so you can read and write simultaneously.
If you stay with access, you will probably have to implement a timer to update only a few records at a time.
Before you do anything else you need to establish the "rules" as far as collisions go.
If a row in the local copy is updated and the same row in the remote copy is updated which one is the "correct" version? Ditto for deletions, inserts are even more of a pain as you can have the "same" set of values but perhaps a different key.
After you have worked out how to handle each of these cases you can then go on to thinking about the implementation.
As other posters have suggested the way to completely avoid these issues is switch to SQLServer or any other "proper" database which can be updated over the network by all users and where concurrency issues are handled by the DBMS when the updates are applied.
Other users have already suggested switching to a server based database i.e. SQL server etc. I would echo this and say it is the best way to go however if you are stuck with access and have no choice then I would suggest you add a field (with an index) along the lines of “Last Updated”. You could then export all records that have been modified within a particular time frame. Export this file as a CSV, ship it over to the remote site and import it into the “master” access database. With a bit of scripting you could automate this process.
I want to secure events stored in one table, which has relations to others.
Events are inserted through windows service, that is connecting to hardware and reading from the hardware.
In events table is PK, date and time, and 3 different values.
The problem is that every admin can log in and insert/update/delete data in this table e.g. using sql management studio. I create triggers to prevent update and delete, so if admin doesn't know triggers, he fail to change data, but if he knows trigger, he can easily disable trigger and do whatever he wants.
So after long thinking I have one idea, to add new column (field) to table and store something like checksum in this field, this checksum will be calculated based on other values. This checksum will be generated in insert/update statements.
If someone insert/update something manually I will know it, because if I check data with checksum, there will be mismatches.
My question is, if you have similar problem, how do you solve it?
What algorithm use for checksum? How to secure against delete statement (I know about empty numbers in PK, but it is not enough) ?
I'm using SQL Server 2005.
As admins have permissions to do everything on your SQL Server, I recommend a temper-evident auditing solution. In this scenario – everything that happens on a database or SQL Server instance is captured and saved in a temper-evident repository. In case someone who has the privileges (like admins) modifies or deletes audited data from the repository, it will be reported
ApexSQL Comply is such a solution, and it has a built in integrity check option
There are several anti-tampering measures that provide different integrity checks and detect tampering even when it’s done by a trusted party. To ensure data integrity, the solution uses hash values. A hash value is a numeric value created using a specific algorithm that uniquely identifies it
Every table in the central repository database has the RowVersion and RowHash column. The RowVersion contains the row timestamp – the last time the row was modified. The RowHash column contains the unique row identifier for the row calculated using the values other table columns
When the original record in the auditing repository is modified, ApexSQL Comply automatically updates the RowVersion value to reflect the time of the last change. To verify data integrity, ApexSQL Comply calculates the RowHash value for the row based on the existing row values. The values used in data integrity verification now updated, and the newly calculated RowHash value will therefore be different from the RowHash value stored in the central repository database. This will be reported as suspected tampering
To hide the tampering, I would have to calculate a new value for RowHash and update it. This is not easy, as the formula used for calculation is complex and non-disclosed. But that’s not all. The RowHash value is calculated using the RowHash value from the previous row. So, to cover up tampering, I would have to recalculate and modify the RowHas values in all following rows
For some tables in the ApexSQL Comply central repository database, the RowHash values are calculated based on the rows in other tables, so to cover tracks of tampering in one table, the admin would have to modify the records in several central repository database tables
This solution is not tamper-proof, but definitely makes covering tempering tracks quite difficult
Disclaimer: I work for ApexSQL as a Support Engineer
Security through obscurity is a bad idea. If there's a formula to calculate a checksum, someone can do it manually.
If you can't trust your DB admins, you have bigger problems.
Anything you do at the server level the admin can undo. That's the very definition of its role and there's nothing you can do to prevent it.
In SQL 2008 you can request auditing of the said SQL server with X events, see http://msdn.microsoft.com/en-us/library/cc280386.aspx. This is CC compliant solution that is tamper evident. That means the admin can stop the audit and do its mischievous actions, but the stopping of the audit is recorded.
In SQL 2005 the auditing solution recommended is using the profiler infrastructure. This can be made tamper evident when correctly deployed. You would prevent data changes with triggers and constraints and audit DDL changes. If the admin changes the triggers, this is visible in the audit. If the admin stops the audit, this is also visible in the audit.
Do you plan this as a one time action against a rogue admin or as a feature to be added to your product? Using digital signatures to sign all your application data can be very costly in app cycles. You also have to design a secure scheme to show that records were not deleted, including last records (ie. not a simple gap in an identity column). Eg. you could compute CHECSUM_AGG over BINARY_CHECKSUM(*), sign the result in the app and store the signed value for each table after each update. Needles to say, this will slow down your application as basically you serialize every operation. For individual rows cheksums/hashes you would have to compute the entire signature in your app, and that would require possibly values your app does not yet have (ie. the identity column value to be assigned to your insert). And how far do you want to go? A simple hash can be broken if the admin gets hold of your app and monitors what you hash, in what order (this is trivial to achieve). He then can recompute the same hash. An HMAC requires you to store a secret in the application which is basically impossible against a a determined hacker. These concerns may seem overkill, but if this is an application you sell for instance then all it takes is for one hacker to break your hash sequence or hmac secret. Google will make sure everyone else finds out about it, eventually.
My point is that you're up the hill facing a loosing battle if you're trying to deter the admin via technology. The admin is a person you trust and if this is broken in your case, the problem is trust, not technology.
Ultimately, even if admins do not have delete rights, they can give themselves access, make the change to not deny deletes, delete the row and then restore the permission and then revoke their access to make permission changes.
If you are auditing that, then when they give themselves access, you fire them.
As far as an effective tamper-resistant checksum, it's possible to use public/private key signing. This will mean that if the signature matches the message, then no one except who the record says created/modified the record could have done it. Anyone can change and sign the record with their own key, but not as someone else.
I'll just point to Protect sensitive information from the DBA in SQL Server 2008
The idea of a checksum computed by the application is a good one. I would suggest that you research Message Authentication Codes, or MACs, for a more secure method.
Briefly, some MAC algorithms (HMAC) use a hash function, and include a secret key as part of the hash input. Thus, even if the admin knows the hash function that is used, he can't reproduce the hash, because he doesn't know all of the input.
Also, in your case, a sequential number should be part of the hash input, to prevent deletion of entire entries.
Ideally, you should use a strong cryptographic hash function from the SHA-2 family. MD5 has known vulnerabilities, and similar problems are suspected in SHA-1.
It might be more effective to try to lock down permissions on the table. With the checksum, it seems like a malicious user might be able spoof it, or insert data that appears to be valid.
http://www.databasejournal.com/features/mssql/article.php/2246271/Managing-Users-Permissions-on-SQL-Server.htm
If you are concerned about people modifying the data, you should also be concerned about them modifying the checksum.
Can you not simply password protect certain permissions on that database?