In the context of logging operations by applications, what do you think are the best practices for updating progress from databese perspective? In my experience, it is best to only insert new records with new statuses into the log table in the database. Unfortunately, I often see how junior programmers try to update the statuses in the existing logs and act only on one entry with a unique process id. This leads to deadlocks by page locks or multithreading if they are scanning through different filters. Then you need to introduce proper lock management to such solutions, which further complicates the logic and such people later have a complete problem with understanding the behavior of the database.
So is a plain insert the only simplest and maintainable solution, or do you know of other simpler approaches?
Thanks in advance for Your knowledge.
A log record is, almost by definition, a record of a point in time. If you update it then you’ve lost the record of that point in time. Therefore log records should never be updated, they should be insert only
Related
I need to create an Audit table that is going to track the actions (insert, update, delete) of my tables in the database and add new row with date, row id, table name and a few more details, so I will know what action happened and when.
So basically from my understanding I need a trigger for each table which is going to track insert/update/delete and a trigger on the database which is going to track new table creation.
My main problem is understanding how to connect between those things so when a new table is being created a trigger will be created for that table which is going to track the actions and add new rows for the Audit table as needed.
Is it possible to make a DDL trigger for create_table and inside of it another trigger for insert / update / delete ?
What you're hoping for is not possible. And I'd strongly advise that you'd be better off thinking about what you really want to achieve at a business level with auditing. It will yield a much simpler and more practical solution.
First up
...trigger on the database which is going to track new table creation.
I cannot stress enough how terrible this idea is. Who exactly has such unfettered access to you database that they can create tables without going through code-review and QA? Which should of course be on the gated pathway towards production. Once you realise that schema changes should not happen ad-hoc, it's patently obvious that you don't need triggers (which are by their very nature reactive) to do something because the schema changed.
Even if you could write such triggers: it's at a meta-programming level that simply isn't worth the effort of trying to foresee all possible permutations.
Better options include:
Requirements assessment and acceptance: This is new information in the system. What are the audit requirements?
Design review: New table; does it need auditing?
Test design: How to test an audit requirements?
Code Review: You've added a new table. Does it need auditing?
Not to mention features provided by tools such as:
Source Control.
Db deployment utilities (whether home-grown or third party).
Part two
... a trigger will be created for that table which is going to track the actions and add new rows for the Audit table as needed.
I've already pointed out why doing the above automatically is a terrible. Now I'm going a step further to point out that doing the above at all is also a bad idea.
It's a popular approach, and I'm sure to get some flack from people who've nicely compartmentalised their particular flavour of it; swearing blind how much time it "saves" them. (There may even be claims to it being a "business requirement"; which I can assure you is more likely a misstated version of the real requirement.)
There are fundamental problems with this approach:
It's reactive instead of proactive. So it usually lacks context.
You'll struggle to audit attempted changes that get rolled back. (Which can be a nightmare for debugging and usually violates real business audit requirements.)
Interpreting audit will be a nightmare because it's just raw data. The information is lost in the detail.
As columns are added/renamed/deleted your audit data loses cohesion. (This is usually the least of problems though.)
These extra tables that always get updated as part of other updates can wreak havoc on performance.
Usually this style of auditing involves: every time a column is added to the "base" table, it's also added to the "audit" table. (This ultimately makes the "audit" table very much like a poorly architected persistent transaction log.)
Most people following this approach overlook the significance of NULLable columns in the "base" tables.
I can tell you from first hand experience, interpreting such audit trails in any but the simplest of cases is not easy. The amount of time wasted is ridiculous: investigating issues, training others to be able to interpret them correctly, writing utilities to try make working with these audit trails less painful, painstakingly documenting findings (because the information is not immediately apparent in the raw data).
If you have any sense of self-preservation you'll heed my advice.
Make it great
(Sorry, couldn't resist.)
A better approach is to proactively plan for what needs auditing. Push for specific business requirements. Note that different cases may need different auditing techniques:
If user performs action X, record A details about the action for legal traceability.
If user attempts to do Y but it prevented by system rules, record B details to track rule system integrity.
If user fails to log in, record C details for security purposes.
If system is upgraded, record D details for troubleshooting.
If certain system events occur, record E details ...
The important thing is that once you know the real business requirements, you won't be saying: "Uh, let's just track everything. It might be useful." Instead you'll:
Be able to produce a cleaner more appropriate and reliable design for each distinct kind of auditing.
Be able to test that it behaves as required!
Be able to use the audit data more easily whenever it's needed.
I happen to be using innodb, read-committed.
My simple question is this relative to a transaction:
I have a table (TreeNodeId) which holds a set of 4 different nodekeys, that represent all extant nodes in my system that relate to available paths to webpages. Each key represents an item in the database, and each row in the table represents various combinations in which items are used.
At the beginning of a transaction, based on the items being changed, I make a single query for all rows in TreeNodeId that reflect some extant combination of my one or 2 items.
Will this single query be internally consistent, even if it fetches 10,000 rows? Is it possible for the db query set to get the first 100 rows, and then for some other simultaneous transaction to commit new or deleted rows that would cause the remaing results to be inconsistent?
Andy
If you isolation level is read 'committed' it will only return results that have been 'committed' by the transaction log. So if you start a query that is in isolation level 'committed' at that point in time the sql transaction log will only give you transactions that had posted to it's log as committed. If in the middle of the select someone posts records they will be seen as 'uncommitted' at that point in time till they end their operation and will be 'committed'. However even if you change the level to 'uncommitted' you should not get data as it is in mid stream, you should get what is available to the engine at the moment you began your operation according to the transaction logs.
Committed versus uncommitted will get records in the system at the moment of select that are there based on your select. So if I had say 3,000,000 records and 200,000 records inserting but they were committing one at a time and only 100,000 had committed and 100,000 were aware of operation in the logs but not committed yet.
Committed would give me 3,100,000 and Uncommitted would give 3,200,000. However there are schools of thought and I just got into a discussion yesterday with someone on this.... Uncommitted will give you the uncommitted results and are known as 'dirty reads' in that you are reading logs that are not set yet(you rebel). You are saying "Hey database I don't care what you got incoming that is finalized I WANT IT NOW." When you say committed you are saying: "Database I only want qualified data, if something is not finalized I don't want it."
Advantages with each:
Uncommitted you will not LOCK anything. You are basically saying to the system: "Don't lock anything out, just let me go through the system freely getting what is there and I don't care if you change something. I want it at moment of operation." If something is trying to insert or update when you perform this it WILL NOT LOCK IT.
Committed will not lock anything except that which is in process to commit till your operation has been completed. You are safe in knowing your data returned is finalized but your run the risk of BLOCKING transactions trying to insert or post. Your are essentially telling the database: "Wait for me to finish before continuing your commits on tables I am accessing. I need my data accurate so hands off till I am done". This will potentially lock data while it is performing the reads on a table you are gathering from. This is not that common as most selects are near instant but on huge systems that are transactional based on posting thousands of records a second it is a BIG CONCERN.
Honestly in my discussion I favored uncommitted and the other person favored committed. I argued it is far more acceptable to get dirty data than stop production inserts. They argued that phantom reads and other instances were worse. This is an opinion and SQL systems are designed around inserts and selects but seldom can you do both exceptionally fast without taking a little away from the other. My answer if you want accurate reporting is do nightly backups, SSIS packages, binary collections, or something similar in an isolation level such as snapshot or committed and put that data somewhere. Let that data have been set in a way that we know it is finalized and it is locked so it may not be changed later and report off of that. Don't report off of production data hot and make it a point to tell everyone to do that. That is bad practice in and of itself to tell people to report off of live data performing inserts and updates in real time.
Will it hurt if you are a small mom and pop store with only 5 or 10 people using the database, probably not. Will it hurt if you are little bigger and have 50 people accessing the same database but it is about 100 gigs and semi transactional in that you get trickle's of data during the day. Still probably not. Will it hurt if you have 200 people and multiple servers and databases and a main transactional database brain storing the composite of all the data. ABSOLUTELY, don't read from a main production database with intense operations if it's main purpose is to get data to store.
EDIT to further point from real world example:
That is why usually at the top of most operations where I am not using table variables (declare #Table table) I set this: "set transaction isolation level read uncommitted". Will I be using this intensely every time I query? LOL, I hope not. In fact Full disclosure it may NEVER EVER help me from this point on because I isolate my data a lot with temp tables for huge transaction reporting. But I will not be getting yelled at by others I have a long running transaction blocking their inserts. You will also see a lot of people do this: "select * from table (nolock)" I Generally give code like this to lesser query designers as it embeds the nolock hint with the query. If I tell everyone to do this they will make it policy.
You do not have to do this and in fact some people will maybe follow me and claim this is wrong and post their side. I do it MOSTLY FOR PRODUCTION PROTECTION and anyone that tells me that is wrong I would like to hear why they like to lock tables and report off of them in production versus getting their data in or updated in real time first. I would have a hard time going to a manager and saying: "You know that huge account you were waiting to post 2 million records on and know the instance it was done. Well John down the hall really wanted to run this query that takes an hour to run because it was sloppily designed. He chose to use committed and is hitting some of the tables doing inserts so we are getting occasional locks. Well I think it is more important he get his report than we get business." I wonder what the manager would tell me back?
We need to insert data(8k records) into a CRM Entity, the data will come from other CRM Entities. Currently we are doing it through code but it takes too much time (Hours). I was wondering if we use SQL to insert directly into the CRM Database it will be a lot easier and will take only minutes. But before moving farward I have few questions:
Is it safe to insert directly into CRM Database, using SQL?
What is the best practice for insert data into CRM using SQL?
What things should i consider before trying it?
EDIT:
4: How do I increase the insert performance?
No, it is not. It is considered unsupported
Don't do it
Rollup 12 was just released and contains a new API feature. There is now a ExecuteMultipleRequest which could be used for batched bulk imports. See http://msdn.microsoft.com/en-us/library/jj863631.aspx
It shouldn't take hours to insert 8000 records. It would help to see your code, but here are some things to consider to improve performance:
Reuse your IOrganizationService. I've found a 10x increase in performance by reusing a IOrganizationService, rather than creating a new one with each record that is being updated
Use multi-threading. You have to be careful with this one, because it could lead to worse performance if the function to check for the entity existing is your bottle neck.
Tweak your exists function. If the check for the entity existing is taking a long time, consider pulling back the entire table and storing it in memory (assuming it's not ridiculously huge). This would remove 8000 separate select statements.
Turn off plugins that may be degrading performance. If you have any plugins registered on the entity, see if performance increases if you disable them during the import.
Create a new, "How do I increase the insert performance" question with your code posted for additional help.
I have not used the CRM application you are referring to, but if you bypass the code you might bypass certain restrictions or even triggers that the code has in place based on certain values sent in.
For example, if you sent a number in through the code, it might perform some mathematical function on that number and add it to some other value and end up storing two values in the database (one value for the number you entered, and another value representing the total including the newly added one).
So if you had just inserted the one value straight into the database, then the total wouldn't get updated with it.
That is just a hypothetical scenario. You may not run into any problems like that or any others, but there could be the chance.
Well i found this article very helpful. It says:
The direct SQL writes to CRM database are not supported.The reason for this is that creating a record in CRM database is so much more than just INSERT INTO…-statement. The first step of optimizing is to understand what happens behind the scenes and can affect the speed:
1. CRM entities usually consist of 2 physical tables.
2. Cascade rules/Sharing: If created record has any relationships with
cascade rules, web service will handle the cascades automatically.
For example cascaded sharing will lead to additional records being
created in PrincipalObjectAccess table. In case of one-time
migrations, disabling the cascade rules while migration runs can
save lot of time
3. Record Ownership: If you are inserting records, make sure you are
setting the owner as an attribute for create and not as an
additional owner assign request. Assigning owner actually takes
4. Money/Time: Web Service handles currencies and time zones.
5. Workflows/Plugins: If the system has any custom workflows and/or
plugins, I strongly recommend pausing them for the duration of
migration.
I am using Action Filter Attributes for loging user activity on certain action which has SQL database interaction. Similarly I can log the activity in the SQL tables using triggers on tables during each activity on the tables. I would like to know which of the above two methods is a best practice ( perfomance wise )
I think that the actionfilter is certainly the cleanest and best practice appraoch since it is in the application layer. Part of the benefit of being there is its managed code and if something breaks you can easily locate the problem. There is also the benefit that all your code is in one spot too.
Database triggers are a big no no in many companies since they have a habit of causing infinite loop well an unknowing programmer creates some logic that steps on the trigger over and over again causing the database to fail. Some companies do allow triggers but very well documented and very lightly used. Hope this helps.
Performance of logging depends greatly on the system architecture. If you have 3 load balanced web servers hitting one main database, triggers would have to handle all the load while Action Filters would split the load in three. In that scenario, Action Filters would be better.
In terms of best practices, I wouldn't use either of those approaches. I would set up Transactional Replication to another SQL server. This approach would run without impacting performance at all. The transaction log is already being generated and replication would just spin up a separate process that's reading that log.
In an app, Users and Cases have a many-to-many relationship. Users pull their list of Cases often, Users can update a single case at a time (a 1-10 second operation, requiring more than one UPDATE). Under READCOMMITTED, any in-use Case would block all associated Users from pulling their list of Cases. Also, the most recent data is a hotspot for both reads and writes to the Cases table.
I think I want to employ dirty reads to keep the experience snappy. READPAST on Cases won't work for this purpose. NOLOCK will work, but I'd like to be able to show which records are dirty when they are listed.
I don't know of any native way to show which records are dirty, so I'm thinking that for each update or insert to Cases, an INUSE flag will be set. This flag must be cleared by the end of the updating transaction such that under READCOMMITTED, this flag will never appear to be set. Note that this is NOT to replace concurrency management, only to show which records are potentially dirty to the User.
My question is whether this is reliable - if we UPDATE two or more fields (INUSE plus the other fields) in a single statement, is it possible that a concurrent NOLOCK query would read some of the new values but not others? If so, is it possible to guarantee that INUSE be set first?
And if I'm thinking about this all wrong, please enlighten me. My ideal situation would be to, in a manageable way, be able to show the values as they were PRIOR to any related transaction so the data is immediately available and always consistent (but partially out-dated). But I don't think this is available - especially in the more complex actual database.
Thanks!
Restating the problem just to be sure: User A on connection A updates two columns (col1, col2) in MyTable. While this is going on, user B on connection B issues a dirty read, selecting data from that row. You are wondering if user B could get, say, the updated value in col1 AND the old/not updated value in col2. Correct?
I have to say: no way could this happen. As I understand it, updates are indeed an atomic transaction, and if you're writing data to the page (in memory), then the entire row update would have to finish on that set of bytes before anything else (another thread) could get access to them.
But I don't know for sure, and I can't imagine how to set up a test to confirm or deny this. The only answer I'd rely on would have to come from someone who actually had a hand in writing the code, or perhaps a Microsoft technician who has similar access. If you don't get any good answers here, posting the question on the appropriate MSDN forum (link) might get a good answer.
Have you considered using SNAPSHOT isolation level? When used for a query, it requires no locks whatsoever, and it gives precisely the semantics that you're asking for:
show the values as they were PRIOR to any related transaction so the data is immediately available and always consistent (but partially out-dated)