When should I be concerned about transaction size? - sql

I have a feature where we need to merge user records. The user chooses a user to keep and and a user to discard.
We have 59 relations to user records in our database so for each one I need to do:
UPDATE (Table) set UserNo=(userToKeep) WHERE UserNo=(userToDiscard)
and then DELETE the userToDiscard and their user prefs (118).
Should I be worried about the transaction size? This is MS-SQL 2005.
Is there anything I could do?
Thanks

Have you tested how long the process actually takes? How often are users merged?
If you have indexes on the user ID in each of this table (and I would think that would be the natural thing to do anyway) then even with 59 tables it shouldn't take too long to perform those updates and deletes. If you only actually merge users a couple times a week then a little blip like that shouldn't be an issue. At worst, someone has to wait an extra couple seconds to do something once or twice a week.
Another option would be to save these user merge requests in a table and do the actual work in a nightly process (or whenever "off-hours" is for your application). You would need to make it clear to the users of your application though that merges do not take effect immediately. You would also need to account for a lot of possible contingencies: what if the same user is set to merge with two different users that night, etc.

It depends on how large your user table is, and what indexes you have in place.

Merging users does not sound like feature that would be used very often. Given that, there's 98% probability you shouldn't worry about transaction size (remaing 2% reserved for possible deadlocks)

Generally transactions should be the smallest size that they need to be to minimize contention and possible deadlock situations. (Although making them too small can cause overhead as well) Would the queries that go against these tables give incorrect results if some of the rows were changed first and others later? Depending on your application, this could cause a business problem.
Any idea how many rows will be updated in each table? If each user could have millions of rows in a table, you might need to be more careful than if there are a handful of rows in each table.

Related

maintenance of application log files sql

I want to create a log table to keep track of users and their actions on website. For ex, when a user log in page a record will be created into log table. when user creates information, a record will be created into log table. similarly for every action, a record will be created into log table. In this way, the log table data will grow very faster. What is the better way to maintain such bigger tables apart from creating trigger and scheduling scripts to clean data frequently?
From my experience typically excessive logging doesnt really gain you much. A lot of people lose the usefulness of logging with the sheer volume of it...just a little warning before hand.
As for maintaining a table that size i recommend potentially partitioning the table and writing a specific set of stored procedures that effectively use a few indexes that you place on the table. Any ad-hoc work on the table should be done minimally and if it is done make sure the ad-hoc hits up against any index you setup on the table. Also with (nolock) will be your friend for SELECT statements if a large amount of inserts going on.
This is the basic general idea I do for the transaction tables I handle and they typically get around 1-2 million rows a day.

What is a good way to manage large ever growing tables in a database?

I am building a web application for medical record keeping. A requirement for this application is logging all changes (view, create, update, delete) to a patients data and pretty much any other useful info in the system (login, cron run, data export, etc).
I am storing the data into a database table currently which is working fine. However it is likely this table will grow unruly very quickly and bloat the database. I am not allowed to delete log entries.
My current plan is to choose an arbitrary size (such as 1 million entries, large but still manageable). When the table hits 1 million entries I move 100,000 oldest entries into a file and store it onto our file server.
Does anyone have any experience with this issue that has other/better ideas on how to handle it?
Additional info:
My primary concern is nothing will ever be deleted from this data. However the data does not necessarily need to be accessed after several months. Since this data could logically hit 1 Billion entries in a matter of a couple years (and I have 300 copies of this db that all include this table) what is a good way to manage the size and performance. This table needs to be on a pager which is obviously going to be an issue when it breaks 1 Million let alone 1 Billion.
Cases like this are tailor-made for partitioning. Using a partitioning strategy, you span your data across multiple tables. This helps to balance I/O, speed up access times for partition-specific queries, etc. This is a discipline in and of itself, and the choice of partitioning key is crucial. In many cases such as log data like this, people often partition on a datetime value.
Partitioned Tables and Indexes (SQL Server)

Removing rows without transaction logging?

We have pretty big table with hundreds of millions of rows. It takes about 5-15 minutes to run removal of rows for a specific foreign key value. For example removing 8 million rows takes 15 minutes.
The questions is that does the removal of the rows actually even free up space as the database has transaction logging on? Can I remove rows with by-passing transaction logging for that operation?
In simple terms, you can't get around the transaction logging. That's just how the database ensures consistency - if the transaction fails halfway through (or the server's power fails, for example), the database engine needs to know how to get into a consistent state again. Also, appending the things to be changed into the transaction log is much faster than actually performing a change on the data files of the DB, especially in cases like yours.
There's a few special cases where it's safe to get around those things - truncate table will remove all the rows at once, and only if the table has no foreign keys, which makes it rather trivial. You can't limit it in any way, though.
The newly free space will be reclaimed as part of the database maintenance cycle. During each database backup, the database is synchronized to have all the data written in the data files, and the transaction log is backed up and emptied in the DB itself (I'm oversimplifying, since there's a lot of possible configurations - in any case, this is something your DBA should care about).
If this is posing a problem to you, the solution wouldn't be to get around the transaction logging anyway. You probably want to ask why (and how often) you need to delete millions of rows at a time.

Should I create multiple tables, or even databases for multiple users of a CRM

I'm working on creating an application best described as a CRM. There is a relatively complex table structure, and I'm thinking about allowing users to do a fair bit of customization (adding fields and the like). One concern is that I will be reaching a certain level of scale almost immediately. We have about 50,000 individual users who will be coming online within about nine months of launch. So I want to build to last.
I'm thinking about two and maybe even three options.
One table set with a userID column on everything and with a custom attributes table created by creating a table which indexes custom attributes, then another table which has their values, which can then be joined to the existing contact records for the user. -- From what I've read, this seems like the right option, but I keep feeling like it's not. It seems like once these tables start reaching the millions of records searching for just one users records in every query is going to become a database hog.
For each user account recreate the table set, preened with a unique identifier (the userID for example.) Then rather than using a WHERE userID=? everywhere I can use a FROM ?_contacts. For attributes I could then have a custom attributes table where users could add additional columns for custom attributes. -- This feels like the simplest way to go, though, of course when I decide to change the database structure there would be a migration from hell.
The third option, which I'm pretty confident is wrong, but for that reason alone I can not rule out, is that a new database should be created for each user with all the requisite tables.
Am I crazy? Is option one really the best?
The first method is the best. Create individual userId's and then you can assign specific roles to them. A database retrieval time indeed depends on the number of records too. But, there is a trade-off where you can write efficient sql queries to fetch data. Well, according to this site, you will probably won't run out of memory or run into concurrency issues, because with a good server, the performance ought to be good, provided that you are efficient in writing queries.
If you recreate table sets, you will just end up creating lots of tables and can make the indexing slow which is a bad practice. Whereas if you opt of relational database scheme rather than an ordinary database scheme, and normalize the database and datatables for improving efficiency.
Creating a new database for each and every user, just sums up the complexity from both the above statements resulting in a shabby and disorganized database access. Because, if you decide to run individual instances of databases for every single user, you would just end up consuming your servers physical resources like RAM and CPU usage which will affect the service quality of all the other users.
Take up option 1. Assign separate userIds and assign them roles and privileges where needed. That is more efficient than the other two methods.

Are single statement UPDATES atomic, regardless of the isolation level? (SQL Server 2005)

In an app, Users and Cases have a many-to-many relationship. Users pull their list of Cases often, Users can update a single case at a time (a 1-10 second operation, requiring more than one UPDATE). Under READCOMMITTED, any in-use Case would block all associated Users from pulling their list of Cases. Also, the most recent data is a hotspot for both reads and writes to the Cases table.
I think I want to employ dirty reads to keep the experience snappy. READPAST on Cases won't work for this purpose. NOLOCK will work, but I'd like to be able to show which records are dirty when they are listed.
I don't know of any native way to show which records are dirty, so I'm thinking that for each update or insert to Cases, an INUSE flag will be set. This flag must be cleared by the end of the updating transaction such that under READCOMMITTED, this flag will never appear to be set. Note that this is NOT to replace concurrency management, only to show which records are potentially dirty to the User.
My question is whether this is reliable - if we UPDATE two or more fields (INUSE plus the other fields) in a single statement, is it possible that a concurrent NOLOCK query would read some of the new values but not others? If so, is it possible to guarantee that INUSE be set first?
And if I'm thinking about this all wrong, please enlighten me. My ideal situation would be to, in a manageable way, be able to show the values as they were PRIOR to any related transaction so the data is immediately available and always consistent (but partially out-dated). But I don't think this is available - especially in the more complex actual database.
Thanks!
Restating the problem just to be sure: User A on connection A updates two columns (col1, col2) in MyTable. While this is going on, user B on connection B issues a dirty read, selecting data from that row. You are wondering if user B could get, say, the updated value in col1 AND the old/not updated value in col2. Correct?
I have to say: no way could this happen. As I understand it, updates are indeed an atomic transaction, and if you're writing data to the page (in memory), then the entire row update would have to finish on that set of bytes before anything else (another thread) could get access to them.
But I don't know for sure, and I can't imagine how to set up a test to confirm or deny this. The only answer I'd rely on would have to come from someone who actually had a hand in writing the code, or perhaps a Microsoft technician who has similar access. If you don't get any good answers here, posting the question on the appropriate MSDN forum (link) might get a good answer.
Have you considered using SNAPSHOT isolation level? When used for a query, it requires no locks whatsoever, and it gives precisely the semantics that you're asking for:
show the values as they were PRIOR to any related transaction so the data is immediately available and always consistent (but partially out-dated)