Best way to do 10,000 inserts in a SQL db? - sql

What happens is whenever i upload media onto my site; everyone will get a notification. Each person can click a box to remove and it will be gone from their msg queue forever.
If i had 10,000 people on my site how would i add it to every person msg queue? I can imagine it takes a lot of time so would i opt for something like a filesystem journal? mark that i need to notify people, the data then my current position. Then update my position every 100 inserts or so? I would need a PK on my watcher list so if anyone registers in the middle of it my order will not be broken since i'll be sorting via PK?
Is this the best solution for a mass notification system?
-edit-
This site is a user created content site. Admins can send out global messages and popular people may have thousands of subscribers.

If 10000 inserts into a narrow many to many table linking the recipients to the messages (recipientid, messageid, status) is slow, I expect you've get bigger problems with your design.
This is the kind of operation I wouldn't typically even worry about batching or people subscribing in the middle of the post operation - basically:
Assuming #publisherid known, #msg known on SQL Server:
BEGIN TRANSACTION
INSERT INTO msgs (publisherid, msg)
VALUES(#publisherid, #msg)
SET #messageid = SCOPE_IDENTITY()
INSERT INTO msqqueue (recipientid, messageid, status)
SELECT subscriberid, #messageid, 0 -- unread
FROM subscribers
WHERE subscribers.publisherid = #publisherid
COMMIT TRANSACTION

Maybe just record for each user which notifications they have seen- so the set of notifications to show to a user are ones created before their "earliest_notification" horizon (when they registered, or a week ago...) minus the ones they have acknowledged. That way you delay inserting anything until it's a single user at once- plus if you only show users messages less than a week old, you can purge the read-this-notification flags that are a week or more old.
(Peformance optimisation hint from my DBA at my old job: "business processes are the easiest things to change, look at them first")

I'd say just let the database take care of things it is well capable of and designed to do. Insert and manage data. Don't try and do it in code, just write the SQL to insert the data all in one go. 10000 rows is a diddle for all real databases.

IMHO inserting records may not be the most efficient solution to this problem. Can you use a client side cookie to store whether the user removed the notification or do you have to keep track of this even if they clear cookies? If you upload a new video the app can just compare the cookie with the new video record id and decide to show or hide the notification based on whats stored in the cookie. This will save you a ton of database inserts, and keeps most heavy lifting on the client.

If you're using Postgres, you can use the COPY command which is the fastest method of them all:
COPY tablename (col1, col2, col3) FROM '/path/to/tabfile';
where tabfile is a TAB-separated file with lots of entries. This will fail if there are some UNIQUE constraints, though, and there are duplicates in the file.

Related

Can you view the contents of a SQL (Server) table in realtime?

Is it possible to have a table or view which updates in real-time - so I can see the changes without refreshing?
The nearest thing I ever found was
raiserror('',0,1) with nowait --to flush the buffer
print 'hello' --say hello
waitfor delay '00:00:01' --pause for 1 second
GO 5 --loop 5 times
but obviously using that for a select gives you multiple tables rather than refreshing the table
Query Notifications can update you in real time when changes in the table occur, but you will have to query again the table to see what changed. At the least it eliminates pooling. As a cache-invalidation solution, is intended to be used with relatively static data that changes seldom.
For frequent changing data the best is to poll and have a way to return only the changes (eg. updated_at) but is rather trickier to detect deletes.
Change Data Capture is a technology that will record changes (and makes discovery of deletes trivial) and you can query for the changes, but is intended for occasionally connected systems (eg. Phones updating the local snapshot from the mothership database), not for live change monitoring.

Efficiently detecting concurrent insertions using standard SQL

The Requirements
I have a following table (pseudo DDL):
CREATE TABLE MESSAGE (
MESSAGE_GUID GUID PRIMARY KEY,
INSERT_TIME DATETIME
)
CREATE INDEX MESSAGE_IE1 ON MESSAGE (INSERT_TIME);
Several clients concurrently insert rows in that table, possibly many times per second. I need to design a "Monitor" application that will:
Initially, fetch all the rows currently in the table.
After that, periodically check if there are any new rows inserted and then fetch
these rows only.
There may be multiple Monitors concurrently running. All the Monitors need to see all the rows (i.e. when a row is inserted, it must be "detected" by all the currently running Monitors).
This application will be developed for Oracle initially, but we need to keep it portable to every major RDBMS and would like to avoid as much database-specific stuff as possible.
The Problem
The naive solution would be to simply find the maximal INSERT_TIME in rows selected in step 1 and then...
SELECT * FROM MESSAGE WHERE INSERT_TIME >= :max_insert_time_from_previous_select
...in step 2.
However, I'm worried this might lead to race conditions. Consider the following scenario:
Transaction A inserts a new row but does not yet commit.
Transaction B inserts a new row and commits.
The Monitor selects rows and sees that the maximal INSERT_TIME
is the one inserted by B.
Transaction A commits. At this point, A's INSERT_TIME is actually
earlier than the B's (A's INSERT was actually executed before
B's, before we even knew who is going to commit first).
The Monitor selects rows newer than B's INSERT_TIME (as a consequence of step 3). Since A's INSERT_TIME is earlier than B's insert time, A's row is skipped.
So, the row inserted by transaction A is never fetched.
Any ideas how to design the client SQL or even change the database schema (as long as it is mildly portable), so these kinds of concurrency problems are avoided, while still keeping a decent performance?
Thanks.
Without using any of the platform-specific change data capture (CDC) technologies, there are a couple of approaches.
Option 1
Each Monitor registers a sort of subscription to the MESSAGE table. The code that writes messages then writes each MESSAGE once per Monitor, i.e.
CREATE TABLE message_subscription (
message_subscription_id NUMBER PRIMARY KEY,
message_id RAW(32) NOT NULLL,
monitor_id NUMBER NOT NULL,
CONSTRAINT uk_message_sub UNIQUE (message_id, monitor_id)
);
INSERT INTO message_subscription
SELECT message_subscription_seq.nextval,
sys_guid,
monitor_id
FROM monitor_subscribers;
Each Monitor then deletes the message from its subscription once that is processed.
Option 2
Each Monitor maintains a cache of the recent messages it has processed that is at least as long as the longest-running transaction could be. If the Monitor maintained a cache of the messages it has processed for the last 5 minutes, for example, it would query your MESSAGE table for all messages later than its LAST_MONITOR_TIME. The Monitor would then be responsible for noting that some of the rows it had selected had already been processed. The Monitor would only process MESSAGE_ID values that were not in its cache.
Option 3
Just like Option 1, you set up subscriptions for each Monitor but you use some queuing technology to deliver the messages to the Monitor. This is less portable than the other two options but most databases can deliver messages to applications via queues of some sort (i.e. JMS queues if your Monitor is a Java application). This saves you from reinventing the wheel by building your own queue table and gives you a standard interface in the application tier to code against.
You need to be able to identify all rows added since the last time you checked (i.e. the monitor checks). You have a "time of insert" column. However, as you spell it out, that time of insert column cannot be used with "greater than [last check]" logic to reliably identify subsequently inserted new items. Commits do not occur in the same order as (initial) inserts. I am not aware of anything that works on all major RDBMSs that would clearly and safely apply such an "as of" tag at the actual time of commit. [This is not to say I would know it if such a thing existed, but it seems a pretty safe guess to me.] Thus, you will have to use something other than a "greater than last check" algorithm.
That leads to filtering. Upon insert, an item (row) is flagged as "not yet checked"; when a montior logs in, it reads all not yet checked items, returns that set, and flips the flag to "checked" (and if there are multiple monitors, this must all be done within its own transaction). The monitors' queries will have to read all the data and pick out which have not yet been checked. The implication is, however, that this will be a fairly small set of data, at least relative to the entire set of data. From here, I see two likely options:
Add a column, perhaps "Checked". Store a binary 1/0 value for is/isnot checked. The cardinality of this value will be extreme -- 99.9s Checked, 00,0s Unchecked, so it should be rather efficient. (Some RDBMSs provide filtered queries, such that the Checked rows won't even be in the index; once flipped to checked, a row will presumably never be flipped back, so the overhead to support this shouldn't be too great.)
Add a separate table identify those rows in the "primary" table that have not yet been checked. When a montior logs in, it reads and deletes the items from that table. This doesn't seem efficient... but again, if the data set involved is small, the overall performance pain might be acceptable.
You should use Oracle AQ with a multi-subscriber queue.
This is Oracle specific, but you can create an abstraction layer of stored procedures (or abstract in Java if you like) so that you have a common API to enqueue the new messages and have each subscriber (monitor) dequeue any pending messages. Behind that API, for Oracle you use AQ.
I am not sure if there is a queuing solution for other databases.
I don't think you will be able to come up with a totally database agnostic approach that meets your requirements. You could extend the example above that included the 'checked' column, to have a second table called monitor_checked - that would contain one row per message per monitor. That is basically what AQ does behind the scenes, so it is sort of reinventing the wheel.
With PostgreSQL, use PgQ. It has all those little details worked out for you.
I doubt you will find a robust and manageable database-agnostic solution for this.

Creating a variable on database to hold global stats

Let's pretend I've got a social network.
I'm always showing to the user how many users are registered and have activated their profile.
So, everytime a single user logs in, it goes to DB and make a:
select count(*) from users where status = 'activated'
so if 5.000 users logs in, or simply refreshes the page, it will make 5.000 requests to SQL above.
I was wondering if is better to have a variable some place(that I still have no idea where to put) that everytime a user activates his profile will add 1 and then, when I want to show how many users are registered to that social network, I'll only get the value of this variable.
How can I make this? Is it really a better solution to what I've got?
You could use an indexed view, that SQL Server will automatically maintain:
create table dbo.users (
ID int not null,
Activated bit not null
)
go
create view dbo.user_status_stats (Activated,user_count)
with schemabinding
as
select Activated,COUNT_BIG(*) from dbo.users group by Activated
go
create unique clustered index IX_user_status_stats on dbo.user_status_stats (Activated)
go
This just has two possible statuses, but could expand to more using a different data type. As I say, in this case, SQL Server will maintain the counts behind the scenes, so you can just query the view:
SELECT user_count from user_status_stats with (NOEXPAND) where Activated = 1
and it won't have to query the underlying table. You need to use the WITH (NOEXPAND) hint on editions below (Enterprise/Developer).
Although as #Jim suggested, doing a COUNT(*) against an index when the index column(s) can satisfy the query criteria using equality comparisons should be pretty quick also.
As you've already guessed - it's not a great idea to calculate this value every time someone hits the site.
You could do as you suggest, and update a central value as users are added, although you'll have to ensure that you don't end up with two processes updating the number simultaneously.
Alternatively you could have a job which runs your SQL routinely and updates the central 'user count' value.
Alternatively #2, you could use something like MemCache to hold the calculated value for a period of time, and then when the cache expires, recalculate it again.
There's a few options you could consider:
1) like you say, maintain a global count each time a profile is activated to save the hit on the users table each time. You could just store that count in a "Stats" table and then query that value from there.
2) don't show the actual "live" count, show a count that's "pretty much up to date" - e.g. cache the count in your application and have the value expire periodically so you then requery the count less frequently. Or if you store the count in a "Stats" table per above, you could have a scheduled job that updates the count every hour, instead of every time a profile is activated.
Depends whether you want to show the exact figure in real-time or whether you can live with a delay. Obviously, data volumes matter too - if you have a large database, then having a slightly out of date cached value could be worth while.
From a purely SQL Server standpoint, no, you are not going to find a better way of doing this. Unless, perhaps, your social network is Facebook sized. Denormalizing your data design (such as keeping a count in a separate table) will lead to possible sources of the data getting out of sync. It doesn't have to get out of sync if it is coded properly, but it can...
Just make sure that you have an index on Status. At which point SQL will not scan the table for the count, but it will scan the index instead. The index will be much smaller (that is, more data will fit in a disk page). If you were to convert your status to an int, smallint, or tinyint you would get even more index leaves in a disk page and thus much less IO. To get your description ('activated', etc.), use a reference table. The reference table would be so small, SQL would just keep the whole thing in RAM after the first access.
Now, if you still think this is too much overhead (and it should't be) you could come up with hybrid method. You could store your count in a separate table (which SQL would keep in RAM if it is just the one record) or assuming your site is in asp.net you could create an Application variable to keep track of the count. You could increment it in Session_Start and decrement it in Session_End. But, you will have to come up with a way of making the the increment and decrement thread safe so two sessions don't try and update the value at the same time.
You can also use the Global Temporary table. You will always get fast retrieval. Even
if you are setting 30 seconds ping. The Example Trigger Link1, Example Trigger Link2 will maintain such activities in this table.

Deleting rows from a contended table

I have a DB table in which each row has a randomly generated primary key, a message and a user. Each user has about 10-100 messages but there are 10k-50k users.
I write the messages daily for each user in one go. I want to throw away the old messages for each user before writing the new ones to keep the table as small as possible.
Right now I effectively do this:
delete from table where user='mk'
Then write all the messages for that user. I'm seeing a lot of contention because I have lots of threads doing this at the same time.
I do have an additional requirement to retain the most recent set of messages for each user.
I don't have access to the DB directly. I'm trying to guess at the problem based on some second hand feedback. The reason I'm focusing on this scenario is that the delete query is showing a lot of wait time (again - to the best of my knowledge) plus it's a newly added bit of functionality.
Can anyone offer any advice?
Would it be better to:
select key from table where user='mk'
Then delete individual rows from there? I'm thinking that might lead to less brutal locking.
If you do this everyday for every user, why not just delete every record from the table in a single statement? Or even
truncate table whatever reuse storage
/
edit
The reason why I suggest this approach is that the process looks like a daily batch upload of user messages preceded by a clearing out of the old messages. That is, the business rules seems to me to be "the table will hold only one day's worth of messages for any given user". If this process is done for every user then a single operation would be the most efficient.
However, if users do not get a fresh set of messages each day and there is a subsidiary rule which requires us to retain the most recent set of messages for each user then zapping the entire table would be wrong.
No, it is always better to perform a single SQL statement on a set of rows than a series of "row-by-row" (or what Tom Kyte calls "slow-by-slow") operations. When you say you are "seeing a lot of contention", what are you seeing exactly? An obvious question: is column USER indexed?
(Of course, the column name can't really be USER in an Oracle database, since it is a reserved word!)
EDIT: You have said that column USER is not indexed. This means that each delete will involve a full table scan of up to 50K*100 = 5 million rows (or at best 10K * 10 = 100,000 rows) to delete a mere 10-100 rows. Adding an index on USER may solve your problems.
Are you sure you're seeing lock contention? It seems more likely that you're seeing disk contention due to too many concurrent (but unrelated updates). The solution to that is simply to reduce the number of threads you're using: Less disk contention will mean higher total throughput.
I think you need to define your requirements a bit clearer...
For instance. If you know all of the users who you want to write messages for, insert the IDs into a temp table, index it on ID and batch delete. Then the threads you are firing off are doing two things. Write the ID of the user to a temp table, Write the message to another temp table. Then when the threads have finished executing, the main thread should
DELETE * FROM Messages INNER JOIN TEMP_MEMBERS ON ID = TEMP_ID
INSERT INTO MESSAGES SELECT * FROM TEMP_messges
im not familiar with Oracle syntax, but that is the way i would approach it IF the users messages are all done in rapid succession.
Hope this helps
TALK TO YOUR DBA
He is there to help you. When we DBAs take access away from the developers for something such as this, it is assumed we will provide the support for you for that task. If your code is taking too long to complete and that time appears to be tied up in the database, your DBA will be able to look at exactly what is going on and offer suggestions or possibly even solve the problem without you changing anything.
Just glancing over your problem statement, it doesn't appear you'd be looking at contention issues, but I don't know anything about your underlying structure.
Really, talk to your DBA. He will probably enjoy looking at something fun instead of planning the latest CPU deployment.
This might speed things up:
Create a lookup table:
create table rowid_table (row_id ROWID ,user VARCHAR2(100));
create index rowid_table_ix1 on rowid_table (user);
Run a nightly job:
truncate table rowid_table;
insert /*+ append */ into rowid_table
select ROWID row_id , user
from table;
dbms_stats.gather_table_stats('SCHEMAOWNER','ROWID_TABLE');
Then when deleting the records:
delete from table
where ROWID IN (select row_id
from rowid_table
where user = 'mk');
Your own suggestion seems very sensible. Locking in small batches has two advantages:
the transactions will be smaller
locking will be limited to only a few rows at a time
Locking in batches should be a big improvement.

Best approach to cache Counts from SQL tables?

I would like to develop a Forum from scratch, with special needs and customization.
I would like to prepare my forum for intensive usage and wondering how to cache things like User posts count and User replies count.
Having only three tables, tblForum, tblForumTopics, tblForumReplies, what is the best approach of cache the User topics and replies counts ?
Think at a simple scenario: user press a link and open the Replies.aspx?id=x&page=y page, and start reading replies. On the HTTP Request, the server will run an SQL command wich will fetch all replies for that page, also "inner joining with tblForumReplies to find out the number of User replies for each user that replied."
select
tblForumReplies.*,
tblFR.TotalReplies
from
tblForumReplies
inner join
(
select IdRepliedBy, count(*) as TotalReplies
from tblForumReplies
group by IdRepliedBy
) as tblFR
on tblFR.IdRepliedBy = tblForumReplies.IdRepliedBy
Unfortunately this approach is very cpu intensive, and I would like to see your ideas of how to cache things like table Counts.
If counting replies for each user on insert/delete, and store it in a separate field, how to syncronize with manual data changing. Suppose I will manually delete Replies from SQL.
These are the three approaches I'd be thinking of:
1) Maybe SQL Server performance will be good enough that you don't need to cache. You might be underestimating how well SQL Server can do its job. If you do your joins right, it's just one query to get all the counts of all the users that are in that thread. If you are thinking of this as one query per user, that's wrong.
2) Don't cache. Redundantly store the user counts on the user table. Update the user row whenever a post is inserted or deleted.
3) If you have thousands of users, even many thousand, but not millions, you might find that it's practical to cache user and their counts in the web layer's memory - for ASP.NET, the "Application" cache.
I would not bother with caching untill I will need this for sure. From my expirience this is no way to predict places that will require caching. Try iterative approach, try to implement witout cashe, then gether statistics and then implement right caching (there are many kinds like content, data, aggregates, distributed and so on).
BTW, I do not think that your query is CPU consuming. SQL server will optimaze that stuff and COUNT(*) will run in ticks...
tbl prefixes suck -- as much as Replies.aspx?id=x&page=y URIs do. Consider ASP.NET MVC or just routing part.
Second, do not optimize prematurely. However, if you really need so, denormalize your data: add TotalReplies column to your ForumTopics table and either rely on your DAL/BL to keep this field up to date (possibly with a scheduled task to resync those), or use triggers.
For each reply you need to keep TotalReplies and TotalDirectReplies. That way, you can support tree-like structure of replies, and keep counts update throughout the entire hierarchy without a need to count each time.