Efficiency - looping greater than the sum of it's parts? - sql

This is a question of concept, and I am just moving from MS Access to SQL Server for stability and scalability.
I need to maintain a database that pulls from another server daily. Due to the possibility (and probability) of record changes on the other server, I have to pull using a 10 day rolling window with the expectation that anything older than 10 days will not change by policy.
The pull is in stages, getting just the records within a date range on the initial pass, then moving to other tables one at a time to pull relevant and relation data.
I have written a script that works with date range variables. If I set the range to 10 days, it pulls everything in about 8 hours. In a test to see if looping might be better, having the script loop starting with today -10, then continue until the while < today, it takes 16 hours to do 3 days.
Being new, I am curious of the logical reason why the looping by date approach is so much slower. My thought was to try to mitigate impact on the other server, but maybe this isn't conceptually the case.
The code works perfectly in both cases with the only difference being looped or all at once.
Thanks for any insight on this!

Since you are pulling from another server, you are probably using a linked server. Also, since it's the most intuitive way to do it, you are probably doing something like this:
select somefields
from ServerName.DatabaseName.Owner.TableName
where whatever
With this syntax, sql server brings the entire contents of the remote table to the local server first, and then applies the where clause. If the remote table has a lot of data, it takes a long time to transfer it over.
When you ran your original query, the data was transferred once. When you set up your loop, it was transferred more than once. That's why it took more time.
To speed up production with linked servers, use openquery.

Related

"Trigger" on frequent changes with time range?

I have a need to monitor a particular table in a SQL database, and get notified (email ideally) if there are a large number of changes within any 30 second period (for example). Unsure if this is best done in SQL Server, or by some other method.
I've been hunting and can't seem to find anything relevant and thought I'd ask here to get pointed in the right direction.
So logically, something like this:
If # of Updates or Inserts exceeds 500 within any 30 second timeframe, send email to x#xxxxxx.com
Any thoughts on how best to attack this would be greatly appreciated.
I'd split this into separate concerns.
Firstly, how do you detect changes? The easiest is a "last updated" column; you can then run a select query on changes since the current time, and compare that to your limiter.
The second question is how you execute that query and send your emails. Your question title mentions triggers - these are specific SQL Server objects. I have a personal dislike of triggers - they are "side effects", and tend to make code hard to understand, and lead to entertaining bugs. In your case, they're probably a bad solution - you want triggers to be incredibly fast, as they hold up your data changes while they execute. Sending an email is probably slower than you want.
The simplest option is probably to use an agent job, scheduled to run at whatever frequency you need. Again, performance may be a challenge here - having a high-load query run every 30 seconds may slow down your database dramatically.
If you are using any audit/log table which keeps track of transactions (Insert, Update or Delete) on that particular table, then it should be fairly easy.
Use SQL server agent job to execute 2 tasks:
Execute a script to count the number of transactions on that audit/log table for the last 30 seconds
If it exceeds 500, run powershell script/ .exe file to send emails

Running an SQL query in background

I'm trying to update a modest dataset of 60k records with a value which takes a little time to compute. From a small trial run of 6k records in the production environment, it took 4 minutes to complete, so the full execution should take around 40 minutes.
However this trial run showed that there were SQL timeouts occurring on user requests when accessing data in related tables (but not necessarily on the actual rows which were being updated).
My question is, is there a way of running non-urgent queries as a background operation in the SQL server without causing timeouts or table locking for extensive periods of time? The data within the column which is being updated during this period is not essential to have the new value returned; aka if a request happened to come in for this row, returning the old value would be perfectly acceptable rather than locking the set until the update is complete (I'm not sure the ins and outs of how this works, obviously I do want to prevent data corruption; could be a way of queuing any additional changes in the background)
This is possibly a situation where the NOLOCK hint is appropriate. You can read about SQL Server isolation levels in the documentation. And Googling "SQL Server NOLOCK" will give you plenty of material on why you should not over-use the construct.
I might also investigate whether you need a SQL query to compute values. A single query that takes 4 minutes on 6k records . . . well, that is a long time. You might want to consider reading the data into an application (say, using Python, R, or whatever) and doing the data manipulation there. It may also be possible to speed up the query processing itself.

stored proc time outs

We are currently having difficulties with a sql server procedure timing out on queries. 9 times out of 10 the query will run within 5 second max, however, on occasions, the proc can continue to run in excess of 2 mins and causing time outs on the front end (.net MVC application)..
They have been investigating this for over a week now, checking jobs, server performance and all seems to be ok..
The DBA's have narrowed it down to a particular table which is being bombarded from different application with inserts / updates. This in combination with the complex select query that is causing the time out that joins on that table (im being told) is causing the time outs..
Are there any suggestions at all to how to get around these time outs?
ie.
replicate the table and query the new table?
Any additional debugging that can prove that this is actually the issue?
Perhaps cache the data on the front end, if a time out, call data from cache?
A table being bombarded with updates is a table being bombarded with locks. And yes, this can affect performance.
First, copy the table and run the query multiple times. There are other possibilities for the performance issue.
One cause of unstable stored procedure performance in SQL Server is compilation. The code in the stored procedure is compiled the first time it is executed -- the resulting execution plan might work for some inputs and not others. This is readily fixed by using the option to recompile the queries each time (although this adds overhead).
Then, think about the query. Does it need the most up-to-date data? If not, perhaps you can just copy the table once per hour or once per day.
If the most recent data is needed, you might need to re-think the architecture. A table that does insert-only using a clustered identity column always inserts at the end of the table. This is less likely to interfere with queries on the table.
Replication may or may not help the problem. After all, full replication will be doing the updates on the replicated copy. You don't solve the "bombardment" problem by bombarding two tables.
If your queries involve a lot of historical data, then partitioning might help. Only the most recent partition would be "bombarded", leaving the others more responsive to queries.
The DBA's have narrowed it down to a particular table which is being bombarded from different application with inserts / updates. This in combination with the complex select query that is causing the time out that joins on that table (im being told) is causing the time outs
We used to face many time outs and used to get a lot of escalations..This is the approach we followed for reducing time outs..
Some may be applicable in your case,some may not...but following will not cause any harm
Change below sql server settings:
1.Remote login timeout :60
2.Remote query timeout:0
Also if your windows server is set to use Dynamic ram,try changing it to static ram..
You may also have to tune,some of windows server settings
TCP Offloading/Chimney & RSS…What is it and should I disable it?
Following above steps,reduced our time outs by 99%..
For the rest 1%,we have dealt each case seperately
1.Update statistics for those tables involved in the query
2.Try fine tuning the query further
This helped us reduce time outs totally..

What would be the most efficient method for storing/updating Interval based data in SQL?

I have a database table with about 700 millions rows plus (growing exponentially) of time based data.
Fields:
PK.ID,
PK.TimeStamp,
Value
I also have 3 other tables grouping this data into Days, Months, Years which contains the sum of the value for each ID in that time period. These tables are updated nightly by a SQL job, the situation has arisen where by the tables will need to updated on the fly when the data in the base table is updated, this can be however up to 2.5 million rows at a time (not very often, typically around 200-500k up to every 5 minutes), is this possible without causing massive performance hits or what would be the best method for achieving this?
N.B
The daily, monthly, year tables can be changed if needed, they are used to speed up queries such as 'Get the monthly totals for these 5 ids for the last 5 years', in raw data this is about 13 million rows of data, from the monthly table its 300 rows.
I do have SSIS available to me.
I cant afford to lock any tables during the process.
700M recors in 5 months mean 8.4B in 5 years (assuming data inflow doesn't grow).
Welcome to the world of big data. It's exciting here and we welcome more and more new residents every day :)
I'll describe three incremental steps that you can take. The first two are just temporary - at some point you'll have too much data and will have to move on. However, each one takes more work and/or more money so it makes sense to take it a step at a time.
Step 1: Better Hardware - Scale up
Faster disks, RAID, and much more RAM will take you some of the way. Scaling up, as this is called, breaks down eventually, but if you data is growing linearly and not exponentially, then it'll keep you floating for a while.
You can also use SQL Server replication to create a copy of your database on another server. Replication works by reading transaction logs and sending them to your replica. Then you can run the scripts that create your aggregate (daily, monthly, annual) tables on a secondary server that won't kill the performance of your primary one.
Step 2: OLAP
Since you have SSIS at your disposal, start discussing multidimensional data. With good design, OLAP Cubes will take you a long way. They may even be enough to manage billions of records and you'll be able to stop there for several years (been there done that, and it carried us for two years or so).
Step 3: Scale Out
Handle more data by distributing the data and its processing over multiple machines. When done right this allows you to scale almost linearly - have more data then add more machines to keep processing time constant.
If you have the $$$, use solutions from Vertica or Greenplum (there may be other options, these are the ones that I'm familiar with).
If you prefer open source / byo, use Hadoop, log event data to files, use MapReduce to process them, store results to HBase or Hypertable. There are many different configurations and solutions here - the whole field is still in its infancy.
Indexed views.
Indexed views will allow you to store and index aggregated data. One of the most useful aspects of them is that you don't even need to directly reference the view in any of your queries. If someone queries an aggregate that's in the view, the query engine will pull data from the view instead of checking the underlying table.
You will pay some overhead to update the view as data changes, but from your scenario it sounds like this would be acceptable.
Why don't you create monthly tables, just to save the info you need for that months. It'd be like simulating multidimensional tables. Or, if you have access to multidimensional systems (oracle, db2 or so), just work with multidimensionality. That works fine with time period problems like yours. At this moment I don't have enough info to give you, but you can learn a lot about it just googling.
Just as an idea.

Postgres: How to fire multiple queries in same time?

I have one procedure which updates record values, and i want to fire it up against all records in table (over 30k records), procedure execution time is from 2 up to 10 seconds, because it depends on network load.
Now i'm doing UPDATE table SET field = procedure_name(paramns); but with that amount of records it takes up to 40 min to process all table.
Now im using 4 different connections witch fork to background and fires query with WHERE clause set to iterate over modulo of row id's to speed this up, ( WHERE id_field % 4 = ) and this works well and cuts down table populate to ~10 mins.
But i want to avoid using cron, shell jobs and multiple connections for this, i know that it can be done with libpq, but is there a way to fire up a query (4 different non-blocking queries) and do not wait till it ends execution, within single connection?
Or if anyone can point me out to some clues on how to write that function, using postgres internals, or simply in C and bound it as a stored procedure?
Cheers Darius
I've got a sure answer for this question - IF you will share with us what your ab workout is!!! I'm getting fat by the minute and I need answers myself...
OK I'll answer anyway.
If you are updating one table, on one database server, in 40 minutes 'single threaded' and in 10 minutes with 4 threads, the bottleneck is not the database server; otherwise, it would get bogged down in I/O. If you are executing a bunch of UPDATES, one call per record, the network round-trip time is killing you.
I'm pretty sure this is the case and not that it's either an I/O bottleneck on the DB or the possibility that procedure_name(paramns); is taking a long time. (If that were the procedure taking 2-10 seconds it would take like 2500 min to do 30K records). The reason I am sure is that starting 4 concurrent processed cuts the time in 1/4. So especially it is not an i/o issue on the DB server.
This might be the one excuse for putting business logic in an SP on the server. Optimization unfortunately means breaking the rules. The consequence is difficult maintenance. but, duh!!
However, the best solution would be to get this set up to use 'bulk update' queries. That might mean you have to take several strange and unintuitive steps such as this:
This will require a lot of modfication if multiple users can run it concurrently.
refactor the system so procedure_name(paramns) can get all the data it needs to process all records via a select statement. May need to use creative joins. If it's an SP of course now you are moving the logic to the client.
Use that have the program create an XML or other importable flat file format with the PK of the record to update, and the new field value or values. Write all the updates to this file instead of executing them on the DB.
have a temp table on the database that matches the layout of this flat file
run an import on the database - clear the temp table and import the file
do an update of a join of the temp table and the table to be updated, e.g., UPDATE mytbl, mytemp WHERE myPK=mytempPK SET myval=mytempnewval (use the right join syntax of course).
You can try some of these things 'by hand' first before you bother coding, to see if it's worth the speed increase.
If possible, you can still put this all in an SP!
I'm not making any guarantees, especially as I look down at my ever-fattening belly, but, this has the potential to melt your update job down to under a minute.
It is possible to update multiple rows at once. Below an example in postgres:
UPDATE
table_name
SET
column_name = temp.column_name
FROM
(VALUES
(<id1>, <value1>),
(<id2>, <value2>),
(<id3>, <value3>)
) AS temp("id", "column_name")
WHERE
table_name.id = temp.id
PHP has some functions for asynchrone queries:
pg_ send_ execute()
pg_ send_ prepare()
pg_send_query()
pg_ send_ query_ params()
No idea about other programming languages, you have to dig into the manuals.
I think you can't. Single connection can handle single query at once. It's described in libpq documentation chapter "Asynchronous Command Processing":
"After successfully calling PQsendQuery, call PQgetResult one or more times to obtain the results. PQsendQuery cannot be called again (on the same connection) until PQgetResult has returned a null pointer, indicating that the command is done."