SQL For Loop Tunning - sql

We're about to start a new process. It will involve over a fifty tables with a total of more than 2 million rows.
The process will loop in a For/For each box. Inside, the tables will suffer different procceses. Basicaly updates (the most called will be a delete looking for duplicates). In the end we'll get a new table with the full content of all those 50 tables with all the updates done.
So my question is: Is it better, in terms of speed, to look for duplicates in the tables during every loop. Or is it better to do a full delete at the end of the process checking the full result?
The amount of work will be more or less the same.
Thanks a lot.
EDIT.
More info.
The loop is kind of needed. Those 50 tables are located in two different servers. Oracle and Access. The loop is to populate them in the SQL server, the local one.
On every population (loop) I do some updates and work on the tables so they are ready.
The question is if the work we do on the tables is faster if its ran into the loop or outside it.
Thanks, hope it gets clearer.

sounds like a single statement
also, why denormalizing prematurely?

Related

different response time when executing a specific query

I've been trying to figure out a performance issue for a while and would appreciate if someone can help me understand the issue.
Our application is connected to Oracle 11g. We have a very big table in which we keep data for last two months. We do millions of inserts every half an hour and a big bulk delete operation at the end of each day. Two of our columns are indexed and we definitely have skewed columns.
The problem is that we are facing many slow responses when reading from this table. I've done some researches as I am not a DB expert. I know about bind variable peeking and cursor sharing. The issue is that even for one specific query with a specific parameters, we see different execution time!
There is no LOB column in the table and the query we use to read data is not complex! it looks for all rows with a specific name (column is indexed) within a specific range (column is indexed).
I am wondering if the large number of insertions/deletions we do could cause any issues?
Is there any type of analysis we could consider to get more input on this issue?
I can see several possible causes of the inconsistency of your query times.
The number of updates being done while your query is running. As long as there are locks on the tables you use in the query your query has to wait for them to be release.
The statistics on the table can become very out of synch with this much data manipulation. I would try two things. First, I would find out when the DBMS_STATS.GATHER_DATABASE_STATS_JOB_PROC job is run and make sure the bulk delete is performed before this job each night. If this does not help I would ask the DBA to set up DBMS_MONITOR on your database to help you trouble shoot the issue.

is it ok to loop a sql query in programing language

I have a doubt in mind when retrieving data from database.
There are two tables and master table id always inserted to other table.
I know that data can retrieve from two table by joining but want to know,
if i first retrieve all my desire data from master table and then in loop (in programing language) join to other table and retrieve data, then which is efficient and why.
As far as efficiency goes the rule is you want to minimize the number of round trips to the database, because each trip adds a lot of time. (This may not be as big a deal if the database is on the same box as the application calling it. In the world I live in the database is never on the same box as the application.) Having your application loop means you make a trip to the database for every row in the master table, so the time your operation takes grows linearly with the number of master table rows.
Be aware that in dev or test environments you may be able to get away with inefficient queries if there isn't very much test data. In production you may see a lot more data than you tested with.
It is more efficient to work in the database, in fewer larger queries, but unless the site or program is going to be very busy, I doubt that it'll make much difference that the loop is inside the database or outside the database. If it is a website application then looping large loops outside the database and waiting on results will take a more significant amount of time.
What you're describing is sometimes called the N+1 problem. The 1 is your first query against the master table, the N is the number of queries against your detail table.
This is almost always a big mistake for performance.*
The problem is typically associated with using an ORM. The ORM queries your database entities as though they are objects, the mistake is assume that instantiating data objects is no more costly than creating an object. But of course you can write code that does the same thing yourself, without using an ORM.
The hidden cost is that you now have code that automatically runs N queries, and N is determined by the number of matching rows in your master table. What happens when 10,000 rows match your master query? You won't get any warning before your database is expected to execute those queries at runtime.
And it may be unnecessary. What if the master query matches 10,000 rows, but you really only wanted the 27 rows for which there are detail rows (in other words an INNER JOIN).
Some people are concerned with the number of queries because of network overhead. I'm not as concerned about that. You should not have a slow network between your app and your database. If you do, then you have a bigger problem than the N+1 problem.
I'm more concerned about the overhead of running thousands of queries per second when you don't have to. The overhead is in memory and all the code needed to parse and create an SQL statement in the server process.
Just Google for "sql n+1 problem" and you'll lots of people discussing how bad this is, and how to detect it in your code, and how to solve it (spoiler: do a JOIN).
* Of course every rule has exceptions, so to answer this for your application, you'll have to do load-testing with some representative sample of data and traffic.

Getting records for comparing: one by one or in one go?

i hava a list of products and i must compare each product in the list with original ones which exist in database. which would be right way?
get all records from db(keep in arraylist), compare each one, update
or
get one from db, compare it, update, get next
there are many joins in db schema, thus second way seems unsuitable. on the other hand products table contains over 5000 records and i have doubts about keeping all of them(can be reduced to ~500 by some queries) in memory.
The best way is somewhere between the two options:
Get all the rows returned from the database in one query, but process each row as you read it, so that only one product is in memory at a time.
Such an approach is completely scalable.
As far as databases are concerned, 5000 results is tiny. I would pull all records and act on them in memory. This would be MUCH lighter on both the DB and on the app server.

SQL SERVER Procedure Inconsistent Performance

I am working on a SQL Job which involves 5 procs, a few while loops and a lot of Inserts and Updates.
This job processes around 75000 records.
Now, the job works fine for 10000/20000 records with speed of around 500/min. After around 20000 records, execution just dies. It loads around 3000 records every 30 mins and stays at same speed.
I was suspecting network, but don't know for sure. These kind of queries are difficult to analyze through SQL Performance Monitor. Not very sure where to start.
Also, there is a single cursor in one of the procs, which executes for very few records.
Any suggestions on how to speed this process up on the full-size data set?
I would check if your updates are within a transaction. If they are, it could explain why it dies after a certain amount of "modified" data. You might check how large your "tempdb" gets as an indicator.
Also I have seen cases when during long-running transactions the database would die when there are other "usages" at the same time, again because of transactionality and improper isolation levels used.
If you can split your job into independent non-overlaping chunks, you might want to do it: like doing the job in chunks by dates, ID ranges of "root" objects etc.
I suspect your whole process is flawed. I import a datafile that contains 20,000,000 records and hits many more tables and does some very complex processing in less time than you are describing for 75000 records. Remember looping is every bit as bad as using cursors.
I think if you set this up as an SSIS package you might be surprised to find the whole thing can run in just a few minutes.
With your current set-up consider if you are running out of room in the temp database or maybe it is trying to grow and can't grow fast enough. Also consider if at the time the slowdown starts, is there some other job running that might be causing blocking? Also get rid of the loops and process things in a set-based manner.
Okay...so here's what I am doing in steps:
Loading a file in a TEMP table, just an intermediary.
Do some validations on all records using SET-Based transactions.
Actual Processing Starts NOW.
TRANSACTION BEGIN HERE......
LOOP STARTS HERE
a. Pick Records based in TEMP tables PK (say customer A).
b. Retrieve data from existing tables (e.g. employer information)
c. Validate information received/retrieved.
d. Check if record already exists - UPDATE. else INSERT. (THIS HAPPENS IN SEPARATE PROCEDURE)
e. Find ALL Customer A family members (PROCESS ALL IN ANOTHER **LOOP** - SEPARATE PROC)
f. Update status for CUstomer A and his family members.
LOOP ENDS HERE
TRANSACTION ENDS HERE

how to automatically determine which tables need a vacuum / reindex in postgresql

i've written a maintenance script for our database and would like to run that script on whichever tables most need vacuuming/reindexing during our down time each day. is there any way to determine that within postgres?
i would classify tables needing attention like this:
tables that need vacuuming
tables that need reindexing (we find this makes a huge difference to performance)
i see something roughly promising here
It sounds like you are trying to re-invent auto-vacuum. Any reason you can't just enable that and let it's do it's job?
For the actual information you want, look at pg_stat_all_tables and pg_stat_all_indexes.
For a good example of how to use the data in it, look at the source for auto-vacuum. It doesn't query the views directly, but it uses that information.
I think you really should consider auto-vacuum.
However, if i did understood right your needs, that's what i'll do:
For every table (how many tables do you have?) define the criterias;
For example, talbe 'foo' need to be reindex every X new records and vacuum every X update, delete or insert
Write out your own application to do that.
Every day it check the tables status, save it in a log (to compare the rows difference over the time), and then reindex/vacuum the tables whose match yours criterias.
Sounds a little hacking, but i think is a good way to do an custom-autovacuum-with-custom-'triggers'-criteria
How about adding the same trigger function that runs after any CRUD action to all the tables.
The function will receives table name, checks the status of the table, and then run vacuum or reindex on that table.
Should be a "simple" pl/sql trigger, but then those are never simple...
Also, if your DB machine is strong enough, and your downtime long enough, just run a script every night to reindex it all, and vacuum it all... that way even if your criteria was not met at test time (night) but the was close to it (few records less than your criteria), it will not pose an issue the next day when it does reach the criteria...