We process CSV files from our upstream systems and load them to our master tables in our SQL Server database. We are currently on boarding a new upstream system and suddenly our UPDATE statement took very long time. It could be due to incoming data having previous related data in our system and it caused huge update. We are able to find out the table which was getting updated through sp_whoisactive.
My query is:
Post the update, is there a way to figure out the number of rows updated for the table from some place like error log or default trace or through DMV?
During update, if we find these kind of huge update happening in future, can we set up some trace to identify the number of rows will get updated or figure out the update statement with current parameters (current values of parameters) ? In sp_whoisactive we get update statement with variables. But we don't know the current parameters.
Proactively, should we setup extended events or something else to capture these kinds of huge updates in future?
Let's start with your third question first. Yes. If you really want to track specific values for changes, the best way to do this is through Extended Events and you must set it up and have it running ahead of time. As you'll see in the rest of this post, there may be no easy way to retrieve the specific information you're looking for, depending. Something like sql_statement_completed will give you precise row counts for a given event. You can filter it to a specific table.
Second question, during updates, you can't really see how many rows are being updated accurately within a transaction. However, you can get a guess at how many rows are likely to be updated. The execution plan will have the row estimates that it anticipates will occur. So, you can query this from sys.dm_exec_query_plan. Combine it with sys.dm_exec_sql_batch to find the query. I'm sure sp_whoisactive can also supply this information (it's just querying the DMVs). You can also watch Live Query Statistics if you've set your server up correctly ahead of time. That will give you the estimated row counts, but then it will show you the actuals as they occur.
Now for the tough question. Can you get row counts after the fact? Kind of. If the query just executed and hasn't executed again, sys.dm_exec_sql_batch does have a last_rows column that will provide that info. If more than one query has run though, that information is lost because it's only the most recent execution of the query. If you're on Azure SQL Database, or SQL Server 2019, you can also look to sys.dm_exec_query_plan_stats to see the last Execution Plan Plus Runtime Metrics. That will also have row counts Although, if that's all you're looking for, and this is the most recent execution, the batch DMV is easier. I don't know if that column is included in sp_whoisactive, but you can just query the DMV yourself.
However, if the query has run more than once, you're out of luck. You can look to the execution plan, as was mentioned before, to see what the row estimates are. If the query suffered from waits more than 30 seconds, it will show up in the system_health extended event session, but that won't include row counts. Really, unless it's the very last time the exact query was run, there's no way after the fact to get the row count value.
Related
I've been trying to figure out a performance issue for a while and would appreciate if someone can help me understand the issue.
Our application is connected to Oracle 11g. We have a very big table in which we keep data for last two months. We do millions of inserts every half an hour and a big bulk delete operation at the end of each day. Two of our columns are indexed and we definitely have skewed columns.
The problem is that we are facing many slow responses when reading from this table. I've done some researches as I am not a DB expert. I know about bind variable peeking and cursor sharing. The issue is that even for one specific query with a specific parameters, we see different execution time!
There is no LOB column in the table and the query we use to read data is not complex! it looks for all rows with a specific name (column is indexed) within a specific range (column is indexed).
I am wondering if the large number of insertions/deletions we do could cause any issues?
Is there any type of analysis we could consider to get more input on this issue?
I can see several possible causes of the inconsistency of your query times.
The number of updates being done while your query is running. As long as there are locks on the tables you use in the query your query has to wait for them to be release.
The statistics on the table can become very out of synch with this much data manipulation. I would try two things. First, I would find out when the DBMS_STATS.GATHER_DATABASE_STATS_JOB_PROC job is run and make sure the bulk delete is performed before this job each night. If this does not help I would ask the DBA to set up DBMS_MONITOR on your database to help you trouble shoot the issue.
I'm trying to update a modest dataset of 60k records with a value which takes a little time to compute. From a small trial run of 6k records in the production environment, it took 4 minutes to complete, so the full execution should take around 40 minutes.
However this trial run showed that there were SQL timeouts occurring on user requests when accessing data in related tables (but not necessarily on the actual rows which were being updated).
My question is, is there a way of running non-urgent queries as a background operation in the SQL server without causing timeouts or table locking for extensive periods of time? The data within the column which is being updated during this period is not essential to have the new value returned; aka if a request happened to come in for this row, returning the old value would be perfectly acceptable rather than locking the set until the update is complete (I'm not sure the ins and outs of how this works, obviously I do want to prevent data corruption; could be a way of queuing any additional changes in the background)
This is possibly a situation where the NOLOCK hint is appropriate. You can read about SQL Server isolation levels in the documentation. And Googling "SQL Server NOLOCK" will give you plenty of material on why you should not over-use the construct.
I might also investigate whether you need a SQL query to compute values. A single query that takes 4 minutes on 6k records . . . well, that is a long time. You might want to consider reading the data into an application (say, using Python, R, or whatever) and doing the data manipulation there. It may also be possible to speed up the query processing itself.
Is it possible with a mysql script, full of just mysql commands that get filtered into the mysql binary, to do a count of current records in insert into a stats table, perhaps with the time and date automatically generated?
I would want to do this, so calculations could be done, eg work out the total number of new records inserted in a given time.
If you are interested in benchmarking your insert statements, you might be able to get what you want by looking at the general query log file. It should show you the date and time of each query executed upon the database. If that isn't sufficient, you might also try looking at the binary log file. That might contain information about how many rows were affected by each query.
Is there a way to find out the number of rows inserted/deleted in a table in MySQL? Is this kind of statistics kept somewhere in the database? If not, what would be the best way to implement something to keep track of these statistics?
When I say how many, I mean within a certain period (last 24 hours, or since server was up, or last week etc)
When I need to keep track of deleted things, I just don't delete.
I change a column value that excludes it from normal user results.
If space is an issue, you can set it's contents you no longer care about to empty.
Inserted you can user COUNT()
The Binary Log contains records of all queries that update or insert data. I don't know if it stores the number of affected rows, however.
There is also a General Query Log, which tracks all queries that were run.
(Information current for MySQL 5.0. If you're using an older version ymmv)
If I want to handle logging my SQL queries, I have 2 possibilities:
Turning the MySQL Log function on
Writting my own 'trace' class
I prefer doing number 2.
Why?
Because it is more controllable. You can easily differ from INSERT DELETE UPDATE and so on queries.
But that is not the only advantage of your own trace class, because creating trace files (so called "logs") makes administrative tasks much more easier.
You can structure the trace output, put it into a separate database, store it into some XML or JSON file.
You can order things as you want them to be.
I am trying to select 100s of rows at a DB that contains 100000s of row and update those rows afters.
the problem is I don't want to go to DB twice for this purpose since update only marks those rows as "read".
is there any way I can do this in java using simple jdbc libraries? (hopefully without using stored procedures)
update: ok here is some clarification.
there are a few instance of same application running on different servers, they all need to select 100s of "UNREAD" rows sorted according to creation_date column, read blob data within it, write it to file and ftp that file to some server. (I know prehistoric but requirements are requirements)
The read and update part is for to ensure each instance getting diffent set of data. (in order, tricks like odds and evens wont work :/)
We select data for update. the data transfers through the wire (we wait and wait) and then we update them as "READ". then release lock for reading. this entire thing takes too long. By reading and updating at the same time, I would like to reduce lock time (from time we use select for update to actual update) so that using multiple instances would increase read rows per second.
Still have ideas?
It seems to me there might be more than one way to interpret the question here.
You are selecting the rows for the
sole purpose of updating them and
not reading them.
You are selecting the rows to show
to somebody, and marking them as
read either one at a time or all as a group.
You want to select the rows and mark
them as read at the time you select
them.
Let's take Option 1 first, as that seems to be the easiest. You don't need to select the rows in order to update them, just issue an update with a WHERE clause:
update table_x
set read = 'T'
where date > sysdate-1;
Looking at option 2, you want to mark them as read when a user has read them (or a down stream system has received it, or whatever). For this, you'll probably have to do another update. If you query for the primary key, in addition to the other columns you'll need in the first select, you will probably have an easier time of updating, as the DB won't have to do table or index scans to find the rows.
In JDBC (Java) there is a facility to do a batch update, where you execute a set of updates all at once. That's worked out well when I need to perform a lot of updates that are of the exact same form.
Option 3, where you want to select and update all in one shot. I don't find much use for this, personally, but that doesn't mean others don't. I suppose some kind of stored procedure would reduce the round trips. I'm not sure what db you are working with here and can't really offer specifics.
Going to the DB isn't so bad. If you aren't returning anything 'across the wire' then an update shouldn't do you too much damage and its only a few hundred thousand rows. What is your worry?
If you're doing a SELECT in JDBC and iterating over the ResultSet to UPDATE each row, you're doing it wrong. That's an (n+1) query problem that will never perform well.
Just do an UPDATE with a WHERE clause that determines which of those rows needs to be updated. It's a single network round trip that way.
Don't be too code-centric. Let the database do the job it was designed for.
Can't you just use the same connection without closing it?