I want to create a scalar SQL function that checks the row count of a table using dm_db_partition_stats. I have a handful of tables that get pushed to me and during that time the tools that use those tables are rendered useless.
I have these tables backed up on another server. What I'd like to do is run a check on the row count. If the results are 0 then the scalar function will return a 1. In the .NET front-end if a 1 is returned then it can query the backup data.
My question is when will the row count get updated in dm_db_partition_stats? Is it immediately or is there some lag involved?
The Dynamic Management Views directly return information about the current internal state of the server, so they are as immediate and real time as it is possible to get. However, the row count from that view is only guaranteed to be approximate, and if there are active transactions in the process of inserting or deleting rows, the count you get may or may not match what you would see if you actually queried the table. So from what you describe it sounds possible that the code that runs subsequently might not find what it was expecting.
Related
We process CSV files from our upstream systems and load them to our master tables in our SQL Server database. We are currently on boarding a new upstream system and suddenly our UPDATE statement took very long time. It could be due to incoming data having previous related data in our system and it caused huge update. We are able to find out the table which was getting updated through sp_whoisactive.
My query is:
Post the update, is there a way to figure out the number of rows updated for the table from some place like error log or default trace or through DMV?
During update, if we find these kind of huge update happening in future, can we set up some trace to identify the number of rows will get updated or figure out the update statement with current parameters (current values of parameters) ? In sp_whoisactive we get update statement with variables. But we don't know the current parameters.
Proactively, should we setup extended events or something else to capture these kinds of huge updates in future?
Let's start with your third question first. Yes. If you really want to track specific values for changes, the best way to do this is through Extended Events and you must set it up and have it running ahead of time. As you'll see in the rest of this post, there may be no easy way to retrieve the specific information you're looking for, depending. Something like sql_statement_completed will give you precise row counts for a given event. You can filter it to a specific table.
Second question, during updates, you can't really see how many rows are being updated accurately within a transaction. However, you can get a guess at how many rows are likely to be updated. The execution plan will have the row estimates that it anticipates will occur. So, you can query this from sys.dm_exec_query_plan. Combine it with sys.dm_exec_sql_batch to find the query. I'm sure sp_whoisactive can also supply this information (it's just querying the DMVs). You can also watch Live Query Statistics if you've set your server up correctly ahead of time. That will give you the estimated row counts, but then it will show you the actuals as they occur.
Now for the tough question. Can you get row counts after the fact? Kind of. If the query just executed and hasn't executed again, sys.dm_exec_sql_batch does have a last_rows column that will provide that info. If more than one query has run though, that information is lost because it's only the most recent execution of the query. If you're on Azure SQL Database, or SQL Server 2019, you can also look to sys.dm_exec_query_plan_stats to see the last Execution Plan Plus Runtime Metrics. That will also have row counts Although, if that's all you're looking for, and this is the most recent execution, the batch DMV is easier. I don't know if that column is included in sp_whoisactive, but you can just query the DMV yourself.
However, if the query has run more than once, you're out of luck. You can look to the execution plan, as was mentioned before, to see what the row estimates are. If the query suffered from waits more than 30 seconds, it will show up in the system_health extended event session, but that won't include row counts. Really, unless it's the very last time the exact query was run, there's no way after the fact to get the row count value.
I have a Dashboard to display data from stored procedure,
Stored procedure contains calculations for data to be display in dashboard, I am getting an performance issue while executing the stored procedure, so I decided to run the SP in background and decided to dump data in a physical table, after that i can directly fetch data from this table, but again millions data again coming there I will get performance I am not getting a way to solve this kindly help me with this.
The problem lies in the amount of data the dashboard is trying to process.
Since it's okay for you to dump the output on a physical table, simply create an aggregate version of that table. For example instead of having millions of records, you can group by country, department, employee, etc then dump the output in a physical table instead. Usually we group the transactions into per day, or in other worlds 1 row per transaction day or GROUP BY CAST(transaction_date AS VARCHAR(12)).
Better yet, if it is possible, modify the stored procedure to return only a few rows of data that is already aggregated.
At least in the place we work in, we call that "reporting tables" and it only contains few thousand rows that drive the dashboards. So we have an SP.. let's say "usp_Report" that is used by the dashboard. It does two things (1) update the "reporting table" in aggregate form (2) return the data found in the "reporting table". The update the data portion only happens per day/hour so we program this change frequency control within the stored procedure.
I am going to be maintaining a local copy of a database on bigquery. I will be using the API and tabledata:list. This database is not my own, and is regularly updated by the maintainers by appending new data (say every hour).
First, can I assume that when this data is appended, it will definitely be added to the end of the database?
Now, let's assume that currently the database has 1,000,000 rows and I am now downloading all of these by paging through tabledata:list. Also, let's assume that the database is updated partway through (with 10,000 rows). By using the page tokens, can I be assured that I will only download the 1m rows present when I started in the order they are in in the database?
Finally, now let's say that I come to update my copy. If I initiate the tabledata:list with a startIndex of 1,000,000 and I use a maxResults of 1000, will I get 10 pages containing the updated data that I am expecting?
I suppose all these questions boil down to whether bigquery respects the order the data is in, whether this order is used by tabledata:list, and whether appended data is guaranteed to follow previous data.
As there is a column whose values are unique, and I can perform a simple select count(1) from table to get the length of the table, I can of course check that my local copy is complete by comparing the length of my local db with that of the remote, however if the above weren't guaranteed and I ended up with holes in my data, it would be quite impractical to remedy as the primary key is not sequential (otherwise I could just fill in the missing rows) and the database is very large.
When you append data, we will append to the end of the table data list, however, bigquery may periodically coalesce data, which does not respect ordering. We have been discussing being able to preserve the ordering, or at least have a way of accessing the most recent data, but this is not yet implemented or designed. If it is an important feature for you, let us know and we'll prioritize it accordingly.
If you use page tokens, you are assured of a stable listing. If the table gets updated in the middle of paging through the data, you'll still only see the data that was in the table when you created the page token. Note that because of this, page tokens are only valid for 24 hours.
This should work as long as no coalesce has occurred since you have updated the table.
You can get the number of rows in the table by calling tables.get, which is usually simpler and faster than running a query.
Hopefully some smarter DBAs than I can help me find a good solution for what I need to do.
For the sake of discussion, lets assume I have a table called 'work' with some number of columns, one of which is a column that represents ownership of that row of work from a given client. The scenario is that I'll have 2 clients connected and polling a table for work to be done, when a row (or some number of rows) shows up, the first client that selects the rows will also update them to imply ownership, that update will remove those rows from being returned to any other client's selects. My question is, in this scenario, what sort of locking can I use to prevent 2 clients from hitting the table at the same time and both of them being returned the same rows via the select?
The UPDATE statement with RETURNING clause is the way to do this.
UPDATE table
SET ownership = owner
RETURNING ( column list );
REFERENCES:
Similar Question
Documentation
My question is, in this scenario, what sort of locking can I use to prevent 2 clients from hitting the table at the same time and both of them being returned the same rows via the select?
No locking needed here.
In the UPDATE, simply specify that you only want the script to take ownership of the task if the owner is still null (assuming that's how you flag unassigned tasks). This should work:
UPDATE foo SET owner = ? WHERE id = ? AND owner = ? WHERE owner IS NULL
If the number of modified rows is equal to the number you expected (or a RETURNING clause returns results as suggested by #Ketema), then you successfully grabbed ownership.
Fake edit because I noticed your comment mere moments before submitting this answer:
eg: 2 clients issuing that query at the same time, they have no chance of manipulating the same rows?
Correct. You might want to read up on MVCC. Running these statements outside of a transaction will do the right thing. Behavior inside a transaction will be different.
What would be the easiest way to count the new records that are inserted into a database? Is it possible to include a count query in with the load query?
Or is something more complex needed, such as recording the existing last record and counting everything added after it?
edit:
I have a cron job, that uses LOAD DATA INFILE in a script that is passed directly to mysql. This data is used with a php web application. As part of the php web application, I need to generate weekly reports, including how many records were inserted in the last week.
I am unable to patch mysql, or drastically change the database schema/structure, but I am able to add in new tables or fields. I would prefer not to count records from the csv file and store this result in a textfile or something. INstead, I would prefer to do everything from within PHP with queries.
Assuming your using Mysql 5 or greater, you could create a trigger which would fire upon inserting into a specific table. Note that an "insert" trigger also fires with the "LOAD" command.
Using a trigger would require you to persist the count information in a separate table. Basically you'd need to create a new table with 1 row/column to hold the count. The trigger would then update that value with the amount of data loaded.
Here's the MySQL manual page on triggers, the syntax is fairly straight forward. http://dev.mysql.com/doc/refman/5.0/en/create-trigger.html
edit
Alternatively, if you don't want to persist the data within the database you could perform your "Load" operations within a stored procedure. This would allow you to perform a select count() on the table before you begin the Load and after the Load is complete. You would just need to subtract the resulting values to determine how many rows were inserted during the Load.
Here's the MySQL manual page on procedures.
http://dev.mysql.com/doc/refman/5.0/en/create-procedure.html
That would probably depend on what is determined as being new. Is it entries entered into the database in the last five minutes or 10 minutes etc? Or is it any record past a certain Auto ID?
If you are looking at time based method of determining what's new, you can have a field (probably of type datetime) that records the time when the record was inserted and to get the number, you simply do a...
select count(*) from table where currentTime > 'time-you-consider-to-be-new'
If you don't want to go by recording the time, you can use an auto increment key and simply keep track of the last inserted ID and count the ones that come after that at any given time window. so if one hour ago the ID was 10000 then a number of records have been inserted since then. You will need to count all records greater than 10000 and keep track of the last insert ID and repeat whenever needed.
If you are not looking at a specific table, you can use the following:
show global status like "Com_%";
This will show you statistics for every type of query. These numbers just keep on counting, so if you want to use them, record the initial number when starting to track the queries, and subtract this from your final number (but yea, that's a given).
If you are looking for pure statistics, I can recommend using Munin with the MySQL plugins.
From where do you load the data? You might consider to count them befor you insert them into the database. If it's a sqlscript you might write a quick and dirty bash script (with grep or something similar) to count the fields.
You say you can't change the structure. Does that mean you can't change the table you are inserting into, or you can't change the database at all? If you can add a table, then just create a table with 2 columns - a timestamp and the key of the table you are loading. Before you load your csv file, create another csv file with just those two columns, and load that csv after your main one.
This might be simpler than you want, but what about a Nagios monitor to track the row count? (Also consider asking around on serferfault.com; this stuff is totally up their alley.)
Perhaps you could write a small shell script that queries the database for the number of rows. You could then have a Cron job that runs every minute/hour/day etc and outputs the COUNT to a log file. Over time, you could review the log file and see the rate at which the database is growing. If you also put a date in the log file, you could review it easier over longer periods.
See if this is the kind of MySQL data collection you're interested in: http://code.google.com/p/google-mysql-tools/wiki/UserTableMonitoring.
If that is the case, Google offers a MySQL patch (to apply to a clean mysql directory source) at http://google-mysql-tools.googlecode.com/svn/trunk/mysql-patches/all.v4-mysql-5.0.37.patch.gz. You can read more about the patch at http://code.google.com/p/google-mysql-tools/wiki/Mysql5Patches.
If this is not what you're looking for, I suggest you explain yourself a little more in order for us to help you better.
Could you use a trigger on the table which will insert into a table you created, which in the structure has a timestamp?
You could then use a date calculation on a period range to find the information needed.
I dont know what version of mysql you are using, but here is link to the syntax for trigger creation in version 5.0: http://dev.mysql.com/doc/refman/5.0/en/create-trigger.html
Good luck,
Matt
Well, if you need exhaustive information: which rows were inserted, updated or deleted, it might make sense to create an additional audit table to store those things with a timestamp. You could do this with triggers. I would also write a stored procedure which would execute as event and erase old entries (whatever you consider old).
Refer to the link posted by Lima on how to create triggers in MySQL.
Refer to page 655 of "MySQL Cookbook" by Paul Dubois (2nd Edition) or page 158 of "SQL for smarties" by Joe Celko.
so the 'load' will only insert new data in the table ? or rewrite the whole table ?
If it will load new data, then you can do a select count(*) from yourtable
once before the loading and once after the loading ... the difference will show you how many new records where inserted..
If on the other hand you rewrite the whole table and want to find the different records from the previous version .. then you would need a completely different approach..
Which one is it ?
Your question is a bit ambiguous but they mysql c APIs provide a function "mysql_affected_rows" that you can call after each query to get the number of affected rows. For an insert it returns the number of rows inserted. Be aware that for updates it returns the number of rows changed not the number of rows that matched the where clause.
If you are performing a number of queries and need to know how many were inserted the most reliable way would probably be doing a count before and after the queries.
As noted in sobbayi's answer adding a "created at" timestamp to your tables would allow you to query for records created after (or before) a given time.
UPDATE:
OK here is what you need to do to get a count before and after:
create a table for the counts:
create table row_counts (ts timestamp not null, row_count integer not null);
in your script add the following before and after your load file inline query:
insert into row_counts (ts,row_count) select now(),count(0) from YOUR_TABLE;
load file inline......
insert into row_counts (ts,row_count) select now(),count(0) from YOUR_TABLE;
the row_counts table will now have the count before and after your load.
show global status like 'Com_insert';
flush status and show session status... will work for just the current connection.
see http://dev.mysql.com/doc/refman/5.1/en/server-status-variables.html#statvar_Com_xxx
Since you asked for the easiest way, I would suggest you to use a trigger on insert. You could use a single column, single row table as a counter and update it with the trigger.