As part of extended events I track rpc_completed and sql_batch_completed events. It is to catch all the queries hitting the db so I have a better understanding of the traffic. Based on that I would like to split application into two groups: read and read-write. The read ones I could move later to the read-only AG nodes for the performance reasons. My question is what is the best way to do that? I can see the Wrties field which at the first sight looks like what I am after. I have done some tests and it looks fine, i can see that for the insert/delete/update statements, the value in this column is greater than 0 when for selects it is 0.
Do you know of any pitfalls of fully depending on that field? Could you recommend another way of dealing with that task?
Update:
Do you know what is the definition of the writes field for sql_batch_completed or rpc_completed?
I couldn't find it. Is it the same as for profiler? Per this thread Making sense of the number of reads/writes in SQL Profiler it is:
Writes: Number of logical write I/Os issued by the user during the connection.
Kind regards,
Rafal
Related
So in the past for our ETL processes we used to do a column by column lookup and comparison. This way worked for us for a while and no one bothered to go back to take a look at how to make it more efficient.
Recently we noticed that our job run times are starting to creep up so this led to some discussion and one suggestion to optimize our process was that we use "hashing".
From my limited time with Googling I am getting some mixed messages about the benefits to hashing.
Now I know by doing a column by column look up it seems to be a tried and true method, simple to understand/implement and is relatively accurate.
With hashing - How would this be any better?
Logically I would think that instead of comparing multiple rows & columns with hashing we could just be comparing multiple rows and one single column for example.
But there is mention that there could be "clashes" or "collision" with hashing - So could this lead to inaccuracies or challenges trouble shooting?
Before we start tearing things apart I just wanted to see if I may have missed anything
EDIT 2022-07-16
When I refer to Azure I used it a bit generically as we are planning on using Azure Data Factory and Azure Synapse for our data movement/ETL/ELT. So we would be using BOTH Azure SQL Server and also Azure Data Lake Gen2 (and eventually other Azure services). Its quite a few moving pieces, in particular with ADLS - One item we were trying to figure out is how we can compare differences between "files". That led us down a rabbit hole of "can we use hashing? What is hashing?"
Hashing is normally better as you only compare one column, that contains the hash, rather than every column.
Collisions can occur with hashes but very rarely. and it depends on the hash function you use. You would need to research your chosen hash function, determine under what conditions you might get collisions and then decide whether that was a concern given your particular circumstances
I was reading about DDD and CQRS (using Asp.Net Core ,MSSQL), and their different approaches, then I read a topic about separating Read and Write Database ,so I started to search web about how to do so and how to sync those databases, but sadly(maybe I was searching wrong) I didn't find any good source to find how to do so.
So here is my question :
How should I separate those databases, and then how should I sync the data between them, e.g. I have a table called "User" which is in read and write separated dbs,now if I add a new row to the write table in write db, I have to tell the read db to sync itself with write db so I can have the new data there to query and use later,but how? I also read something about Event Sourcing Pattern or Event-Driven Architecture,but they didn't help me find out how to sync.
so anyone know how to do so or have any good resources about this topic which can help a dummy :)
(consider you're explaining for a guy who is learning it for the first time!).
Thanks!
I have a related answer that may provide some background on how to approach CQRS.
The main point to keep in mind is that the "write" side is concerned with changes/transaction (OLTP) and the "read" side is concerned with queries (OLAP).
How you update your "read" side (read model) is going to depend on how you make the "write" side changes. When using an Event Store things may be easier in that each event has a global sequence number and each projection (read model) tracks where it is in terms of the global sequence number. So when new events arrive (projection polls) then they can be actioned if the event applies to the projection.
If you simply update the "write" side with, say, a SQL query then things are going to be a bit different, and possibly tricky, since you don't have any mechanism to replay those changes into the read model should you wish to make changes. In such a case you could use messaging, and possibly store those, or make the changes to the "read" side together with the "write" side... which isn't ideal; unless you need 100% consistency.
As mentioned by #Levi Ramsey, the read model is usually quite a bit different from the write model in that it is optimised for reading so it may include denormalized data or simply be in a data store that is more suited to read models.
The main benefit of CQRS is around being able to use different data models and/or different databases for queries vs. updates. If they are using the same data model, there's often not much benefit (at least not with a DB like SQL Server which is, at most scales, reasonable for both) to CQRS.
This in turn implies that it's generally not possible to just have the two databases automatically be in sync, because there's going to be some model translation involved (e.g. from a relational DB (with a normalized schema) like SQL Server to a denormalized document DB like Mongo).
One fairly common pattern is to have the software which writes to the DB also publish events describing what was updated to some event bus. Another piece of software subscribes to those events and performs the appropriate updates to the read DB. Note that this implies the existence of a period of time where queries against the read DB and the write DB will give different results.
This might seem like a fairly specific question but I'm wondering if there is any technology/pattern out there that might help me in a current project. I have a hugely complex database which is updated by multiple systems. I now need to do change tracking on various bits of data that is spread across multiple tables so that I can send it to a third party system.
I've considered a number of options but unfortunately I can't seem to come to any other conclusion than using database triggers. I'm thinking of storing a flag in a table (or queue) to identify the rows that have changed and then building an xml diff containing the changed data to send to a web service. This feels a little dirty so I was wondering if anyone could think of a better alternative.
Depending on the database platform you're using, you might look into Change Data Capture. Since you mention .NET, here's some info about it: http://technet.microsoft.com/en-us/library/bb522489(v=sql.105).aspx
Other database systems may offer something similar.
Another option would be insert/update/delete triggers on the tables, however triggers should be approached carefully as they can cause some significant performance problems if not done right.
And yet another option still would be what you describe - some sort of flag to monitor for changes. A simple CREATED and MODIFIED timestamp fields can go a long way here, as rather than just a bit indicator suggesting that the row may need attention, you'll know when the update happened, and your export process can be programmed accordingly (e.g., select * from table where modified > getdate()-1).
I want to write a generic logging snip-it into a collection of stored procedures. I'm writing this to have a quantitative measure of our front-user user experience as I know which SP's are used by the front-end software and how they are used. I'd like to use this to gather a base-line before we commence performance tunning and afterward to show the outcome of tunning.
I can dynamically pull the object name from ##PROCID, but I've been unable to determine all parameters passed and their values. Anyone know if this is possible?
EDIT: marking my response as the answer to close this question. Appears extended events are the least intrusive item to performance, however i'm not sure if there is any substantial difference between minimal profiling and extended events. Perhaps something for a rainy day.
I can get the details of the parameters taken by the proc without parsing its text (at least in SQL Server 2005).
select * from INFORMATION_SCHEMA.PARAMETERS where
SPECIFIC_NAME = OBJECT_NAME(##PROCID)
And I guess that this means that I could, with some appropriately madcap dynamic SQL, also pull out their values.
I don't know how to do this off the top of my head, but I would consider running a trace instead if I were you. You can use SQL Server Profiler to gather only information for the stored procedures that you specify (using filters). You can send the output to a table and then query the results to your heart's content. The output can include IO information, what parameters were passed, the client userid and machine, and much much more.
After running the trace you can aggregate the results into reports that would show how many times a procedure was called, what parameters were used, etc...
Here is a link that might help:
http://msdn.microsoft.com/en-us/library/ms187929.aspx
Appears the best solution to my situation is to do profiling gathering only SP:starting and SP:completed and writing some TSQL to iterate through data and populate a tracking table.
I personally preferred code-generation for this, but politically where i'm working they preferred this solution. We lost some granularity in logging, but this is a sufficient solution to my problem.
EDIT: This ended being an OK solutions. Even profiling just these two items degrades performance to a noticeable degree. :( I wish we had a MSFT provided way to profile a workload that didn't degrade production performance. Oracle has nice solution to this, but it's has its tradeoff's as well. I'd love to see MSFT implement something similar. The new DMV's and extended events help to correlate items. Thanks again for the link Martin.
I'd like to start a discussion about the implementation of a database system.
I'm working for a company having a database system grown over ca. the last 10 years.
Let me try to describe what it's doing and how it's implemented:
The system is divided into 3 main parts handled by 3 different teams.
Entry:
The Entry Team is responsible for creating GUIs for the system. In the background is a huge MS SQL database (ca. 100 tables) and the GUI is created using .NET. There are different GUI applications and each application has lots of different tabs to fill in the corresponding tables. If e.g. a new column is added to the database, this column is added manually to the GUI application.
Dataflow:
The purpose of the Dataflow Team is to do do data calculations and prepare the data for the reporting team. This is done via multiple levels. Let me try to explain the process a little bit more in detail: The Dataflow Team uses the data from the Entry database copied to another server and another database via Transactional-Replication (this data contains information from all clients). Then once per hour a self-written application is checking for changed rows in the input tables (using a ChangedDate column) and then calling a stored procedure for each output table calculating new data using 1-N of the input tables. After that the data is copied to another database on another server using again Transaction-Replication. Here another stored procedure is called to calclulate additional new output tables. This stored procedure is started using a SQL job. From there the data is split to different databases, each database being client specific. This copying is done using another self-written application using the .NET bulkcopy command (filtering on the client). These client specific databases are copied to different client specific reporting databases on other servers via another self-written application which compares the reporting database with the client specific database to calculate the data difference. Just the data differences are copied (because the reporting database run in former times on the client servers).
This whole process is orchestrated by another self-written application to control e.g. if the Transactional-Replications are finished before starting the job to call the Stored procedure etc... Futhermore also the synchronisation between the different clients is orchestrated here. The process can be graphically displayed by a self-written monitoring tool which looks pretty complex as you can imagine...
The status of all this components is logged and can be viewed by another self-written application.
If new columns or tables are added all this components have to be manually changed.
For deployment installation instructions are written using MS Word. (ca. 10 people working in this team)
Reporting:
The Reporting Team created it's own platform written in .NET to allow the client to create custom reports via a GUI. The reports are accessible via the Web.
The biggest tables have around 1 million rows. So, I hope I didn't forget anything important.
Well, what I want to discuss is how other people realize this scenario, I can't imagine that every company writes it's own custom applications.
What are actually the possibilities to allow fast calculations on databases (next to using T-SQL). I'm somehow missing the link here to the object oriented programming I'm used to from my old company, but we never dealt with so much data and maybe for fast calculations this is the way to do it...Or is it possible using e.g. LINQ or BizTalk Server to create the algorithms and calculations, maybe even in a graphical way? The question is just how to convert the existing meter-long Stored procedures into the new format...
In future we want to use data warehousing, but that will take a while, so maybe it's possible to have a separate step to streamline the process.
Any comments are appreciated.
Thanks
Daniel
Why on earth would you want to convert existing working complex stored procs (which can be performance tuned) to LINQ (or am I misunderstanding you)? Because you personally don't like t-sql? Not a good enough reason. Are they too slow? Then they can be tuned (which is something you really don't want to try to do in LINQ). It is possible the process can be made better using SSIS, but as complex as SSIS is and the amount of time a rewrite of the process would take, I'm not sure you really would gain anything by doing so.
"I'm somehow missing the link here to the object oriented programming..." Relational databases are NOT Object-oriented and cannot perform well if you try to treat them like they are. Learn to think in terms of sets not objects when accessing databases. You are coming from the mindset of one user at a time inserting one record at a time, but this is not the mindset neeeded to deal with the transfer of large amounts of data. For these types of things, using the database to handle the problem is better than doing things in an object-oriented manner. Once you have a large amount of data and lots of reporting, people are far more interested in performance than you may have been used to in the past when you used some tools that might not be so good for performance. Whether you like T-SQL or not, it is SQL Server's native language and the database is optimized for it's use.
The best advice, having been here before, is to start by learning first how SQL works, and doing it in the context of the existing architecture sounds like a good way to start (since nothing you've described sounds irrational on the face of it.)
Whatever abstractions you try to lay on top (LINQ, Biztalk, whatever) all eventually resolve to pure SQL. And almost always they add overhead and complexity.
Your OO paradigms aren't transferable. Any suggestions about abstractions will need to be firmly defensible based on your firm grasp of the SQL consequences.
It will take a while, but it's all worth knowing, both professionally and personally.
I'm currently re-engineering a complex system which is moving from Focus (a database and language) to a data warehouse (separate team) and processing (my team) and reporting (separate team).
The current process is combined - data is loaded and managed in the Focus language and Focus database(s) and then reported (and historical data is retained)
In the new process, the DW is loaded and then our process begins. Our processes are completely coded in SQL, and a million row fact table (for one month) would be relatively small. We have some feeds where the monthly data is 25 million rows. There are some statistics tables produced which are over 200 million rows (a month). The processing can take several hours a month, end to end. We use tables to store intermediate results, and we ensure indexing strategies are suitable for the processing. Except for one piece implemented as an SSIS flow from the database back to itself because of extremely poor scalar UDF performance, the entire system is implemented as a series of T-SQl SPs.
We also have a process monitoring system similar to what you are discussing as well as having the dependencies in a table which ensures that each process runs only if all its prerequisites are satisfied. I've recently grafted on the MSAGL to graphically display and interact with the process (previously I was using graphviz to generate static images) from a .NET Windows application. The new system thus has much clearer dependency information as well as good information about process performance so effort can be concentrated on the slowest performing bottlenecks.
I would not plan on doing any re-engineering of any complex system without a clear strategy, a good inventory of the existing system and a large budget for time and money.
From the sounds of what you are saying, you have a three step process.
Input data
Analyze data
Report data
Steps one and three need to be completed by "users". Therefore, a GUI is needed for each respective team to do the task at hand, otherwise, they would be directly working on SQL Server, and would require extensive SQL knowledge. For these items, I do not see any issue with the approach your organization is taking, you are building a customized system to report on the data at hand. The only item that might be worth considering on these side, is standardization between the teams on common libraries and the technologies used.
Your middle step does seem to be a bit lengthy, with many moving parts. However, I've worked on a number of large reporting systems where that is truly the only way to get around it. WIthout knowing more of your organization and the exact nature of operations.
By "fast calculations" you must mean "fast retrieval" Data warehouses (both relational and otherwise) are fast with math because the answers are pre-calculated in advance. SQL, unless you are using CLR stored procedures, is usually a rather slow when it comes to math.
You'd be hard pressed to defeat the performance of BCP and SQL with anything else. If the update routines are long and bloated because they loop through the tables, then sure I can see why you'd want to go to .NET. But you'd probably increase performance by figuring out how to rewrite them all nice and SET based. BCP is not going to be able to be beaten. When I used SQL Server 2000 BCP was often faster than DTS. And SSIS in general (due to all the data type checking) seems to be way slower than DTS. If you kill performance no doubt people are going to be coming to you. Still if you are doing a ton of row by row complex calculations, optimizing that into a CLR stored procedure or even a .NET application that is called from SQL Server to do the processing will probably result in a speed up. Of course if you were row processing and you manage to rewrite the queries to do set processing you'd probably get a bigger speed up. But depending upon how complex the calculations are .NET may help.
Now if a front end change could immediately update and propagate the data, then you might want to change things to .NET so that as soon as a row is changed it can be recalculated and update all the clients. However if a lot of rows are changed or the database is just ginormous then you will kill performance. If the operation needs to be done in bulk then probably the way it is currently being done is the best.
The only thing I might as is that maybe there is a lot of duplicate SQL that looks exactly the same except for a table name and or the column names. If so, you can probably use .NET combined with SQL-SMO(or DMO if using SQL Server 2000) to code generate it.
Here's an example that I often see to load a datawarehouse
Assuming some row tables are loaded with the data from the source
select changed rows from source into temporary tables
see if any columns that matter were changed
if so terminate existing row (or clone it into some history table)
insert/update new row
I often see one of those queries per table and the only variations are the table/column names and maybe references to the key column. You can easily get the column definitions and key definitions out of SQL Server and then make a .NET program to create the INSERT/SELECT/ETC. In the worst case you may just have to store some type of table with TABLE_NAME, COLUMN_NAME for the columns that matter. Then instead of having to wrap your head around a complex ETL process and 20 or 200 update queries, you just need to wrap your head around UPDATE and one query. Any changes to the way things are done can be done once and applied to all the queries.
In particular my guess is that you can apply this technique to the individual client databases if you haven't already. Probably all the queries/bulk copy scripts are the same or almost the same with the exception of database/server name. So you can just autogenerate them based on a CLIENTs table or something.....