PAL Wizard for Biztalk Performance How to find out which process is using the most available memory - sql-server-2005

While checking the performance of Biztalk application using PAL,I observed that performance counter \Memory\Available MBytes has raised an alert.However I am not able to find out what is the actual process or program which is causing this.
I can see the below information along with the details of the performance counter,but there is no info on who or which process is causing it.
Alerts
An alert is generated if any of the above thresholds were broken during one of the time intervals analyzed. An alert condition of OK means that the counter instance was analyzed, but did not break any thresholds. The background of each of the values represents the highest priority threshold that the value broke. See the 'Thresholds Analyzed' section as the color key to determine which threshold was broken. A white background indicates that the value was not analyzed by any of the thresholds.
Time Condition Counter Min Avg Max Hourly Trend

PAL might be referring to the BizTalk memory thresholds used in throttling. If you run a perfmon against these counters, you can see how much memory each of your BTSNTSvc's is consuming, as well as the threshold levels which will cause throttling, and the current throttling states of your process, if applicable.
From firsthand experience, BizTalk throttling on memory conditions is a really bad thing

Related

Tracing only error traces in jaeger, finding out only the error traces in jaeger

I actually was trying to sample only the error traces in my application but i already have a probabilistic sampler parameter set in my application which samples the span at the beginning itself and the rest span follow the same pattern after then, i tried using force sampling option in jaeger but it doesnt seem to override the original decision made by the initial span of getting sampled or not. Kindly help me out here.
Jaeger clients implement so-called head-based sampling, where a sampling decision is made at the root of the call tree and propagated down the tree along with the trace context. This is done to guarantee consistent sampling of all spans of a given trace (or none of them), because we don't want to make the coin flip at every node and end up with partial/broken traces. Implementing on-error sampling in the head-based sampling system is not really possible. Imaging that your service is calling service A, which returns successfully, and then service B, which returns an error. Let's assume the root of the trace was not sampled (because otherwise you'd catch the error normally). That means by the time you know of an error from B, the whole sub-tree at A has been already executed and all spans discarded because of the earlier decision not to sample. The sub-tree at B has also finished executing. The only thing you can sample at this point is the spans in the current service. You could also implement a reverse propagation of the sampling decision via response to your caller. So in the best case you could end up with a sub-branch of the whole trace sampled, and possible future branches if the trace continues from above (e.g. via retries). But you can never capture the full trace, and sometimes the reason B failed was because A (successfully) returned some data that caused the error later.
Note that reverse propagation is not supported by the OpenTracing or OpenTelemetry today, but it has been discussed in the last meetings of the W3C Trace Context working group.
The alternative way to implement sampling is with tail-based sampling, a technique employed by some of the commercial vendors today, such as Lightstep, DataDog. It is also on the roadmap for Jaeger (we're working on it right now at Uber). With tail-based sampling 100% of spans are captured from the application, but only stored in memory in a collection tier, until the full trace is gathered and a sampling decision is made. The decision making code has a lot more information now, including errors, unusual latencies, etc. If we decide to sample the trace, only then it goes to disk storage, otherwise we evict it from memory, so that we only need to keep spans in memory for a few seconds on average. Tail-based sampling imposes heavier performance penalty on the traced applications because 100% of traffic needs to be profiled by tracing instrumentation.
You can read more about head-based and tail-based sampling either in Chapter 3 of my book (https://www.shkuro.com/books/2019-mastering-distributed-tracing/) or in the awesome paper "So, you want to trace your distributed system? Key design insights from years of practical experience" by Raja R. Sambasivan, Rodrigo Fonseca, Ilari Shafer, Gregory R. Ganger (http://www.pdl.cmu.edu/PDL-FTP/SelfStar/CMU-PDL-14-102.pdf).

How to set optimum value for storm.config.setMessageTimeoutSecs

I am working on Apache storm topology. I have different bolts who carrying out functionality like cosmosdb insertion , REST api call etc.
I want to set value for storm.config.setMessageTimeoutSecs for my Storm topology.
Now I have set it for 5 min, still I can see failures at spot side due to message time out. Is there any max value for message time out for topology.
and How to set optimum value for storm.config.setMessageTimeoutSecs
I don't believe there is a maximum timeout, no. You probably don't want to set it too high though, since it will mean that if a tuple is actually lost during network transfer, it will take much longer to figure out that it's been lost and to replay it.
Are you setting topology.max.spout.pending to a reasonable value? That can help to avoid too many tuples in flight at a time, which helps keep the complete latency down.
I'd also look at whether 5+ minutes is a reasonable processing time for your tuples. I don't know your use case, so maybe that's reasonable, but it seems like a long time to me.
If you actually need to process some tuples that will take such a long time, you might consider splitting your topology into smaller topologies, so you can set a high timeout for the slow part, and a lower timeout for the rest.

Designing a service for scale. Number of servers needed

Suppose that I need to design a web service. To keep it simple, assume that I use LAMP (Linux-Apache-MySQL-PHP).
I know that I will serve exactly N user requests per second. The requests are basically simple CRUD operations to the database, no file uploads or complex calculations.
Suppose that each request executes M ms and takes K Mb of memory on my server, having G Gb of RAM.
How many such servers do I need? Is it just N * K / G?
The reasonable value for M is 200ms. What is the reasonable value for K?
Do we need to take CPU power into account in this question?
Any additional considerations?
What you're doing is a good back of the envelope approximation but by no means should you use your thought exercise as a definitive guide for scaling your service.
That is because no service will exhibit that type of constant behavior as you describe (blame if on unpredictable peripheral i/o, garbage collection, external factors, user input, etc)
The correct approach is to perform scale and load testing. After you've written your service, start to load test your service and note the performance characteristics of your service. If you do things right you should reach a point where your configuration maxes out: either the CPU, the network throughput, the memory, or disk I/O. If neither are maxed out and you hit a limit then it's one of your upstream dependencies (your database etc.)
Once you've reached your peak it will tell you how many requests per second you can handle at peak.
You will also notice that in most cases peak performance is not sustainable: your setup may be able to burst for short periods of time handling many more requests per second than under sustained load.
After you get the numbers for a single server, you can start to vary in two ways:
test with different hardware configurations (add more RAM if you're memory bound, add a better CPU if you're CPU bound, etc)
test with multiple servers; start adding servers and see how your service scales horizontally
Ideally your service should scale linearly as you add servers but you will likely find that the performance curve is not linear.
Get your numbers, tweak your design. Rinse. Repeat.
There is no substitute, magic formula.

Microsoft SQL ##cpu_busy replacement for CPU saturation stat

We have a process that was originally written for Sybase that looks at the ##cpu_busy variables to detect the cpu load on the RDMS. The application will be running against Microsoft SQL.
curCPULoadPercent=100*(##cpu_busy - prevCpuBusy)/((##cpu_busy - prevCpuBusy)+(##io_busy - prevIOBusy)+(##idle - prevIdle))
If any of the above deltas are negative, the ratio is ignore and the previous one is used.
Unfortunately, instead of letting the number roll over, the Microsoft engineers stop increasing the number (134217727) and thus fails to reveal CPU load after 29 days (from what I have been reading).
(The Sybase engineers knew that a negative number subtracted from a negative number gives a positive difference.)
The following provides a hint on how to monitor CPU load, but we are unable to determine which wait types to look at.
https://msdn.microsoft.com/en-us/library/ms179984.aspx
We have been measuring the ##cpu_busy differences compared to the wait types and not any particular wait type matches.
Our plan is to build a predictive model to predict the old ratio from correlated wait types but I am wondering if someone has already figured this out.
Any suggestions?
starting with SQL2005 the runtime detail that you can get from SQL Server for things like Wait types and locks and load and memory pressure are extremely detailed and varied.
Do not use ## variables for what you are doing. If you can define exactly what you are trying to detect I can guarantee that the newer sql views have exactly what your are looking for and in particular the values can be defined in a spid or overall context.
The SQL Profiler tools can graph the values you are looking for probably.
Benjamin Nevarez has the answer for CPU utilization:
http://sqlblog.com/blogs/ben_nevarez/archive/2009/07/26/getting-cpu-utilization-data-from-sql-server.aspx
It uses Dynamic Mgmt View data from sys.dm_os_ring_buffers
where ring_buffer_type = N'RING_BUFFER_SCHEDULER_MONITOR'.

How should I handle measurement logging in my Discrete Event Simulation engine?

NOTE: This question has been ported over from Programmers since it appears to be more appropriate here given the limitation of the language I'm using (VBA), the availability of appropriate tags here and the specificity of the problem (on the inference that Programmers addresses more theoretical Computer Science questions).
I'm attempting to build a Discrete Event Simulation library by following this tutorial and fleshing it out. I am limited to using VBA, so "just switch to [insert language here] and it's easy!" is unfortunately not possible. I have specifically chosen to implement this in Access VBA to have a convenient location to store configuration information and metrics.
How should I handle logging metrics in my Discrete Event Simulation engine?
If you don't want/need background, skip to The Design or The Question section below...
Simulation
The goal of a simulation of the type in question is to model a process to perform analysis of it that wouldn't be feasible or cost-effective in reality.
The canonical example of a simulation of this kind is a Bank:
Customers enter the bank and get in line with a statistically distributed frequency
Tellers are available to handle customers from the front of the line one by one taking an amount of time with a modelable distribution
As the line grows longer, the number of tellers available may have to be increased or decreased based on business rules
You can break this down into generic objects:
Entity: These would be the customers
Generator: This object generates Entities according to a distribution
Queue: This object represents the line at the bank. They find much real world use in acting as a buffer between a source of customers and a limited service.
Activity: This is a representation of the work done by a teller. It generally processes Entities from a Queue
Discrete Event Simulation
Instead of a continuous tick by tick simulation such as one might do with physical systems, a "Discrete Event" Simulation is a recognition that in many systems only critical events require process and the rest of the time nothing important to the state of the system is happening.
In the case of the Bank, critical events might be a customer entering the line, a teller becoming available, the manager deciding whether or not to open a new teller window, etc.
In a Discrete Event Simulation, the flow of time is kept by maintaining a Priority Queue of Events instead of an explicit clock. Time is incremented by popping the next event in chronological order (the minimum event time) off the queue and processing as necessary.
The Design
I've got a Priority Queue implemented as a Min Heap for now.
In order for the objects of the simulation to be processed as events, they implement an ISimulationEvent interface that provides an EventTime property and an Execute method. Those together mean the Priority Queue can schedule the events, then Execute them one at a time in the correct order and increment the simulation clock appropriately.
The simulation engine is a basic event loop that pops the next event and Executes it until there are none left. An event can reschedule itself to occur again or allow itself to go idle. For example, when a Generator is Executed it creates an Entity and then reschedules itself for the generation of the next Entity at some point in the future.
The Question
How should I handle logging metrics in my Discrete Event Simulation engine?
In the midst of this simulation, it is necessary to take metrics. How long are Entities waiting in the Queue? How many Acitivity resources are being utilized at any one point? How many Entities were generated since the last metrics were logged?
It follows logically that the metric logging should be scheduled as an event to take place every few units of time in the simulation.
The difficulty is that this ends up being a cross-cutting concern: metrics may need to be taken of Generators or Queues or Activities or even Entities. Consider also that it might be necessary to take derivative calculated metrics: e.g. measure a, b, c, and ((a-c)/100) + Log(b).
I'm thinking there are a few main ways to go:
Have a single, global Stats object that is aware of all of the simulation objects. Have the Generator/Queue/Activity/Entity objects store their properties in an associative array so that they can be referred to at runtime (VBA doesn't support much in the way of reflection). This way the statistics can be attached as needed Stats.AddStats(Object, Properties). This wouldn't support calculated metrics easily unless they are built into each object class as properties somehow.
Have a single, global Stats object that is aware of all of the simulation objects. Create some sort of ISimStats interface for the Generator/Queue/Activity/Entity classes to implement that returns an associative array of the important stats for that particular object. This would also allow runtime attachment, Stats.AddStats(ISimStats). The calculated metrics would have to be hardcoded in the straightforward implementation of this option.
Have multiple Stats objects, one per Generator/Queue/Activity/Entity as a child object. This might make it easier to implement simulation object-specific calculated metrics, but clogs up the Priority Queue a little bit with extra things to schedule. It might also cause tighter coupling, which is bad :(.
Some combination of the above or completely different solution I haven't thought of?
Let me know if I can provide more (or less) detail to clarify my question!
Any and every performance metric is a function of the model's state. The only time the state changes in a discrete event simulation is when an event occurs, so events are the only time you have to update your metrics. If you have enough storage, you can log every event, its time, and the state variables which got updated, and retrospectively construct any performance metric you want. If storage is an issue you can calculate some performance measures within the events that affect those measures. For instance, the appropriate time to calculate delay in queue is when a customer begins service (assuming you tagged each customer object with its arrival time). For delay in system it's when the customer ends service. If you want average delays, you can update the averages in those events. When somebody arrives, the size of the queue gets incremented, then they begin service it gets decremented. Etc., etc., etc.
You'll have to be careful calculating statistics such as average queue length, because you have to weight the queue lengths by the amount of time you were in that state: Avg(queue_length) = (1/T) integral[queue_length(t) dt]. Since the queue_length can only change at events, this actually boils down to summing the queue lengths multiplied by the amount of time you were at that length, then divide by total elapsed time.