Jaeger: ignore certain spans in a tracing - jaeger

I'm just getting started with implementing Jaeger in our pipeline, using python and the opentelemetry packages. A single task has let's say 5 steps. Step 1 can take 1-2 days. Then step 2 can take 3-4 hours. And then steps 3-5 typically take 5-30 minutes each.
My issue is that, this means that monitoring the overall time taken for each tracing is useful to find anomalies in Step 1, but useless for the other steps, because the variation in time taken for Step 1 swamps out any variation in the times taken for the rest of the steps. This makes it hard for the teams focused on monitoring and improving steps 2-5 to see at a glance when their processes are getting slow.
Is there a way in the Jaeger UI to create separate dashboards or queries, where tracings can be calculated based on a specified set of spans? If each span is named appropriately, can I do something like:
Dashboard 1: Show tracing times calculated with just span 1
Dashboard 2: Show tracing times calculated with just span 2
Dashboard 3: Show tracing times calculated with spans 3-5
If this isn't possible with Jaeger, is it possible with Grafana or some other frontend UI?
Note, I still need to record all spans, because I want to monitor all the steps, so removing them from the data itself is not an option. Ideally I want to do the filtering at the UI level to create separate dashboards that are useful to their respective teams.

Related

Storing vast amounts of "uptime" data for a website monitoring service

this is more of a general discussion rather than a code question.
I have a website monitoring platform whereby users of the system can input their website URL and we'll check it every X minutes based on the customer's interval, at each interval, an entry is stored as a UptimeCheck model in the Laravel 8 project with the status being down or up.
If a customer has 20 monitors, and each checks every minute, then over a 30 day period for the one customer they'd accumulate over 1 million rows.
My query, is really do I need to keep this number of rows?
The reason this number of rows is kept is so that we can present a graph showing the average website uptime.
My thinking is that if I created some kind of SVG programatically for each day and store this in the table then I wouldn't need to store as many entries, but my concern here is how would I merge SVG models into one to present a daily graph?
What kind of libraries could I use and how else might I approach this?
Unlike performance, the trick for storing uptime data is simple. You don't store it. ;)
You need to store DOWNTIME data instead. Register only unavailability events and extrapolate uptime when displaying reports.

SSAS Process Default behavior

I'm trying to make sense of Process Default behavior on SSAS 2017 Enterprise Edition.
My cube is processed daily in this standard sequence:
Loop through 30 dimensions and performing Process Add or Process Update as required.
Process approximately 80 partitions for the previous day.
Exec a Process Default as the final step.
Everything works just fine, and for the amount of data involved, performs really well. However I have observed that after the process default completes, if I re-run the process default step manually (with no other activity having occurred whatsoever), it will take exactly the same time as the first run.
My understanding was that this step basically scans the cube looking for unprocessed objects and will process any objects found to be unprocessed. Given the flow of dimension processing, and subsequent partition processing, I'd certainly expect some objects to be unprocessed on the first run - particularly aggregations and indexes.
The end to end processing time is around 65 mins, but 10 mins of this is the final process default step.
What would explain this is that if the process default isn't actually finding anything to do, and the elapsed time is the cost of scanning the meta data. Firstly it seems an excessive amount of time, but also if I don't run the step, the cube doesn't come online, which suggests it is definitely doing something.
I've had a trawl through Profiler to try to find events to capture what process default is doing, but I'm not able to find anything that would capture the event specifically. I've also monitored the server performance during the step, and nothing is under any real load.
Any suggestions or clarifications..?

ADF Dataflows; Do I have any control or influence over cluster startup time. (NOT "TTL")

Yes, I know about TTL; Yes, I'm configuring that; No, that's not what I'm asking about here.
Spinning up an initial cluster for a Dataflow takes around 5 minutes.
Starting acquiring compute from an existing "warm" cluster (i.e. one which has been left 'Alive' using TTL), for a new dataflow still appears to take 1-2 minutes.
Those are pretty large numbers, especially if you have a multi-step ETL process, and have broken up your pipeline to separate concerns (or if you're executing the dataflows in a loop, to process data per-source-day)
Controlling the TTL gives me some control over which of those two possibilities I'm triggering, but even 2 minutes can be a quite substantial overhead. (I have a pipeline where fully half the execution time is waiting for those 1-2 minute 'Acquire Compute' startups)
Do I have any control at all, over how long startup takes in each case? Is there anything that I can do to speed up the startup, or anything that I should avoid to prevent making things even worse!
There's a new feature in town, to fix exactly this problem.
Release blog:
https://techcommunity.microsoft.com/t5/azure-data-factory/how-to-startup-your-data-flows-execution-in-less-than-5-seconds/ba-p/2267365
ADF has added a new option in the Azure Integration Runtime for data flow TTL: Quick re-use. ... By selecting the re-use option with a TTL setting, you can direct ADF to maintain the Spark cluster for that period of time after your last data flow executes in a pipeline. This will provide much faster sequential executions using that same Azure IR in your data flow activities.

Google Cloud ML, extend previous run of hyperparameter tuning

I am running hyper parameter tuning using Google Cloud ML. I am wondering if it is possible to benefit from (possibly partial) previous runs.
One application would be :
I launch an hyperparameter tuning job
I stop it because I want to change the type of cluster I am using
I want to restart my hypertune job on a new cluster, but I want to benefit from previous runs I already paid for.
or another application :
I launch an hypertune campain
I want to extend the number of trials afterwards, without starting from scratch
and then for instance, I want remove one degree of liberty (e.g. training_rate), focusing on other parameters
Basically, what I need is "how can I have a checkpoint for hypertune ?"
Thx !
Yes, this is an interesting workflow -- Its not exactly possible with the current set of APIs, so its something we'll need to consider in future planning.
However, I wonder if there are some workarounds that can pan out to approximate your intended workflow, right now.
Start with higher number of trials - given you can cancel a job, but not extend one.
Finish a training job early based on some external input - eg. once you've arrived at a fixed training_rate, you could record that in a file in GCS, and mark subsequent trials with different training rate as infeasible, so those trials end fast.
To go further, eg. launch another job (to add runs, or change scale tier), you could potentially try using the same output directory, and this time lookup previous results for a given set of hyperparameters with an objective metric (you'll need to record them somewhere where you can look them up -- eg. create gcs files to track the trial runs), so the particular trial completes early, and training moves on to the next trial. Essentially rolling your own "checkpoint for hypertune".
As I mentioned, all of these are workarounds, and exploratory thoughts on what might be possible from your end with current capabilities.

Finding an applications scalibility point using JMeter

I am trying to find an applications scalibility point using JMeter. I define the scalability point as "The minimum number of concurrent users from which any increase no longer increases the Throughput per second".
I am using the following technique. Schedule my load test to run for an hour, starting a new thread sending SOAP/XML-RPC Requests every 30 seconds. I do this by setting my number of threads to 120 and my ramp up period to 3600 seconds.
Then looking at my TOTAL rows Throughput in my Summary Report Listener. A new row (thread) is added every 30 seconds, the total throughput number rises until it plateaus at about 123 requests per second after 80 of the threads are active in my case. It then slowly drops the throughput number to 120 per second as the last 20 threads are added. I then conclude that my applications scalability point is 123 requests per second with 80 active users.
My question, is this a valid way to find an application scalibility point or is there different technique that I should be trying?
From a technical perspective what you're doing does answer your question regarding one specific user scenario, though I think you might be missing the big picture.
First of all keep in mind that the actual HTTP request you're sending and ramp up times can often impact what you call a scalability point. Are your requests hitting a cache? Are they not random enough? Are they too random? Do they represent real world requests? is 30 seconds going to give you the same results as 20 seconds or 10 seconds?
From my personal experience it's MUCH easier and more intuitive to look at graphs when trying to analyze app performance. It's not just a question of raw numbers but also looking and trends and rates of change.
For example here is an example testing the ghost.org blogging platofom using JMeter with an interactive JMeter results graph.
http://blazemeter.com/blog/ghost-performance-benchmark