Will collectd attempt to catch up if it has fallen behind? - collectd

I'm using collectd with a custom writer plugin. The writer reports to servers in a different data center and occasionally we have network connection issues between the data centers. Collectd continues to work but falls behind. After the network connections return to normal collectd continues to report at the same rate as specified by Interval in the config and never catches up.
Is there something I need to do in the writer plugin so that collectd will report more often or send more data to flush the queue and get caught back up?

The problem was that more metrics were being reported than could be send in the current 1 minute interval. The network issues just made the problem worse. The solution is to change the interval to 5 minutes.

Related

Are delayed messages in Redis reliable?

I have an architecture solution that relies on the delayed messages.
In short:
There are many clients (mostly mobile devices running android or ios) that can process a given job.
I am creating a job delegation (in RDBMS) for a given client expecting it to be picked up within a certain period of time and the "chosen" client receives a push notification that there is something for it to process. IMO the details about the algorithm of choosing single client out of many is irrelevant to the problem so skipping this part.
When the client pulls a job delegation then the status of it is changed from pending to processing.
As mentioned clients are mobile devices and are often carried by people in move and thus can, due to many reasons, be unable to pull the job delegation from the server and process it.
That's why during the creation of the job delegation, there is also a delayed message dispatched in Redis which is supposed to check in now() + 40 seconds if the job was pulled or not (so if the status is pending or not).
If the delegation hasn't been pulled by the client (status = pending) server times it out and creates a new job delegation with status = pending for a different client. As so on as so for.
It works pretty well except the fact that I've noticed the "check if should timeout" jobs do not ALWAYS run at the time I would expect them to be run. The average is 7 seconds later and the max is 29 seconds later for the analyzed sample of few thousands of jobs. Redis is used as a queue but also as a key-value cache store and in general heavily utilized by the system. May it become that much impacted by the load? I've sort of "reproduced" the issue also on my local environment with a containerized setup with much less load so I doubt it's entirely due to the Redis being busy.
The delay in execution (vs expected) is quite a problem here because it may happen that, especially in case of trying few clients from the list, the total time since creation of the job till it's successfully processed can increase a lot.
So back to the original question. Is the delayed messaging functionality in Redis reliable?
Are there any good recommended docs about it?
Are there any more reliable solutions designed to solve that issue?
Expecting that messages set to be executed in a given timestamp is executed no later than 2-3 seconds from that timestamp.

nifi error flow not working unless two files in queue

This is for nifi 1.3.
I have an executeScript Processor which is connected to a putSplunk Processor through a failure flow. I am doing some testing so currently I  make the executeScript Processor fail which causes a flow file to go into the failure flow. The flow file seems to sit in the queue and is not processed or at least not processed fully by the putSplunk processor. If I stop the processor group containing these processors and then start again while a
flow file is stuck in the failure flow queue then the flow file is pushed through and processed in the putSplunk Processor. I have this already running in nifi 1.1 and I do not get this issue. What is going on?
I got it working, just in case anyone runs into this. Seems like it has something to do with the scheduler while I could set the putSplunk Timer to high number of seconds in 1.1 it seems that in 1.3 I need to set the number of seconds to 0 or a low amount of time and it seems to work now. Maybe it does not run for the first time until the number of seconds specified or there are multiple flow files queued. Success and Failure are terminated so that does not matter.

Data not showing up intermittently on the OpenTSDB UI

We are running some high volume tests by pushing metrics to OpenTSDB (2.3.0) with BigTable, and a curious problem surfaces from time to time. For some metrics, an hour of data stops showing up on the web UI when we run a query. The span of "missing" data is very clearcut and borders on the hour (UTC). After a while, while rerunning the same query, the data shows up. There does not seem to be any pattern that we can deduce here, other than the hour span. Any pointers on what to look for and debug this?
How long do you have to wait before the data shows up? Is it always the most recent hour that is missing?
Have you tried using OpenTSDB CLI when this is happening and issuing a scan to see if the data is available that way?
http://opentsdb.net/docs/build/html/user_guide/cli/scan.html
You could also check via an HBase shell scan to see if you can get the raw data that way (here's information on how it's stored in HBase):
http://opentsdb.net/docs/build/html/user_guide/backends/hbase.html
If you can verify the data is there then it seems likely to be a web UI problem. If not, the next likely culprit is something getting backed up in the write pipeline.
I am not aware of any particular issue in the Google Cloud Bigtable backend layer that would cause this behavior, but I believe some folks have encountered issues with OpenTSDB compactions during periods of high load that result in degraded performance.
It's worth checking in the Google Cloud Console to see if there's any outliers in the latency, CPU or throughput graphs that correlate with the times during which you experience the issue.

Pentaho BI Server load test - Possible deadlock

We are conducting a load testing on our BI infrastructure at the moment. We are testing with 10 concurrent users against single pentaho node (bi server platform).
A test scenario for each user is:
Open pentaho page
Authenticate to the platform
Open a report using URL (like this http://itrac5125:8080/pentaho/api/repos/%3Ahome%3ALoadTesting%3A4Measures.xanalyzer/editor)
When report is refreshed go to 3) and open another report
As you see steps 3. and 4. are in the loop.
After 15 minutes of running this test the BI platform becomes extremely unresponsive. It takes almost three minutes to load home page. Once loaded, trying to push buttons like Browse Files / Create nnw did not result in any change of view.
We used a java profiler tool to what's happening inside application and discovered 200 http threads (see Threads) attachment. Around 95% of them were for the majority of time blocked waiting for a resource (see Blocked). Is this normal? I am afraid that managing this amount of threads that are waiting for a resource might be quite an overhead for processor. We checked code of BI platform (see Code) and there is indeed a lock on a resource, that judging by number of threads waiting inside this method seems to be recalculated very often.
Threads (http://postimg.org/image/4c2yug17f/full/)
Blocked (http://postimg.org/image/gm32nbd29/)
Code (http://postimg.org/image/6p5vt1b6r/)
Attaching as well cpu and ram usage graphs that were taken for the time period when the test was executed.
CPU (http://postimg.org/image/tbxubog6b/full/):
RAM (http://postimg.org/image/jecpimes9/full/):
Is there anyone experiencing similar issues? I would be happy to hear about other experience in terms of load testing / load optimazing for Pentaho BI Server.
After over a week of testing it turned out to be an issue on Pentaho side related to wrong synchronization of threads that lead to a deadlock.
We have been able to contact with Pentaho and they confirmed it is a bug on their side (see jira: http://jira.pentaho.com/browse/BISERVER-12642). This should be fixed in a service pack for Pentaho 5.4.

WinForms ReportViewer: slow initial rendering

UPDATE 2.4.2010
Yeah, this is an old question but I thought I would give an update. So, I'm working with the ReportViewer again and it's still rendering slowly on the initial load. The only difference is that the SQL database is on the reporting server.
UPDATE 3.16.2009
I have done profiling and it's not the SQL that is making the ReportViewer render slowly on the first call. On the first call, the ReportViewer control locks up the UI thread and makes the program unresponsive. After about 5 seconds the ReportViewer will unlock the UI thread and display "Report is being generated" and then finally show the report. I know 5 seconds is not much but this shouldn't be happening. My coworker does the same thing in a program of his and the ReportViewer immediately displays the "Report is being generated" upon any request.
The only difference is that the reporting server is on one server and the data is on another server. However, when I am developing the reports within SSRS, there is no delay.
UPDATE
I have noticed that only the first load of the ReportViewer takes a long time; each subsequent load of the same or different reports loads fast.
I have a WinForms ReportViewer that I'm using in Remote processing mode that can take up to 30 seconds to render when the ReportViewer.RefreshReport() method is called. However, the report itself runs fast.
This is the code to setup my ReportViewer:
rvReport.ProcessingMode = ProcessingMode.Remote
rvReport.ShowParameterPrompts = False
rvReport.ServerReport.ReportServerUrl = New Uri(_reportServerURL)
rvReport.ServerReport.ReportPath = _reportPath
This is where the ReportViewer can take up to 30 seconds to render:
rvReport.RefreshReport()
I found the answer on other forums. MSDN explains that a DLL is searching for some Verisign web server and it takes forever... there are 2 ways to turn it off, one is a checkbox in internet explorer and another is adding some lines to the app.config file of the app.
You can pull a report in two modes, local and server. If you're running in local mode, it's going to pull both the data and the report definition onto your machine, then render them both. In server mode, it's going to just let SSRS do all the work, then pull back the information to render.
If you're using local mode, it could be a hardware issue. If you've got a huge dataset, that's a lot of data to store in memory.
Other than that, that's not a lot of info to go on...
Update: since you've noticed it's only the first call that takes a while, have you done any profiling to determine if the bulk of the work is done on the backend SQL calls or is spent in the actual report render?
If it's faster on subsequent calls, it's possible you're (incidentally) caching at one level or another. You can cache reports (http://www.sqlservercurry.com/2007/12/configure-report-to-be-cached-ssrs-2005.html) or it could be that the execution plan to return the data is being cached deep in SQL Server.
In summary of the various ideas already presented, it could be
startup time for the report viewer infrastructure on the client
cache loading time on the client
query execution time at the server
report rendering time at the server
Try running the report, closing down the client, restarting the client and running the report again. If the report is much faster the second time, repeat this experiment but load, run and unload another large application in between report runs.
If the second report run continues to be much quicker, then the difference you are seeing has more to do with the SQL Server's I/O cache than what's happening on the client. You can further test this by deliberately displacing the MSSQL cache by running a query that pulls a lot of data from tables that aren't used in the report.
All of the above is interesting but unimportant. If you want to ensure snappy report response Reporting Services provides extensive support for scheduled generation of reports, so that when the consumer requests the report, the only delay is network delivery.
If your users insist on reporting on up to the minute (live) data they'll either have to specify tighter constraint parameters or get used to waiting.
ReportServer always takes a while to wake up because it's running under IIS. There is a process time out on each AppPool. We have the same issue with our ASP.NET application's report viewer. You could try increasing the AppPool keep alive times in the IIS settings.
See here:
http://www.sqlreportingservices.net/Ask/5536.aspx
http://www.developmentnow.com/g/115_2005_9_0_0_597422/First-run-of-reports-is-SLOW.htm
I'm assuming you're running SQL2005 SSRS of course.
One option is to upgrade to 2008 where SSRS no longer depends on IIS.
Thinking way out of the box: Is the report server on different machine to the one running the application? The network could be taking a long time to resolve "reportServerURL". Once resolved the name would be cached and hence subsequent calls would be quicker..
I have had this problem before with badly configured DNS servers. Try replacing "reportServerUrl" with "reportServerIPAddress" and see if the initial call to ReportViewer is any faster.
I was having this same problem.
i find out that changing the default printer(slow network here) fix the problem.
The ReportViewer gets some information from the default printer,
and since the network here is very slow, i was having 10 seconds of delay
Hope it helps
UPDATE
I have noticed that only the first load of the ReportViewer takes a long time; each subsequent load of the same or different reports loads fast.
You are set to run on server which means the SRS server needs to do the rendering as such the first time there will be a delay for one or all of the following reasons (these are the slowest of the bunch, there are others but they are quicker):
DNS resolution: The URL needs to be resolved to an IP address. Once this is done it is cached locally which speeds it up.
ASP.NET/IIS needs time to warm up. There is all kinds of compilation and initial loading that must occur - after loaded it will remain in the servers memory until you restart IIS or the default clean up time occurs.
Reporting Services needs time to warm up in the same way as ASP.NET/IIS does.
To test for this use a network monitor such as Netmon (if you are a Microsoft fan) or Wireshark (my recommendation) and watch the traffic from your machine to the server. You'll see the DNS request, then the HTTP requests go and the delay will be in the returning data. On second call you will see the speed is vastly different in the return and DNS checks.
What you could do to prevent this is a warm up script - I don't know one for SRS but here is a link to a SharePoint one which would not be hard to change since it has the exact same issues.
It seems as though you are going after the SSRS report directly. You may want to hit the SSRS web service instead. That may improve your performance.
Here is a possible resolution for your problem:
Try to access the first report from web before accessing any report with the application.
If the problem doesn't appear, you could make an application that will "preload" the first report, in order to allow reporting services to do their start-up.
I've seen this kind of solution for some demo applications from Microsoft. The applications where using Analysis Services and Reporting Services.
Good luck otherwise
To my knowledge, I think it's a problem Microsoft is finding it tough resolve.
Initially, the report loader is only slow at firt time rendering of report and subsequent reports loads noramal (a bit faster).
To help counter this, place a Startup Form with a label (Label1) and Timer (Timer1) control. Set Label1.Text="Please, wait (about 15 secs)". Set Timer1.Interval=3.
At the form_Load event of the Startup Form, set Timer1.Start.
At Tick event of Timer1 place "frmMyReportForm.reportViewer1.SetDisplayMode(Microsoft.Reporting.WinForms.DisplayMode.Normal)"
"frmMyReportForm" any of the forms in your project containing a reportviewer control.
All the delays will be caught here so that when you generate the actual report, there will be no delays.
I hope this might be helpful to my fellow developers.