ApplicationInsights dotnet core custom initializer - asp.net-core

I'm unable to figure out how to configure AI for aspnetcore project. I've done the following:
services.AddSingleton<ITelemetryInitializer, AppInsightsInitializer>();
services.AddApplicationInsightsTelemetry(Configuration);
Where I need the loggedin user and the servicename so I've got this initializer:
public class AppInsightsInitializer : ITelemetryInitializer
{
private IHttpContextAccessor _httpContextAccessor;
public AppInsightsInitializer(IHttpContextAccessor httpContextAccessor)
{
_httpContextAccessor = httpContextAccessor ?? throw new ArgumentNullException("httpContextAccessor");
}
public void Initialize(ITelemetry telemetry)
{
var httpContext = _httpContextAccessor.HttpContext;
telemetry.Context.Properties["appname"] = "MyCoolService";
if (httpContext != null && httpContext.User.Identity.IsAuthenticated == true && httpContext.User.Identity.Name != null)
{
telemetry.Context.User.AuthenticatedUserId = httpContext.User.Identity.Name;
}
}
}
I've got no applicationinights.config file (I understand they are not needed)
Problem: I got 4 entries of each log (same id). The data is correct. I also got the following errror in the logs:
AI: Error collecting 9 of the configured performance counters. Please check the configuration. Counter \Process(??APP_WIN32_PROC??)\% Processor Time: Failed to perform the first read for performance counter. Please make sure it exists.
Category: Process, counter: % Processor Time, instance Counter \Memory\Available Bytes: Failed to register performance counter.
Category: Memory, counter: Available Bytes, instance: .
Counter \ASP.NET Applications(??APP_W3SVC_PROC??)\Requests/Sec: Failed to perform the first read for performance counter. Please make sure it exists.
Category: ASP.NET Applications, counter: Requests/Sec, instance MyCoolv3.Api.exe Counter \.NET CLR Exceptions(??APP_CLR_PROC??)\# of Exceps Thrown / sec: Failed to perform the first read for performance counter. Please make sure it exists.
Category: .NET CLR Exceptions, counter: # of Exceps Thrown / sec, instance Counter \ASP.NET Applications(??APP_W3SVC_PROC??)\Request Execution Time: Failed to perform the first read for performance counter. Please make sure it exists.
Category: ASP.NET Applications, counter: Request Execution Time, instance MyCoolv3.Api.exe Counter \Process(??APP_WIN32_PROC??)\Private Bytes: Failed to perform the first read for performance counter. Please make sure it exists. Category: Process, counter: Private Bytes, instance Counter \Process(??APP_WIN32_PROC??)\IO Data Bytes/sec: Failed to perform the first read for performance counter. Please make sure it exists.
Category: Process, counter: IO Data Bytes/sec, instance Counter \ASP.NET Applications(??APP_W3SVC_PROC??)\Requests In Application Queue: Failed to perform the first read for performance counter. Please make sure it exists. Category: ASP.NET Applications, counter: Requests In Application Queue, instance MyCoolv3.Api.exe
Counter \Processor(_Total)\% Processor Time: Failed to register performance counter. Category: Processor, counter: % Processor Time, instance: _Total.
I'm using:
<PackageReference Include="Microsoft.ApplicationInsights.AspNetCore" Version="2.1.0-beta1" />
<PackageReference Include="Microsoft.AspNetCore" Version="1.1.1" />
<PackageReference Include="Microsoft.AspNetCore.Mvc" Version="1.1.2" />

services.AddApplicationInsightsTelemetry(Configuration);
isn't needed anymore (hypothetically, you should even be getting deprecation warnings on that line?)
if you create a asp.net core new project in VS2017, AI will already be there in package references (though the 2.0 version, not that 2.1 beta version), and all of the wireup would already be done in program.cs and in some other files.
If you're porting an existing one, then instead of the above AddApplicationInsights... line, instead you'd have
.UseApplicationInsights()
in your program.cs startup of your app instead. for more details, there's some info about this in the 2.1 beta release notes on github
We're also working on updating the "configure application insights" tools in VS2017 to properly "migrate" apps like this in a future update.
I'm not sure why you'd get multiple instances of any events unless you're explicitly logging them, or if you possibly have multiple calls to startup (which also shouldn't affect anything). Where are are you seeing multiple instances? in VS's appinsights tools? in the portal?

Related

Static Hangfire RecurringJob methods in LINQPad are not behaving

I have a script in LINQPad that looks like this:
var serverMode = EnvironmentType.EWPROD;
var jobToSchedule = JobType.ABC;
var hangfireCs = GetConnectionString(serverMode);
JobStorage.Current = new SqlServerStorage(hangfireCs);
Action<string, string, XElement> createOrReplaceJob =
(jobName, cronExpression, inputPackage) =>
{
RecurringJob.RemoveIfExists(jobName);
RecurringJob.AddOrUpdate(
jobName,
() => new BTR.Evolution.Hangfire.Schedulers.JobInvoker().Invoke(
jobName,
inputPackage,
null,
JobCancellationToken.Null),
cronExpression, TimeZoneInfo.Local);
};
// psuedo code to prepare inputPackage for client ABC...
createOrReplaceJob("ABC.CustomReport.SurveyResults", "0 2 * * *", inputPackage);
JobStorage.Current.GetConnection().GetRecurringJobs().Where( j => j.Id.StartsWith( jobToSchedule.ToString() ) ).Dump( "Scheduled Jobs" );
I have to schedule in both QA and PROD. To do that, I toggle the serverMode variable and run it once for EWPROD and once for EWQA. This all worked fine until recently, and I don't know exactly when it changed unfortunately because I don't always have to run in both environments.
I did purchase/install LINQPad 7 two days ago to look at some C# 10 features and I'm not sure if that affected it.
But here is the problem/flow:
Run it for EWQA and everything works.
Run it for EWPROD and the script (Hangfire components) seem to run in a mix of QA and PROD.
When I'm running it the 'second time' in EWPROD I've confirmed:
The hangfireCs (connection string) is right (pointing to PROD) and it is assigned to JobStorage.Current
The query at the end of the script, JobStorage.Current.GetConnection().GetRecurringJobs() uses the right connection.
The RecurringJob.* methods inside the createOrReplaceJob Action use the connection from the previous run (i.e. EWQA). If I monitor my QA Hangfire db, I see the job removed and added.
Temporary workaround:
Run it for EWQA and everything works.
Restart LINQPad or use 'Cancel and Reset All Queries' method
Run it for EWPROD and now everything works.
So I'm at a loss of where the issue might lie. I feel like my upgrade/install of LINQPad7 might be causing problems, but I'm not sure if there is a different way to make the RecurringJob.* static methods use the 'updated' connection string.
Any ideas on why the restart or reset is now needed?
LINQPad - 5.44.02
Hangfire.Core - 1.7.17
Hangfire.SqlServer - 1.7.17
This is caused by your script (or a library that you call) caching something statically, and not cleaning up between executions.
Either clear/dispose objects when you're done (e.g., JobStorage.Current?) or tell LINQPad not to re-use the process between executions, by adding Util.NewProcess=true; to your script.

Infinispan clustered lock performance does not improve with more nodes?

I have a piece of code that is essentially executing the following with Infinispan in embedded mode, using version 13.0.0 of the -core and -clustered-lock modules:
#Inject
lateinit var lockManager: ClusteredLockManager
private fun getLock(lockName: String): ClusteredLock {
lockManager.defineLock(lockName)
return lockManager.get(lockName)
}
fun createSession(sessionId: String) {
tryLockCounter.increment()
logger.debugf("Trying to start session %s. trying to acquire lock", sessionId)
Future.fromCompletionStage(getLock(sessionId).lock()).map {
acquiredLockCounter.increment()
logger.debugf("Starting session %s. Got lock", sessionId)
}.onFailure {
logger.errorf(it, "Failed to start session %s", sessionId)
}
}
I take this piece of code and deploy it to kubernetes. I then run it in six pods distributed over six nodes in the same region. The code exposes createSession with random Guids through an API. This API is called and creates sessions in chunks of 500, using a k8s service in front of the pods which means the load gets balanced over the pods. I notice that the execution time to acquire a lock grows linearly with the amount of sessions. In the beginning it's around 10ms, when there's about 20_000 sessions it takes about 100ms and the trend continues in a stable fashion.
I then take the same code and run it, but this time with twelve pods on twelve nodes. To my surprise I see that the performance characteristics are almost identical to when I had six pods. I've been digging in to the code but still haven't figured out why this is, I'm wondering if there's a good reason why infinispan here doesn't seem to perform better with more nodes?
For completeness the configuration of the locks are as follows:
val global = GlobalConfigurationBuilder.defaultClusteredBuilder()
global.addModule(ClusteredLockManagerConfigurationBuilder::class.java)
.reliability(Reliability.AVAILABLE)
.numOwner(1)
and looking at the code the clustered locks is using DIST_SYNC which should spread out the load of the cache onto the different nodes.
UPDATE:
The two counters in the code above are simply micrometer counters. It is through them and prometheus that I can see how the lock creation starts to slow down.
It's correctly observed that there's one lock created per session id, this is per design what we'd like. Our use case is that we want to ensure that a session is running in at least one place. Without going to deep into detail this can be achieved by ensuring that we at least have two pods that are trying to acquire the same lock. The Infinispan library is great in that it tells us directly when the lock holder dies without any additional extra chattiness between pods, which means that we have a "cheap" way of ensuring that execution of the session continues when one pod is removed.
After digging deeper into the code I found the following in CacheNotifierImpl in the core library:
private CompletionStage<Void> doNotifyModified(K key, V value, Metadata metadata, V previousValue,
Metadata previousMetadata, boolean pre, InvocationContext ctx, FlagAffectedCommand command) {
if (clusteringDependentLogic.running().commitType(command, ctx, extractSegment(command, key), false).isLocal()
&& (command == null || !command.hasAnyFlag(FlagBitSets.PUT_FOR_STATE_TRANSFER))) {
EventImpl<K, V> e = EventImpl.createEvent(cache.wired(), CACHE_ENTRY_MODIFIED);
boolean isLocalNodePrimaryOwner = isLocalNodePrimaryOwner(key);
Object batchIdentifier = ctx.isInTxScope() ? null : Thread.currentThread();
try {
AggregateCompletionStage<Void> aggregateCompletionStage = null;
for (CacheEntryListenerInvocation<K, V> listener : cacheEntryModifiedListeners) {
// Need a wrapper per invocation since converter could modify the entry in it
configureEvent(listener, e, key, value, metadata, pre, ctx, command, previousValue, previousMetadata);
aggregateCompletionStage = composeStageIfNeeded(aggregateCompletionStage,
listener.invoke(new EventWrapper<>(key, e), isLocalNodePrimaryOwner));
}
The lock library uses a clustered Listener on the entry modified event, and this one uses a filter to only notify when the key for the lock is modified. It seems to me the core library still has to check this condition on every registered listener, which of course becomes a very big list as the number of sessions grow. I suspect this to be the reason and if it is it would be really really awesome if the core library supported a kind of key filter so that it could use a hashmap for these listeners instead of going through a whole list with all listeners.
I believe you are creating a clustered lock per session id. Is this what you need ? what is the acquiredLockCounter? We are about to deprecate the "lock" method in favour of "tryLock" with timeout since the lock method will block forever if the clustered lock is never acquired. Do you ever unlock the clustered lock in another piece of code? If you shared a complete reproducer of the code will be very helpful for us. Thanks!

Query process instances based on starting message name

ENV: camunda 7.4, BPMN 2.0
Given a process, which can be started by multiple start message events.
is it possible to query process instances started by specific messages identified by message name?
if yes, how?
if no, why?
if not at the moment, when?
Some APIs like IncidentMessages?
That is no out-of-the-box feature but should be easy to build by using process variables.
The basic steps are:
1. Implement an execution listener that sets the message name as a variable:
public class MessageStartEventListener implements ExecutionListener {
public void notify(DelegateExecution execution) throws Exception {
execution.setVariable("startMessage", "MessageName");
}
}
Note that via DelegateExecution#getBpmnModelElementInstance you can access the BPMN element that the listener is attached to, so you could determine the message name dynamically.
2. Declare the execution listener at the message start events:
<process id="executionListenersProcess">
<startEvent id="theStart">
<extensionElements>
<camunda:executionListener
event="start" class="org.camunda.bpm.examples.bpmn.executionlistener.MessageStartEventListener" />
</extensionElements>
<messageEventDefinition ... />
</startEvent>
...
</process>
Note that with a BPMN parse listener, you can add such a listener programmatically to every message start event in every process definition. See this example.
3. Make a process instance query filtering by that variable
RuntimeService runtimeService = processEngine.getRuntimeService();
List<ProcessInstance> matchingInstances = runtimeService
.createProcessInstanceQuery()
.variableValueEquals("startMessage", "MessageName")
.list();

Time out for test cases in googletest

Is there a way in gtest to have a timeout for inline/test cases or even tests.
For example I would like to do something like:
EXPECT_TIMEOUT(5 seconds, myFunction());
I found this issue googletest issues as 'Type:Enhancement' from Dec 09 2010.
https://code.google.com/p/googletest/issues/detail?id=348
Looks like there is no gtest way from this post.
I am probably not the first to trying to figure out a way for this.
The only way I can think is to make a child thread run the function, and if it does not return by the
time limit the parent thread will kill it and show timeout error.
Is there any way where you don't have to use threads?
Or any other ways?
I just came across this situation.
I wanted to add a failing test for my reactor. The reactor never finishes. (it has to fail first). But I don't want the test to run forever.
I followed your link but still not joy there. So I decided to use some of the C++14 features and it makes it relatively simple.
But I implemented the timeout like this:
TEST(Init, run)
{
// Step 1 Set up my code to run.
ThorsAnvil::Async::Reactor reactor;
std::unique_ptr<ThorsAnvil::Async::Handler> handler(new TestHandler("test/data/input"));
ThorsAnvil::Async::HandlerId id = reactor.registerHandler(std::move(handler));
// Step 2
// Run the code async.
auto asyncFuture = std::async(
std::launch::async, [&reactor]() {
reactor.run(); // The TestHandler
// should call reactor.shutDown()
// when it is finished.
// if it does not then
// the test failed.
});
// Step 3
// DO your timeout test.
EXPECT_TRUE(asyncFuture.wait_for(std::chrono::milliseconds(5000)) != std::future_status::timeout);
// Step 4
// Clean up your resources.
reactor.shutDown(); // this will allow run() to exit.
// and the thread to die.
}
Now that I have my failing test I can write the code that fixes the test.

Cascading S3 Sink Tap not being deleted with SinkMode.REPLACE

We are running Cascading with a Sink Tap being configured to store in Amazon S3 and were facing some FileAlreadyExistsException (see [1]).
This was only from time to time (1 time on around 100) and was not reproducable.
Digging into the Cascading codem, we discovered the Hfs.deleteResource() is called (among others) by the BaseFlow.deleteSinksIfNotUpdate().
Btw, we were quite intrigued with the silent NPE (with comment "hack to get around npe thrown when fs reaches root directory").
From there, we extended the Hfs tap with our own Tap to add more action in the deleteResource() method (see [2]) with a retry mechanism calling directly the getFileSystem(conf).delete.
The retry mechanism seemed to bring improvement, but we are still sometimes facing failures (see example in [3]): it sounds like HDFS returns isDeleted=true, but asking directly after if the folder exists, we receive exists=true, which should not happen. Logs also shows randomly isDeleted true or false when the flow succeeds, which sounds like the returned value is irrelevant or not to be trusted.
Can anybody bring his own S3 experience with such a behavior: "folder should be deleted, but it is not"? We suspect a S3 issue, but could it also be in Cascading or HDFS?
We run on Hadoop Cloudera-cdh3u5 and Cascading 2.0.1-wip-dev.
[1]
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory s3n://... already exists
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:132)
at com.twitter.elephantbird.mapred.output.DeprecatedOutputFormatWrapper.checkOutputSpecs(DeprecatedOutputFormatWrapper.java:75)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:923)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:882)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:882)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:856)
at cascading.flow.hadoop.planner.HadoopFlowStepJob.internalNonBlockingStart(HadoopFlowStepJob.java:104)
at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:174)
at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:137)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:122)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:42)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.j
[2]
#Override
public boolean deleteResource(JobConf conf) throws IOException {
LOGGER.info("Deleting resource {}", getIdentifier());
boolean isDeleted = super.deleteResource(conf);
LOGGER.info("Hfs Sink Tap isDeleted is {} for {}", isDeleted,
getIdentifier());
Path path = new Path(getIdentifier());
int retryCount = 0;
int cumulativeSleepTime = 0;
int sleepTime = 1000;
while (getFileSystem(conf).exists(path)) {
LOGGER
.info(
"Resource {} still exists, it should not... - I will continue to wait patiently...",
getIdentifier());
try {
LOGGER.info("Now I will sleep " + sleepTime / 1000
+ " seconds while trying to delete {} - attempt: {}",
getIdentifier(), retryCount + 1);
Thread.sleep(sleepTime);
cumulativeSleepTime += sleepTime;
sleepTime *= 2;
} catch (InterruptedException e) {
e.printStackTrace();
LOGGER
.error(
"Interrupted while sleeping trying to delete {} with message {}...",
getIdentifier(), e.getMessage());
throw new RuntimeException(e);
}
if (retryCount == 0) {
getFileSystem(conf).delete(getPath(), true);
}
retryCount++;
if (cumulativeSleepTime > MAXIMUM_TIME_TO_WAIT_TO_DELETE_MS) {
break;
}
}
if (getFileSystem(conf).exists(path)) {
LOGGER
.error(
"We didn't succeed to delete the resource {}. Throwing now a runtime exception.",
getIdentifier());
throw new RuntimeException(
"Although we waited to delete the resource for "
+ getIdentifier()
+ ' '
+ retryCount
+ " iterations, it still exists - This must be an issue in the underlying storage system.");
}
return isDeleted;
}
[3]
INFO [pool-2-thread-15] (BaseFlow.java:1287) - [...] at least one sink is marked for delete
INFO [pool-2-thread-15] (BaseFlow.java:1287) - [...] sink oldest modified date: Wed Dec 31 23:59:59 UTC 1969
INFO [pool-2-thread-15] (HiveSinkTap.java:148) - Now I will sleep 1 seconds while trying to delete s3n://... - attempt: 1
INFO [pool-2-thread-15] (HiveSinkTap.java:130) - Deleting resource s3n://...
INFO [pool-2-thread-15] (HiveSinkTap.java:133) - Hfs Sink Tap isDeleted is true for s3n://...
ERROR [pool-2-thread-15] (HiveSinkTap.java:175) - We didn't succeed to delete the resource s3n://... Throwing now a runtime exception.
WARN [pool-2-thread-15] (Cascade.java:706) - [...] flow failed: ...
java.lang.RuntimeException: Although we waited to delete the resource for s3n://... 0 iterations, it still exists - This must be an issue in the underlying storage system.
at com.qubit.hive.tap.HiveSinkTap.deleteResource(HiveSinkTap.java:179)
at com.qubit.hive.tap.HiveSinkTap.deleteResource(HiveSinkTap.java:40)
at cascading.flow.BaseFlow.deleteSinksIfNotUpdate(BaseFlow.java:971)
at cascading.flow.BaseFlow.prepare(BaseFlow.java:733)
at cascading.cascade.Cascade$CascadeJob.call(Cascade.java:761)
at cascading.cascade.Cascade$CascadeJob.call(Cascade.java:710)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
First, double check the Cascading compatibility page for supported distributions.
http://www.cascading.org/support/compatibility/
Note Amazon EMR is listed as they periodically run the compatibility tests and report the results back.
Second, S3 is an eventually consistent filesystem. HDFS is not. So assumptions about the behavior of HDFS don't carry over to storing data against S3. For example, a rename is really a copy and delete. Where the copy can take hours. Amazon has patched their internal distribution to accommodate many of the differences.
Third, there are no directories in S3. It is a hack and supported differently by different S3 interfaces (jets3t vs s3cmd vs ...). This is bound to be problematic considering the prior point.
Fourth, network latency and reliability are critical, especially when communicating to S3. Historically I've found the Amazon network to be better behaved when manipulating massive datasets on S3 when using EMR vs standard EC2 instances. I also believe their is a patch in EMR that improves matters here as well.
So I'd suggest try running the EMR Apache Hadoop distribution to see if your issues clear up.
When running any jobs on Hadoop that use files in S3, the nuances of eventual consistency must be kept in mind.
I've helped troubleshoot many apps which turned out to have similar race conditions for delete as their root issue -- whether they were in Cascading or Hadoop streaming or written directly in Java.
There was discussion at one point of having notifications from S3 after a given key/value pair had been fully deleted. I haven't kept up on where that feature stood. Otherwise, it's probably best to design systems -- again, whether in Cascading or any other app that uses S3 -- such that data which is consumed or produced by a batch workflow gets managed in HDFS or HBase or a key/value framework (e.g., have used Redis for this). Then S3 gets used for durable storage, but not for intermediate data.