Log messages classification/grouping and finding human readable pattern for each group - tensorflow

As new to data science and machine learning I would like to ask the following questions about the problem explained below:
Is machine learning good for such problem or is it overkill?
Could this problem be related with another classical problem that has already published papers so I can choose the right solution?
The problem:
I've been doing a research on pretty interesting problem that I believe many Analytics system solved by automated process.
We are collecting many JavaScript error messages that happen in all kind of browsers and custom build web applications. Our goal is to group the similar messages and label each group by the common pattern the grouped messages have.
Example:
+---------------------------------------------------------------+
|Label: "Forbidden: User session {{placeholder1}} has expired." |
+---------------------------------------------------------------+
|Message: "Forbidden: User session aad3-1v299-4400 has expired."|
|Message: "Forbidden: User session jj41-1d333-bbaa has expired."|
|Message: "Forbidden: User session aab3-bn12n-1111 has expired."|
+---------------------------------------------------------------+
So far we have semi-automated process that solves the problem but from time to time we get new user generated JavaScript error messages that slip through our filters.
I've been thinking about naive 2 step approach that uses existing libraries/tools/algorithms.
For a batch of error lines run an algorithm (e.g. Levenshtein) that finds similar strings. Group the similar errors.
Within a group of similar strings run a diff and find the dynamic parts. Check the diff:
For reference here we have error messages that were collected in the period of one minute:
Message: 3312445,Error: Unknown page "retina_list"
Message: 9931234,Error: Unknown page "widget_summary"
Message: ReferenceError: 'alg,TypeError: g' is undefined
Message: 522574,Error: Unknown page "page_options"
Message: ReferenceError: '297756| Zly / Error in handler for event:,[object Object],ApiListenerError: TypeError: a' is undefined
Message: [Euv warn]: style="width: {{item.evaluation}}em": interpolation in 'style' attribute will cause the attribute to be discarded in Internet Explorer. Use krt-bind:style instead. (found in component: <default-componentfalse2320383>)
Message: [Euv warn]: src="//www.example.com/image/{{item._id}}-1.jpg?w=220&h=165&mode=crop": interpolation in 'src' attribute will cause a 404 request. Use krt-bind:src instead. (found in component: <default-componentfalse8372912>)
Message: [Euv warn]: src="//www.example.com/image/{{item._id}}?car=recommend_sp312": interpolation in 'src' attribute will cause a 404 request. Use krt-bind:src instead. (found in component: <default-componentfalse3330736>)
Message: [Euv warn]: src="//www.example.com/image/{{item._id}}-1.jpg?w=220&h=165&mode=crop": interpolation in 'src' attribute will cause a 404 request. Use krt-bind:src instead. (found in component: <default-componentfalse4893336>)
Message: ReferenceError: 'alg,TypeError: g' is undefined
Message: 73276| Zly / Error in handler for event:,[object Object],ApiListenerError: TypeError: Cannot read property 'campaignName' of undefined
Message: ReferenceError: 'alg,TypeError: g' is undefined
Message: ReferenceError: 'bend,TypeError: f' is undefined
I've been playing lately with Tensorflow JS where I am complete beginner but I may try to train something that could help me classify strings and label them.
I also think that the more serious problem is to generate the group label than grouping the strings because sometimes a pair of similar strings have very different length and the placeholders are long sentences with special characters like \,".^%#&*!?<>|][{}.

As you have pointed out, it sounds like we can separate this problem into two distinct steps.
Group together similar messages, and
Label each group.
Step 1:
While I am not too familiar with Tensorflow JS, I do not believe it is overkill to use Machine Learning (ML) to tackle this problem, especially for step 1.
In fact, this type of problem is a great candidate for a specific form of ML known as Unsupervised Learning, and more specifically, Clustering. In Unsupervised Learning, we look to find “previously unknown patterns in our data without pre-existing labels”.
See: https://en.wikipedia.org/wiki/Unsupervised_learning
In this context, that means that we do not know if “Error Message 1” and “Error Message 2” will belong to the same group before we apply our Clustering algorithm. Using your example, we can reasonably suspect that the messages:
“Forbidden: User session aad3-1v299-4400 has expired"
“Forbidden: User session jj41-1d333-bbaa has expired"
will belong to the same group, but the Clustering algorithm does not know this when it starts.
We can contrast this with a form of Supervised Learning known as Classification, where we know beforehand that we expect a group to have the form
“Forbidden: User session {{placeholder1}} has expired".
Then the pre-existing labels in the data are that messages such as
“Forbidden: User session aad3-1v299-4400 has expired"
“Forbidden: User session jj41-1d333-bbaa has expired"
belong to the expected group just above. We essentially give the ML model a bunch of examples of what this group looks like, and then incoming messages that appear to be similar will be classified to this group.
It sounds like from your description that for Step 1, you want to perform a string match (such as Levenshtein) to compare all of the example messages, and then apply a Clustering algorithm to those results. Then after you have groups (clusters) of messages, Step 2 involves finding an appropriate label for each group.
Step 2:
Agreed that finding an appropriate label for each group is likely the harder problem here. One approach that could be useful is to count how many times a word or phrase appears within a group or cluster, and if it does not meet some pre-defined threshold, to use a placeholder as you have in your example label. For example, the words “Forbidden”, “User”, “session”, and “expired” will be common to the group, whereas the alpha numeric ID’s listed are unique to the individual messages. If the threshold is that a word or phrase must show up in at least two messages, only the ID’s will be replaced by the placeholder.
In this approach, you are essentially looking to find words or phrases that are uncommon to the group, and do not provide useful information in forming an appropriate label. In a way, this is the opposite of a concept used in many search engines that aims to find how common or important a word or phrase is to a document (see https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

Related

Why is VK_SAMPLE_COUNT_1_BIT an invalid choice for multisampling in Vulkan?

Hello people of StackOverflow,
I am currently working on a games engine using the Vulkan graphics API, in the past I was just setting anti-aliasing to the max it could be. However today I was trying to turn it off (to improve performance on weaker systems). To do this I tried to set the MSAA samples on my engine to VK_SAMPLE_COUNT_1_BIT however this produced the validation error:
Validation Error: [ VUID-VkSubpassDescription-pResolveAttachments-00848 ] Object 0: handle = 0x55aaa6e32828, type = VK_OBJECT_TYPE_DEVICE; | MessageID = 0xfad6c3cb | ValidateCreateRenderPass(): Subpass 0 requests multisample resolve from attachment 0 which has VK_SAMPLE_COUNT_1_BIT. The Vulkan spec states: If pResolveAttachments is not NULL, for each resolve attachment that is not VK_ATTACHMENT_UNUSED, the corresponding color attachment must not have a sample count of VK_SAMPLE_COUNT_1_BIT (https://www.khronos.org/registry/vulkan/specs/1.2-extensions/html/vkspec.html#VUID-VkSubpassDescription-pResolveAttachments-00848)
I can work around this problem relatively easily so it isn't really an issue for me, however I was wondering why exactly this limit is put into place. If I want to set the MSAA samples to 1 why can't I?
Thanks,
sckzor
A sample count of 1 means "not a multisampled image". And if you're doing multisample resolve, resolving from a non-multisampled image doesn't make sense. Which is also why you can't use such images for any other things that expect a multisampled image (you can't use an MS-style sampler or texture function on them).

Using Optaplanner for VRPPD

I am trying to run the example "optaplanner-mixedvrp-experiment" developed by Geoffrey De Smet and when I run it it throws me the following error:
Caused by: java.lang.IllegalStateException: The entity (MY) has a
variable (previousStandstill) with value (MUNO) which has a
sourceVariableName variable (nextVisit) with a value (WERBOMONT) which
is not null. Verify the consistency of your input problem for that
sourceVariableName variable.
I have not made any change, I have only cloned and executed it, I import and solve it and it throws me this error.
Do you know what could be happening?
I am applying it in the development of a variant of VRP with multiple deliveries and collections, but it throws me the same error. I have activated the FULL_ASSERT mode and nextVisit, previousStandstill, visitIndex are always null
It's been a long time since I looked at that code, so it's using an old version of optaplanner. Our goal is still to clean it up and offer an out of the box example for VRPPD (and probably remove some boilerplate along the way, using the upcoming #CollectionPlanningVariabe etc). That being said, we have multiple users&customers who used that optaplanner-mixedvrp-experiment to successfully build VRPPD implementations.
Which dataset did you try?
FWIW, that IllegalStateException says that when A.previous = B, the B.next is not A. So either the dataset importer didn't import it correctly - before calling solve() - especially if it fails before the first CH step in FULL_ASSERT. Or one of the custom moves corrupted the model.

How to get Exception source “Activity Description name”

When exceptions occur in a UIPath project I have an email that is sent out with the exception info included. There seems to be an issue though where I can only see where the error occured by looking at the selector information such as:
Cannot find the UI element corresponding to this selector:
<html app='chrome.exe' title='Microsoft Dynamics GP' />
<webctrl aaname='Add' idx='1'
parentid='a00000000000000008549000000030009000000000001000000000000' tag='DIV' />
This info and the stack trace or any other info is not really helpful for quickly finding the source of the problem. I have looked through the UIPath documentation and forum and found only the this question, which seemed to point to using the exception.Source to show the name of the activity where the error occurred. exception.Source only returns “UiPath.Core.Activities” though instead of "Type into Copy Job# 'INPUT'" in the following example:
This obviously causes a big problem with exception handling. How can I easily return the source with each exception?
When your selector fails, you will end up with a new object of type UiPath.Core.SelectorNotFoundException. However, until the team at UiPath decides to add the Display Name into the inner exception, there is little you can do in this particular case.
Take the following example - the first line shows the Inner Exception, and the second one in red is essentially just the exception being rethrown. Note that only the latter one contains the Display Name property.
The Source itself will usually be of type UiPath.Core.Activities, but since this is just the type's name, we don't have any link to the faulting object. Here's what you can do:
Add some details to your exception. You don't want to do this for each activity, but you could have certain blocks of try-catches (example: logging into the system consists of three individual activites, and they reside in one block).
Rethrow the exception. That way the Display Name will end up in the execution log file.

NuPIC OPF Runtime error getOutputData unknown output categoriesOut

I'm trying to run TemporalClassification model using OPF to recognize patterns from stream. I've adjusted model params so it has two Sensor inputs: ScalarEncoder and SDRCategoryEncoder. The latter marked as classifierOnly. And also it's set as predictedField in inferences.
When trying to feed model with input data I get
RuntimeError: getOutputData unknown output 'categoriesOut' on region Classifier.
NontemporalClassification (only inferenceType changed) model runs without such error.
I've found 6 occurances of categoriesOut in nupic code: https://github.com/numenta/nupic/search?utf8=%E2%9C%93&q=categoriesOut
And error arises in nupic/frameworks/opf/clamodel.py at line 558
classificationDist = classifier.getOutputData('categoriesOut')
Seems that ClassifierRegion in the network is not prepared properly to output data.
Can anyone explain why the classfier region doesn't have 'categoriesOut'? I guess there's misconfiguration in my model params, but there were no errors or warnings during initialization of model. Is there any mandatory parameters and assignments (except noticed in NUPIC documentation) necessary for TemporalClassification model to run?
There are several types of ClassifierRegions in NuPIC. You can find them in nupic/regions folder. I've checked sources and found that 'categoriesOut' is in the outputs dict of the KNNClassifierRegion
https://github.com/numenta/nupic/blob/469f6372082e95dd5d2a96181b745ba36d2e7a8a/nupic/regions/KNNClassifierRegion.py
outputs=dict(
categoriesOut=dict(
description='A vector representing, for each category '
'index, the likelihood that the input to the node belongs '
'to that category based on the number of neighbors of '
'that category that are among the nearest K.',
dataType='Real32',
count=0,
regionLevel=True,
isDefaultOutput=True),
Ensure you use KNNClassifierRegion when configuring your TemporalClassification model. Samples for NontemporalClassification use CLAClassifier, but CLAClassifierRegion has no categoriesOut in its outputs and error described in your question will arise if you keep
'regionName' : 'CLAClassifierRegion'
for TemporalClassification model.

Handling error after aggregation

I am reading some lines from a CSV file, converting them to business objects, aggregating these to batches and passing the resulting aggregates to a bean, which may throw an PersistenceException.
Somehow like this:
from(file:inputdir).split().tokenize("\n").bean(a).aggregate(constant(true), new AbstractListAggregationStrategy(){...}).completionSize(3).bean(b)
I have a onException(Exception.class).handled(true).to("file:failuredir").log(). If an exception occurs on bean(a), everything is handled as expected: wrong lines in inputdir/input.csv are written to failuredir/input.csv.
Now if bean(b) fails, Camel seems to fail reconstructing the original message:
message.org.apache.camel.component.file.GenericFileOperationFailedException: Cannot store file: target/failure/ID-myhostname-34516-1372093690069-0-7
Having tried various attempts to get this working, like using HawtDBAggregationRepository, toggling useOriginalMessage at onException and propagating back the exception in my AggregationStrategy, I am out of ideas.
How can I achieve the same behaviour for bean(b) which can be seen with bean(a)?
The aggregator is a stateful EIP pattern, so when it sends out a message, then its a new Exchange. So the bean(b) cannot get access to the original message that came from the file route.