Rapidminer process failed message - text-mining

I use Rapidminer in text mining process
it work with naive well but with decision tree it give error message
the dataset contains about 50000 word,I use stem and stopword filtering
error message

Related

GNURadio Companion and OFDM TX and RX in single Graph

I am following this github example for understanding OFDM on gnuradio-companion, I am able to execute ofdm_tx individually (64 and 512 FFT point) without any issues, but when I connect these two in single graph, I am able to get spectrum from ofdm_tx (no output from ofdm_rx or getting straight line).
My question here, each time I close my output spectrum, my tool get hanged and in background (inside gnu-companion) I observe the following message tarin (attached, printscreen). Similar thing also observed when I run ofdm_rx individually.
Error message in Console :
packet_headerparser_b :info: Detected an invalid packet at item 1448.
header_payload_demux :info :parser returned #f
Please guide me in this regard,
by selecting "NO" for vector source "Repeat" variable , issue sorted out (no hang), but not able to see spectrum anymore.

Log messages classification/grouping and finding human readable pattern for each group

As new to data science and machine learning I would like to ask the following questions about the problem explained below:
Is machine learning good for such problem or is it overkill?
Could this problem be related with another classical problem that has already published papers so I can choose the right solution?
The problem:
I've been doing a research on pretty interesting problem that I believe many Analytics system solved by automated process.
We are collecting many JavaScript error messages that happen in all kind of browsers and custom build web applications. Our goal is to group the similar messages and label each group by the common pattern the grouped messages have.
Example:
+---------------------------------------------------------------+
|Label: "Forbidden: User session {{placeholder1}} has expired." |
+---------------------------------------------------------------+
|Message: "Forbidden: User session aad3-1v299-4400 has expired."|
|Message: "Forbidden: User session jj41-1d333-bbaa has expired."|
|Message: "Forbidden: User session aab3-bn12n-1111 has expired."|
+---------------------------------------------------------------+
So far we have semi-automated process that solves the problem but from time to time we get new user generated JavaScript error messages that slip through our filters.
I've been thinking about naive 2 step approach that uses existing libraries/tools/algorithms.
For a batch of error lines run an algorithm (e.g. Levenshtein) that finds similar strings. Group the similar errors.
Within a group of similar strings run a diff and find the dynamic parts. Check the diff:
For reference here we have error messages that were collected in the period of one minute:
Message: 3312445,Error: Unknown page "retina_list"
Message: 9931234,Error: Unknown page "widget_summary"
Message: ReferenceError: 'alg,TypeError: g' is undefined
Message: 522574,Error: Unknown page "page_options"
Message: ReferenceError: '297756| Zly / Error in handler for event:,[object Object],ApiListenerError: TypeError: a' is undefined
Message: [Euv warn]: style="width: {{item.evaluation}}em": interpolation in 'style' attribute will cause the attribute to be discarded in Internet Explorer. Use krt-bind:style instead. (found in component: <default-componentfalse2320383>)
Message: [Euv warn]: src="//www.example.com/image/{{item._id}}-1.jpg?w=220&h=165&mode=crop": interpolation in 'src' attribute will cause a 404 request. Use krt-bind:src instead. (found in component: <default-componentfalse8372912>)
Message: [Euv warn]: src="//www.example.com/image/{{item._id}}?car=recommend_sp312": interpolation in 'src' attribute will cause a 404 request. Use krt-bind:src instead. (found in component: <default-componentfalse3330736>)
Message: [Euv warn]: src="//www.example.com/image/{{item._id}}-1.jpg?w=220&h=165&mode=crop": interpolation in 'src' attribute will cause a 404 request. Use krt-bind:src instead. (found in component: <default-componentfalse4893336>)
Message: ReferenceError: 'alg,TypeError: g' is undefined
Message: 73276| Zly / Error in handler for event:,[object Object],ApiListenerError: TypeError: Cannot read property 'campaignName' of undefined
Message: ReferenceError: 'alg,TypeError: g' is undefined
Message: ReferenceError: 'bend,TypeError: f' is undefined
I've been playing lately with Tensorflow JS where I am complete beginner but I may try to train something that could help me classify strings and label them.
I also think that the more serious problem is to generate the group label than grouping the strings because sometimes a pair of similar strings have very different length and the placeholders are long sentences with special characters like \,".^%#&*!?<>|][{}.
As you have pointed out, it sounds like we can separate this problem into two distinct steps.
Group together similar messages, and
Label each group.
Step 1:
While I am not too familiar with Tensorflow JS, I do not believe it is overkill to use Machine Learning (ML) to tackle this problem, especially for step 1.
In fact, this type of problem is a great candidate for a specific form of ML known as Unsupervised Learning, and more specifically, Clustering. In Unsupervised Learning, we look to find “previously unknown patterns in our data without pre-existing labels”.
See: https://en.wikipedia.org/wiki/Unsupervised_learning
In this context, that means that we do not know if “Error Message 1” and “Error Message 2” will belong to the same group before we apply our Clustering algorithm. Using your example, we can reasonably suspect that the messages:
“Forbidden: User session aad3-1v299-4400 has expired"
“Forbidden: User session jj41-1d333-bbaa has expired"
will belong to the same group, but the Clustering algorithm does not know this when it starts.
We can contrast this with a form of Supervised Learning known as Classification, where we know beforehand that we expect a group to have the form
“Forbidden: User session {{placeholder1}} has expired".
Then the pre-existing labels in the data are that messages such as
“Forbidden: User session aad3-1v299-4400 has expired"
“Forbidden: User session jj41-1d333-bbaa has expired"
belong to the expected group just above. We essentially give the ML model a bunch of examples of what this group looks like, and then incoming messages that appear to be similar will be classified to this group.
It sounds like from your description that for Step 1, you want to perform a string match (such as Levenshtein) to compare all of the example messages, and then apply a Clustering algorithm to those results. Then after you have groups (clusters) of messages, Step 2 involves finding an appropriate label for each group.
Step 2:
Agreed that finding an appropriate label for each group is likely the harder problem here. One approach that could be useful is to count how many times a word or phrase appears within a group or cluster, and if it does not meet some pre-defined threshold, to use a placeholder as you have in your example label. For example, the words “Forbidden”, “User”, “session”, and “expired” will be common to the group, whereas the alpha numeric ID’s listed are unique to the individual messages. If the threshold is that a word or phrase must show up in at least two messages, only the ID’s will be replaced by the placeholder.
In this approach, you are essentially looking to find words or phrases that are uncommon to the group, and do not provide useful information in forming an appropriate label. In a way, this is the opposite of a concept used in many search engines that aims to find how common or important a word or phrase is to a document (see https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

how can i handle syslog-ng parser failure when using a patterndb

We parse millions of messages a day using syslog-ng, and are in the process of implementing patterndb.
Due to inconsistency in how the messages are composed, in a small percentage of cases, my patterns are insufficient to capture the fields of the message (spacing is off, or sometimes a field is missing altogether).
How can I deal with these cases? Ideally, the parser entry in my log destination would evaluate to false (like a filter) and it would be captured by my fallback log destination.
Try setting drop-unmatched(yes) (needs syslog-ng OSE 3.11 or later):
parser pattern_db {
db-parser(
file("/opt/syslog-ng/var/db/patterndb.xml")
drop-unmatched(yes)
);
};
Also, recent syslog-ng versions have several different parsers that might be better for certain log messages than patterndb, for example, JSON and key=value parsers.

ADLA/U-SQL Error: Vertex user code error

I just have a simple U-SQL that extracts a csv using Extractors.Csv(encoding:Encoding.[Unicode]); and outputs into a lake store table. The file size is small around 600MB and is unicode type. The number of rows is 700K+
These are the columns:
UserId int,
Email string,
AltEmail string,
CreatedOn DateTime,
IsDeleted bool,
UserGuid Guid,
IFulfillmentContact bool,
IsBillingContact bool,
LastUpdateDate DateTime,
IsTermsOfUse string,
UserTypeId string
When I submit this job to my local, it works great without any issues. Once I submit it to ADLA, I get the following error:
Vertex failure triggered quick job abort. Vertex failed: SV1_Extract_Partition[0][0] with error: Vertex user code error.
Vertex failed with a fail-fast error
Vertex SV1_Extract_Partition[0][0].v1 {BA7B2378-597C-4679-AD69-07413A143E47} failed
Error:
Vertex user code error
exitcode=CsExitCode_StillActive Errorsnippet=An error occurred while processing adl://lakestore.azuredatalakestore.net/Data/User.csv
Any help is appreciated!
Since the file is larger than 250MB, you need to make sure that you upload it as a row-oriented file and not a binary file.
Also, please check the reply for the following question to see how you currently can find more details on the error: Debugging u-sql Jobs

NuPIC OPF Runtime error getOutputData unknown output categoriesOut

I'm trying to run TemporalClassification model using OPF to recognize patterns from stream. I've adjusted model params so it has two Sensor inputs: ScalarEncoder and SDRCategoryEncoder. The latter marked as classifierOnly. And also it's set as predictedField in inferences.
When trying to feed model with input data I get
RuntimeError: getOutputData unknown output 'categoriesOut' on region Classifier.
NontemporalClassification (only inferenceType changed) model runs without such error.
I've found 6 occurances of categoriesOut in nupic code: https://github.com/numenta/nupic/search?utf8=%E2%9C%93&q=categoriesOut
And error arises in nupic/frameworks/opf/clamodel.py at line 558
classificationDist = classifier.getOutputData('categoriesOut')
Seems that ClassifierRegion in the network is not prepared properly to output data.
Can anyone explain why the classfier region doesn't have 'categoriesOut'? I guess there's misconfiguration in my model params, but there were no errors or warnings during initialization of model. Is there any mandatory parameters and assignments (except noticed in NUPIC documentation) necessary for TemporalClassification model to run?
There are several types of ClassifierRegions in NuPIC. You can find them in nupic/regions folder. I've checked sources and found that 'categoriesOut' is in the outputs dict of the KNNClassifierRegion
https://github.com/numenta/nupic/blob/469f6372082e95dd5d2a96181b745ba36d2e7a8a/nupic/regions/KNNClassifierRegion.py
outputs=dict(
categoriesOut=dict(
description='A vector representing, for each category '
'index, the likelihood that the input to the node belongs '
'to that category based on the number of neighbors of '
'that category that are among the nearest K.',
dataType='Real32',
count=0,
regionLevel=True,
isDefaultOutput=True),
Ensure you use KNNClassifierRegion when configuring your TemporalClassification model. Samples for NontemporalClassification use CLAClassifier, but CLAClassifierRegion has no categoriesOut in its outputs and error described in your question will arise if you keep
'regionName' : 'CLAClassifierRegion'
for TemporalClassification model.