I have got a large set of data. The data has 13 parameters and those parameters depend on each other and the dependency is established by some rules.
Example:- If say parameter_one is "A", and parameter_two is "B", and there is rule stating that parameter_one==A and parameter_two==B=>parameter_three==C, then parameter_three should be C(ideally). So, basically it's a lot of if/else statements.
Now, I just have the data, and we have to make the machine learning model learn the rules, so that whenever any data comes which doesn't obeys the rules:- as in above example, if parameter_three would have been 'D' instead of 'C', then it's a violation of the rule. How can I make the model learn these rules?
Also, the rules can't be written manually since there are a lot of rules and it's not scaling.
My try
I thought of using an autoencoder and pass the training data through it. After that for each data, we would use the reconstruction loss to check if it's a violation case or not. However, it's overfitting and not working well on test data.
I have previously tried to use deep neural network also, but it was not helping there. Can anyone help me out here?
Thanks in advance.
You could use Association Rule Mining algorithms like Apriori or FP-Growth to generate the frequent item sets.
From the frequent item sets you can generate Association rules.
Once you have the association rules, you can assign a weight to each rule (or use some parameter like confidence/lift of the rule).
When you want to test it on a new data entry, do weighted summing (if the new entry satisfies a rule, use the rule's weight to calculate the score/sum for the new entry).
If the generated score for the new entry is greater than a chosen threshold, you can say the new entry passes the preset rules otherwise it's in violation of the rules.
Weighted summing gives you flexibility to assign importance to association rules. You can also do this, if new entry does not satisfy even one of the association rules, then it is in violation of preset rules.
Related
I am working on an optimization model using AnyLogic. Is there a way to specify an array of decision variables in AnyLogic like how it is in IBM Cplex? For lesser number of decision variables (say 2 to 5), I used to specify them individually, for example, numAgents_1, numAgents_2 for locations 1 and 2. However, as my model grows in size and more locations are added (up to 40), is there a way I can specify them as an array or list of decision variables?
Any help regarding this would be really useful. Thanks.
Yes, but you need to use a "custom experiment" instead and set it up using an Array of decision variables.
This is not totally straight forward, however, best start by checking the example models that apply custom experiments.
Some starting points below:
I was able to find a few, but I was wondering, is there more algorithms that based on data encoding/modification instead of complete encryption of it. Examples that I found:
Steganography. The method is based on hiding a message within a message;
Tokenization. Data is mapped in the tokenization server to a random token that represents the real data outside of the server;
Data perturbation. As far as I know it works mostly with databases. Adds noise to the sensitive records yet allows to read general and public fields, like sum of the records on a specific day.
Are there any other methods like this?
If your purpose is to publish this data there are other methods similars to data perturbation, its called Data Anonymization [source]:
Data masking—hiding data with altered values. You can create a mirror
version of a database and apply modification techniques such as
character shuffling, encryption, and word or character substitution.
For example, you can replace a value character with a symbol such as
“*” or “x”. Data masking makes reverse engineering or detection
impossible.
Pseudonymization—a data management and de-identification method that
replaces private identifiers with fake identifiers or pseudonyms, for
example replacing the identifier “John Smith” with “Mark Spencer”.
Pseudonymization preserves statistical accuracy and data integrity,
allowing the modified data to be used for training, development,
testing, and analytics while protecting data privacy.
Generalization—deliberately removes some of the data to make it less
identifiable. Data can be modified into a set of ranges or a broad
area with appropriate boundaries. You can remove the house number in
an address, but make sure you don’t remove the road name. The purpose
is to eliminate some of the identifiers while retaining a measure of
data accuracy.
Data swapping—also known as shuffling and permutation, a technique
used to rearrange the dataset attribute values so they don’t
correspond with the original records. Swapping attributes (columns)
that contain identifiers values such as date of birth, for example,
may have more impact on anonymization than membership type values.
Data perturbation—modifies the original dataset slightly by applying techniques that round numbers and add random noise. The range
of values needs to be in proportion to the perturbation. A small base
may lead to weak anonymization while a large base can reduce the
utility of the dataset. For example, you can use a base of 5 for
rounding values like age or house number because it’s proportional to
the original value. You can multiply a house number by 15 and the
value may retain its credence. However, using higher bases like 15 can
make the age values seem fake.
Synthetic data—algorithmically manufactured information that has no
connection to real events. Synthetic data is used to create artificial
datasets instead of altering the original dataset or using it as is
and risking privacy and security. The process involves creating
statistical models based on patterns found in the original dataset.
You can use standard deviations, medians, linear regression or other
statistical techniques to generate the synthetic data.
Is this what are you looking for?
EDIT: added link to the source and quotation.
I am new to Doc2Vec, please bear with the naive questions.
I have generated Doc2vector score i.e. using the 'Paragraph Vector' algorithm.
I have an array output for each document.
I use the model.similar for doc1 and get the output - doc5 and doc10 are similar to doc1.
Q1) How to summarize using the code what are the important words or high-level summary this document holds?
In addition, If I use the array output and run K- means to get 5 clusters. How to define the cluster definition.
Q2) I can read the documents but the number of documents is very high and doing a manual read to find the cluster definition is not possible.
There's no built-in 'summarization' function for Doc2Vec doc-vectors (or clusters of same).
Theoretically, the model could do something that's sort-of the opposition of doc-vector inference. It could take a doc-vector – perhaps one corresponding to a existing document – and then provide it to the model, run the model "forward", and read out the activation levels of all its output nodes. At least in models using the default negative-sampling, those nodes map one-to-one with known vocabulary words, and you could plausibly sort/scale those activation levels to find the top-N "most-associated" words with that doc-vector.
You could look at the predict_output_word() method source of Word2Vec to get a rough idea of how such a calculation could work:
https://github.com/RaRe-Technologies/gensim/blob/3514d3fb9224280edd8ddd14c46b722220df5436/gensim/models/word2vec.py#L1131
As mentioned, this isn't an existing capability, and I don't know of an online source for code to do such a calculation. But, if it were implemented, it would be a welcome contribution.
(I'm not sure what your Q2 question actually is.)
In a vrp problem, I want to employ different probability distribitions for all moveSelectors depending on where in the chain the planning is being done. More concretely, I want to employ a block distribution for the first entity in the chain and parabolic for anywhere else in the (same) chain.
Now, I could configure identical moves one with block distribution and one with parabolic, but this is going to get cluttered very quick. So, instead I am wondering what will happen if I state that in the implemented NearbyDistanceMeter the distance is 0 if it is the first entity in the chain and a value > 0 if it is not the first entity. Will this work as intended ?
It will not.
The NearbyDistanceMeter should be idempotent (give the same result when called twice), regardless of the state of the planning variables.
In fact, it's called & cached before solving really starts.
I successfully amended the nice CloudBalancing example to include the fact that I may only have a limited number of computers open at any given time (thanx optaplanner team - easy to do). I believe this is referred to as a bounded-space problem. It works dandy.
The processes come in groupwise, say 20 processes in a given order per group. I would like to amend the example to have optaplanner also change the order of these groups (not the processes within one group). I have therefore added a class ProcessGroup in the domain with a member List<Process>, the instances of ProcessGroup being stored in a List<ProcessGroup>. The desired optimisation would shuffle the members of this List, causing the instances of ProcessGroup to be placed at different indices of the List List<ProcessGroup>. The index of ProcessGroup should be ProcessGroup.index.
The documentation states that "if in doubt, the planning entity is the many side of the many-to-one relationsship." This would mean that ProcessGroup is the planning entity, the member index being a planning variable, getting assigned to (hopefully) different integers. After every new assignment of indices, I would have to resort the list List<ProcessGroup in ascending order of ProcessGroup.index. This seems very odd and cumbersome. Any better ideas?
Thank you in advance!
Philip.
The current design has a few disadvantages:
It requires 2 (genuine) entity classes (each with 1 planning variable): probably increases search space (= longer to solve, more difficult to find a good or even feasible solution) + it increases configuration complexity. Don't use multiple genuine entity classes if you can avoid it reasonably.
That Integer variable of GroupProcess need to be all different and somehow sequential. That smelled like a chained planning variable (see docs about chained variables and Vehicle Routing example), in which case the entire problem could be represented as a simple VRP with just 1 variable, but does that really apply here?
Train of thought: there's something off in this model:
ProcessGroup has in Integer variable: What does that Integer represent? Shouldn't that Integer variable be on Process instead? Are you ordering Processes or ProcessGroups? If it should be on Process instead, then both Process's variables can be replaced by a chained variable (like VRP) which will be far more efficient.
ProcessGroup has a list of Processes, but that a problem property: which means it doesn't change during planning. I suspect that's correct for your use case, but do assert it.
If none of the reasoning above applies (which would surprise me) than the original model might be valid nonetheless :)