Is it possible to access SCIP's Statistics output values directly from PyScipOpt Model Object? - scip

I'm using SCIP to solve MILPs in Python using PyScipOpt. After solving a problem, the solver statistics can be either 1) printed as a string using printStatistics(), or 2) saved to an external file using writeStatistics(). For example:
import pyscipopt as pso
model = pso.Model()
model.addVar(name="x", obj=1)
model.optimize()
model.printStatistics()
model.writeStatistics(filename="stats.txt")
There's a lot of information in printStatistics/writeStatistics that doesn't seem to be accessible from the Python model object directly (e.g. primal-dual integral value, data for individual branching rules or primal heuristics, etc.) It would be helpful to be able to extract the data from this output via, e.g., attributes of the model object or a dictionary.
Is there any way to access this information from the model object without having to parse the raw text/file output?

PySCIPOpt does not provide access to the statistics directly. The data for the various tables (e.g. separators, presolvers, etc.) are stored separately for every single plugin in SCIP and are sometimes not straightforward to collect.
If you are only interested in certain statistics about the general solving process, then you might want to add PySCIPOpt wrappers for a few of the simple get functions defined in scip_solvingstats.c.
Lastly, you might want to check out IPET for parsing the statistics output.

Related

Is it possible in my case to implement a strategy pattern with different semantics of algorithms?

Hi everyone, I am new on Stack Overflow so if you like my example please vote up so I get reputation of 50 for some extra features.
Now let's start with my problem.
I have several classes that literally convert one data model to another.
Different classes use different versions of the data model.
Here is my example:
In this example I have 3 converters (for now) and two algorithms that convert one data model to another, but they work for different versions of the data model. For example, AlgoVerOne works for an older version of the data model while AlgoVer2 works for a newer version that contains more / less information in it.
What matters is that ConverterA and ConverterB use the same version of the data model. So the conversion algorithm is exactly the same because the versions of the data model do not differ.
PROBLEM
My problem is that the semantics of some parts are different for these two classes. Let's say there is an element in a data model that has a value of 100. This value can be converted and inserted into another data model, because these classes use the same version of it. But the semantics of value 100 for ConverterA means "car" while for ConverterB means "bus".
So the algorithm needed to convert one data model to another is the same, but the value of an element within that data model is semantically different for these two classes.
I don’t want to use a completely new algorithm for both classes because it only changes 1% of the semantics of the whole data model.

How do I add a new feature column to a tf.data.Dataset object?

I am building an input pipeline for proprietary data using Tensorflow 2.0's data module and using the tf.data.Dataset object to store my features. Here is my issue - the data source is a CSV file that has only 3 columns, a label column and then two columns which just hold strings referring to JSON files where that data is stored. I have developed functions that access all the data I need, and am able to use Dataset's map function on the columns to get the data, but I don't see how I can add a new column to my tf.data.Dataset object to hold the new data. So if anyone could help with the following questions, it would really help:
How can a new feature be appended to a tf.data.Dataset object?
Should this process be done on the entire Dataset before iterating through it, or during (I think during iteration would allow utilization of the performance boost, but I don't know how this functionality works)?
I have all the methods for taking the input as the elements from the columns and performing everything required to get the features for each element, I just don't understand how to get this data into the dataset. I could do "hacky" workarounds, using a Pandas Dataframe as a "mediator" or something along those lines, but I want to keep everything within the Tensorflow Dataset and pipeline process, for both performance gains and higher quality code.
I have looked through the Tensorflow 2.0 documentation for the Dataset class (https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/data/Dataset), but haven't been able to find a method that can manipulate the structure of the object.
Here is the function I use to load the original dataset:
def load_dataset(self):
# TODO: Function to get max number of available CPU threads
dataset = tf.data.experimental.make_csv_dataset(self.dataset_path,
self.batch_size,
label_name='score',
shuffle_buffer_size=self.get_dataset_size(),
shuffle_seed=self.seed,
num_parallel_reads=1)
return dataset
Then, I have methods which allow me to take a string input (column element) and return the actual feature data. And I am able to access the elements from the Dataset using a function like ".map". But how do I add that as a column?
Wow, this is embarassing, but I have found the solution and it's simplicity literally makes me feel like an idiot for asking this. But I will leave the answer up just in case anyone else is ever facing this issue.
You first create a new tf.data.Dataset object using any function that returns a Dataset, such as ".map".
Then you create a new Dataset by zipping the original and the one with the new data:
dataset3 = tf.data.Dataset.zip((dataset1, dataset2))

Interpret the Doc2Vec Vectors Clusters Representation

I am new to Doc2Vec, please bear with the naive questions.
I have generated Doc2vector score i.e. using the 'Paragraph Vector' algorithm.
I have an array output for each document.
I use the model.similar for doc1 and get the output - doc5 and doc10 are similar to doc1.
Q1) How to summarize using the code what are the important words or high-level summary this document holds?
In addition, If I use the array output and run K- means to get 5 clusters. How to define the cluster definition.
Q2) I can read the documents but the number of documents is very high and doing a manual read to find the cluster definition is not possible.
There's no built-in 'summarization' function for Doc2Vec doc-vectors (or clusters of same).
Theoretically, the model could do something that's sort-of the opposition of doc-vector inference. It could take a doc-vector – perhaps one corresponding to a existing document – and then provide it to the model, run the model "forward", and read out the activation levels of all its output nodes. At least in models using the default negative-sampling, those nodes map one-to-one with known vocabulary words, and you could plausibly sort/scale those activation levels to find the top-N "most-associated" words with that doc-vector.
You could look at the predict_output_word() method source of Word2Vec to get a rough idea of how such a calculation could work:
https://github.com/RaRe-Technologies/gensim/blob/3514d3fb9224280edd8ddd14c46b722220df5436/gensim/models/word2vec.py#L1131
As mentioned, this isn't an existing capability, and I don't know of an online source for code to do such a calculation. But, if it were implemented, it would be a welcome contribution.
(I'm not sure what your Q2 question actually is.)

Elegantly handle samples with insufficient data in workflow?

I've set up a Snakemake pipeline for doing some simple QC and analysis on shallow shotgun metagenomics samples coming through our lab.
Some of the tools in the pipeline will fail or error when samples with low amounts of data are delivered as inputs -- but this is sometimes not knowable from the raw input data, as intermediate filtering steps (such as adapter trimming and host genome removal) can remove varying numbers of reads.
Ideally, I would like to be able to handle these cases with some sort of check on certain input rules, which could evaluate the number of reads in an input file and choose whether or not to continue with that portion of the workflow graph. Has anyone implemented something like this successfully?
Many thanks,
-jon
I'm not aware of the possibility to not complete the workflow based on some computation happening inside the workflow. The rules to be executed are determined based on the final required output, and failure will happen if this final output cannot be generated.
One approach could be catch the particular tool failure (try ... except construct in a run section or return code handling in a shell section) and generate a dummy output file for the corresponding rule, and have the downstream rules "propagate" dummy file generation based on a test identifying the rule's input as such a dummy file.
Another approach could be to pre-process the data outside of your snakemake workflow to determine which input to skip, and then use some filtering on the wildcards combinations as described here: https://stackoverflow.com/a/41185568/1878788.
I've been trying to find a solution to this issue as well.
Thus far I think I've identified a few potential solution but have yet to be able to correctly implement them.
I use seqkit stats to quickly generate a txt file and use the num_seqs column to filter with. You can write a quick pandas function to return a list of files which pass your threshold, and I use config.yaml to pass the minimum read threshold:
def get_passing_fastq_files(wildcards):
qc = pd.read_table('fastq.stats.txt').fillna(0)
passing = list(qc[qc['num_seqs'] > config['minReads']]['file'])
return passing
Trying to implement that as an input function in Snakemake has been an esoteric nightmare to be honest. Probably my own lack of nuanced understand about the Wildcards object.
I think the use of a checkpoint is also necessary in the process to force Snakemake to recompute the DAG after filtering samples out. Haven't been able to connect all the dots yet however, and I'm trying to avoid janky solutions that use token files etc.

Strategies for handling nominal values with numerical attributes

I'm using a data set that consists of mostly nominal values from SFDC (e.g. EE Names, Title, Role, Lead Source, Account Name, etc.) and am trying to correlate the features to a boolean class of whether a Sales Lead was converted to a Sales Contact.
I wanted to run this data through some basic feature selection algorithms, but most require numerical values only. I could map each of the unique classifications to a new field(feature) with a boolean mapping scheme, but then i'll generate an extremely large number of new features and I'm not sure if that will give a meaningful output. Admittedly the best solution might be to run the data through a decision tree, but wanted to see if there were any other strategies that others have come up with in the community for handling data sets of mostly nominal data that have been successfully used on real world applications.
I'm using python with scipy/numpy/pandas/scikit-learn to do my analysis.
I would first try to use sklearn.feature_extraction.DictVectorizer and then try Chi2 univariate feature selection that can work with sparse data representations. For instance there is an application of chi2 feature selection on sparse text data here in scikit-learn: http://scikit-learn.org/dev/auto_examples/document_classification_20newsgroups.html
Unfortunately, scikit-learn's decision trees and ensemble do not work on sparse representations yet.