How do I add a new feature column to a tf.data.Dataset object? - tensorflow

I am building an input pipeline for proprietary data using Tensorflow 2.0's data module and using the tf.data.Dataset object to store my features. Here is my issue - the data source is a CSV file that has only 3 columns, a label column and then two columns which just hold strings referring to JSON files where that data is stored. I have developed functions that access all the data I need, and am able to use Dataset's map function on the columns to get the data, but I don't see how I can add a new column to my tf.data.Dataset object to hold the new data. So if anyone could help with the following questions, it would really help:
How can a new feature be appended to a tf.data.Dataset object?
Should this process be done on the entire Dataset before iterating through it, or during (I think during iteration would allow utilization of the performance boost, but I don't know how this functionality works)?
I have all the methods for taking the input as the elements from the columns and performing everything required to get the features for each element, I just don't understand how to get this data into the dataset. I could do "hacky" workarounds, using a Pandas Dataframe as a "mediator" or something along those lines, but I want to keep everything within the Tensorflow Dataset and pipeline process, for both performance gains and higher quality code.
I have looked through the Tensorflow 2.0 documentation for the Dataset class (https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/data/Dataset), but haven't been able to find a method that can manipulate the structure of the object.
Here is the function I use to load the original dataset:
def load_dataset(self):
# TODO: Function to get max number of available CPU threads
dataset = tf.data.experimental.make_csv_dataset(self.dataset_path,
self.batch_size,
label_name='score',
shuffle_buffer_size=self.get_dataset_size(),
shuffle_seed=self.seed,
num_parallel_reads=1)
return dataset
Then, I have methods which allow me to take a string input (column element) and return the actual feature data. And I am able to access the elements from the Dataset using a function like ".map". But how do I add that as a column?

Wow, this is embarassing, but I have found the solution and it's simplicity literally makes me feel like an idiot for asking this. But I will leave the answer up just in case anyone else is ever facing this issue.
You first create a new tf.data.Dataset object using any function that returns a Dataset, such as ".map".
Then you create a new Dataset by zipping the original and the one with the new data:
dataset3 = tf.data.Dataset.zip((dataset1, dataset2))

Related

Is it possible to access SCIP's Statistics output values directly from PyScipOpt Model Object?

I'm using SCIP to solve MILPs in Python using PyScipOpt. After solving a problem, the solver statistics can be either 1) printed as a string using printStatistics(), or 2) saved to an external file using writeStatistics(). For example:
import pyscipopt as pso
model = pso.Model()
model.addVar(name="x", obj=1)
model.optimize()
model.printStatistics()
model.writeStatistics(filename="stats.txt")
There's a lot of information in printStatistics/writeStatistics that doesn't seem to be accessible from the Python model object directly (e.g. primal-dual integral value, data for individual branching rules or primal heuristics, etc.) It would be helpful to be able to extract the data from this output via, e.g., attributes of the model object or a dictionary.
Is there any way to access this information from the model object without having to parse the raw text/file output?
PySCIPOpt does not provide access to the statistics directly. The data for the various tables (e.g. separators, presolvers, etc.) are stored separately for every single plugin in SCIP and are sometimes not straightforward to collect.
If you are only interested in certain statistics about the general solving process, then you might want to add PySCIPOpt wrappers for a few of the simple get functions defined in scip_solvingstats.c.
Lastly, you might want to check out IPET for parsing the statistics output.

filter TabularDataset in azure ML

My Dataset is huge. I am using Azure ML notebooks and using azureml.core to read dateset and convert to azureml.data.tabular_dataset.TabularDataset. Is there anyway i would filter the data in the tabularDataset with out converting to pandas data frame.
I am using below code to read the data. as the data is huge pandas data-frame is running out of memory. I don't have to load complete data into the program. Only subset is required. is there any way i could filter the records before converting to pandas data frame
def read_Dataset(dataset):
ws = Workspace.from_config()
ds = ws.datasets
tab_dataset = ds.get(dataset)
dataframe = tab_dataset.to_pandas_dataframe()
return dataframe
At this point of time, we only support simple sampling, filtering by column name, and datetime (reference here). Full filtering capability (e.g. by column value) on tabulardataset is an upcoming feature in the next couple of months. We will update our public documentation once the feature is ready.
You can subset your data in two ways,
row wise - use TabularDataset class filter method
column wise - use TabularDataset class keep_columns method or drop_columns method
hope this helps tackle out of memory error

Pyspark use customized function to store each row into a self defined object, for example a node object

Is there a way to utilize the map function to store each row of the pyspark dataframe into a self-defined python class object?
pyspark dataframe
For example, in the picture above I have a spark dataframe, I want to store every row of id, features, label into a node object (with 3 attributes node_id, node_features, and node_label). I am wondering if this is feasible in pyspark. I have tried something like
for row in df.rdd.collect()
do_something (row)
but this can not handle big data and is extremely slow. I am wondering if there is a more efficient way to resolve it. Much thanks.
You can use foreach method for your operation. The operation will be parallelized in spark.
Refer Pyspark applying foreach if you need more details.

Add dataset parameters into column to use them in BigQuery later with DataPrep

I am importing several files from Google Cloud Storage (GCS) through Google DataPrep and store the results in tables of Google BigQuery. The structure on GCS looks something like this:
//source/user/me/datasets/{month}/2017-01-31-file.csv
//source/user/me/datasets/{month}/2017-02-28-file.csv
//source/user/me/datasets/{month}/2017-03-31-file.csv
We can create a dataset with parameters as outlined on this page. This all works fine and I have been able to import it properly.
However, in this BigQuery table (output), I have no means of extracting only rows with for instance a parameter month in it.
How could I therefore add these Dataset Parameters (here: {month}) into my BigQuery table using DataPrep?
While the original answers were true at the time of posting, there was an update rolled out last week that added a number of features not specifically addressed in the release notes—including another solution for this question.
In addition to SOURCEROWNUMBER() (which can now also be expressed as $sourcerownumber), there's now also a source metadata reference called $filepath—which, as you would expect, stores the local path to the file in Cloud Storage.
There are a number of caveats here, such as it not returning a value for BigQuery sources and not being available if you pivot, join, or unnest . . . but in your scenario, you could easily bring it into a column and do any needed matching or dropping using it.
NOTE: If your data source sample was created before this feature, you'll need to create a new sample in order to see it in the interface (instead of just NULL values).
Full notes for these metadata fields are available here:
https://cloud.google.com/dataprep/docs/html/Source-Metadata-References_136155148
There is currently no access to data source location or parameter match values within the flow. Only the data in the dataset is available to you. (except SOURCEROWNUMBER())
Partial Solution
One method I have been using to mimic parameter insertion into the eventual table is to have multiple dataset imports by parameter and then union these before running your transformations into a final table.
For each known parameter search dataset, have a recipe that fills a column with that parameter per dataset and then union the results of each of these.
Obviously, this is only so scalable i.e. it works if you know the set of parameter values that will match. once you get to the granularity of time-stamp in the source file there is no way this is feasible.
In this example just the year value is the filtered parameter.
Longer Solution (An aside)
The alternative to this I eventually skated to was to define dataflow jobs using Dataprep, use these as dataflow templates and then run an orchestration function that ran the dataflow job (not dataprep) and amended the parameters for input AND output via the API. Then there was a transformation BigQuery Job that did the roundup append function.
Worth it if the flow is pretty settled, but not for adhoc; all depends on your scale.

Tensorflow Quickstart Tutorial example - where do we say which column is what?

I'm slowly training myself on Tensorflow, following the Get Started section on Tensorflow.org. So far so good.
I have one question regarding the Iris data set example:
https://www.tensorflow.org/get_started/estimator
When it comes to that section:
https://www.tensorflow.org/get_started/estimator#construct_a_deep_neural_network_classifier
It is not entirely clear to me when/where do we tell the system which columns are features and which column is the target value.
The only thing i see is that when we load the dataset, we tell the system the target is in integer format and the features are in float32. But what if my data is all integer or all float32 ? How can I tell the system that, for example, first 5 columns are features, and target is in last column. Or is it implied that first columns are always features and last column can only be the target ?
The statement 'feature_columns=feature_columns' when defining the DNNClassifier says which columns are to be considered a part of the data, and the input function says what are inputs and what are labels.