Is there a way to partition a tf.Dataset with TensorFlow’s Dataset API? - tensorflow

I checked the doc but I could not find a method for it. I want to de cross validation, so I kind of need it.
Note that I'm not asking how to split a tensor, as I know that TensorFlow provides an API for that an has been answered in another question. I'm asking on how to partition a tf.Dataset (which is an abstraction).

You could either:
1) Use the shard transformation partition the dataset into multiple "shards". Note that for best performance, sharding should be to data sources (e.g. filenames).
2) As of TensorFlow 1.12, you can also use the window transformation to build a dataset of datasets.

I am afraid you cannot. The dataset API is a way to efficiently stream inputs to your net at run time. It is not a set of tools to manipulate datasets as a whole -- in that regards it might be a bit of a misnomer.
Also, if you could, this would probably be a bad idea. You would rather have this train/test split done once and for all.
it let you review those sets offline
if the split is done each time you run an experiment there is a risk that samples start swapping sets if you are not extremely careful (e.g. when you add more data to your existing dataset)
See also a related question about how to split a set into training & testing in tensorflow.

Related

Using PCA on Part of Dataframe

I want to use a clustering algorithm to a dataframe that contains a lot of features (32 columns).
A part of the features are encoded using one hot encoder.
I want to use PCA ( Principal Component analysis ) to reduce the dimension and make the machine learning process easier.
Is it possible to use the PCA just for some columns of the data frame and keep the other columns as they are then use machine learning model.
Or it is obligatory to use PCA for all the dataframe before clustering.
I guess there should be no issue with doing what you describe.
What this does, effectively, is merge some of the objects' features into fewer ones, but then using other, non-merged ones in addition to the merged ones. I don't know what effect that would have on the outcome; it might be good to run a correlation to see whether the unmerged features add anything to the PCA-merged ones. You might find that they basically duplicate what is there already.
Since clustering is an exploratory method, you can basically do whatever you want. It is of course advisable to have a reason for doing so, as it otherwise ends up as simply trial-and-error, and if you find a result, you won't be able to describe why you got there. It is possible (or even likely for some data sets) that there are multiple ways to cluster them, so you should make decisions based on what you know about the data already, so they can be justified in those terms.
Running random trial-and-error clustering until you find a structure makes it a bit difficult to come up with a good explanation why that structure is valid.

Implement data generator in federated training

(I have posted the question on https://github.com/tensorflow/federated/issues/793 and maybe also here!)
I have customized my own data and model to federated interfaces and the training converged. But I am confused about an issue that in an images classification task, the whole dataset is extreme large and it can't be stored in a single federated_train_data nor be imported to memory for one time. So I need to load the dataset from the hard disk in batches to memory real-timely and use Keras model.fit_generator instead of model.fit during training, the approach people use to deal with large data.
I suppose in iterative_process shown in image classification tutorial, the model is fitted on a fixed set of data. Is there any way to adjust the code to let it fit to a data generator?I have looked into the source codes but still quite confused. Would be incredibly grateful for any hints.
Generally, TFF considers the feeding of data to be part of the "Python driver loop", which is a helpful distinction to make when writing TFF code.
In fact, when writing TFF, there are generally three levels at which one may be writing:
TensorFlow defining local processing (IE, processing that will happen on the clients, or on the server, or in the aggregators, or at any other placement one may want, but only a single placement.
Native TFF defining the way data is communicated across placements. For example, writing tff.federated_sum inside of a tff.federated_computation decorator; writing this line declares "this data is moved from clients to server, and aggregated via the sum operator".
Python "driving" the TFF loop, e.g. running a single round. It is the job of this final level to do what a "real" federated learning runtime would do; one example here would be selecting the clients for a given round.
If this breakdown is kept in mind, using a generator or some other lazy-evaluation-style construct to feed data in to a federated computation becomes relatively simple; it is just done at the Python level.
One way this could be done is via the create_tf_dataset_for_client method on the ClientData object; as you loop over rounds, your Python code can select from the list of client_ids, then you can instantiate a new list of tf.data.Datasetsand pass them in as your new set of client data. An example of this relatively simple usage would be here, and a more advanced usage (involving defining a custom client_datasets_fn which takes client_id as a parameter, and passing it to a separately-defined training loop would be here, in the code associated to this paper.
One final note: instantiating a tf.data.Dataset does not actually load the dataset into memory; the dataset is only loaded in when it is iterated over. One helpful tip I have received from the lead author of tf.data.Dataset is to think of tf.data.Dataset more as a "dataset recipe" than a literal instantiation of the dataset itself. It has been suggested that perhaps a better name would have been DataSource for this construct; hopefully that may help the mental model on what is actually happening. Similarly, using the tff.simulation.ClientData object generally shouldn't really load anything into memory until it is iterated over in training on the clients; this should make some nuances around managing dataset memory simpler.

How Data and String are treated in graphlab

I am having large date set in which some of columns are Date and other are categorical Data like Status, Department Name, Country Name.
So how this data is treated in graphlab when i call the graphlab.linear_regression.create method, does i have to pre-process this data and convert them into numbers or can directly provide to graphlab.
Graphlab is mostly used for computing tabular and graph based datasets, and have high scalability and performance. In graphlab.linear_regression.create, graphlab have inbuilt feature of understanding the type of data and giving most suitable method of linear regression for optimizing results. For Example, for numeric data of target and feature both, most of the time, graphlab takes Newtons Method of linear regression. Similarly, depending on the dataset, understands the need and gives method accordingly.
Now, about preprocessing, graphlab only takes SFrame for learning that need to be parsed correctly before any learning. While creating an SFrame, unprocessed and error creating data are always reflected and throws an error. So, in order to go through any learning, you need to have a clean data. If SFrame accepts the data, and also your chosen target and feature for learning that you want, you are good to go but pre-processing and cleaning data is always recommended. Also, its always a good practice to do feature engineering before any learning algorithm, and redefining data types before learning is always recommended for accuracy.
About your point on how data is treated in Graphlab, I would say, it depends!. Some datasets are tabular and are treated accordingly and some in graph structure. Graphlab performs very well when comes to regression tree and boosted classifiers which follows decision tree concept and are quite time and resource consuming in other libraries than graphlab.
For me, graphlab performed very well while creating recommendation engine where I had dataset of nodes and edges and boosted tree classifier with 18 iterations too worked flawless in quite scalable time and I must say, even for tree structured data, graphlab performs very well. I hope this answer helps.

What is the safest way to handle different versions of a DataFrame in pandas?

I'm learning some pandas/ML type stuff. Right now I'm doing a Kaggle tutorial, and the example data we've been given has a bunch of features. I suspect that some of these features are adding noise to the model rather than helping. So, I want to apply several models to the data with all features (as in the tutorial) and record their scores as a baseline. Then, I want to remove one feature at a time, and use the same models on the data without that one feature, and compare the scores.
What's the best way to do this? Naively, I'd just make a different copy of the dataset for each removed feature, but copy() is a little confusing in pandas (in version 0.20, it says that it makes a deep copy by default, which should be exactly what I want, right? A copy with no connection/reference to the original?). I tried it and it didn't seem to actually be making the copy.
Is there a better way? Thank you.
Using for loop.
variables = locals()
feature=['A','B','C']
for i in feature:
variables["dfremoved{0}".format(i)] = df.drop(i,axis=1)
''' Do your fit and predict here within the for loop'''

What is the concept of CNTKTextFormatDeserializer and why use?

I am using the CNTKTextReader to read in my training and test sets. The train file is getting large ( 2.7 GB now, and soon to get bigger ).
I don't understand what is "CNTKTextFormatDeserializer" -- the doc I found didn't explain what the big picture is -- what is it and why use it? The doc I found just went into syntax of it.
So, is it a way to use a binary version of these files to make them more compact?
Readers in general are just a way to make certain aspects of training easier. These include
randomization: SGD generalizes better when the data presented to it are coming in random order. The reader can randomize the data for you with shuffling happening on the fly.
distributed training: For distributed training the reader is aware of the multiple workers and can make sure they receive distinct chunks of data.
memory budget issues: The reader does not load the whole training file in memory.
language agnostic i/o: The reader provides a cross-platform way to read data. (if you want to always be in Python, you might not care about this but others do).
The CTF format is a little verbose and indeed there is a binary format deserializer that was recently added.