using tf.data or queues CNN text classification by tensorflow - tensorflow

I used CNN-Text Classification base on this github link https://github.com/dennybritz/cnn-text-classification-tf, While my dataset is too large with 10000 documents(Size: 120M).
For Efficient performance, I want to change the evaluation set to use a smaller subset of my data, or use Tensorflow queues or tf.data to read data sequentially. Now I don't know how can I solve this issue? and witch .py project in this package has to be changed?
Thanks.

Related

Issues in padding(pre-processing) of huggingface gpt2 transformer model and issues with very large dataset during model training

Objective: I am trying to train a Tensorflow Huggingface GPT2 model (language model training from scratch)
Model Description:
Huggingface GPT2 Tensorflow Model
Attached a pic of config. Model Config
Dataset Description:
I have a large dataset (~20GB),
the data is separated into multiple text files with each new line as a training example.
I am facing two issues.
The examples are of different length and I am not sure how to make all the example sizes of same length to feed to the model.
Solutions Tried: We can either pad them, but then I am not sure how to do that in batches in Tensorflow. I searched about data-collator
Doubt: Padding would have to be done to make all the examples of equal size in the batch or across the whole dataset. And would this be with tokens or some other information. (Different Data Collators for Language Modelling etc.)
Since the data is very large, it cannot be loaded in memory at once while training. (Doing model.fit). For that I am not sure how to proceed.
Solutions: I am thinking of training and saving the model on small files but that would require manual intervention or for looping and the model would not be trained on the whole dataset in one go, so if there are other alternatives. Help would be really appreciated.

Transforming tensorflow datasets to beam datasets

There are a variety of ways to get a dataset you can train on in tensorflow. One of the things tensorflow transform does is provide the ability to do preprocessing via AnalyzeAndTransformDataset and TransformDataset. Surprisingly, the dataset being referred to is not a tensorflow dataset, but rather a dataset in the apache beam sense. That is understandable to some degree, given that the function is tft_beam.AnalyzeAndTransformDataset.
The heart of my question is this: given that the metadata is already known by tensorflow, why aren't there easier ways to get from a tensorflow dataset to a beam dataset. I understand that a tensorflow dataset will generally repeat itself forever, but is there a way to transform a tensorflow dataset to a dataset that can be processed by beam? Or is the only solution to have the beam dataset created by pointing to the original data on disk? Does this have to do with the unboundedness of a tensorflow dataset or is there some other reason that a tensorflow dataset cannot be analyzed/transformed through appropriate transformations so that it's abstracted from the developer?. All of the examples I have seen started with dictionaries, and there is another stack overflow question here that talks about this to some extent, but doesn't fully explain why this is the way it is.
This seems to be a question for Tensorflow team rather than Apache Beam, but TFX transforms you referred to are built on top of Beam transforms (so Beam is used as a utility). You are not directly working with Beam constructs (PColelctions, PTransforms etc.). If you want to build a Beam pipeline using the intermediate data, you might need to start with TFRecord files and use Beam's tfrecordio source as the other post mentioned.

Solutions for big data preprecessing for feeding deep neural network models built with TensorFlow 2.0?

Currently I am using Python, Numpy, pandas, scikit-learn to do data preprocessing (LabelEncoder, MinMaxScaler, fillna, etc.), and then feeding the processed data to DNN models built with Tensorflow 2.0. This input pipeline meets my needs when data is small enough to fit a PC's RAM.
Now I have some large datasets, more than 10GB, some are larger. I also plan to deploy the models in a production environment, which means there will be new data coming everyday. For DNN model training there is distributed strategy of tensorflow 2.0. But for data preprocessing obviously I cannot use pandas, scikitlearn on the large datasets with one PC. It seems to me I need to use a for-loop where I repeatedly fetch a small part of the data and use it for training?
I am wondering what do people typically use in either experiment or production environment for big data preprocessing?
Should I use Spark(Scala) / PySpark and Tensorflow input pipeline?
Yeah, with the current way you are doing preprocessing, it'll not scale well.
PySpark is one right way to run your preprocessing layer. Setup a simple standalone spark cluster with few workers and then run your preprocessing (labelEncoder/OneHotEncoder/fillNA/...) This solution should scale well and it abstracts the distributed computation layer.
PS : PySpark might not be the only way forward, but it is one of the good way forward for this use case.

preprocess data sets for Tensorflow highlevel estimators

I'm coming from a Scikit Learn background.
I'm having difficulty understanding how to preprocess data sets for Tensorflow.
I'm trying to implement svm with the iris data set.
If I have two numpy arrays, one containing a list of the features, and the other containing the list of the labels, which functions would I use to create the classifier?
estimator = SVM(
example_id_column='example_id',
feature_columns=[real_feature_column, sparse_feature_column],
l2_regularization=10.0)
I'm assuming the example_id_column would be
example_id_column = '0,1,2'
I'm not sure about how to attain the feature_columns
I think the most effective way is using the TFRecords files. There's a comprehensive tutorial available that's still mostly relevant, too. This also has the advantage of letting you define a lot more of your pipeline as part of the graph, being able to do concurrent reads from the source files, and not needing to fit your dataset in memory. It's definitely worth the effort.

how can use torch model?

I have a Torch Model which is trained on a large scale dataset (Places Dataset) and it's authors uploaded it on github, i am working on a similar project and i want to make use of it and use it's trained weights instead of use the large dataset to train it and save time and efforts, it is possible ? how can i know the only the trained filters weights? i don't want to copy the code, i only want to make use of it and save time and efforts.
NOTE: I use Tensoflow in my implementation.