Can we build a LSTM classification model by using pyspark - tensorflow

I had familied with build a LSTM model for time series classification problem by using tensor flow with a small dataset. However, I try to further study and create a new LSTM model with a new big data set (around 30GB text files data). I realize that pyspark is an effective way to handle big data processing.
My question is Can we create LSTM model by using pyspark or Does anyone has a guideline or an example use case in the same propose?

Related

Tensorflow : Is it possible to identify the data is used for training?

I have created text classification model(.pb) using tensorflow. Prediction is good.
Is it possible to check the sentence using for prediction is already used to train the model or not. I need to retrain the model when new sentence is given to model to predict.
I did some research and couldn't find a way to get the train data only with the pb file because that file only stores the features and not the actual train data(obviously),but if you have the dataset,then you can easily verify duh....
I don't think you can ever find the exact train data with only the trained model,cause the model only contains the features and not the actual train data

Can choose some data in training data after doing data augmentation?

I am training a UNET for semantic segmentation but I only have 200 labeled images. Given the small size of the dataset, it definitely needs some data augmentation techniques.
I have question about the test and the validation set.
I have custom data generator which keep feeding data from folder for training model.
So what I plan to do is:
do data augmentation for the training set and keep all of it in the same folder
"randomly" pick some of training data into test and validation set (of course, before training).
I am not sure if this is fine, since we just do some simple processing (flipping, transposing, adjusting brightness)
Would it be better to separate the data first and do the augmentation for the rest of data in the training folder?

How to train new data continuously in tensorflow

I use TF-slim training flower data set, scripts is this. the flower data set has only 5 classes. If I add some new image data to the roses, or add a new classification, what should I do after the train 1000 steps? Do I need to delete already trained data, such as checkpoint files?
There exists a similar question on Data Science Stack Exchange, with an answer that considers your scenario:
Once a model is trained and you get new data which can be used for
training, you can load the previous model and train onto it. For
example, you can save your model as a .pickle file and load it and
train further onto it when new data is available. Do note that for the
model to predict correctly, the new training data should have a
similar distribution as the past data
I do the same in my own project, where I started with a small dataset that grew bigger over the time. After addding new data I retrain the model from the last checkpoint.

How to use TensorFlow to predict large csv files by chunks and glue results together

Now that I've trained a predicting model with TensorFlow, and there's a large test.csv file that's too big to fit into memory, can it be possible to feed it by a smaller chunk at a time and then concat them again within one session?
Using tf.estimator.Estimator for your model and calling the predict method using the numpy_input_fn will give you all the pieces to build what you want.

multi-label supervised classification of text data

I am solving machine learning problem using python. My knowledge in machine learning is not much. The problem has given training dataset. Training dataset includes text samples and labels for those text samples. All possible values of labels are given. So this is supervised problem. Some text samples don't have empty set of labels. Now I have to make a model to find labels from given text data.
What I have done is, I have created pandas dataframe from training data. Dataframe has columns as [text_data, label1, label2, label3, ..., labeln]. The values of labels columns are either 0 or 1. Then I cleaned and tokenized text_data. I removed stop words from tokens. I stemmed tokens by using PorterStemmer. I split out dataframe into training data and validation data like 80:20. And now trying to make some model by predicting validation data's labels by using training data. But I am very much confused here about how to make model. I tried few things like Naive Bayes Classifier but it didn't work or maybe I did some mistake. Any idea how I should proceed now?