How to speed up model.predict() lightgbm pandas - pandas

I have a dataset of about 1 million rows on which I am training an lgbmregressor model.
Features: I have about 10 features like longitude, latitude, postalCode, Bathrooms, Bedrooms, BuildingArea, LotArea, HomeMedianValue, RentMedianArea etc. and the target feature is RentValuePerMonth.
Model: the best performance with the minimum mdape error is provided by lgbmr - about 7% in the median (after optimizing the hyperparameters, I get:
LGBMRegressor(max_depth=25, n_estimators=4000, learning_rate=0.1,
max_bin=1300, num_leaves=1500)
Model training takes about half an hour.
In order to make a prediction of 1 million objects, it takes the model about 30 minutes.
The problem is that I need to predict the target feature for a dataset of 125 million rows and it will take a huge amount of time.
I'm trying to predict 20 million rows in one step, but the model has been running for 3 days so far.
What am I doing wrong and how could I speed up the process?
Thanks!

Related

How datasets are structured in TensorFlow?

In my first TensorFlow project, I have a big dataset (1M elements) which contains 8 categories of elements, with each category, has a different number of elements of course. I want to split the big dataset into 10 exclusive small datasets, with each of them having approximately 1/10 of each category. (This is for 10-fold cross-validation purposes.)
Here is how I do.
I wind up having 80 datasets, with each category having 10 small datasets, then I randomly sample data from 80 of them by using sample_from_datasets. However, after some steps, I met a lot of warning saying "DirectedInterleave selected an exhausted input:36" where 36 can be some other integer numbers.
The reason I want to do sample_from_datasets is that I tried to do shuffle the original dataset. Even though shuffle only 0.4 x total elements, it still takes a long long time to finish (about 20mins).
My questions are
1. based on my case, any good advice on how to structure the datasets?
2. is it normal to have a long shuffling time? any better solution for shuffling?
3. why do I get this DirectIngerleave selected an exhausted input: warning? and what does it mean?
thank you.
Split your whole datasets into Training, Testing and Validation categories. As you have 1M data, you can split like this: 60% training, 20% testing and 20% validation. Splitting of datasets is completely up to you and your requirements. But normally maximum data is used for training the model. Next, the rest of the datasets can be used for testing and validation.
As you have ten class datasets, split each category into Training, Testing and Validation categories.
Let, you have A, B, C and D categories data. Split your data "A", "B", "C", and "D" like below:
'A'- 60 % for training 20% testing and 20% validation
'B'- 60 % for training 20% testing and 20% validation
'C'- 60 % for training 20% testing and 20% validation
'D'- 60 % for training 20% testing and 20% validation
Finally merge all the A, B, C and D training, testing and validation datasets.

I want train_test_split to train mainly on one specific number range

I am running some regression models in jupyter/python to predict the cycle time of certain projects. I used train_test_split from sklearn to randomly divide my data set.
The models tend to work pretty well for projects with high cycle times (between 150 - 300 days), but I care more about the lower cycle times between 0 and 50 days.
I believe the model is more accurate for the higher range because most of the projects (about 60-70%) have cycle times over 100 days. I want my model to mainly get the lower cycle times right, because for the purposes of what I'm doing, a project with a cycle time of 120 days is the same as a project with 300 day cycle time.
In my mind, I need to train more on the projects with shorter cycle times? I feel like this might help?
Is there a way to split the data less randomly? Aka train on a higher ratio of shorter cycle time projects
Is there a better or different approach I should consider?

Setting up training in Tensorflow with overlapping data?

I am trying to train a neural network to forecast using time series data. I'm trying to train a neural network to predict temperature 10 minutes into the future, and lets say I have data points of temperature every 5 minutes and I want to give it 15 minutes worth of data to use in the prediction and the data I have is this.
[1,2,3,4,5,6,7,8,9,10,11,12]
so if I were to train on the data one potential training sample is [1,2,3] as x and [5] as y (as it's 10 minute into the future (two 5 minute steps)).
I want a way to train on all possible inputs, these are as follow.
[1,2,3][5]
[2,3,4][6]
[3,4,5][7]
[4,5,6][8]
[5,6,7][9]
[6,7,8][10]
[7,8,9][11]
[8,9,10][12]
But I don't want to train by first saving each possible example to disk then training from that. This takes up more space than is necessary as the data is duplicated. I would like to do this in some kind of preprocessing of the data.
All the instructions and examples I have found of using the tensorflow input pipeline such as here https://www.tensorflow.org/guide/datasets all use "non overlapping" data, I can't find anything to deal with my scenario.
The problem I'm having is I really have no idea how to set this overlapping data scenario in tensorflow without saving massive amounts of duplicated data to disk. If anyone has any links or guides as to the best way to do this I'd very much appreciate it thank you.
You are probably looking for this transformation: https://www.tensorflow.org/api_docs/python/tf/contrib/data/sliding_window_batch
tf.contrib.data.sliding_window_batch(window_size=3, stride=1)

LSTM (long short memory) on Penn Tree Bank data

The PennTree Bank data seems difficult to understand. Below are two links
https://github.com/townie/PTB-dataset-from-Tomas-Mikolov-s-webpage/tree/master/data
https://www.tensorflow.org/tutorials/recurrent
My concern is as follows. The reader gives me around a million occurrences of 10000 words. I have written code to convert this data set into one-hot encoding. Thus, I have a million vectors of 10000 dimensions, each vector having a 1 at a single location. Now, I want to train a LSTM (long short term memory) model on this for prediction.
For simplicity let us assume that there are 30 (and not a million occurences) occurences, and the sequence length is equal to 10 for the LSTM (the number of timesteps that it unrolls). Let us denote these occurences by
X1,X2,....,X10,X11,...,X20,X21,...X30
Now, my concern is that should I use 3 data samples for training
X1,..X10 and X11,..,X20, and X21,..X30
or should I use 20 data samples for training
X1,..X10 and X2,...,X11, and X3,..,X12, so on until X21,..,X30
In case I go with the latter, then am I not breaking the i.i.d. assumption of training data sequence generations?

Using Torch for Time Series prediction using LSTMs

My main problem is how should I pre-process my dataset that is basically a 60 minutely sequenced numbers inputs that will result in a 1 hourly output. Knowing that each input vector every minute is producing some output, but unfortunately this output can't be observed until 1 hour is passed.
I thought about considering putting 60 inputs as one big input vector which corresponds to 1 hourly output on a normal ML classfier, hence having 1 sample at a time. But I don't think it would be time series anymore.
How can I represent that to be doable in an LSTM environment?