How to run huge datasets in Vertex AI - tensorflow

I am working with large feature sets (20,000 rows x 20,000 columns) and Vertex AI has a hard limit of 1,000 columns. How can I import data into Google cloud efficiently so that I can run TensorFlow models or auto ML on my data? I haven't been able to find documentation for this issue.

Are you trying this with datasets / AutoML? One thing to try is Feature Store (https://cloud.google.com/vertex-ai/docs/featurestore) or putting it into BigQuery (https://cloud.google.com/blog/products/data-analytics/automl-tables-now-generally-available-bigquery-ml). I know that might be a change in your workflow but should be able to accommodate up to 10,000 columns.

Related

Pandas to Koalas (Databricks) conversion code for big scoring dataset

I have been encountering OOM errors while getting to score a huge dataset. The dataset shape is (15million,230). Since the working environment is Databricks, I decided to update the scoring code to Koalas and take advantage of the Spark architecture to alleviate my memory issues.
However, I've run into some issues trying to convert part of my code from pandas to koalas. Any help into how to work around this issue is much appreciated.
Currently, I'm trying to add a few adjusted columns to my dataframe but I'm getting a PandasNotImplementedError : The method pd.Series.__iter__() is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.
Code/Problem area :
df[new_sixmon_cols] = df[sixmon_cols].div([min(6,i) for i in df['mob']],axis=0)
df[new_twelvemon_cols] = df[twelvemon_cols].div([min(12,i) for i in df['mob']],axis=0)
df[new_eighteenmon_cols] = df[eighteenmon_cols].div([min(18,i) for i in df['mob']],axis=0)
df[new_twentyfourmon_cols] = df[twentyfourmon_cols].div([min(24,i) for i in df['mob']],axis=0)
print('The shape of df after add adjusted columns for all non indicator columns is:')
print(df.shape)
I believe the problem area is div([min(6,i)] but I'm not certain how to go about converting this particular piece of code efficiently or in general how to handle scoring a big dataset leveraging Databricks or the cloud environment.
Some pointers about the data/model:
The data is feature reduced and selected of course.
I built the model with 2.5m records and now I'm trying to work on scoring files.

Is it possible to refresh a customer level dataset daily to give churn predictions using BigqueryML

Is it possible to refresh a customer level dateset daily to give churn predictions using Bigquery ML
You can run a scheduled query to run CREATE MODEL periodically. The model is persisted in BigQuery storage when CREATE MODEL is executed.
For immediate updates, automating the pipeline workflow with either Composer or AirFlow is the best option.
The word "refresh" here may have ambiguous meaning, and the word "dataset" probably refers to a BigQuery Table.
Your question as I understand it is whether you can create an up-to-date table with users and a churn score that was calculated using BigQuery ML.
The answer is yes, after you train a model using BQML you can run scheduled queries to give you predictions. Please keep in mind:
Updating data in a BigQuery table is highly unrecommended. Instead consider appending the results to a table with a timestamp that identifies the predictions (for each day let's say). Then create a view that shows only the most recent predictions.
You may need to run a chain of queries for example: Create a training dataset with features, run predictions and save the results. For this you may want to use Cloud Workflows.
Solution to achieve your required solution is to create ML BQ model and store output at runtime. Then Use predicted output to perform any action of Bigquery tables.
You can execute Dataflow or simple python code to achieve required data archrivals.
from dateutil.parser import parse
import datetime
from google.cloud import bigquery
stream_query = """delete from `ikea-itsd-ml.test123.YOUR_NEW_TABLE4` WHERE 1=1"""
stream_client = bigquery.Client()
stream_Q = stream_client.query(stream_query)
stream_data_df = stream_Q.to_dataframe()

Does Google's AutoML Table shuffle my data samples before training/evaluation?

I sought through the documentation but still have no clue whether or not the service shuffles data before training/evaluation. I need to know this because by data is time-series which would be realistic to evaluate a trained model on samples of earlier period of time.
Can someone please let me know the answer or guide me how to figure this out?
I know that I can export evaluation result and tweak on it but BigQuery seems to not respect the order of original data and there's no absolute time feature in the data.
It doesn't shuffle but split it.
Take a look here: About controlling data split. It says:
By default, AutoML Tables randomly selects 80% of your data rows for training, 10% for validation, and 10% for testing.
If your data is time-sensitive, you should use the Time column.
By using it, AutoML Tables will use the earliest 80% of the rows for training, the next 10% of rows for validation, and the latest 10% of rows for testing.

Pyspark, dask, or any other python: how to pivot a large table without crashing laptop?

I can pivot a smaller dataset fine using pandas, dask, or pyspark.
However when the dataset exceeds around 2 million rows, it crashes my laptop. The final pivoted table would have 1000 columns and about 1.5 million rows. I suspect that on the way to the pivot table there must be some huge RAM usage that exceeds system memory, which I don't understand how pyspark or dask is used and useful if intermediate steps won't fit in ram at all times.
I thought dask and pyspark would allow larger than ram datasets even with just 8gb of ram. I also thought these libraries would chunk the data for me and never exceed the amount of ram that I have available. I realize that I could read in my huge dataset in very small chunks, and then pivot a chunk, and then immediately write the result of the pivot to a parquet or hdf5 file, manually. This should never exceed ram. But then wouldn't this manual effort defeat the purpose of all of these libraries? I am under the impression that what I am describing is definitely included right out of the box with these libraries, or am I wrong here?
If I have 100gb file of 300 million rows and want to pivot this using a laptop, it is even possible (I can wait a few hours if needed).
Can anyone help out here? I'll go ahead and add a bounty for this.
Simply please show me how to take a large parquet file that itself is too large for ram; pivot that into a table that is too large for ram, never exceeding available ram (say 8gb) at all times.
#df is a pyspark dataframe
df_pivot = df.groupby(df.id).pivot("city").agg(count(cd.visit_id))

Are there any guidelines on sharding a data set?

Are there any guidelines on choosing the number of shard files for a data set, or the number of records in each shard?
In the examples of using tensorflow.contrib.slim,
there are roughly 1024 records in each shard of ImageNet data set.(tensorflow/models/inception)
there are roughly 600 records in each shard of flowers data set. (tensorflow/models/slim)
Do the number of shard files and the number of records in each shard has any impact on the training and the performance of the trained model?
To my knowledge, if we don't split the data set into multiple shards, it will be not quite random for shuffling data as the capacity of the RandomShuffleQueue may be less than the size of the data set.
Are there any other advantages of using multiple shards?
Update
The documentation says
If you have more reading threads than input files, to avoid the risk that you will have two threads reading the same example from the same file near each other.
Why can't we use 50 threads to read from 5 files?
The newer(2.5) version of Tensorflow has shard feature for dataset.
Find the below sample code from tensorflow documentation
A = tf.data.Dataset.range(10)
B = A.shard(num_shards=3, index=0)
list(B.as_numpy_iterator())
When reading a single input file, you can shard elements as follows
d = tf.data.TFRecordDataset(input_file)
d = d.shard(num_workers, worker_index)