Pandas to Koalas (Databricks) conversion code for big scoring dataset - pandas

I have been encountering OOM errors while getting to score a huge dataset. The dataset shape is (15million,230). Since the working environment is Databricks, I decided to update the scoring code to Koalas and take advantage of the Spark architecture to alleviate my memory issues.
However, I've run into some issues trying to convert part of my code from pandas to koalas. Any help into how to work around this issue is much appreciated.
Currently, I'm trying to add a few adjusted columns to my dataframe but I'm getting a PandasNotImplementedError : The method pd.Series.__iter__() is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.
Code/Problem area :
df[new_sixmon_cols] = df[sixmon_cols].div([min(6,i) for i in df['mob']],axis=0)
df[new_twelvemon_cols] = df[twelvemon_cols].div([min(12,i) for i in df['mob']],axis=0)
df[new_eighteenmon_cols] = df[eighteenmon_cols].div([min(18,i) for i in df['mob']],axis=0)
df[new_twentyfourmon_cols] = df[twentyfourmon_cols].div([min(24,i) for i in df['mob']],axis=0)
print('The shape of df after add adjusted columns for all non indicator columns is:')
print(df.shape)
I believe the problem area is div([min(6,i)] but I'm not certain how to go about converting this particular piece of code efficiently or in general how to handle scoring a big dataset leveraging Databricks or the cloud environment.
Some pointers about the data/model:
The data is feature reduced and selected of course.
I built the model with 2.5m records and now I'm trying to work on scoring files.

Related

Feature Selection for Text Classification with Information Gain in R

I´m trying to prepare my dataset ideally for binary document classification with an SVM algorithm in R.
The dataset is a combination of 150171 labelled variables and 2099 observations stored in a dataframe. The variables are a combination uni- and bigrams which were retrieved from a text dataset.
When I´m trying to calculate the Information gain as a feature selection method, the Error "cannot allocate vector of size X Gb" occurs although I already extended my memory and I´m running on a 64-bit operating system. I tried the following package:
install.packages("FSelector")
library(FSelector)
value <- information.gain(Usefulness ~., dat_SentimentAnalysis)
Does anybody know a solution/any trick for this problem?
Thank you very much in advance!

label encoding in dask_cudf dataframe

I am trying to use dask_cudf to preprocess a very large dataset (150,000,000+ records) for multi-class xgboost training and am having trouble encoding the class column (dtype is string). I tried using the 'replace' function, but the error message said the two dtypes must match. I tried using dask_ml.LabelEncoder, but it said string arrays aren't supported in cudf. I tried using compute() in various ways, but i kept running into out-of-memory errors (i'm assuming because operations on cudf dataframe require a smaller dataset). I also tried pulling the class column out, encoding, and then merging it back with the dataframe, but the partitions do not line up. I tried manually lining them up, but dask_cudf seemingly does not support repartioning using 'divisions' parameter (got error saying something like 'old and new partitions do not match'). Any help on how to do this would be much appreciated.
Strings aren't supported on xgboost. Not having seen your data, here are a few ways quick and dirty ways I've modified string columns to train, as generally strings may not matter:
If the strings were actually numeric (like dates), converting to int (int8 int16, int32)
I did this by hashmapping the strings and then running xgboost (basically creating a reversible conversion between string and integer as long as you don't change the integer) and train on your current, now hashed as an integer, column.
if the strings are classes, manually naming class numbers (0,1,2,...,n) in a new column and train on that one.
There are definitely other, better ways. As for the second part of your question, left a comment.
Now, your XGBoost model and your dask-cudf dataframe per-GPU allocation must fit on a single GPU, or you will get memory errors. If your model will be considering a large amount of data, please train on the largest GPU memory sized cluster you can. A100s can have 40GB and 80GB. Some older compute GPUs, V100 and GV100 have 32GB. A6000 and RTX8000 have 48GB. then it goes to 24, 16, and lower from there. Please size your GPUs accordingly

What's the real advantage of using Tensorflow Transform?

today i'm using pandas as my main tool of data pre-processing in my project, where i need to do some transformations in the data to ensure they are in a correct format, which my python class expects.
So i heard about TF Tansform and tested it a little, but i didn't see any obvious advantage (obviously i'm referring to data transformation itself, not in a machine learning pipeline).
For example, i made a simple code in TFT to uppercase all values in my dataframe column:
upper = tf.strings.upper(input, encoding='', name=None)
The execution time of this pre processing function is: 17.1880068779
This is, in the other hand, the code that i use to do the exactly same thing in dataframe:
x = dataset['CITY'].str.upper()
The execution time is: 0.0188028812408
So,i'm doing something wrong? I think that if we have dozens of transformations and a dataset if millions of lines, maybe TFT will be better in this comparison, but for a 100k dataframe it seems not so useful.

Pandas v 0.25 groupby with many columns gives memory error

After updating to pandas v0.25.2 a script doing a groupby over many columns on a large dataframe no longer works. I get a memory error
MemoryError: Unable to allocate array with shape (some huge number...,) and data type int64
Doing a bit of research I find issue (#14942) reported on Git for an earlier version
import numpy as np
import pandas as pd
df = pd.DataFrame({
'cat': np.random.randint(0, 255, size=3000000),
'int_id': np.random.randint(0, 255, size=3000000),
'other_id': np.random.randint(0, 10000, size=3000000),
'foo': 0
})
df['cat'] = df.cat.astype(str).astype('category')
# killed after 6 minutes of 100% cpu and 90G maximum main memory usage
grouped = df.groupby(['cat', 'int_id', 'other_id']).count()
Running this code (on version 0.25.2) also gives a memory error. Am I doing something wrong (is the syntax in pandas v0.25 changed?), or has this issue, which is marked as resolved, returned?
Use observed=True to fix it and prevent the groupby to expand all possible combination of factor variables:
df.groupby(index, observed=True)
There is a related GitHub Issue: PERF: groupby with many empty groups memory blowup.
While the proposed solution addresses the issue, it is likely that another problem will arise when dealing with larger datasets. pandas groupby is slow and memory hungry; may need 5-10x the memory of the dataset. A more effective solution is to use a tool that is order of magnitude faster, less memory hungry, and seamlessly integrates with pandas; it reads directly from the dataframe memory. No need for data round trip, and typically no need for extensive data chunking.
My new tool of choice for quick data aggregation is https://duckdb.org. It takes your existing dataframe df and query directly on it without even importing it into the database. Here is an example final result using your dataframe generation code. Notice that total time was 0.45sec. Not sure why pandas does not use DuckDB for the groupby under the hood.
db object is created using this small wrapper class that allows you to simply just type db = DuckDB() and you are ready to explore the data in any project. You can expand this further or you can even simplify it using %sql using this documentation page: enter link description here. By the way, the sql returns a dataframe, so you can do also db.sql(...).pivot_table(...) it is that simple.
class DuckDB:
def __init__(self, db=None):
self.db_loc = db or ':memory:'
self.db = duckdb.connect(self.db_loc)
def sql(self, sql=""):
return self.db.execute(sql).fetchdf()
def __del__():
self.db.close()
Note: DuckDB is good but not perfect, but it turned way more stable than Dusk or even PySpark with much simpler set up. For larger data sets you may need a real database, but for datasets that can fit in memory this is great. Regarding memory usage, if you have a larger dataset ensure that you limite DuckDB using pragmas as it will eat it all in no time. Limit simply places extra onto disk without dealing with data chunking. Also do not assume that this is a database. Assume this is in-memory database, if you need some results stored, then just export them into parquet instead of saving the database. Because the format is not stable between releases and you will have to export to parquet anyway to move from one version to the next.
I expanded this data frame to 300mn records so in total it had around 1.2bn records or around 9GB. It still completed your groupby and other summary stats on a 32GB machine 18GB was still free.

Pandas is using lot of memory

I have flask app code where an API is exposed to dump the data from oracle database to postgress database.
I am using Pandas to copy the content of the tables from oracle, mysql and postgress to postgress.
After using constantly for 15 days or so, the CPU memory consumption is very high.
It usually transfers atleast 5 million records per two days.
Can anyone help me optimizing pandas write.
If you have some preprocess step, I suggest using dask. Dask offers parallel computation and do not fill memory unless you explicitly force it. The force means computation of any task on dataframe. Refer to documentation here for dask api read_sql_table method.
import dask.dataframe as dd
# read the data as dask dataframe
df = dd.read_csv('path/ to / file') # this code is subject to change as your
# source changes, just consider this as a
# pseudo.
{
 # do the preprocess step on data.
}
# finally write it.
This solution comes very handy if you have to deal with large dataset with a preprocessing step possibly a reduction. Refer to documentation here for more information. It may have a significant improvement depending on your preprocess step.
Or alternatively, you can use chunksize parameter of pandas as #TrigonaMinima suggested. This allows your machine to retrieve the data in chunks as "x rows at a time" so you may want to process it as above with preprocessing, this may require you to create temp file and append them.