run pandas udf on single batch - pandas

according to the docs pandas UDF's work the following way:
The Python function should take a pandas Series as an input and return
a pandas Series of the same length, and you should specify these in
the Python type hints. Spark runs a pandas UDF by splitting columns
into batches, calling the function for each batch as a subset of the
data, then concatenating the results.
Is it possible to run the pandas UDF only on a single batch? I saw the option spark.sql.execution.arrow.maxRecordsPerBatch, but since its default to 10'000 and my dataset that is much smaller still runs in batches.

Related

pandas to_numeric a large wide dataframe

I need to apply pd.to_numeric to a long and wide (1000+ columns) dataframe where invalid values are coerced as NaN.
Currently I'm using
df.apply(pd.to_numeric, errors="coerce")
which can take substantial amount of time due to the number of columns.
df.astype()
does not work either as it does not take coerce option.
Any comment is appreciated.
As it has already been commented on, the amount of data you're working with makes it pretty hard for pandas transformations to not be extremely slow.
I recommend you set up a PySpark session inside your local machine, transform the DataFrame column types and proceed to convert to Pandas at the end if you really need it.
In PySpark, you can convert all your dataframe's column to float by doing this:
df = df.select(*(col(c).cast("float").alias(c) for c in df.columns))
Afterwards you can just save your DataFrame back to where you want it to be (or maybe stick to PySpark and join the group!):
df.toPandas().to_csv('my_file.csv')

Is it possible to use Series.str.extract with Dask?

I'm currently processing a large dataset with Pandas and I have to extract some data using pandas.Series.str.extract.
It looks like this:
df['output_col'] = df['input_col'].str.extract(r'.*"mytag": "(.*?)"', expand=False).str.upper()
It works well, however, as it has to be done about ten times (using various source columns) the performance aren't very good. To improve the performance by using several cores, I wanted to try Dask but it doesn't seem to be supported (I cannot find any reference to an extract method in the dask's documentation).
Is there any way to performance such Pandas action in parallel?
I have found this method where you basically split your dataframe into multiple ones, create a process per subframes and then concatenate them back.
You should be able to do this like in pandas. It's mentioned in this segment of the documentation, but it might be valuable to expand it.
import pandas as pd
import dask.dataframe as dd
​
s = pd.Series(["example", "strings", "are useful"])
ds = dd.from_pandas(s, 2)
ds.str.extract("[a-z\s]{4}(.{2})", expand=False).str.upper().compute()
0 PL
1 NG
2 US
dtype: object
Your best bet is to use map_partitions, which enables you to perform general pandas operations to the parts of your series, and acts like a managed version of the multiprocessing method you linked.
def inner(df):
df['output_col'] = df['input_col'].str.extract(
r'.*"mytag": "(.*?)"', expand=False).str.upper()
return df
out = df.map_partitions(inner)
Since this is a string operation, you probably want processes (e.g., the distributed scheduler) rather than threads. Note, that your performance will be far better if you load your data using dask (e.g., dd.read_csv) rather than create the dataframe in memory and then pass it to dask.

Implementing pythonic statistical functions on spark dataframes

I have very large datasets in spark dataframes that are distributed across the nodes.
I can do simple statistics like mean, stdev, skewness, kurtosis etc using the spark libraries pyspark.sql.functions .
If I want to use advanced statistical tests like Jarque-Bera (JB) or Shapiro-Wilk(SW) etc, I use the python libraries like scipy since the standard apache pyspark libraries don't have them. But in order to do that, I have to convert the spark dataframe to pandas, which means forcing the data into the master node like so:
import scipy.stats as stats
pandas_df=spark_df.toPandas()
JBtest=stats.jarque_bera(pandas_df)
SWtest=stats.shapiro(pandas_df)
I have multiple features, and each feature ID corresponds to a dataset on which I want to perform the test statistic.
My question is:
Is there a way to apply these pythonic functions on a spark dataframe while the data is still distributed across the nodes, or do I need to create my own JB/SW test statistic functions in spark?
Thank you for any valuable insight
Yous should be able to define a vectorized user-defined function that wraps the Pandas function (https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html), like this:
from pyspark.sql.functions import pandas_udf, PandasUDFType
import scipy.stats as stats
#pandas_udf('double', PandasUDFType.SCALAR)
def vector_jarque_bera(x):
return stats.jarque_bera(x)
# test:
spark_df.withColumn('y', vector_jarque_bera(df['x']))
Note that the vectorized function column takes a column as its argument and returns a column.
(Nb. The #pandas_udf decorator is what transforms the Pandas function defined right below it into a vectorized function. Each element of the returned vector is itself a scalar, which is why the argument PandasUDFType.SCALAR is passed.)

Is it possible to use pyspark to speed up regression analysis on each column of a very large size of an array?

I have an array of very large size. I want to do linear regression on each column of the array. To speed up the calculation, I created a list with each column of the array as its element. I then employed pyspark to create a RDD and further applied a defined function on it. I had memory problems in creating that RDD (i.e. parallelization).
I have tried to improve the spark.driver.memory to 50g by setting the spark-defaults.conf but the program still seems dead.
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from pyspark import SparkContext
sc = SparkContext("local", "get Linear Coefficients")
def getLinearCoefficients(column):
y=column[~np.isnan(column)] # Extract column non-nan values
x=np.where(~np.isnan(column))[0]+1 # Extract corresponding indexs plus 1
# We only do linear regression interpolation when there are no less than 3 data pairs exist.
if y.shape[0]>=3:
model=LinearRegression(fit_intercept=True) # Intilialize linear regression model
model.fit(x[:,np.newaxis],y) # Fit the model using data
n=y.shape[0]
slope=model.coef_[0]
intercept=model.intercept_
r2=r2_score(y,model.predict(x[:,np.newaxis]))
rmse=np.sqrt(mean_squared_error(y,model.predict(x[:,np.newaxis])))
else:
n,slope,intercept,r2,rmse=np.nan,np.nan,np.nan,np.nan,np.nan
return n,slope,intercept,r2,rmse
random_array=np.random.rand(300,2000*2000) # Here we use a random array without missing data for testing purpose.
columns=[col for col in random_array.T]
columnsRDD=sc.parallelize(columns)
columnsLinearRDD=columnsRDD.map(getLinearCoefficients)
n=np.array([e[0] for e in columnsLinearRDD.collect()])
slope=np.array([e[1] for e in columnsLinearRDD.collect()])
intercept=np.array([e[2] for e in columnsLinearRDD.collect()])
r2=np.array([e[3] for e in columnsLinearRDD.collect()])
rmse=np.array([e[4] for e in columnsLinearRDD.collect()])
The program output was stagnant like the following.
Exception in thread "dispatcher-event-loop-0" java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.scheduler.TaskSetManager$$anonfun$resourceOffer$1.apply(TaskSetManager.scala:486)
at org.apache.spark.scheduler.TaskSetManager$$anonfun$resourceOffer$1.apply(TaskSetManager.scala:467)
at scala.Option.map(Option.scala:146)
at org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:467)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet$1.apply$mcVI$sp(TaskSchedulerImpl.scala:315)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at org.apache.spark.scheduler.TaskSchedulerImpl.org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet(TaskSchedulerImpl.scala:310)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4$$anonfun$apply$11.apply(TaskSchedulerImpl.scala:412)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4$$anonfun$apply$11.apply(TaskSchedulerImpl.scala:409)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:409)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:396)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:396)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:86)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:64)
at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I guess it is possible to use pyspark to speed up the calculation but how could I make it? Modifying other parameters in spark-defaults.conf? Or vectorize each column of the array (I do know range() function in Python3 do that way and it is really faster.)?
That is not going to work that way. You are basically doing three things:
you are using a RDD for parallelization,
you are calling your getLinearCoefficients() function and finally
you call collect() on it to use your existing code.
There is nothing wrong with the frist point, but there is a huge mistake in the second and third step. Your getLinearCoefficients() function does not benefit from pyspark, as you use numpy and sklearn (Have a look at this post for a better explanation). For most of the functions you are using, there is a pyspark equivalent.
The problem with the third step is the collect() function. When you call collect(), pyspark is bringing all the rows of the RDD to the driver and executes the sklearn functions there. Therefore you get only the parallelization which is allowed by sklearn. Using pyspark is completely pointless in the way you are doing it currently and maybe even a drawback. Pyspark is not a framework which allows you to run your python code in parallel. When you want to execute your code in parallel with pyspark, you have to use the pyspark functions.
So what can you?
First of all you could use the n_jobs parameter of the LinearRegession class to use more than one core for your calculation. This allows you at least to use all cores of one machine.
Another thing you could do, is stepping away from sklearn and use the linearRegression of pyspark (have a look at the guide and the api). With this you can use a whole cluster for your linear regression.
For large datasets with more than 100k samples, using LinearRegression is discouraged. General advice is to use the SGDRegressor and set the parameters correctly, so that OLS loss is being used:
from sklearn.linear_model import SGDRegressor
And replace your LinearRegression with:
model = SGDRegressor(loss=’squared_loss’, penalty=’none’, fit_intercept=True)
Setting loss=’squared_loss’ and penalty=’none’ sets the SGDRegressor to use OLS and no regularization, thus it should produce results similar to LinearRegression.
Try out some options like learning_rate and eta0/power_t to find an optimum in the performance.
Furthermore I recommend using train_test_split to split the data set and use the test set for scoring. A good test size to begin with is test_size=.3.

Fastest way to load multiple numpy arrays into spark rdd?

I'm new to Spark. In my application, I would like to create an RDD from many numpy arrays. Each numpy array is (10,000, 5,000). Currently, I'm trying the following:
rdd_list = []
for np_array in np_arrays:
pandas_df = pd.DataFrame(np_array)
spark_df = sqlContext.createDataFrame(pandas_df) ##SLOW STEP
rdd_list.append(spark_df.rdd)
big_rdd = sc.union(rdd_list)
All of the steps are fast, except converting the Pandas dataframe to Spark dataframe is very slow. If I use a subset of the numpy array, such (10,000, 500), it takes a couple minutes to convert it to a Spark dataframe. But if I use the full numpy array (10,000, 5,000), it just hangs.
Is there anything I can do to speed up my workflow? Or should I be doing this in a completely different way? (FYI, I'm kind of stuck with the initial numpy arrays.)
For my application I had used the class ArrayRDD from the sparkit-learn project for loading numpy arrays into spark RDDs. I had no complaints but your mileage may vary.