Implementing pythonic statistical functions on spark dataframes - pandas

I have very large datasets in spark dataframes that are distributed across the nodes.
I can do simple statistics like mean, stdev, skewness, kurtosis etc using the spark libraries pyspark.sql.functions .
If I want to use advanced statistical tests like Jarque-Bera (JB) or Shapiro-Wilk(SW) etc, I use the python libraries like scipy since the standard apache pyspark libraries don't have them. But in order to do that, I have to convert the spark dataframe to pandas, which means forcing the data into the master node like so:
import scipy.stats as stats
pandas_df=spark_df.toPandas()
JBtest=stats.jarque_bera(pandas_df)
SWtest=stats.shapiro(pandas_df)
I have multiple features, and each feature ID corresponds to a dataset on which I want to perform the test statistic.
My question is:
Is there a way to apply these pythonic functions on a spark dataframe while the data is still distributed across the nodes, or do I need to create my own JB/SW test statistic functions in spark?
Thank you for any valuable insight

Yous should be able to define a vectorized user-defined function that wraps the Pandas function (https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html), like this:
from pyspark.sql.functions import pandas_udf, PandasUDFType
import scipy.stats as stats
#pandas_udf('double', PandasUDFType.SCALAR)
def vector_jarque_bera(x):
return stats.jarque_bera(x)
# test:
spark_df.withColumn('y', vector_jarque_bera(df['x']))
Note that the vectorized function column takes a column as its argument and returns a column.
(Nb. The #pandas_udf decorator is what transforms the Pandas function defined right below it into a vectorized function. Each element of the returned vector is itself a scalar, which is why the argument PandasUDFType.SCALAR is passed.)

Related

Combining pandas dataframes and ArcGIS feature classes

I have a bit of a general question about the compatibility of Pandas dataframes and Arc featureclasses.
My current project is within ArcGIS and so I am mapping mostly with featureclasses. I am however, most familiar with using pandas to perform simple data analysis with tables. Therefore, I am attempting to work with dataframes for the most part, and then join their data to feature classes for final mapping using some key field common between sets.
Attempts:
1.I have come to find that arcpy AddJoin does not accept dfs.
2.I am currently trying convert df to csv and then do an Addjoin however I am unsure if this is supported and I far prefer the functionality of filtering dfs with "df.loc" etc.
Update cursor seems to be a good option, however, I am experiencing issues accessing the key field of the "row" in my loop to match records. I will post another question about this as it is a separate issue.
Which of these or other options is the best for this purpose?
Thanks!
Esri introduced something called Spatially Enabled DataFrame:
The Spatially Enabled DataFrame inserts a custom namespace called spatial into the popular Pandas DataFrame structure to give it spatial abilities. This allows you to use intutive, pandorable operations on both the attribute and spatial columns.
import arcpy
import pandas as pd
# important as it "enhances" Pandas by importing these classes
from arcgis.features import GeoAccessor, GeoSeriesAccessor
# from a shape file
df = pd.DataFrame.spatial.from_featureclass(r"data\hospitals.shp")
# from a map layer
project = arcpy.mp.ArcGISProject('CURRENT')
map = project.activeMap
first_layer = map.listLayers()[0]
layer_name = first_layer.name
df = pd.DataFrame.spatial.from_featureclass(layer_name)
# or directly by name
df = pd.DataFrame.spatial.from_featureclass("Streets")
# of if nested within a group layer (e.g. Buildings)
df = pd.DataFrame.spatial.from_featureclass("Buildings\Residential")
# save to shapefile
df.spatial.to_featureclass(location=r"c:\temp\residential_buildings.shp")
However, you have to use intermediate files if you go back and forth (to my knowledge). Although it's a bit tricky having geopandas installed along arcpy, it may be worth looking into (only) using geopandas.
IMHO, I would recommend that you avoid unnecessarily going back and forth between arcpy and pandas. Pandas allows to merge, join and concat dataframes. Or, you may be able to do everything in geopandas without needing to touch arcpy functions at all.

joblib.Memory and pandas.DataFrame inputs

I've been finding that joblib.Memory.cache results in unreliable caching when using dataframes as inputs to the decorated functions. Playing around, I found that joblib.hash results in inconsistent hashes, at least in some cases. If I understand correctly, joblib.hash is used by joblib.Memory, so this is probably the source of the problem.
Problems seem to occur when new columns are added to dataframes followed by a copy, or when a dataframe is saved and loaded from disk. The following example compares the inconsistent hash output when applied to dataframes, or the consistent results when applied to the equivalent numpy data.
import pandas as pd
import joblib
df = pd.DataFrame({'A':[1,2,3],'B':[4.,5.,6.], })
df.index.name='MyInd'
df['B2'] = df['B']**2
df_copy = df.copy()
df_copy.to_csv("df.csv")
df_fromfile = pd.read_csv('df.csv').set_index('MyInd')
print("DataFrame Hashes:")
print(joblib.hash(df))
print(joblib.hash(df_copy))
print(joblib.hash(df_fromfile))
def _to_tuple(df):
return (df.values, df.columns.values, df.index.values, df.index.name)
print("Equivalent Numpy Hashes:")
print(joblib.hash(_to_tuple(df)))
print(joblib.hash(_to_tuple(df_copy)))
print(joblib.hash(_to_tuple(df_fromfile)))
results in output:
DataFrame Hashes:
4e9352c1ffc14fb4bb5b1a5ad29a3def
2d149affd4da6f31bfbdf6bd721e06ef
6843f7020cda9d4d3cbf05dfc47542d4
Equivalent Numpy Hashes:
6ad89873c7ccbd3b76ae818b332c1042
6ad89873c7ccbd3b76ae818b332c1042
6ad89873c7ccbd3b76ae818b332c1042
The "Equivalent Numpy Hashes" is the behavior I'd like. I'm guessing the problem is due to some kind of complex internal metadata that DataFrames utililize. Is there any canonical way to use joblib.Memory.cache on pandas DataFrames so it will cache based upon the data values only?
A "good enough" workaround would be if there is a way a user can tell joblib.Memory.cache to utilize something like my _to_tuple function above for specific arguments.

run pandas udf on single batch

according to the docs pandas UDF's work the following way:
The Python function should take a pandas Series as an input and return
a pandas Series of the same length, and you should specify these in
the Python type hints. Spark runs a pandas UDF by splitting columns
into batches, calling the function for each batch as a subset of the
data, then concatenating the results.
Is it possible to run the pandas UDF only on a single batch? I saw the option spark.sql.execution.arrow.maxRecordsPerBatch, but since its default to 10'000 and my dataset that is much smaller still runs in batches.

Is it possible to use Series.str.extract with Dask?

I'm currently processing a large dataset with Pandas and I have to extract some data using pandas.Series.str.extract.
It looks like this:
df['output_col'] = df['input_col'].str.extract(r'.*"mytag": "(.*?)"', expand=False).str.upper()
It works well, however, as it has to be done about ten times (using various source columns) the performance aren't very good. To improve the performance by using several cores, I wanted to try Dask but it doesn't seem to be supported (I cannot find any reference to an extract method in the dask's documentation).
Is there any way to performance such Pandas action in parallel?
I have found this method where you basically split your dataframe into multiple ones, create a process per subframes and then concatenate them back.
You should be able to do this like in pandas. It's mentioned in this segment of the documentation, but it might be valuable to expand it.
import pandas as pd
import dask.dataframe as dd
​
s = pd.Series(["example", "strings", "are useful"])
ds = dd.from_pandas(s, 2)
ds.str.extract("[a-z\s]{4}(.{2})", expand=False).str.upper().compute()
0 PL
1 NG
2 US
dtype: object
Your best bet is to use map_partitions, which enables you to perform general pandas operations to the parts of your series, and acts like a managed version of the multiprocessing method you linked.
def inner(df):
df['output_col'] = df['input_col'].str.extract(
r'.*"mytag": "(.*?)"', expand=False).str.upper()
return df
out = df.map_partitions(inner)
Since this is a string operation, you probably want processes (e.g., the distributed scheduler) rather than threads. Note, that your performance will be far better if you load your data using dask (e.g., dd.read_csv) rather than create the dataframe in memory and then pass it to dask.

Is it possible to use pyspark to speed up regression analysis on each column of a very large size of an array?

I have an array of very large size. I want to do linear regression on each column of the array. To speed up the calculation, I created a list with each column of the array as its element. I then employed pyspark to create a RDD and further applied a defined function on it. I had memory problems in creating that RDD (i.e. parallelization).
I have tried to improve the spark.driver.memory to 50g by setting the spark-defaults.conf but the program still seems dead.
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from pyspark import SparkContext
sc = SparkContext("local", "get Linear Coefficients")
def getLinearCoefficients(column):
y=column[~np.isnan(column)] # Extract column non-nan values
x=np.where(~np.isnan(column))[0]+1 # Extract corresponding indexs plus 1
# We only do linear regression interpolation when there are no less than 3 data pairs exist.
if y.shape[0]>=3:
model=LinearRegression(fit_intercept=True) # Intilialize linear regression model
model.fit(x[:,np.newaxis],y) # Fit the model using data
n=y.shape[0]
slope=model.coef_[0]
intercept=model.intercept_
r2=r2_score(y,model.predict(x[:,np.newaxis]))
rmse=np.sqrt(mean_squared_error(y,model.predict(x[:,np.newaxis])))
else:
n,slope,intercept,r2,rmse=np.nan,np.nan,np.nan,np.nan,np.nan
return n,slope,intercept,r2,rmse
random_array=np.random.rand(300,2000*2000) # Here we use a random array without missing data for testing purpose.
columns=[col for col in random_array.T]
columnsRDD=sc.parallelize(columns)
columnsLinearRDD=columnsRDD.map(getLinearCoefficients)
n=np.array([e[0] for e in columnsLinearRDD.collect()])
slope=np.array([e[1] for e in columnsLinearRDD.collect()])
intercept=np.array([e[2] for e in columnsLinearRDD.collect()])
r2=np.array([e[3] for e in columnsLinearRDD.collect()])
rmse=np.array([e[4] for e in columnsLinearRDD.collect()])
The program output was stagnant like the following.
Exception in thread "dispatcher-event-loop-0" java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.scheduler.TaskSetManager$$anonfun$resourceOffer$1.apply(TaskSetManager.scala:486)
at org.apache.spark.scheduler.TaskSetManager$$anonfun$resourceOffer$1.apply(TaskSetManager.scala:467)
at scala.Option.map(Option.scala:146)
at org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:467)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet$1.apply$mcVI$sp(TaskSchedulerImpl.scala:315)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
at org.apache.spark.scheduler.TaskSchedulerImpl.org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet(TaskSchedulerImpl.scala:310)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4$$anonfun$apply$11.apply(TaskSchedulerImpl.scala:412)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4$$anonfun$apply$11.apply(TaskSchedulerImpl.scala:409)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:409)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$4.apply(TaskSchedulerImpl.scala:396)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:396)
at org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:86)
at org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:64)
at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I guess it is possible to use pyspark to speed up the calculation but how could I make it? Modifying other parameters in spark-defaults.conf? Or vectorize each column of the array (I do know range() function in Python3 do that way and it is really faster.)?
That is not going to work that way. You are basically doing three things:
you are using a RDD for parallelization,
you are calling your getLinearCoefficients() function and finally
you call collect() on it to use your existing code.
There is nothing wrong with the frist point, but there is a huge mistake in the second and third step. Your getLinearCoefficients() function does not benefit from pyspark, as you use numpy and sklearn (Have a look at this post for a better explanation). For most of the functions you are using, there is a pyspark equivalent.
The problem with the third step is the collect() function. When you call collect(), pyspark is bringing all the rows of the RDD to the driver and executes the sklearn functions there. Therefore you get only the parallelization which is allowed by sklearn. Using pyspark is completely pointless in the way you are doing it currently and maybe even a drawback. Pyspark is not a framework which allows you to run your python code in parallel. When you want to execute your code in parallel with pyspark, you have to use the pyspark functions.
So what can you?
First of all you could use the n_jobs parameter of the LinearRegession class to use more than one core for your calculation. This allows you at least to use all cores of one machine.
Another thing you could do, is stepping away from sklearn and use the linearRegression of pyspark (have a look at the guide and the api). With this you can use a whole cluster for your linear regression.
For large datasets with more than 100k samples, using LinearRegression is discouraged. General advice is to use the SGDRegressor and set the parameters correctly, so that OLS loss is being used:
from sklearn.linear_model import SGDRegressor
And replace your LinearRegression with:
model = SGDRegressor(loss=’squared_loss’, penalty=’none’, fit_intercept=True)
Setting loss=’squared_loss’ and penalty=’none’ sets the SGDRegressor to use OLS and no regularization, thus it should produce results similar to LinearRegression.
Try out some options like learning_rate and eta0/power_t to find an optimum in the performance.
Furthermore I recommend using train_test_split to split the data set and use the test set for scoring. A good test size to begin with is test_size=.3.