Is there an Apache Arrow equivalent of the Spark Pandas UDF - pandas

Spark provides a few different ways to implement UDFs that consume and return Pandas DataFrames. I am currently using the cogrouped version that takes two (co-grouped) Pandas DataFrames as input and returns a third.
For efficient translation between Spark DataFrames and Pandas DataFrames, Spark uses the Apache Arrow memory layout, however transformation is still required to go from Arrow to Pandas and back. I would really like to access the Arrow data directly, as this is how I will ultimately be working with the data in the UDF (using Polars).
It seems wasteful to go from Spark -> Arrow -> Pandas -> Arrow (Polars) on the way in and the reverse on the return.

import pyarrow as pa
import polars as pl
sql_context = SQLContext(spark)
data = [('James',[1, 2]),]
spark_df = sql_context.createDataFrame(data=data, schema = ["name","properties"])
df = pl.from_arrow(pa.Table.from_batches(spark_df._collect_as_arrow()))
print(df)
shape: (1, 2)
┌───────┬────────────┐
│ name ┆ properties │
│ --- ┆ --- │
│ str ┆ list[i64] │
╞═══════╪════════════╡
│ James ┆ [1, 2] │
└───────┴────────────┘

Related

Why does pySpark crash when using Apache Arrow for string types?

In an attempt to get some outlier plots on large datasets I need to convert a spark DataFrame to pandas. Turing to Apache Arrow a simple run is crashing my pyspark console when casting x as string (it works fine without the cast), why?
Using Python version 3.8.9 (default, Apr 10 2021 15:47:22)
Spark context Web UI available at http://6d0b1018a45a:4040
Spark context available as 'sc' (master = local[*], app id = local-1621164597906).
SparkSession available as 'spark'.
>>> import time
>>> from pyspark.sql.functions import rand
>>> from pyspark.sql import functions as F
>>> spark = SparkSession.builder.appName("Console_Test").getOrCreate()
>>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
21/05/16 11:31:03 WARN SQLConf: The SQL config 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' instead of it.
>>> a_df = spark.range(1 << 25).toDF("id").withColumn("x", rand())
>>> a_df = a_df.withColumn("id", F.col("id").cast("string"))
>>> start_t = time.time()
>>> a_pd = a_df.toPandas()
Killed
#
Additionally I noticed that options such as spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", "5000")are seemingly without effect as the web ui shows records of significantly more than 5000 being assigned to the tasks.
Any indication on how to resolve the pyspark console crash or more directly render large scatter plots would be highly appreciated - I have (unsuccessfully) tried to find a way to apply Table.to_pandas(split_blocks=True, self_destruct=True)but did not get the able structure from a spark DataFrame.
You try to convert 33.5 mio (2^25) rows into a Pandas dataframe. This will lead to an OutOfMemoryError, as all data will be transfered to the Spark driver.
A way to find outliers would be to calculate the histogram for the column x and then filter down a_df to the relevant bins in Spark before creating the Pandas dataframe:
hist = a_df.select("x").rdd.flatMap(lambda x: x).histogram(10) #create 10 bins
hist is a tuple of two arrays: the first array contains the boundaries of the bins and the second array contains the numbers of elements in each bin:
([1.7855041778425118e-08,
0.1000000152099446,
0.20000001256484742,
0.30000000991975023,
0.40000000727465307,
0.5000000046295558,
0.6000000019844587,
0.6999999993393615,
0.7999999966942644,
0.8999999940491672,
0.99999999140407],
[3355812,
3356891,
3352364,
3352438,
3357564,
3356213,
3354933,
3355144,
3357241,
3355832])
rand creates uniformly distributed randon numbers, so the histogram in this case is not very interesting. But for real world distributions, the histogram will be useful.

Stacking after unstacking a data frame is different in pandas

I created a DataFrame with 3 indexes. Later, I applied unstack method followd by the stack method. Then checked for equality of new and old data frames. Why are both of them different? Is unstacking not opposite procedure to stacking? Here is my code :
import numpy as np
import pandas as pd
data = pd.Series([7]*9, index = [[1,2,3,2,4,9,6,7,9], ['a','c','f','a', 'k','f','c','d','a'], [np.nan]*9])
data2 = data.unstack().stack()
print(data2.equals(data))
The output returns False, but don't know why!

How to I convert multiple Pandas DFs into a single Spark DF?

I have several Excel files that I need to load and pre-process before loading them into a Spark DF. I have a list of these files that need to be processed. I do something like this to read them in:
file_list_rdd = sc.emptyRDD()
for file_path in file_list:
current_file_rdd = sc.binaryFiles(file_path)
print(current_file_rdd.count())
file_list_rdd = file_list_rdd.union(current_file_rdd)
I then have some mapper function that turns file_list_rdd from a set of (path, bytes) tuples to (path, Pandas DataFrame) tuples. This allows me to use Pandas to read the Excel file and to manipulate the files so that they're uniform before making them into a Spark DataFrame.
How do I take an RDD of (file path, Pandas DF) tuples and turn it into a single Spark DF? I'm aware of functions that can do a single transformation, but not one that can do several.
My first attempt was something like this:
sqlCtx = SQLContext(sc)
def convert_pd_df_to_spark_df(item):
return sqlCtx.createDataFrame(item[0][1])
processed_excel_rdd.map(convert_pd_df_to_spark_df)
I'm guessing that didn't work because sqlCtx isn't distributed with the computation (it's a guess because the stack trace doesn't make much sense to me).
Thanks in advance for taking the time to read :).
Can be done using conversion to Arrow RecordBatches which Spark > 2.3 can process into a DF in a very efficient manner.
https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5
This snippet monkey-patches spark to include a createFromPandasDataframesRDD method.
The createFromPandasDataframesRDD method accepts a RDD object of pandas DFs (Assumes same columns) and returns a single Spark DF.
I solved this by writing a function like this:
def pd_df_to_row(rdd_row):
key = rdd_row[0]
pd_df = rdd_row[1]
rows = list()
for index, series in pd_df.iterrows():
# Takes a row of a df, exports it as a dict, and then passes an unpacked-dict into the Row constructor
row_dict = {str(k):v for k,v in series.to_dict().items()}
rows.append(Row(**row_dict))
return rows
You can invoke it by calling something like:
processed_excel_rdd = processed_excel_rdd.flatMap(pd_df_to_row)
pd_df_to_row now has a collection of Spark Row objects. You can now say:
processed_excel_rdd.toDF()
There's probably something more efficient than the Series-> dict-> Row operation, but this got me through.
Why not make a list of the dataframes or filenames and then call union in a loop. Something like this:
If pandas dataframes:
dfs = [df1, df2, df3, df4]
sdf = None
for df in dfs:
if sdf:
sdf = sdf.union(spark.createDataFrame(df))
else:
sdf = spark.createDataFrame(df)
If filenames:
names = [name1, name2, name3, name4]
sdf = None
for name in names:
if sdf:
sdf = sdf.union(spark.createDataFrame(pd.read_excel(name))
else:
sdf = spark.createDataFrame(pd.read_excel(name))

How to use grouped set of images and data to train using tensorflow?

So say I have an object that has been photographed from multiple angles, how do I tell tensorflow that the images combined is one object and not seperate objects? All tutorials seem to cover individual images per class.
For example this structure:
-vegetables
├───carrot
│ ├───carrot_1
│ │ └───image_1.jpg
│ │ └───image_2.jpg
│ │ └───image_3.jpg
│ └───carrot_2
│ └───image_1.jpg
├───potato
│ ├───potato_1
│ │ └───image_1.jpg
Also in the future I would like to combine these images with more sensor output.

How to np.roll() faster?

I'm using np.roll() to do nearest-neighbor-like averaging, but I have a feeling there are faster ways. Here is a simplified example, but imagine 3 dimensions and more complex averaging "stencils". Just for example see section 6 of this paper.
Here are a few lines from that simplified example:
for j in range(nper):
phi2 = 0.25*(np.roll(phi, 1, axis=0) +
np.roll(phi, -1, axis=0) +
np.roll(phi, 1, axis=1) +
np.roll(phi, -1, axis=1) )
phi[do_me] = phi2[do_me]
So should I be looking for something that returns views instead of arrays (as it seems roll returns arrays)? In this case is roll initializing a new array each time it's called? I noticed the overhead is huge for small arrays.
In fact it's most efficient for arrays of [100,100] to [300,300] in size on my laptop. Possibly caching issues above that.
Would scipy.ndimage.interpolation.shift() perform better, the way it is implemented here, and if so, is it fixed? In the linked example above, I'm throwing away the wrapped parts anyway, but might not always.
note: in this question I'm looking only for what is available within NumPy / SciPy. Of course there are many good ways to speed up Python and even NumPy, but that's not what I'm looking for here, because I'm really trying to understand NumPy better. Thanks!
np.roll has to create a copy of the array each time, which is why it is (comparatively) slow. Convolution with something like scipy.ndimage.filters.convolve() will be a bit faster, but it may still create copies (depending on the implementation).
In this particular case, we can avoid copying altogether by using numpy views and padding the original array in the beginning.
import numpy as np
def no_copy_roll(nx, ny):
phi_padded = np.zeros((ny+2, nx+2))
# these are views into different sub-arrays of phi_padded
# if two sub-array overlap, they share memory
phi_north = phi_padded[:-2, 1:-1]
phi_east = phi_padded[1:-1, 2:]
phi_south = phi_padded[2:, 1:-1]
phi_west = phi_padded[1:-1, :-2]
phi = phi_padded[1:-1, 1:-1]
do_me = np.zeros_like(phi, dtype='bool')
do_me[1:-1, 1:-1] = True
x0, y0, r0 = 40, 65, 12
x = np.arange(nx, dtype='float')[None, :]
y = np.arange(ny, dtype='float')[:, None]
rsq = (x-x0)**2 + (y-y0)**2
circle = rsq <= r0**2
phi[circle] = 1.0
do_me[circle] = False
n, nper = 100, 100
phi_hold = np.zeros((n+1, ny, nx))
phi_hold[0] = phi
for i in range(n):
for j in range(nper):
phi2 = 0.25*(phi_south +
phi_north +
phi_east +
phi_west)
phi[do_me] = phi2[do_me]
phi_hold[i+1] = phi
return phi_hold
This will cut about 35% off the time for a simple benchmark like
from original import original_roll
from mwe import no_copy_roll
import numpy as np
nx, ny = (301, 301)
arr1 = original_roll(nx, ny)
arr2 = no_copy_roll(nx, ny)
assert np.allclose(arr1, arr2)
here is my profiling result
37.685 <module> timing.py:1
├─ 22.413 original_roll original.py:4
│ ├─ 15.056 [self]
│ └─ 7.357 roll <__array_function__ internals>:2
│ └─ 7.243 roll numpy\core\numeric.py:1110
│ [10 frames hidden] numpy
├─ 14.709 no_copy_roll mwe.py:4
└─ 0.393 allclose <__array_function__ internals>:2
└─ 0.393 allclose numpy\core\numeric.py:2091
[2 frames hidden] numpy
0.391 isclose <__array_function__ internals>:2
└─ 0.387 isclose numpy\core\numeric.py:2167
[4 frames hidden] numpy
For more elaborate stencils this approach still works, but it can get a bit unwieldy. In this case you can take a look at skimage.util.view_as_windows, which uses a variation of this trick (numpy stride tricks) to return an array that gives you cheap access to a window around each element. You will, however, have to do your own padding, and will need to take care to not create copies of the resulting array, which can get expensive fast.
The fastest implementation I could get so far is based on the low-level implementation of scipy.ndimage.interpolation.shift that you already mentioned:
from scipy.ndimage.interpolation import _nd_image, _ni_support
cval = 0.0 # unused for mode `wrap`
mode = _ni_support._extend_mode_to_code('wrap')
_nd_image.zoom_shift(data, None, shift, data, 0, mode, cval) # in-place update
Precomputing mode, cval and shift to call low-level zoom_shift method directly got me about x5 speedup wrt. calling shift instead, and x10 speedup wrt np.roll.