apache arrow - adequacy for parallel processing - pandas

I have a huge dataset and am using Apache Spark for data processing.
Using Apache Arrow, we can convert Spark-compatible data-frame to Pandas-compatible data-frame and run operations on it.
By converting the data-frame, will it achieve the performance of parallel processing seen in Spark or will it behave like Pandas?

As you can see on the documentation here
Note that even with Arrow, toPandas() results in the collection of all records in the DataFrame to the driver program and should be done on a small subset of the data
The data will be sent to the driver when the data is moved to the Pandas data frame. That means that you may have performance issues if there is too much data for the driver to deal with. For that reason, if you are decided to use Pandas, try to group the data before calling to toPandas() method.
It won't have the same parallelization once it's converted to a Pandas data frame because Spark executors won't be working on that scenario. The beauty of Arrow is to be able to move from the Spark data frame to Pandas directly, but you have to think on the size of the data
Another possibility would be to use other frameworks like Koalas. It has some of the "beauties" of Pandas but it's integrated into Spark.

Related

Why Import pandas in PySpark?

Hi In the University in the data science area we learned that if we wanted to work with small data we should use pandas and if we work with Big Data we schould use spark, in the case of Python programmers PySpark.
Recently I saw in a Hackaton in the cloud (azure Synapse, that work inside in Spark) importing pandas in the notebook ( I suppose the code is good cause was made from Microsoft people)
import pandas
from azureml.core import Dataset
training_pd = training_data.toPandas().to_csv('training_pd.csv', index=False)
Why do they do that ?
Pandas dataframes does not support parallelization. On the other hand, with Pandas, you need no cluster, you have more libraries and easy-to-extend examples. And let's be real, its performance is better for every task that doesn't require scaling.
So, if you start your data engineering life learning Pandas, you're stuck with two things:
Externalized knowledge: ready-made code, snippets, and projects;
Internalized knowledge: API that you know well and prefer much more, patterns, guarantees, and gut feeling how to write this code in general.
To a man with a hammer, everything looks like a nail. And that's not always a bad thing. If you have strict deadlines, done better than perfect! Better use Pandas now, than learn proper scalable solutions for years.
Imagine you want to use an Apache Zeppelin notebook in PySpark mode, with all these cool visualizations. But it's not quite meet your requirements, and you're thinking about how to quick-fix that. At the same time, you can instantly google a ready-made solution for Pandas. This is a way to go; you have no other option to meet your deadlines.
Another guess is that if you write your code in Python, you can debug it easily in every good IDE like PyCharm, using the interactive debugger. And that generally isn't valid for online notebooks, especially in Spark mode. Do you know any good debugger for Spark? I know nothing (people from the Big Data Tools plugin for IDEA are trying to fix this for Scala, but not for Python, as far as I know). So you have to write code in the IDE and then copy-paste it into the notebook.
And last but not least, it may be just a mistake. People do not always perfectly know what they're doing, especially in a large field as Big Data. You're fortunate to have this university course. Average Joe on the internets had no such option.
I should stop here because only speculations lie ahead.
The main difference between working with PySpark and Pandas is the syntax. To show this difference, I provide a simple example of reading in a parquet file and doing some transformations on the data. As you can see, the syntax is completely different between PySpark and Pandas, which means that your Pandas knowledge is not directly transferable to PySpark.
# Pandas
pandasDF = pd.read_parquet(path_to_data)
pandasDF['SumOfTwoColumns'] = pandasDF['Column1'] + pandasDF['Column2']
pandasDF.rename({'Column1': 'Col1', 'Column2': 'Col2'}, axis=1, inplace=True)
# PySpark
sparkDF = spark.read.parquet(path_to_data)
sparkDF = sparkDF.withColumn('SumOfTwoColumns', col('Column1') + col('Column2'))
sparkDF = sparkDF.withColumnRenamed('Column1', 'Col1').withColumnRenamed('Column2', 'Col2')
These differences in usage, but also in syntax, mean that there is a learning curve when transferring from using pure Pandas code to pure PySpark code. This also means that your legacy Pandas code can not be used directly on Spark with PySpark. Luckily there are solutions that allow you to use your Pandas code and knowledge on Spark.
Solutions to leverage the power of Spark with Pandas
There are mainly two options to use Pandas code on Spark: Koalas and Pandas UDFs
Although, its not recommended to use Pandas while working with pyspark, but sometimes, I have also seen people doing the same.
Bassically seems that the person who make that work feel more conformatable in Pandas. Of course Pandas doesn’t scale and If your data set grows, you need more RAM and probably a faster CPU (faster in terms of single core performance). While this may be limiting for some scenarios seems that in the example the csv would be not big enought to use spark.
I can not see any other reason.

Fetch_pandas vs Unload as Parquet - Snowflake data unloading using Python connector

I am new to Snowflake and Python. I am trying to figure out which would faster and more efficient:
Read data from snowflake using fetch_pandas_all() or fetch_pandas_batches() OR
Unload data from Snowflake into Parquet files and then read them into a dataframe.
CONTEXT
I am working on a data layer regression testing tool, that has to verify and validate datasets produced by different versions of the system.
Typically a simulation run produces around 40-50 million rows, each having 18 columns.
I have very less idea about pandas or python, but I am learning (I used to be a front-end developer).
Any help appreciated.
LATEST UPDATE (09/11/2020)
I used fetch_pandas_batches() to pull down data into manageable dataframes and then wrote them to the SQLite database. Thanks.
Based on your use-case, you are likely better off just running a fetch_pandas_all() command to get the data into a df. The performance is likely better as it's one hop of the data, and it's easier to code, as well. I'm also a fan of leveraging the SQLAlchemy library and using the read_sql command. That looks something like this:
resultSet = pd.read_sql(text(sqlQuery), SnowEngine)
once you've established an engine connection. Same concept, but leverages the SQLAlchemy library instead.

why some notes in spark works very slow? and why multiple execution in same situation has different execution time?

My question is about the execution time of pyspark codes in zeppelin.
I have some notes and I work with some SQL's in it. in one of my notes, I convert my dataframe to panda with .topandas() function. size of my data is about 600 megabyte.
my problem is that it takes a long time.
if I use sampling for example like this:
df.sample(False, 0.7).toPandas()
it works correctly and in an acceptable time.
the other strange point is when I run this note several times, sometimes it works fast and sometimes slow. for example for the first run after restart pyspark interpreter, it works faster.
how I can work with zeppelin in a stable state?
and which parameters are effective to run a spark code in an acceptable time?
The problem here is not zeppelin, but you as a programmer. Spark is a distributed (cluster computing) data analysis engine written in Scala which therefore runs in a JVM. Pyspark is the Python API for Spark which utilises the Py4j library to provide an interface for JVM objects.
Methods like .toPandas() or .collect() return a python object which is not just an interface for JVM objects (i.e. it actually contains your data). They are costly because they require to transfer your (distributed) data from the JVM to the python interpreter inside the spark driver. Therefore you should only use it when the resulting data is small and work as long as possible with pyspark dataframes.
Your other issue regarding different execution times needs to be discussed with your cluster admin. Network spikes and jobs submitted by other users can influence your execution time heavily. I am also surprised that your first run after a restart of the spark interpreter is faster, because during the first run the sparkcontext is created and cluster ressources are allocated which adds some overhead.

Apache Ignite analogue of Spark vector UDF and distributed compute in general

I have been using Spark for some time now with success in Python however we have a product written in C# that would greatly benefit from distributed and parallel execution. I did some research and tried out the new C# API for Spark but this is a little restrictive at the moment.
In regards to Ignite, on the surface it seems like a decent alternative. Its got good .NET support, it has clustering ability and the ability to distribute compute across the grid.
However, I was wondering if it really can be used to replace Spark in our use case - what we need is a distributed way in which to perform data frame type operations. In particular a lot of our code in Python was implemented using Pandas UDF and we let Spark worry about the data transfer and merging of results.
If i wanted to use Ignite, where our data is really more like a table (typically CSV sourced) rather than key/value based, is there an efficient way to represent that data across the grid and send computations to the cluster that execute on an arbitrary subset of the data in the same way Spark does, especially in the sense that the result of the calculations just become 1..n more columns in the dataframe without having to collect all the results back to the main program?
You can load your structured data (CSV) to Ignite using its SQL implementation:
https://apacheignite-sql.readme.io/docs/overview
it will provide the possibility to do distributed SQL queries over this data and indexes support. Spark also provides the possibility to work with structured data using SQL but there are no indexes. Indexes will help you to significantly increase the performance of your SQL operations.
In case if you have already had some solution worked using Spark data frames then you also can save the same logic but use Ignite integration with Spark instead:
https://apacheignite-fs.readme.io/docs/ignite-data-frame
In this case, you can have all data stored in Ignite SQL tables and do SQL requests and other operations using Spark.
Here you can see an example how to load CSV data to Ignite using Spark DF and how it can be configured:
https://www.gridgain.com/resources/blog/how-debug-data-loading-spark-ignite

How does Spark DataFrame handles Pandas DataFrame that is larger than memory

I am learning Spark now, and it seems to be the big data solution for Pandas Dataframe, but I have this question which makes me unsure.
Currently I am storing Pandas dataframes that are larger than memory using HDF5. HDF5 is a great tool which allows me to do chunking on the pandas dataframe. So when I need to do processing on large Pandas dataframe, I will do it in chunks. But Pandas does not support distributed processing and HDF5 is only for a single PC environment.
Using Spark dataframe may be solution, but my understanding of Spark is the dataframe must be able to fit in memory, and once loaded as a Spark dataframe, Spark will distribute the dataframe to the different workers to do the distributed processing.
Is my understanding correct? If this is the case, then how does Spark handle a dataframe that is larger than the memory? Does it support chunking, like HDF5?
the dataframe must be able to fit in memory, and once loaded as a Spark dataframe, Spark will distribute the dataframe to the different workers to do the distributed processing.
This is true only if you're trying to load your data on a driver and then parallelize. In a typical scenario you store data in a format which can be read in parallel. It means your data:
has to be accessible on each worker, for example using distributed file system
file format has to support splitting (the simplest examples is plain old csv)
In situation like this each worker reads only its own part of the dataset without any need to store data in a driver memory. All logic related to computing splits is handled transparently by the applicable Hadoop Input Format.
Regarding HDF5 files you have two options:
read data in chunks on a driver, build Spark DataFrame from each chunk, and union results. This is inefficient but easy to implement
distribute HDF5 file / files and read data directly on workers. This generally speaking harder to implement and requires a smart data distribution strategy