Why Import pandas in PySpark? - pandas

Hi In the University in the data science area we learned that if we wanted to work with small data we should use pandas and if we work with Big Data we schould use spark, in the case of Python programmers PySpark.
Recently I saw in a Hackaton in the cloud (azure Synapse, that work inside in Spark) importing pandas in the notebook ( I suppose the code is good cause was made from Microsoft people)
import pandas
from azureml.core import Dataset
training_pd = training_data.toPandas().to_csv('training_pd.csv', index=False)
Why do they do that ?

Pandas dataframes does not support parallelization. On the other hand, with Pandas, you need no cluster, you have more libraries and easy-to-extend examples. And let's be real, its performance is better for every task that doesn't require scaling.
So, if you start your data engineering life learning Pandas, you're stuck with two things:
Externalized knowledge: ready-made code, snippets, and projects;
Internalized knowledge: API that you know well and prefer much more, patterns, guarantees, and gut feeling how to write this code in general.
To a man with a hammer, everything looks like a nail. And that's not always a bad thing. If you have strict deadlines, done better than perfect! Better use Pandas now, than learn proper scalable solutions for years.
Imagine you want to use an Apache Zeppelin notebook in PySpark mode, with all these cool visualizations. But it's not quite meet your requirements, and you're thinking about how to quick-fix that. At the same time, you can instantly google a ready-made solution for Pandas. This is a way to go; you have no other option to meet your deadlines.
Another guess is that if you write your code in Python, you can debug it easily in every good IDE like PyCharm, using the interactive debugger. And that generally isn't valid for online notebooks, especially in Spark mode. Do you know any good debugger for Spark? I know nothing (people from the Big Data Tools plugin for IDEA are trying to fix this for Scala, but not for Python, as far as I know). So you have to write code in the IDE and then copy-paste it into the notebook.
And last but not least, it may be just a mistake. People do not always perfectly know what they're doing, especially in a large field as Big Data. You're fortunate to have this university course. Average Joe on the internets had no such option.
I should stop here because only speculations lie ahead.

The main difference between working with PySpark and Pandas is the syntax. To show this difference, I provide a simple example of reading in a parquet file and doing some transformations on the data. As you can see, the syntax is completely different between PySpark and Pandas, which means that your Pandas knowledge is not directly transferable to PySpark.
# Pandas
pandasDF = pd.read_parquet(path_to_data)
pandasDF['SumOfTwoColumns'] = pandasDF['Column1'] + pandasDF['Column2']
pandasDF.rename({'Column1': 'Col1', 'Column2': 'Col2'}, axis=1, inplace=True)
# PySpark
sparkDF = spark.read.parquet(path_to_data)
sparkDF = sparkDF.withColumn('SumOfTwoColumns', col('Column1') + col('Column2'))
sparkDF = sparkDF.withColumnRenamed('Column1', 'Col1').withColumnRenamed('Column2', 'Col2')
These differences in usage, but also in syntax, mean that there is a learning curve when transferring from using pure Pandas code to pure PySpark code. This also means that your legacy Pandas code can not be used directly on Spark with PySpark. Luckily there are solutions that allow you to use your Pandas code and knowledge on Spark.
Solutions to leverage the power of Spark with Pandas
There are mainly two options to use Pandas code on Spark: Koalas and Pandas UDFs
Although, its not recommended to use Pandas while working with pyspark, but sometimes, I have also seen people doing the same.

Bassically seems that the person who make that work feel more conformatable in Pandas. Of course Pandas doesn’t scale and If your data set grows, you need more RAM and probably a faster CPU (faster in terms of single core performance). While this may be limiting for some scenarios seems that in the example the csv would be not big enought to use spark.
I can not see any other reason.

Related

Processing xlsx files using Python Matplotlib and other libraries

Python newbie. It looks to me that in order to build a flexible data collection from xlsx it will be well worth my time to learn the Pandas DataFrame approach. More upfront learning time, but a much better platform once I have it working.
I've played around with some code examples, but I really want an expert to give me advice on the best approach.

Is there a Pandas Profiling like implemention built on polars?

We use Pandas and Pandas Profiling extensively in our projects to generate profile reports. We were going to explore using Polars as a Pandas alternative and wanted to check if there were any implementations like Pandas Profiling built on top of Polars?
I have searched a bit before posting this question and did not find any similar implementations. So, wanted to check if anyone else had an idea about the same?
I'm not aware of any project implemented natively with Polars. That said, there's an easy way to use Pandas Profiling with Polars.
From the Other DataFrame libraries page of the Pandas Profiling documentation:
If you have data in another framework of the Python Data ecosystem, you can use pandas-profiling by converting to a pandas DataFrame, as direct integrations are not yet supported.
On the above page, you'll see suggestions for using Pandas Profiling with other dataframe libraries, such as Modin, Vaex, PySpark, and Dask.
We can do the same thing easily with Polars, using the to_pandas method.
Adapting an example from the Quick Start Guide to use Polars:
import polars as pl
import numpy as np
from pandas_profiling import ProfileReport
df = pl.DataFrame(np.random.rand(100, 5), columns=["a", "b", "c", "d", "e"])
profile = ProfileReport(df.to_pandas(), title="Pandas Profiling Report")
profile.to_file('your_report.html')
In general, you're always one method call away from plugging Polars into any framework that uses Pandas. I myself use to_pandas so that I can use Polars with my favorite graphing library, plotnine.
(As an aside, thank you for sharing the Pandas Profiling project here. I was quite impressed with the output generated, and will probably use it on projects going forward.)

In what situations are Datasets preferred to Dataframes and vice-versa in Apache Spark?

I have been searching for any links or documents or articles that will help me understand when do we go for Datasets over Dataframes and vice-versa?
All I find on the internet are headlines with when to use a Dataset but when opened, they just specify the differences between Dataframe and a Dataset. There are so many links with just listing differences in the name of scenarios.
There is only one question on stackoverflow that has the right title but even in that answer, the databricks documentation link is not working.
I am looking for some information that can help me understand fundamentally when do we go for a Dataset or in what scenarios is Dataset preferred over Dataframe and vice versa.
If not an answer, even a link or documentation that can help me understand is appreciated.
The page you are looking for is moved to here. According to the session, in summary, Dataset API is available for Scala (and Java) only, and it combines the benefits of both RDD and Dataframes which are:
Functional Programming (RDDs)
Type-safe (RDDs)
Relational (Dataframes)
Catalyst query optimization (Dataframes)
Tunsten direct/packed RAM (Dataframes)
JIT code generation (Dataframes)
Sorting/Shuffling without deserializing (Dataframes)
In addition, Datasets consume less memory and can catch analysis errors at the compile time while it is cached at Runtime for Dataframes. This is also a good article.
Therefore, the answer is you would better use Datasets when you are coding in Scala or Java and want to use functional programming and save more memory with all dataframe capabilities.
Datasets are preferred over Dataframes in Apache Spark when the data is strongly typed, i.e., when the schema is known ahead of time and the data is not necessarily homogeneous. This is because Datasets can enforce type safety, which means that type errors will be caught at compile time rather than at runtime. In addition, Datasets can take advantage of the Catalyst optimizer, which can lead to more efficient execution. Finally, Datasets can be easily converted to Dataframes, so there is no need to choose between the two upfront.

Fetch_pandas vs Unload as Parquet - Snowflake data unloading using Python connector

I am new to Snowflake and Python. I am trying to figure out which would faster and more efficient:
Read data from snowflake using fetch_pandas_all() or fetch_pandas_batches() OR
Unload data from Snowflake into Parquet files and then read them into a dataframe.
CONTEXT
I am working on a data layer regression testing tool, that has to verify and validate datasets produced by different versions of the system.
Typically a simulation run produces around 40-50 million rows, each having 18 columns.
I have very less idea about pandas or python, but I am learning (I used to be a front-end developer).
Any help appreciated.
LATEST UPDATE (09/11/2020)
I used fetch_pandas_batches() to pull down data into manageable dataframes and then wrote them to the SQLite database. Thanks.
Based on your use-case, you are likely better off just running a fetch_pandas_all() command to get the data into a df. The performance is likely better as it's one hop of the data, and it's easier to code, as well. I'm also a fan of leveraging the SQLAlchemy library and using the read_sql command. That looks something like this:
resultSet = pd.read_sql(text(sqlQuery), SnowEngine)
once you've established an engine connection. Same concept, but leverages the SQLAlchemy library instead.

If I use python pandas, is there any need for structured arrays?

Now that pandas provides a data frame structure, is there any need for structured/record arrays in numpy? There are some modifications I need to make to an existing code which requires this structured array type framework, but I am considering using pandas in its place from this point forward. Will I at any point find that I need some functionality of structured/record arrays that pandas does not provide?
pandas's DataFrame is a high level tool while structured arrays are a very low-level tool, enabling you to interpret a binary blob of data as a table-like structure. One thing that is hard to do in pandas is nested data types with the same semantics as structured arrays, though this can be imitated with hierarchical indexing (structured arrays can't do most things you can do with hierarchical indexing).
Structured arrays are also amenable to working with massive tabular data sets loaded via memory maps (np.memmap). This is a limitation that will be addressed in pandas eventually, though.
I'm currently in the middle of transition to Pandas DataFrames from the various Numpy arrays. This has been relatively painless since Pandas, AFAIK, if built largely on top of Numpy. What I mean by that is that .mean(), .sum() etc all work as you would hope. On top of that, the ability to add a hierarchical index and use the .ix[] (index) attribute and .xs() (cross-section) method to pull out arbitray pieces of the data has greatly improved the readability and performance of my code (mainly by reducing the number of round-trips to my database).
One thing I haven't fully investigated yet is Pandas compatibility with the more advanced functionality of Scipy and Matplotlib. However, in case of any issues, it's easy enough to pull out a single column that behaves enough like an array for those libraries to work, or even convert to an array on the fly. A DataFrame's plotting methods, for instance, rely on matplotlib and take care of any conversion for you.
Also, if you're like me and your main use of Scipy is the statistics module, pystatsmodels is quickly maturing and relies heavily on pandas.
That's my two cents' worth
I never took the time to dig into pandas, but I use structured array quite often in numpy. Here are a few considerations:
structured arrays are as convenient as recarrays with less overhead, if you don't mind losing the possibility to access fields by attributes. But then, have you ever tried to use min or max as field name in a recarray ?
NumPy has been developed over a far longer period than pandas, with a larger crew, and it becomes ubiquitous enough that a lot of third party packages rely on it. You can expect structured arrays to be more portable than pandas dataframes.
Are pandas dataframes easily pickable ? Can they be sent back and forth with PyTables, for example ?
Unless you're 100% percent that you'll never have to share your code with non-pandas users, you might want to keep some structured arrays around.