If I use python pandas, is there any need for structured arrays? - numpy

Now that pandas provides a data frame structure, is there any need for structured/record arrays in numpy? There are some modifications I need to make to an existing code which requires this structured array type framework, but I am considering using pandas in its place from this point forward. Will I at any point find that I need some functionality of structured/record arrays that pandas does not provide?

pandas's DataFrame is a high level tool while structured arrays are a very low-level tool, enabling you to interpret a binary blob of data as a table-like structure. One thing that is hard to do in pandas is nested data types with the same semantics as structured arrays, though this can be imitated with hierarchical indexing (structured arrays can't do most things you can do with hierarchical indexing).
Structured arrays are also amenable to working with massive tabular data sets loaded via memory maps (np.memmap). This is a limitation that will be addressed in pandas eventually, though.

I'm currently in the middle of transition to Pandas DataFrames from the various Numpy arrays. This has been relatively painless since Pandas, AFAIK, if built largely on top of Numpy. What I mean by that is that .mean(), .sum() etc all work as you would hope. On top of that, the ability to add a hierarchical index and use the .ix[] (index) attribute and .xs() (cross-section) method to pull out arbitray pieces of the data has greatly improved the readability and performance of my code (mainly by reducing the number of round-trips to my database).
One thing I haven't fully investigated yet is Pandas compatibility with the more advanced functionality of Scipy and Matplotlib. However, in case of any issues, it's easy enough to pull out a single column that behaves enough like an array for those libraries to work, or even convert to an array on the fly. A DataFrame's plotting methods, for instance, rely on matplotlib and take care of any conversion for you.
Also, if you're like me and your main use of Scipy is the statistics module, pystatsmodels is quickly maturing and relies heavily on pandas.
That's my two cents' worth

I never took the time to dig into pandas, but I use structured array quite often in numpy. Here are a few considerations:
structured arrays are as convenient as recarrays with less overhead, if you don't mind losing the possibility to access fields by attributes. But then, have you ever tried to use min or max as field name in a recarray ?
NumPy has been developed over a far longer period than pandas, with a larger crew, and it becomes ubiquitous enough that a lot of third party packages rely on it. You can expect structured arrays to be more portable than pandas dataframes.
Are pandas dataframes easily pickable ? Can they be sent back and forth with PyTables, for example ?
Unless you're 100% percent that you'll never have to share your code with non-pandas users, you might want to keep some structured arrays around.

Related

Using PCA on Part of Dataframe

I want to use a clustering algorithm to a dataframe that contains a lot of features (32 columns).
A part of the features are encoded using one hot encoder.
I want to use PCA ( Principal Component analysis ) to reduce the dimension and make the machine learning process easier.
Is it possible to use the PCA just for some columns of the data frame and keep the other columns as they are then use machine learning model.
Or it is obligatory to use PCA for all the dataframe before clustering.
I guess there should be no issue with doing what you describe.
What this does, effectively, is merge some of the objects' features into fewer ones, but then using other, non-merged ones in addition to the merged ones. I don't know what effect that would have on the outcome; it might be good to run a correlation to see whether the unmerged features add anything to the PCA-merged ones. You might find that they basically duplicate what is there already.
Since clustering is an exploratory method, you can basically do whatever you want. It is of course advisable to have a reason for doing so, as it otherwise ends up as simply trial-and-error, and if you find a result, you won't be able to describe why you got there. It is possible (or even likely for some data sets) that there are multiple ways to cluster them, so you should make decisions based on what you know about the data already, so they can be justified in those terms.
Running random trial-and-error clustering until you find a structure makes it a bit difficult to come up with a good explanation why that structure is valid.

In what situations are Datasets preferred to Dataframes and vice-versa in Apache Spark?

I have been searching for any links or documents or articles that will help me understand when do we go for Datasets over Dataframes and vice-versa?
All I find on the internet are headlines with when to use a Dataset but when opened, they just specify the differences between Dataframe and a Dataset. There are so many links with just listing differences in the name of scenarios.
There is only one question on stackoverflow that has the right title but even in that answer, the databricks documentation link is not working.
I am looking for some information that can help me understand fundamentally when do we go for a Dataset or in what scenarios is Dataset preferred over Dataframe and vice versa.
If not an answer, even a link or documentation that can help me understand is appreciated.
The page you are looking for is moved to here. According to the session, in summary, Dataset API is available for Scala (and Java) only, and it combines the benefits of both RDD and Dataframes which are:
Functional Programming (RDDs)
Type-safe (RDDs)
Relational (Dataframes)
Catalyst query optimization (Dataframes)
Tunsten direct/packed RAM (Dataframes)
JIT code generation (Dataframes)
Sorting/Shuffling without deserializing (Dataframes)
In addition, Datasets consume less memory and can catch analysis errors at the compile time while it is cached at Runtime for Dataframes. This is also a good article.
Therefore, the answer is you would better use Datasets when you are coding in Scala or Java and want to use functional programming and save more memory with all dataframe capabilities.
Datasets are preferred over Dataframes in Apache Spark when the data is strongly typed, i.e., when the schema is known ahead of time and the data is not necessarily homogeneous. This is because Datasets can enforce type safety, which means that type errors will be caught at compile time rather than at runtime. In addition, Datasets can take advantage of the Catalyst optimizer, which can lead to more efficient execution. Finally, Datasets can be easily converted to Dataframes, so there is no need to choose between the two upfront.

Why Import pandas in PySpark?

Hi In the University in the data science area we learned that if we wanted to work with small data we should use pandas and if we work with Big Data we schould use spark, in the case of Python programmers PySpark.
Recently I saw in a Hackaton in the cloud (azure Synapse, that work inside in Spark) importing pandas in the notebook ( I suppose the code is good cause was made from Microsoft people)
import pandas
from azureml.core import Dataset
training_pd = training_data.toPandas().to_csv('training_pd.csv', index=False)
Why do they do that ?
Pandas dataframes does not support parallelization. On the other hand, with Pandas, you need no cluster, you have more libraries and easy-to-extend examples. And let's be real, its performance is better for every task that doesn't require scaling.
So, if you start your data engineering life learning Pandas, you're stuck with two things:
Externalized knowledge: ready-made code, snippets, and projects;
Internalized knowledge: API that you know well and prefer much more, patterns, guarantees, and gut feeling how to write this code in general.
To a man with a hammer, everything looks like a nail. And that's not always a bad thing. If you have strict deadlines, done better than perfect! Better use Pandas now, than learn proper scalable solutions for years.
Imagine you want to use an Apache Zeppelin notebook in PySpark mode, with all these cool visualizations. But it's not quite meet your requirements, and you're thinking about how to quick-fix that. At the same time, you can instantly google a ready-made solution for Pandas. This is a way to go; you have no other option to meet your deadlines.
Another guess is that if you write your code in Python, you can debug it easily in every good IDE like PyCharm, using the interactive debugger. And that generally isn't valid for online notebooks, especially in Spark mode. Do you know any good debugger for Spark? I know nothing (people from the Big Data Tools plugin for IDEA are trying to fix this for Scala, but not for Python, as far as I know). So you have to write code in the IDE and then copy-paste it into the notebook.
And last but not least, it may be just a mistake. People do not always perfectly know what they're doing, especially in a large field as Big Data. You're fortunate to have this university course. Average Joe on the internets had no such option.
I should stop here because only speculations lie ahead.
The main difference between working with PySpark and Pandas is the syntax. To show this difference, I provide a simple example of reading in a parquet file and doing some transformations on the data. As you can see, the syntax is completely different between PySpark and Pandas, which means that your Pandas knowledge is not directly transferable to PySpark.
# Pandas
pandasDF = pd.read_parquet(path_to_data)
pandasDF['SumOfTwoColumns'] = pandasDF['Column1'] + pandasDF['Column2']
pandasDF.rename({'Column1': 'Col1', 'Column2': 'Col2'}, axis=1, inplace=True)
# PySpark
sparkDF = spark.read.parquet(path_to_data)
sparkDF = sparkDF.withColumn('SumOfTwoColumns', col('Column1') + col('Column2'))
sparkDF = sparkDF.withColumnRenamed('Column1', 'Col1').withColumnRenamed('Column2', 'Col2')
These differences in usage, but also in syntax, mean that there is a learning curve when transferring from using pure Pandas code to pure PySpark code. This also means that your legacy Pandas code can not be used directly on Spark with PySpark. Luckily there are solutions that allow you to use your Pandas code and knowledge on Spark.
Solutions to leverage the power of Spark with Pandas
There are mainly two options to use Pandas code on Spark: Koalas and Pandas UDFs
Although, its not recommended to use Pandas while working with pyspark, but sometimes, I have also seen people doing the same.
Bassically seems that the person who make that work feel more conformatable in Pandas. Of course Pandas doesn’t scale and If your data set grows, you need more RAM and probably a faster CPU (faster in terms of single core performance). While this may be limiting for some scenarios seems that in the example the csv would be not big enought to use spark.
I can not see any other reason.

Pandas panelnd vs dataframe with hierarchical index

I was wondering when and why I should prefer a panel(nd) over a dataframe with hierarchical index, and vice versa. In my very brief experience, I would say that the former is more convenient for slicing, while the latter for mathematical operations. My particular need would be to interactively manipulate 3-5 dimensional panels with convenient slicing and element-wise operations.
Thanks,
Giacomo
Generally stick with a multi-indexed frame as they are more fully supported.
A panelnd is like a generalized n-dim Panel, good mainly for single-dtyped data. It does work like a Panel, but has some quirks and missing features (its why its experimental).
Their are ways to apply some operations to multiple slabs of a n-dim (esp. via new apply in 0.13.1, see here.
Once I get to more than 3 dimensions, I mainly 'hold' the data and slice to work it in 2 dimensions, then reassemble it if needed. Storage can also be convient for these higher dim objects (e.g. via HDFStore), and was the reason they were created in the first place.

Haskell: list/vector/array performance tuning

I am trying out Haskell to compute partition functions of models in statistical physics. This involves traversing quite large lists of configurations and summing various observables - which I would like to do as efficiently as possible.
The current version of my code is here: https://gist.github.com/2420539
Some strange things happen when trying to choose between lists and vectors to enumerate the configurations; in particular, to truncate the list, using V.toList . V.take (3^n) . V.fromList (where V is Data.Vector) is faster than just using take, which feels a bit counter-intuitive. In both cases the list is evaluated lazily.
The list itself is built using iterate; if instead I use Vectors as much as possible and build the list by using V.iterateN, again it becomes slower ...
My question is, is there a way (other than splicing V.toList and V.fromList at random places in the code) to predict which one will be the quickest? (BTW, I compile everything using ghc -O2 with the current stable version.)
Vectors are strict, and have O(1) subsets (e.g. take). They also have an optimized insert and delete. So you will sometimes see performance improvements by switching data structures on the fly. However, it is usually the wrong approach -- keeping all data in either one form or the other is better. (And you're using UArrays as well -- further confusing the issue).
General rules:
If the data is large and being transformed only in bulk fashion, using a dense, efficient structures like vectors make sense.
If the data is small, and traversed linearly, rarely, then lists make sense.
Remember that operations on lists and vectors have different complexity, so while iterate . replicate on lists is O(n), but lazy, the same on vectors will not necessarily be as efficient (you should prefer the built in methods in vector to generate arrays).
Generally, vectors should always be better for numerical operations. It might be that you have to use different functions that you do in lists.
I would stick to vectors only. Avoid UArrays, and avoid lists except as generators.