Processing xlsx files using Python Matplotlib and other libraries - dataframe

Python newbie. It looks to me that in order to build a flexible data collection from xlsx it will be well worth my time to learn the Pandas DataFrame approach. More upfront learning time, but a much better platform once I have it working.
I've played around with some code examples, but I really want an expert to give me advice on the best approach.

Related

In what situations are Datasets preferred to Dataframes and vice-versa in Apache Spark?

I have been searching for any links or documents or articles that will help me understand when do we go for Datasets over Dataframes and vice-versa?
All I find on the internet are headlines with when to use a Dataset but when opened, they just specify the differences between Dataframe and a Dataset. There are so many links with just listing differences in the name of scenarios.
There is only one question on stackoverflow that has the right title but even in that answer, the databricks documentation link is not working.
I am looking for some information that can help me understand fundamentally when do we go for a Dataset or in what scenarios is Dataset preferred over Dataframe and vice versa.
If not an answer, even a link or documentation that can help me understand is appreciated.
The page you are looking for is moved to here. According to the session, in summary, Dataset API is available for Scala (and Java) only, and it combines the benefits of both RDD and Dataframes which are:
Functional Programming (RDDs)
Type-safe (RDDs)
Relational (Dataframes)
Catalyst query optimization (Dataframes)
Tunsten direct/packed RAM (Dataframes)
JIT code generation (Dataframes)
Sorting/Shuffling without deserializing (Dataframes)
In addition, Datasets consume less memory and can catch analysis errors at the compile time while it is cached at Runtime for Dataframes. This is also a good article.
Therefore, the answer is you would better use Datasets when you are coding in Scala or Java and want to use functional programming and save more memory with all dataframe capabilities.
Datasets are preferred over Dataframes in Apache Spark when the data is strongly typed, i.e., when the schema is known ahead of time and the data is not necessarily homogeneous. This is because Datasets can enforce type safety, which means that type errors will be caught at compile time rather than at runtime. In addition, Datasets can take advantage of the Catalyst optimizer, which can lead to more efficient execution. Finally, Datasets can be easily converted to Dataframes, so there is no need to choose between the two upfront.

Why Import pandas in PySpark?

Hi In the University in the data science area we learned that if we wanted to work with small data we should use pandas and if we work with Big Data we schould use spark, in the case of Python programmers PySpark.
Recently I saw in a Hackaton in the cloud (azure Synapse, that work inside in Spark) importing pandas in the notebook ( I suppose the code is good cause was made from Microsoft people)
import pandas
from azureml.core import Dataset
training_pd = training_data.toPandas().to_csv('training_pd.csv', index=False)
Why do they do that ?
Pandas dataframes does not support parallelization. On the other hand, with Pandas, you need no cluster, you have more libraries and easy-to-extend examples. And let's be real, its performance is better for every task that doesn't require scaling.
So, if you start your data engineering life learning Pandas, you're stuck with two things:
Externalized knowledge: ready-made code, snippets, and projects;
Internalized knowledge: API that you know well and prefer much more, patterns, guarantees, and gut feeling how to write this code in general.
To a man with a hammer, everything looks like a nail. And that's not always a bad thing. If you have strict deadlines, done better than perfect! Better use Pandas now, than learn proper scalable solutions for years.
Imagine you want to use an Apache Zeppelin notebook in PySpark mode, with all these cool visualizations. But it's not quite meet your requirements, and you're thinking about how to quick-fix that. At the same time, you can instantly google a ready-made solution for Pandas. This is a way to go; you have no other option to meet your deadlines.
Another guess is that if you write your code in Python, you can debug it easily in every good IDE like PyCharm, using the interactive debugger. And that generally isn't valid for online notebooks, especially in Spark mode. Do you know any good debugger for Spark? I know nothing (people from the Big Data Tools plugin for IDEA are trying to fix this for Scala, but not for Python, as far as I know). So you have to write code in the IDE and then copy-paste it into the notebook.
And last but not least, it may be just a mistake. People do not always perfectly know what they're doing, especially in a large field as Big Data. You're fortunate to have this university course. Average Joe on the internets had no such option.
I should stop here because only speculations lie ahead.
The main difference between working with PySpark and Pandas is the syntax. To show this difference, I provide a simple example of reading in a parquet file and doing some transformations on the data. As you can see, the syntax is completely different between PySpark and Pandas, which means that your Pandas knowledge is not directly transferable to PySpark.
# Pandas
pandasDF = pd.read_parquet(path_to_data)
pandasDF['SumOfTwoColumns'] = pandasDF['Column1'] + pandasDF['Column2']
pandasDF.rename({'Column1': 'Col1', 'Column2': 'Col2'}, axis=1, inplace=True)
# PySpark
sparkDF = spark.read.parquet(path_to_data)
sparkDF = sparkDF.withColumn('SumOfTwoColumns', col('Column1') + col('Column2'))
sparkDF = sparkDF.withColumnRenamed('Column1', 'Col1').withColumnRenamed('Column2', 'Col2')
These differences in usage, but also in syntax, mean that there is a learning curve when transferring from using pure Pandas code to pure PySpark code. This also means that your legacy Pandas code can not be used directly on Spark with PySpark. Luckily there are solutions that allow you to use your Pandas code and knowledge on Spark.
Solutions to leverage the power of Spark with Pandas
There are mainly two options to use Pandas code on Spark: Koalas and Pandas UDFs
Although, its not recommended to use Pandas while working with pyspark, but sometimes, I have also seen people doing the same.
Bassically seems that the person who make that work feel more conformatable in Pandas. Of course Pandas doesn’t scale and If your data set grows, you need more RAM and probably a faster CPU (faster in terms of single core performance). While this may be limiting for some scenarios seems that in the example the csv would be not big enought to use spark.
I can not see any other reason.

Data post-processing after the simulation in Dymola

I am looking for a better way to do data post-processing after the simulation in Dymola, I could use the MATLAB scripts shipped with Dymola installation, but is there more user-friendly tools for post-processing? for example, I want to get the data between 10s and 100s.
One alternative would be using Python and DyMat. To me this proved to be one of the best solutions outside of Matlab.

Curious on how to use some basic machine learning in a web application

A co-worker and I had an idea to create a little web game where a user enters a chunk of data about themselves and then the application would write for them to sound like them in certain structures. (Trying to leave the idea a little vague.) We are both new to ML and thought this could be a fun first dive.
We have a decent bit of background with PHP, JavaScript (FE and Node), Ruby a little bit of other languages, and have had interest in learning Python for ML. Curious if you can run a cost efficient ML library for text well with a web app, being most servers lack GPUs?
Perhaps you have to pay for one of the cloud based systems, but wanted to find the best entry point for this idea without racking up too much cost. (So far I have been reading about running Pytorch or TensorFlow, but it sounds like you lose a lot of efficiency running with CPUs.)
Thank you!
(My other thought is doing it via an iOS app and trying Apple's ML setup.)
It sounds like you are looking for something like Tensorflow JS
Yes, before jumping into training something with Deep Learning; (this might even be un-necessary for your purpose) try to build a nice and simple baseline for this.
Before Deep Learning (just a few yrs ago) people did similar tasks using n-gram feature based language models. https://web.stanford.edu/~jurafsky/slp3/3.pdf
Essentially you try to predict the next few words probabilistically given a small context(of n-words; typically n is small like 5 or 6)
This should be a lot of fun to work out and will certainly do quite well with a small amount of data. Also such a model will run blazingly fast; so you don't have to worry about GPUs and compute .
To improve on these results with Deep Learning, you'll need to collect a ton of data first; and it will be work to get it to be fast on a web based platform

What is better Orange.data.Table or Pandas for data manage in python?

Iam doing data mining and i dont know if going to use Table or Pandas?
any information for select the most suitable library for manage my dataset going to be welcome. Thank for any answer that help me in this.
I am an Orange programmer, and I'd say that if you are writing python scripts to analyze data, start with numpy + sklearn or Pandas.
To create an Orange.data.Table, you need to define Domain, which Orange uses for data transformations. Thus, tables in Orange are harder to create (but can, for example, provide automatic processing of testing data).
Of course, if you need to interface something specific from Orange, you will have to make a Table.