I am using df.collect to get the complete list , but performance wise its very slow. can anyone suggest a better way to convert a complete df to list and not just a single column.
I am currently in a scenario in which I have a (really inefficient) for loop that attempts to process a pandas dataframe df with complicated aggregation functions (say complicated_function) over some groupby keys (respectively groupby_keys) as follows:
final_result = []
for groupby_key in groupby_keys:
small_df = complicated_function_1(df.loc[df['groupby_key'] == groupby_key])
df_we_want = pd.concat(final_result)
I am trying to see if there is a more efficient way of dealing with this, rather than having to use a for loop, especially if there are many groupby_keys.
Is it possible to convert all this into a single function for me to just groupby and agg/concatenate or pipe? Or is this procedure doomed to be constrained to the for loop? I have tried multiple combinations but have been getting an assortment of errors (have not really been able to pin down something specific).
I have setup Dask with their dashboard (it is really good).
I was looking through their Flame Graph for the pandas query df.groupby('myId').agg(['min', 'max','mean','std']) and I could see each aggregation function has taken their own time to solve.
Is it possible to improve this by doing all in one pass?
Note: I also need the index of max and min, so looking for a solution where I can get all this in one pass.
Currently, I am working on a single node Hadoop and I wrote a job to output a sorted dataframe with only one partition to one single csv file. And I discovered several outcomes when using repartition differently.
At first, I used orderBy to sort the data and then used repartition to output a CSV file, but the output was sorted in chunks instead of in an overall manner.
Then, I tried to discard repartition function, but the output was only a part of the records. I realized without using repartition spark will output 200 CSV files instead of 1, even though I am working on a one partition dataframe.
Thus, what I did next were placing repartition(1), repartition(1, "column of partition"), repartition(20) function before orderBy. Yet output remained the same with 200 CSV files.
So I used the coalesce(1) function before orderBy, and the problem was fixed.
I do not understand why working on a single partitioned dataframe has to use repartition and coalesce, and how the aforesaid processes affect the output. Grateful if someone can elaborate a little.
Spark has relevant parameters here:
spark.sql.shuffle.partitions and spark.default.parallelism.
When you perform operations like sort in your case, it triggers something called a shuffle operation
That will split your dataframe to spark.sql.shuffle.partitions partitions.
I also struggled with the same problem as you do and did not find any elegant solution.
Spark generally doesn’t have a great concept of ordered data, because all your data is split accross multiple partitions. And every time you call an operation that requires a shuffle your ordering will be changed.
For this reason, you’re better off only sorting your data in spark for the operations that really need it.
Forcing your data into a single file will break when the dataset gets larger
As Miroslav points out your data gets shuffled between partitions every time you trigger what’s called a shuffle stage (this is things like grouping or join or window operations)
You can set the number of shuffle partitions in the spark Config - the default is 200
Calling repartition before a group by operation is kind of pointless because spark needs to reparation your data again to execute the groupby
Coalesce operations sometimes get pushed into the shuffle stage by spark. So maybe that’s why it worked. Either that or because you called it after the groupby operation
A good way to understand what’s going on with your query is to start using the spark UI - it’s normally available at http://localhost:4040
More info here https://spark.apache.org/docs/3.0.0-preview/web-ui.html
I have a pandas dataframe containing 100 million tweets.
I have extracted URL's from the data and have currently stored it as a list in pandas column:
I want to run analysis on these URL's (like sorting by domain name,finding out what type of user posted which domains).
Is it possible to store like this:
where the URL column is pandas series with dynamic size so i can easily process? Otherwise what would be the best way to store the urls for efficiency while applying pandas operations and speed?
yes if you concat strings with \n like 'url1\nurl2\nurl3'
if you have list of url, you can use join:
listurl = ['url1','url2','url3']
Is there a general explanation, why spark needs so much more time to calculate the maximum value of a column?
I imported the Kaggle Quora training set (over 400.000 rows) and I like what spark is doing when it comes to rowwise feature extraction. But now I want to scale a column 'manually': find the maximum value of a column and divide by that value.
I tried the solutions from Best way to get the max value in a Spark dataframe column and https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html
I also tried df.toPandas() and then calculate the max in pandas (you guessed it, df.toPandas took a long time.)
The only thing I did ot try yet is the RDD way.
Before I provide some test code (I have to find out how to generate dummy data in spark), I'd like to know
can you give me a pointer to an article discussing this difference?
is spark more sensitive to memory constraints on my computer than pandas?
As #MattR has already said in the comment - you should use Pandas unless there's a specific reason to use Spark.
Usually you don't need Apache Spark unless you encounter MemoryError with Pandas. But if one server's RAM is not enough, then Apache Spark is the right tool for you. Apache Spark has an overhead, because it needs to split your data set first, then process those distributed chunks, then process and join "processed" data, collect it on one node and return it back to you.
#MaxU, #MattR, I found an intermediate solution that also makes me reassess Sparks laziness and understand the problem better.
sc.accumulator helps me define a global variable, and with a separate AccumulatorParam object I can calculate the maximum of the column on the fly.
In testing this I noticed that Spark is even lazier then expected, so this part of my original post ' I like what spark is doing when it comes to rowwise feature extraction' boils down to 'I like that Spark is doing nothing quite fast'.
On the other hand a lot of the time spent on calculating the maximum of the column has most presumably been the calculation of the intermediate values.
Thanks for yourinput and this topic really got me much further in understanding Spark.