Spark dataframe reducebykey like operation - sql

I have a Spark dataframe with the following data (I use spark-csv to load the data in):
key,value
1,10
2,12
3,0
1,20
is there anything similar to spark RDD reduceByKey which can return a Spark DataFrame as: (basically, summing up for the same key values)
key,value
1,30
2,12
3,0
(I can transform the data to RDD and do a reduceByKey operation, but is there a more Spark DataFrame API way to do this?)

If you don't care about column names you can use groupBy followed by sum:
df.groupBy($"key").sum("value")
otherwise it is better to replace sum with agg:
df.groupBy($"key").agg(sum($"value").alias("value"))
Finally you can use raw SQL:
df.registerTempTable("df")
sqlContext.sql("SELECT key, SUM(value) AS value FROM df GROUP BY key")
See also DataFrame / Dataset groupBy behaviour/optimization

I think user goks missed out on some part in the code. Its not a tested code.
.map should have been used to convert the rdd to a pairRDD using .map(lambda x: (x,1)).reduceByKey. ....
reduceByKey is not available on a single value rdd or regular rdd but pairRDD.
Thx

How about this? I agree this still converts to rdd then to dataframe.
df.select('key','value').map(lambda x: x).reduceByKey(lambda a,b: a+b).toDF(['key','value'])

Related

One to One mapping of data in the row in pandas

I have dataset looks like this
And I want the output of this data frame like this. So it's kind of one to one mapping of row values. Assume option1 and option2 has same comma separated values.
Please let me know how do I achieve this ?
You can use the zip() function from the standard Python library and the explode() method from Pandas dataframe like that :
df["option1"] = df["option1"].str.split(",")
df["option2"] = df["option2"].str.split(",")
df["option3"] = df["option3"]*max(df["option1"].str.len().max(), df["option2"].str.len().max())
new_df = pd.DataFrame(df.apply(lambda x: list(zip(x[0], x[1], x[2])), axis=1).explode().to_list(), columns=df.columns)
new_df

Can I turn a Pyspark-SQL groupby into a dataframe? I need to join it later

I have a groupby that I want to have as a pyspark dataframe, as I need to join the resulting data with another dataset that I have.
So basically, I just want this table to be a dataframe that I can perform dataframe operations on.
DATE
COUNT
01/12/2019
583
02/14/2020
421
crash_orig.groupBy('Date').count().sort(desc('count')).show()
Just use assignment operator to save the dataframe by declaring a variable:
df = crash_orig.groupBy('Date').count().sort(desc('count'))
df.show()

How to convert pandas dataframe to single index after aggregation?

I have been playing with aggregation in pandas dataframe. Considering the following dataframe:
df=pd.DataFrame({'a':[1,2,3,4,5,6,7,8],
'batch':['q','q','q','w','w','w','w','e'],
'c':[4,1,3,4,5,1,3,2]})
I have to do aggregation on the batch column with mean for column a and min for column c.
I used the following method to do the aggregation:
agg_dict = {'a':{'a':'mean'},'c':{'c':'min'}}
aggregated_df = df.groupby("batch").agg(agg_dict)
The problem is that I want the final data frame to have the same columns as the original data frame with the slight difference of having the aggregated values present in each of the columns.
The result of the above aggregation is a multi-index data frame, and am not sure how to convert it to an individual data frame?
I followed the link: Reverting from multiindex to single index dataframe in pandas . But, this didn't work, and the final output was still a multi-index data frame.
Great, if someone could help
you can try the following code df.groupby('batch').aggregate({'c':'min','a':mean})

Flatten and rename multi-index agg columns

I have some Pandas / cudf code that aggregates a particular column using two aggregate methods, and then renames the multi-index columns to flattened columns.
df = (
some_df
.groupby(["some_dimension"])
.agg({"some_metric" : ["sum", "max"]})
.reset_index()
.rename(columns={"some_dimension" : "some_dimension__id", ("some_metric", "sum") : "some_metric_sum", ("some_metric", "max") : "some_metric_max"})
)
This works great in cudf, but does not work in Pandas 0.25 -- the hierarchy is not flattened out.
Is there a similar approach using Pandas? I like the cudf tuple syntax and how they just implicitly flatten the columns. Hoping to find a similarly easy way to do it in Pandas.
Thanks.
In pandas 0.25.0+ there is something called groupby aggregation with relabeling.
Here is a stab at your code
df = (some_df
.groupby(["some_dimension"])
.agg(some_metric_sum=("some_metric", "sum"),
some_metric_max=("some_metric", "max"]})
.reset_index()
.rename(colunms = {"some_dimension":"some_dimension_id"}))

How to translate a pandas group by without aggregation to pyspark?

I am trying to convert the following pandas line into pyspark:
df = df.groupby('ID', as_index=False).head(1)
Now, I am familiar with the pyspark df.groupby("col1", "col2") method in pyspark, as well as the following to get whatever the first element is within a group:
df = df.withColumn("row_num", row_number().over(Window.partitionBy("ID").orderBy("SOME_DATE_COLUMN"))).where(col("row_num") < 2)
However, without an orderBy argument, this grouping and fetching of the first element in each group doesn't work (and I am literally trying to convert from pandas to spark, whatever the pandas line does):
An error occurred while calling o2547.withColumn.
: org.apache.spark.sql.AnalysisException: Window function row_number() >requires window to be ordered, please add ORDER BY clause. For example >SELECT row_number()(value_expr) OVER (PARTITION BY window_partition >ORDER BY window_ordering) from table
Looking at the pandas groupby documentation, I cannot grasp what groupby does without a following sort/agg function applied to the groups; i.e. what is the default order within a group from which the $.head(1)$ fetches the first element?
It depends on the order of your pandas dataframe before the groupby. From the pandas groupby documentation:
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.
Converting the pandas beheaviour exactly to pyspark is impossible as pyspark dataframes aren't ordered. But if your data source can provide a row number or something like that, it is possible.