Flatten and rename multi-index agg columns - pandas

I have some Pandas / cudf code that aggregates a particular column using two aggregate methods, and then renames the multi-index columns to flattened columns.
df = (
some_df
.groupby(["some_dimension"])
.agg({"some_metric" : ["sum", "max"]})
.reset_index()
.rename(columns={"some_dimension" : "some_dimension__id", ("some_metric", "sum") : "some_metric_sum", ("some_metric", "max") : "some_metric_max"})
)
This works great in cudf, but does not work in Pandas 0.25 -- the hierarchy is not flattened out.
Is there a similar approach using Pandas? I like the cudf tuple syntax and how they just implicitly flatten the columns. Hoping to find a similarly easy way to do it in Pandas.
Thanks.

In pandas 0.25.0+ there is something called groupby aggregation with relabeling.
Here is a stab at your code
df = (some_df
.groupby(["some_dimension"])
.agg(some_metric_sum=("some_metric", "sum"),
some_metric_max=("some_metric", "max"]})
.reset_index()
.rename(colunms = {"some_dimension":"some_dimension_id"}))

Related

One to One mapping of data in the row in pandas

I have dataset looks like this
And I want the output of this data frame like this. So it's kind of one to one mapping of row values. Assume option1 and option2 has same comma separated values.
Please let me know how do I achieve this ?
You can use the zip() function from the standard Python library and the explode() method from Pandas dataframe like that :
df["option1"] = df["option1"].str.split(",")
df["option2"] = df["option2"].str.split(",")
df["option3"] = df["option3"]*max(df["option1"].str.len().max(), df["option2"].str.len().max())
new_df = pd.DataFrame(df.apply(lambda x: list(zip(x[0], x[1], x[2])), axis=1).explode().to_list(), columns=df.columns)
new_df

Grouping filter on Pandas

i was trying to apply filter on grouping function, but i am not getting right syntax. Typically the way we apply filter on grouping function in SQL, i am looking for the same feature or functionality in Pandas.
This is my query, and i want to filter the result where count>=5
home.groupby('location').agg({'price_per_sqft':['mean','std','count']})
Could you show me the way to filter the result?
First for avoid MultiIndex add column price_per_sqft after groupby and then filter by boolean indexing:
df = home.groupby('location')['price_per_sqft'].agg(['mean','std','count'])
df1 = df[df['count']>=5]
Or DataFrame.query:
df1 = df.query("count>=5")
Another idea is used named aggregations:
df = home.groupby('location').agg(avg=('price_per_sqft', 'mean'),
std=('price_per_sqft', 'std'),
counts=('price_per_sqft', 'count'))
df1 = df[df['counts']>=5]

pandas groupby returns multiindex with two more aggregates

When grouping by a single column, and using as_index=False, the behavior is expected in pandas. However, when I use .agg, as_index no longer appears to behave as expected. In short, it doesn't appear to matter.
# imports
import pandas as pd
import numpy as np
# set the seed
np.random.seed(834)
df = pd.DataFrame(np.random.rand(10, 1), columns=['a'])
df['letter'] = np.random.choice(['a','b'], size=10)
summary = df.groupby('letter', as_index=False).agg([np.count_nonzero, np.mean])
summary
returns:
a
count_nonzero mean
letter
a 6.0 0.539313
b 4.0 0.456702
When I would have expected the axis to be 0 1 with letter as a column in the dataframe.
In summary, I want to be able to group by one or more columns, summarize a single column with multiple aggregates, and return a dataframe that does not have the group by columns as the index, nor a Multi Index in the column.
The comment from #Trenton did the trick.
summary = df.groupby('letter')['a'].agg([np.count_nonzero, np.mean]).reset_index()

Preferred pandas code for selecting all rows and a subset of columns

Suppose that you have a pandas DataFrame named df with columns ['a','b','c','d','e'] and you want to create a new DataFrame newdf with columns 'b' and 'd'. There are two possible ways to do this:
newdf = df[['b','d']]
or
newdf = df.loc[:,['b','d']]
The first is using the indexing operator. The second is using .loc. Is there a reason to prefer one over the other?
Thanks to #coldspeed, it seems that newdf = df.loc[:,['b','d']] is preferred to avoid the dreaded SettingWithCopyWarning.

Spark dataframe reducebykey like operation

I have a Spark dataframe with the following data (I use spark-csv to load the data in):
key,value
1,10
2,12
3,0
1,20
is there anything similar to spark RDD reduceByKey which can return a Spark DataFrame as: (basically, summing up for the same key values)
key,value
1,30
2,12
3,0
(I can transform the data to RDD and do a reduceByKey operation, but is there a more Spark DataFrame API way to do this?)
If you don't care about column names you can use groupBy followed by sum:
df.groupBy($"key").sum("value")
otherwise it is better to replace sum with agg:
df.groupBy($"key").agg(sum($"value").alias("value"))
Finally you can use raw SQL:
df.registerTempTable("df")
sqlContext.sql("SELECT key, SUM(value) AS value FROM df GROUP BY key")
See also DataFrame / Dataset groupBy behaviour/optimization
I think user goks missed out on some part in the code. Its not a tested code.
.map should have been used to convert the rdd to a pairRDD using .map(lambda x: (x,1)).reduceByKey. ....
reduceByKey is not available on a single value rdd or regular rdd but pairRDD.
Thx
How about this? I agree this still converts to rdd then to dataframe.
df.select('key','value').map(lambda x: x).reduceByKey(lambda a,b: a+b).toDF(['key','value'])