How to turn a pandas pivot_table into a simple table - pandas

I have a simple DataFrame defined by:
df = pd.DataFrame({
'year': [2019, 2020, 2019, 2020, 2019, 2020, 2019, 2020],
'name': ['Alice', 'Alice', 'Alice', 'Alice', 'Bob', 'Bob', 'Bob', 'Bob'],
'sales': [100, 200, 300, 400, 500, 600, 700, 800]
})
This DataFrame is easily turned into a pivot table using pivot_table :
table = pd.pivot_table(
df,
index=['name'],
columns=['year'],
aggfunc=np.sum)
Now I need to turn this DataFrame into a simple JSON array. Unfortunately, the to_json method doesn't return a simple array:
table.reset_index().to_json(orient="records")
[
{
"["name",""]":"Alice",
"["sales",2019]":400,
"["sales",2020]":600
},
{
"["name",""]":"Bob",
"["sales",2019]":1200,
"["sales",2020]":1400}
]
How can I turn the table DataFrame into a simple (without multiindex) DataFrame?
[
{
"name":"Alice",
"2019":400,
"2020":600
},
{
"name":"Bob",
"2019":1200,
"2020":1400
}
]

you need to pass the values param to get rid of the multiindex , also get rid of the lists in the param inside pivot when passing a single column:
table = pd.pivot_table(
df,
index='name',
columns='year',
values='sales',
aggfunc=np.sum)
table.reset_index().to_json(orient="records")
'[{"name":"Alice","2019":400,"2020":600},{"name":"Bob","2019":1200,"2020":1400}]'
Adding another alternative if you like:
out = (df.groupby(['name','year'])['sales'].sum().unstack()
.reset_index().to_json(orient='records'))
'[{"name":"Alice","2019":400,"2020":600},{"name":"Bob","2019":1200,"2020":1400}]'

Related

Create new data frame from unique values of certain columns [duplicate]

Say my data looks like this:
date,name,id,dept,sale1,sale2,sale3,total_sale
1/1/17,John,50,Sales,50.0,60.0,70.0,180.0
1/1/17,Mike,21,Engg,43.0,55.0,2.0,100.0
1/1/17,Jane,99,Tech,90.0,80.0,70.0,240.0
1/2/17,John,50,Sales,60.0,70.0,80.0,210.0
1/2/17,Mike,21,Engg,53.0,65.0,12.0,130.0
1/2/17,Jane,99,Tech,100.0,90.0,80.0,270.0
1/3/17,John,50,Sales,40.0,50.0,60.0,150.0
1/3/17,Mike,21,Engg,53.0,55.0,12.0,120.0
1/3/17,Jane,99,Tech,80.0,70.0,60.0,210.0
I want a new column average, which is the average of total_sale for each name,id,dept tuple
I tried
df.groupby(['name', 'id', 'dept'])['total_sale'].mean()
And this does return a series with the mean:
name id dept
Jane 99 Tech 240.000000
John 50 Sales 180.000000
Mike 21 Engg 116.666667
Name: total_sale, dtype: float64
but how would I reference the data? The series is a one dimensional one of shape (3,). Ideally I would like this put back into a dataframe with proper columns so I can reference properly by name/id/dept.
If you call .reset_index() on the series that you have, it will get you a dataframe like you want (each level of the index will be converted into a column):
df.groupby(['name', 'id', 'dept'])['total_sale'].mean().reset_index()
EDIT: to respond to the OP's comment, adding this column back to your original dataframe is a little trickier. You don't have the same number of rows as in the original dataframe, so you can't assign it as a new column yet. However, if you set the index the same, pandas is smart and will fill in the values properly for you. Try this:
cols = ['date','name','id','dept','sale1','sale2','sale3','total_sale']
data = [
['1/1/17', 'John', 50, 'Sales', 50.0, 60.0, 70.0, 180.0],
['1/1/17', 'Mike', 21, 'Engg', 43.0, 55.0, 2.0, 100.0],
['1/1/17', 'Jane', 99, 'Tech', 90.0, 80.0, 70.0, 240.0],
['1/2/17', 'John', 50, 'Sales', 60.0, 70.0, 80.0, 210.0],
['1/2/17', 'Mike', 21, 'Engg', 53.0, 65.0, 12.0, 130.0],
['1/2/17', 'Jane', 99, 'Tech', 100.0, 90.0, 80.0, 270.0],
['1/3/17', 'John', 50, 'Sales', 40.0, 50.0, 60.0, 150.0],
['1/3/17', 'Mike', 21, 'Engg', 53.0, 55.0, 12.0, 120.0],
['1/3/17', 'Jane', 99, 'Tech', 80.0, 70.0, 60.0, 210.0]
]
df = pd.DataFrame(data, columns=cols)
mean_col = df.groupby(['name', 'id', 'dept'])['total_sale'].mean() # don't reset the index!
df = df.set_index(['name', 'id', 'dept']) # make the same index here
df['mean_col'] = mean_col
df = df.reset_index() # to take the hierarchical index off again
Adding to_frame
df.groupby(['name', 'id', 'dept'])['total_sale'].mean().to_frame()
You are very close. You simply need to add a set of brackets around [['total_sale']] to tell python to select as a dataframe and not a series:
df.groupby(['name', 'id', 'dept'])[['total_sale']].mean()
If you want all columns:
df.groupby(['name', 'id', 'dept'], as_index=False).mean()[['name', 'id', 'dept', 'total_sale']]
The answer is in two lines of code:
The first line creates the hierarchical frame.
df_mean = df.groupby(['name', 'id', 'dept'])[['total_sale']].mean()
The second line converts it to a dataframe with four columns('name', 'id', 'dept', 'total_sale')
df_mean = df_mean.reset_index()

How to stack a pd.DataFrame until it becomes pd.Series?

I have the following pd.DataFrame:
df = pd.DataFrame(
data=[['dog', 'kg', 100, 241], ['cat', 'lbs', 300, 1]],
columns=['animal', 'unit', 0, 1],
).set_index(['animal', 'unit'])
df.columns = pd.MultiIndex.from_tuples(list(zip(*[['2019', '2018'], ['Apr', 'Oct']])))
and I would like to convert it a 2D matrix with no multilevel indexes on index or column:
pd.DataFrame(
data=[
['dog', 'kg', 100, '2019', 'Apr'],
['dog', 'kg', 241, '2018', 'Oct'],
['cat', 'lbs', 300, '2019', 'Apr'],
['cat', 'lbs', 1, '2018', 'Oct']
],
columns=['animal', 'unit', 'value', 'year', 'month']
)
To achieve this, I use df.stack().stack() -> this becomes a pd.Series and then I do .reset_index() on these series t convert to DataFrame.
My question is - how do I avoid the second (or multiple more) stack()?
Is there a way to stack a pd.DataFrame until it becomes a pd.Series?

Pandas Groupby: return dict of rows

I would like to group my dataframe by one of the columns and then return a dictionary that has a list of all of the rows per column value. Is there a fast Pandas idiom for doing this?
Example:
test = pd.DataFrame({
'id': ['alice', 'bob', 'bob', 'charlie'],
'transaction_date': ['2020-01-01', '2020-01-01', '2020-01-02', '2020-01-02'],
'amount': [50.0, 10.0, 12.0, 13.0]
})
Desired output:
result = {
'alice': [Series(transaction_date='2020-01-01', amount=50.0)],
'bob': [Series(transaction_date='2020-01-01', amount=10.0), Series(transaction_date='2020-01-02', amount=12.0)],
'charlie': [Series(transaction_date='2020-01-02', amount=53.0)],
}
The following approaches do NOT work:
test.groupby('id').agg(list)
Returns a Dataframe where each column (amount and transaction_date) has a list of values, but that's not what I want. I want the result to be one list of rows / Pandas series per unique grouping column value ('id' value).
test.groupby('id').agg(list).to_dict():
{'amount': {'charlie': [13.0], 'bob': [10.0, 12.0], 'alice': [50.0]}, 'transaction_date': {'charlie': ['2020-01-02'], 'bob': ['2020-01-01', '2020-01-02'], 'alice': ['2020-01-01']}}
test.groupby('id').apply(list).to_dict():
{'charlie': ['amount', 'id', 'transaction_date'], 'bob': ['amount', 'id', 'transaction_date'], 'alice': ['amount', 'id', 'transaction_date']}
Use itertuples and zip,
import pandas as pd
test = pd.DataFrame({
'id': ['alice', 'bob', 'bob', 'charlie'],
'transaction_date': ['2020-01-01', '2020-01-01', '2020-01-02', '2020-01-02'],
'amount': [50.0, 10.0, 12.0, 13.0]
})
columns = ['transaction_date', 'amount']
grouped = (test
.groupby('id')[columns]
.apply(lambda x: list(x.itertuples(name='Series', index=False))))
print(dict(zip(grouped.index, grouped.values)))
{
'alice': [Series(transaction_date='2020-01-01', amount=50.0)],
'bob': [
Series(transaction_date='2020-01-01', amount=10.0),
Series(transaction_date='2020-01-02', amount=12.0)
],
'charlie': [Series(transaction_date='2020-01-02', amount=13.0)]
}

pandas: aggregate array during groupby, equivalent of SQL's array_agg?

I've got this dataframe:
df1 = pd.DataFrame([
{ 'id': 1, 'spend': 60, 'store': 'Stockport' },
{ 'id': 2, 'spend': 68, 'store': 'Didsbury' },
{ 'id': 3, 'spend': 70, 'store': 'Stockport' },
{ 'id': 4, 'spend': 35, 'store': 'Didsbury' },
{ 'id': 5, 'spend': 16, 'store': 'Didsbury' },
{ 'id': 6, 'spend': 12, 'store': 'Didsbury' },
])
I've grouped it by store and got the total spend by store:
df.groupby("store").agg({'spend': 'sum'})\
.reset_index().sort_values("spend", ascending=False)
store spend
Didsbury 131
Stockport 130
Is there a way I can get the IDs for each store as a column in the grouped object? Like the equivalent of ARRAY_AGG in Postgres? So the desired output would be:
store spend ids
Didsbury 131 [2,4,5,6]
Stockport 130 [1,3]
We can use named_aggregations, which is an aggregation method available since pandas >= 0.25.0.
Notice how we can instantely rename our column to "ids":
df1.groupby('store').agg(
spend=('spend', 'sum'),
ids=('id', list)
).reset_index()
store spend ids
0 Didsbury 131 [2, 4, 5, 6]
1 Stockport 130 [1, 3]
You can pass list like aggregation function for id column:
df = (df1.groupby("store").agg({'spend': 'sum', 'id':list})
.reset_index()
.sort_values("spend", ascending=False))
print (df)
store spend id
0 Didsbury 131 [2, 4, 5, 6]
1 Stockport 130 [1, 3]

Pandas Series .loc() access error after appending

I have a multi-index pandas series as below. I want to add a new entry (new_series) to multi_df, calling it multi_df_appended. However I don't understand the change in behaviour between multi_df and multi_df_appended when I try to access a non-existing multi-index.
Below is the code that reproduces the problem. I want the penultimate line of code: multi_df_appended.loc['five', 'black', 'hard', 'square' ] to return an empty Series like it does with multi_df but instead I get the error given. What am I doing wrong here?
df = pd.DataFrame({'id' : range(1,9),
'code' : ['one', 'one', 'two', 'three',
'two', 'three', 'one', 'two'],
'colour': ['black', 'white','white','white',
'black', 'black', 'white', 'white'],
'texture': ['soft', 'soft', 'hard','soft','hard',
'hard','hard','hard'],
'shape': ['round', 'triangular', 'triangular','triangular','square',
'triangular','round','triangular']
}, columns= ['id','code','colour', 'texture', 'shape'])
multi_df = df.set_index(['code','colour','texture','shape']).sort_index()['id']
# try to access a non-existing multi-index combination:
multi_df.loc['five', 'black', 'hard', 'square' ]
Series([], dtype: int64) # returns an empty Series as desired/expected.
# append multi_df with a new row
new_series = pd.Series([9], index = [('four', 'black', 'hard', 'round')] )
multi_df_appended = multi_df.append(new_series)
# now try again to access a non-existing multi-index combination:
multi_df_appended.loc['five', 'black', 'hard', 'square' ]
error: 'MultiIndex lexsort depth 0, key was length 4' # now instead of the empty Series, I get an error!?
As #Jeff answered, if I do .sortlevel(0) and then run .loc() for an unknown index, it does not give the "lexsort depth" error:
multi_df_appended_sorted = multi_df.append(new_series).sortlevel(0)
multi_df_appended_sorted.loc['five', 'black', 'hard', 'square' ]
Series([], dtype: int64)