filter a pivot table pandas - pandas

I am trying to filter a pivot table based on adding filter to pandas pivot table but it doesn't work.
maintenance_part_consumptionDF[(maintenance_part_consumptionDF.Asset == 'A006104') & (maintenance_part_consumptionDF.Reason == 'R565')].pivot_table(
values=["Quantity", "Part"],
index=["Asset"],
columns=["Reason"],
aggfunc={"Quantity": np.sum, "Part": lambda x: len(x.unique())},
fill_value=0,
)
But shows, TypeError: pivot_table() got multiple values for argument 'values'
Update
Creation of the pivot table:
import numpy as np
maintenance_part_consumption_pivottable_part = pd.pivot_table(
maintenance_part_consumptionDF,
values=["Quantity", "Part"],
index=["Asset"],
columns=["Reason"],
aggfunc={"Quantity": np.sum, "Part": lambda x: len(x.unique())},
fill_value=0,
)
maintenance_part_consumption_pivottable_part.head(2)
When I slice it manually:
maintenance_part_consumption_pivottable_partDF=pd.DataFrame(maintenance_part_consumption_pivottable_part)
maintenance_part_consumption_pivottable_partDF.iloc[15,[8]]
I get this output:
Reason Part R565 38 Name: A006104, dtype: int64
Which is the exact output I need.
But I don't want to do this way with iloc because it's more mechanical I have to count the number of y row indexes or/and x row indexes before getting to the result "38".
Hint: If I could take the asset description itsel and also for the reason from the table like it's asked on the question below.
How many unique parts were used for the asset A006104 for the failure
reason R565?
Sorry, I wanted to upload the table via an image to make it more realistic but I am not allowed.

If you read the documentation for DataFrame.pivot_table, the first parameter is values, so in your code:
.pivot_table(
maintenance_part_consumptionDF, # this is `values`
values=["Quantity", "Part"], # this is also `values`
...
)
Simply drop the furst argument:
.pivot_table(
values=["Quantity", "Part"],
index=["Asset"],
columns=["Reason"],
aggfunc={"Quantity": np.sum, "Part": lambda x: len(x.unique())},
fill_value=0,
)
There is also a closely related function: pd.pivot_table whose first parameter is a dataframe. Don't mix up the two

Related

Filtering out None from column in Pandas Dataframe

Let df be a dataframe and assume df["Scores"] is a column of dtype: object.
Some elements of "Scores" have the value None. I would like to filter out the None values from the Pandas series df["Scores].
I tried using the boolean series
df["Scores"] != None
however this returns True everywhere so did not help.
On the other hand the following will extract the values that are not None.
vals = []
for x in df["Scores"]:
if x!= None:
vals.append(x)
Would like to know why the first does not work.

When using 'df.groupby(column).apply()' get the groupby column within the 'apply' context?

I want to get the groupby column i.e. column that is supplied to df.groupby as a by argument (i.e. df.groupby(by=column)), within the apply context that comes after groupby (i.e. df.groupby(by=column).apply(Here)).
For example,
df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
'Parrot', 'Parrot'],
'Max Speed': [380., 370., 24., 26.]})
df.groupby(['Animal']).apply(Here I want to know that groupby column is 'Animal')
df
Animal Max Speed
0 Falcon 380.0
1 Falcon 370.0
2 Parrot 24.0
3 Parrot 26.0
Of course, I can have one more line of code or simply by supplying the groupby column to the apply context separately (e.g. .apply(lambda df_: some_function(df_,s='Animal')) ), but I am curious to see if this can be done in a single line e.g. possibly using pandas function built for doing this.
I just figured out a one-liner solution:
df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
'Parrot', 'Parrot'],
'Max Speed': [380., 370., 24., 26.]})
df.groupby(['Animal']).apply(lambda df_: df_.apply(lambda x: all(x==df_.name)).loc[lambda x: x].index.tolist())
returns groupby column within each groupby.apply context.
Animal
Falcon [Animal]
Parrot [Animal]
Since it is quite a long one-liner (uses 3 lambdas!), it is better to wrap it in a separate function, as shown below:
def get_groupby_column(df_): return df_.apply(lambda x: all(x==df_.name)).loc[lambda x: x].index.tolist()
df.groupby(['Animal']).apply(get_groupby_column)
Note of caution: this solution won't apply if other columns of the dataframe also contain the items from the groupby column e.g. if Max Speed column contained any of the items from the groupby column (i.e. Animal) there will be inaccurate results.
You could use grouper.names:
>>> df.groupby('Animal').grouper.names
['Animal']
>>>
With apply:
grouped = df.groupby('Animal')
grouped.apply(lambda x: grouped.grouper.names)

How to select a value in a dataframe with MultiIndex?

I use the Panda library to analyze the data coming from an excel file.
I used pivot_table to get a pivot table with the information I'm interested in. I end up with a multi index array.
For "OPE-2016-0001", I would like to obtain the figures for 2017 for example. I've tried lots of things and nothing works. What is the correct method to use? thank you
import pandas as pd
import numpy as np
from math import *
import tkinter as tk
pd.set_option('display.expand_frame_repr', False)
df = pd.read_csv('datas.csv')
def tcd_op_dataExcercice():
global df
new_df = df.assign(Occurence=1)
tcd= new_df.pivot_table(index=['Numéro opération',
'Libellé opération'],
columns=['Exercice'],
values=['Occurence'],
aggfunc=[np.sum],
margins=True,
fill_value=0,
margins_name='Total')
print(tcd)
print(tcd.xs('ALSTOM 8', level='Libellé opération', drop_level=False))
tcd_op_dataExcercice()
I get the following table (image).
How do I get the value framed in red?
You can use .loc to select rows by a DataFrame's Index's labels. If the Index is a MultiIndex, it will index into the first level of the MultiIndex (Numéro Opéracion in your case). Though you can pass a tuple to index into both levels (e.g. if you specifically wanted ("OPE-2016-0001", "ALSTOM 8"))
It's worth noting that the columns of your pivoted data are also a MultiIndex, because you specified the aggfunc, values and columns as lists, rather than individual values (i.e. without the []). Pandas creates a MultiIndex because of these lists, even though they had one
argument.
So you'll also need to pass a tuple to index into the columns to get the value for 2017:
tcd.loc["OPE-2016-0001", ('sum', 'Occurence', 2017)]
If you had instead just specified the aggfunc etc as individual strings, the columns would just be the years and you could select the values by:
tcd.loc["OPE-2016-0001", 2017]
Or if you specifically wanted the value for ALSTOM 8:
tcd.loc[("OPE-2016-0001", "ALSTOM 8"), 2017]
An alternative to indexing into a MultiIndex would also be to just .reset_index() after pivoting -- in which case the levels of the MultiIndex will just become columns in the data. And you can then select rows based on the values of those columns. E.g (assuming you specified aggfunc etc as strings):
tcd = tcd.reset_index()
tcd.query("'Numéro Opéracion' == 'OPE-2016-0001'")[2017]

pandas groupby returns multiindex with two more aggregates

When grouping by a single column, and using as_index=False, the behavior is expected in pandas. However, when I use .agg, as_index no longer appears to behave as expected. In short, it doesn't appear to matter.
# imports
import pandas as pd
import numpy as np
# set the seed
np.random.seed(834)
df = pd.DataFrame(np.random.rand(10, 1), columns=['a'])
df['letter'] = np.random.choice(['a','b'], size=10)
summary = df.groupby('letter', as_index=False).agg([np.count_nonzero, np.mean])
summary
returns:
a
count_nonzero mean
letter
a 6.0 0.539313
b 4.0 0.456702
When I would have expected the axis to be 0 1 with letter as a column in the dataframe.
In summary, I want to be able to group by one or more columns, summarize a single column with multiple aggregates, and return a dataframe that does not have the group by columns as the index, nor a Multi Index in the column.
The comment from #Trenton did the trick.
summary = df.groupby('letter')['a'].agg([np.count_nonzero, np.mean]).reset_index()

Streamlit - Applying value_counts / groupby to column selected on run time

I am trying to apply value_counts method to a Dataframe based on the columns selected dynamically in the Streamlit app
This is what I am trying to do:
if st.checkbox("Select Columns To Show"):
all_columns = df.columns.tolist()
selected_columns = st.multiselect("Select", all_columns)
new_df = df[selected_columns]
st.dataframe(new_df)
The above lets me select columns and displays data for the selected columns. I am trying to see how could I apply value_counts/groupby method on this output in Streamlit app
If I try to do the below
st.table(new_df.value_counts())
I get the below error
AttributeError: 'DataFrame' object has no attribute 'value_counts'
I believe the issue lies in passing a list of columns to a dataframe. When you pass a single column in [] to a dataframe, you get back a pandas.Series object (which has the value_counts method). But when you pass a list of columns, you get back a pandas.DataFrame (which doesn't have value_counts method defined on it).
Can you try st.table(new_df[col_name].value_counts())
I think the error is because value_counts() is applicable on a Series and not dataframe.
You can try Converting ".value_counts" output to dataframe
If you want to apply on one single column
def value_counts_df(df, col):
"""
Returns pd.value_counts() as a DataFrame
Parameters
----------
df : Pandas Dataframe
Dataframe on which to run value_counts(), must have column `col`.
col : str
Name of column in `df` for which to generate counts
Returns
-------
Pandas Dataframe
Returned dataframe will have a single column named "count" which contains the count_values()
for each unique value of df[col]. The index name of this dataframe is `col`.
Example
-------
>>> value_counts_df(pd.DataFrame({'a':[1, 1, 2, 2, 2]}), 'a')
count
a
2 3
1 2
"""
df = pd.DataFrame(df[col].value_counts())
df.index.name = col
df.columns = ['count']
return df
val_count_single = value_counts_df(new_df, selected_col)
If you want to apply for all object columns in the dataframe
def valueCountDF(df, object_cols):
c = df[object_cols].apply(lambda x: x.value_counts(dropna=False)).T.stack().astype(int)
p = (df[object_cols].apply(lambda x: x.value_counts(normalize=True,
dropna=False)).T.stack() * 100).round(2)
cp = pd.concat([c,p], axis=1, keys=["Count", "Percentage %"])
return cp
val_count_df_cols = valueCountDF(df, selected_columns)
And Finally, you can use st.table or st.dataframe to show the dataframe in your streamlit app