groupby shows unobserved values of non-categorical columns - pandas

I created this simple example to illustrate my issue:
x = pd.DataFrame({"int_var1": range(3), "int_var2": range(3, 6), "cat_var": pd.Categorical(["a", "b", "a"]), "value": [0.1, 0.2, 0.3]})
it yields this DataFrame:
int_var1 int_var2 cat_var value
0 3 a 0.1
1 4 b 0.2
2 5 a 0.3
where the first two columns are integers, the third column is categorical with two levels, and the fourth column is floats. The issue is that when I try to use groupby followed by agg it seems I only have two options, either I can show no unobserved values like so:
x.groupby(['int_var1', 'int_var2', 'cat_var'], observed = True).agg({"value": "sum"}).fillna(0)
int_var1 int_var2 cat_var value
0 3 a 0.1
1 4 b 0.2
2 5 a 0.3
or I can show unobserved values for all grouping variables like so:
x.groupby(['int_var1', 'int_var2', 'cat_var'], observed = False).agg({"value": "sum"}).fillna(0)
int_var1 int_var2 cat_var value
0 3 a 0.1
b 0.0
4 a 0.0
b 0.0
5 a 0.0
b 0.0
1 3 a 0.0
b 0.0
4 a 0.0
b 0.2
5 a 0.0
b 0.0
2 3 a 0.0
b 0.0
4 a 0.0
b 0.0
5 a 0.3
b 0.0
Is there a way to show unobserved values for the categorical variables only and not every possible permutation of all grouping variables?

You can unstack the level of interest, cat_var in this case:
(x.groupby(['int_var1', 'int_var2', 'cat_var'],observed=True)
.agg({'value':'sum'})
.unstack('cat_var',fill_value=0)
)
Output:
value
cat_var a b
int_var1 int_var2
0 3 0.1 0.0
1 4 0.0 0.2
2 5 0.3 0.0

Related

how to create a column with the index of the biggest among other columns AND some condition

I have a dataset with some columns, I want to create another column, where values are the column name of the variable with the highest value BUT different from 1
For Example:
df = pd.DataFrame({'A': [1, 0.2, 0.1, 0],
'B': [0.2,1, 0, 0.5],
'C': [1, 0.4, 0.3, 1]},
index=['1', '2', '3', '4'])
df
index
A
B
C
1
1.0
0.2
1.0
2
0.2
1.0
0.4
3
0.1
0.0
0.3
4
0.0
0.5
1.0
Should give an output like
index
A
B
C
NEWCOL
1
1.0
0.2
1.0
B
2
0.2
0.3
0.1
C
3
0.1
0.4
0.2
B
4
0.0
0.5
1.0
B
df2['newcol'] = df2.idxmax(axis=1) if df2.max(index=1) != 1
but didn't work
here is one way to do it
# filter out the data that is 1 and find the id of the max value using idxmax
df['newcol']=df[~df.isin([1])].idxmax(axis=1)
df
A B C newcol
1 1.0 0.2 1.0 B
2 0.2 1.0 0.4 C
3 0.1 0.0 0.3 C
4 0.0 0.5 1.0 B
PS: your input, starting and expected data don't match. The above is based on the input DF

Pandas groupby().sum() is not ignoring None/empty/np.nan values

I am facing a weird problem with Pandas groupby().sum() together.
I have a DF with 2 category columns and 3 numeric/float value columns - value columns have Nones as shown below:
cat_1 cat_2 val_1 val_2 val_3
0 b z 0.1 NaN NaN
1 b x 0.1 NaN NaN
2 c y 0.1 1.0 NaN
3 c z 0.1 NaN NaN
4 c x 0.1 1.0 NaN
I want to group-by cat_1 and then sum the value columns: val_1, val_2 and val_3 per category.
This is what the final aggregation should look like:
cat_1 val_1 val_2 val_3
0 b 0.2 NaN NaN
1 c 0.3 2.0 NaN
Problem: is that when I do df.groupby(["cat_1"], as_index=False).sum(), I get 0.0 for categories where all values are None/null:
cat_1 val_1 val_2 val_3
0 b 0.2 0.0 0.0
1 c 0.3 2.0 0.0
How can I fix this issue? Thank you.
It seems there is no direct way of doing it like via DataFrame.sum(skipna=False) but it is not implemented for groupby-sum. However, as ScootCork mentioned, the required behaviour can be achieved using: groupby().sum(min_count=1) - from documentation.

Excel sumproudct function in pandas dataframes

Ok, as a python beginner I found multiplication matrix in pandas dataframes is very difficult to conduct.
I have two tables look like:
df1
Id lifetime 0 1 2 3 4 5 .... 30
0 1 4 0.1 0.2 0.1 0.4 0.5 0.4... 0.2
1 2 7 0.3 0.2 0.5 0.4 0.5 0.4... 0.2
2 3 8 0.5 0.2 0.1 0.4 0.5 0.4... 0.6
.......
9 6 10 0.3 0.2 0.5 0.4 0.5 0.4... 0.2
df2
Group lifetime 0 1 2 3 4 5 .... 30
0 2 4 0.9 0.8 0.9 0.8 0.8 0.8... 0.9
1 2 7 0.8 0.9 0.9 0.9 0.8 0.8... 0.9
2 3 8 0.9 0.7 0.8 0.8 0.9 0.9... 0.9
.......
9 5 10 0.8 0.9 0.7 0.7 0.9 0.7... 0.9
I want to perform excel's sumproduct function in my codes and the length of the columns that need to be summed are based on the lifetime in column 1 of both dfs, e,g.,
for row 0 in df1&df2, lifetime=4:
sumproduct(df1 row 0 from column 0 to column 3,
df2 row 0 from column 0 to column 3)
for row 1 in df1&df2, lifetime=7
sumproduct(df1 row 2 from column 0 to column 6,
df2 row 2 from column 0 to column 6)
.......
How can I do this?
You can use .iloc to access row and columns with integers.
So where lifetime==4 is row 0, and if you count the column numbers where Id is zero, then column labeled as 0 would be 2, and column labeled as 3 would be 5, to get that interval you would enter 2:6.
Once you get the correct data in both data frames with .iloc[0,2:6], you run np.dot
See below:
import numpy as np
np.dot(df1.iloc[0,2:6], df2.iloc[1,2:6])
Just to make sure you have the right data, try just running
df1.iloc[0,2:6]
Then try the np.dot product. You can read up on "pandas iloc" and "slicing" for more info.

Grouping by and applying lambda with condition for the first row - Pandas

I have a data frame with IDs, and choices that have made by those IDs.
The alternatives (choices) set is a list of integers: [10, 20, 30, 40].
Note: That's important to use this list. Let's call it 'choice_list'.
This is the data frame:
ID Choice
1 10
1 30
1 10
2 40
2 40
2 40
3 20
3 40
3 10
I want to create a variable for each alternative: '10_Var', '20_Var', '30_Var', '40_Var'.
At the first row of each ID, if the first choice was '10' for example, so the variable '10_Var' will get the value 0.6 (some parameter), and each of the other variables ('20_Var', '30_Var', '40_Var') will get the value (1 - 0.6) / 4.
The number 4 stands for the number of alternatives.
Expected result:
ID Choice 10_Var 20_Var 30_Var 40_Var
1 10 0.6 0.1 0.1 0.1
1 30
1 10
2 40 0.1 0.1 0.1 0.6
2 40
2 40
3 20 0.1 0.6 0.1 0.1
3 40
3 10
you can use np.where to do this. It is efficient that df.where
df = pd.DataFrame([['1', 10], ['1', 30], ['1', 10], ['2', 40], ['2', 40], ['2', 40], ['3', 20], ['3', 40], ['3', 10]], columns=('ID', 'Choice'))
choices = np.unique(df.Choice)
for choice in choices:
df[f"var_{choice}"] = np.where(df.Choice==choice, 0.6, (1 - 0.6) / 4)
df
Result
ID Choice var_10 var_20 var_30 var_40
0 1 10 0.6 0.1 0.1 0.1
1 1 30 0.1 0.1 0.6 0.1
2 1 10 0.6 0.1 0.1 0.1
3 2 40 0.1 0.1 0.1 0.6
4 2 40 0.1 0.1 0.1 0.6
5 2 40 0.1 0.1 0.1 0.6
6 3 20 0.1 0.6 0.1 0.1
7 3 40 0.1 0.1 0.1 0.6
8 3 10 0.6 0.1 0.1 0.1
Edit
To set values to 1st row of group only
df = pd.DataFrame([['1', 10], ['1', 30], ['1', 10], ['2', 40], ['2', 40], ['2', 40], ['3', 20], ['3', 40], ['3', 10]], columns=('ID', 'Choice'))
df=df.set_index("ID")
## create unique index for each row if not already
df = df.reset_index()
choices = np.unique(df.Choice)
## get unique id of 1st row of each group
grouped = df.loc[df.reset_index().groupby("ID")["index"].first()]
## set value for each new variable
for choice in choices:
grouped[f"var_{choice}"] = np.where(grouped.Choice==choice, 0.6, (1 - 0.6) / 4)
pd.concat([df, grouped.iloc[:, -len(choices):]], axis=1)
We can use insert o create the rows based on the unique ID values ​​obtained through Series.unique.We can also create a mask to fill only the first row using np.where.
At the beginning sort_values ​​is used to sort the values ​​based on the ID. You can skip this step if your data frame is already well sorted (like the one shown in the example):
df=df.sort_values('ID')
n=df['Choice'].nunique()
mask=df['ID'].ne(df['ID'].shift())
for choice in df['Choice'].sort_values(ascending=False).unique():
df.insert(2,column=f'{choice}_Var',value=np.nan)
df.loc[mask,f'{choice}_Var']=np.where(df.loc[mask,'Choice'].eq(choice),0.6,0.4/n)
print(df)
ID Choice 10_Var 20_Var 30_Var 40_Var
0 1 10 0.6 0.1 0.1 0.1
1 1 30 NaN NaN NaN NaN
2 1 10 NaN NaN NaN NaN
3 2 40 0.1 0.1 0.1 0.6
4 2 40 NaN NaN NaN NaN
5 2 40 NaN NaN NaN NaN
6 3 20 0.1 0.6 0.1 0.1
7 3 40 NaN NaN NaN NaN
8 3 10 NaN NaN NaN NaN
A mix of numpy and pandas solution:
rows = np.unique(df.ID.values, return_index=1)[1]
df1 = df.loc[rows].assign(val=0.6)
df2 = (pd.crosstab([df1.index, df1.ID, df1.Choice], df1.Choice, df1.val, aggfunc='first')
.reindex(choice_list, axis=1)
.fillna((1-0.6)/len(choice_list)).reset_index(level=[1,2], drop=True))
pd.concat([df, df2], axis=1)
Out[217]:
ID Choice 10 20 30 40
0 1 10 0.6 0.1 0.1 0.1
1 1 30 NaN NaN NaN NaN
2 1 10 NaN NaN NaN NaN
3 2 40 0.1 0.1 0.1 0.6
4 2 40 NaN NaN NaN NaN
5 2 40 NaN NaN NaN NaN
6 3 20 0.1 0.6 0.1 0.1
7 3 40 NaN NaN NaN NaN
8 3 10 NaN NaN NaN NaN

Converting a flat table of records to an aggregate dataframe in Pandas [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 4 years ago.
I have a flat table of records about objects. Object have a type (ObjType) and are hosted in containers (ContainerId). The records also have some other attributes about the objects. However, they are not of interest at present. So, basically, the data looks like this:
Id ObjName XT ObjType ContainerId
2 name1 x1 A 2
3 name2 x5 B 2
22 name5 x3 D 7
25 name6 x2 E 7
35 name7 x3 G 7
..
..
92 name23 x2 A 17
95 name24 x8 B 17
99 name25 x5 A 21
What I am trying to do is 're-pivot' this data to further analyze which containers are 'similar' by looking at the types of objects they host in aggregate.
So, I am looking to convert the above data to the form below:
ObjType A B C D E F G
ContainerId
2 2.0 1.0 1.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 1.0 2.0 1.0 1.0
9 1.0 1.0 0.0 1.0 0.0 0.0 0.0
11 0.0 0.0 0.0 2.0 3.0 1.0 1.0
14 1.0 1.0 0.0 1.0 0.0 0.0 0.0
17 1.0 1.0 0.0 0.0 0.0 0.0 0.0
21 1.0 0.0 0.0 0.0 0.0 0.0 0.0
This is how I have managed to do it currently (after a lot of stumbling and using various tips from questions such as this one). I am getting the right results but, being new to Pandas and Python, I feel that I must be taking a long route. (I have added a few comments to explain the pain points.)
import pandas as pd
rdf = pd.read_csv('.\\testdata.csv')
#The data in the below group-by is all that is needed but in a re-pivoted format...
rdfg = rdf.groupby('ContainerId').ObjType.value_counts()
#Remove 'ContainerId' and 'ObjType' from the index
#Had to do reset_index in two steps because otherwise there's a conflict with 'ObjType'.
#That is, just rdfg.reset_index() does not work!
rdx = rdfg.reset_index(level=['ContainerId'])
#Renaming the 'ObjType' column helps get around the conflict so the 2nd reset_index works.
rdx.rename(columns={'ObjType':'Count'}, inplace=True)
cdx = rdx.reset_index()
#After this a simple pivot seems to do it
cdf = cdx.pivot(index='ContainerId', columns='ObjType',values='Count')
#Replacing the NaNs because not all containers have all object types
cdf.fillna(0, inplace=True)
Ask: Can someone please share other possible approaches that could perform this transformation?
This is a use case for pd.crosstab. Docs.
e.g.
In [539]: pd.crosstab(df.ContainerId, df.ObjType)
Out[539]:
ObjType A B D E G
ContainerId
2 1 1 0 0 0
7 0 0 1 1 1
17 1 1 0 0 0
21 1 0 0 0 0