How do columns work in a Pandas Dataframe after using GroupBy - pandas

Basically, I want to use iterrows method to loop through my group-by dataframe, but I can't figure out how the columns work. In the example below, it does not create a column Called "Group1" and "Group2" like one might expect. One of the columns is a dtype itself?
import pandas as pd
df = pd.DataFrame(columns=["Group1", "Group2", "Amount"])
df = df.append({"Group1": "Apple", "Group2": "Red Delicious", "Amount": 15}, ignore_index=True)
df = df.append({"Group1": "Apple", "Group2": "McIntosh", "Amount": 20}, ignore_index=True)
df = df.append({"Group1": "Apple", "Group2": "McIntosh", "Amount": 30}, ignore_index=True)
df = df.append({"Group1": "Apple", "Group2": "Fuju", "Amount": 7}, ignore_index=True)
df = df.append({"Group1": "Orange", "Group2": "Navel", "Amount": 9}, ignore_index=True)
df = df.append({"Group1": "Orange", "Group2": "Navel", "Amount": 5}, ignore_index=True)
df = df.append({"Group1": "Orange", "Group2": "Mandarin", "Amount": 12}, ignore_index=True)
print(df.dtypes)
print(df.to_string())
df_sum = df.groupby(['Group1', 'Group2']).sum(['Amount'])
print("---- Sum Results----")
print(df_sum.dtypes)
print(df_sum.to_string())
for index, row in df_sum.iterrows():
# The line below is what I want to do conceptually.
# print(row.Group1, row.Group2. row.Amount) # 'Series' object has no attribute 'Group1'
print(row.Amount) # 'Series' object has no attribute 'Group1'
The part of the output we are interested is here. I noticed that "Group1 and Group2" are on a lin below the Amount.
---- Sum Results----
Amount int64
dtype: object
Amount
Group1 Group2
Apple Fuju 7
McIntosh 50
Red Delicious 15
Orange Mandarin 12
Navel 14

Simply try:
df_sum = df.groupby(['Group1', 'Group2'])['Amount'].sum().reset_index()
OR
df_sum = df.groupby(['Group1', 'Group2'])['Amount'].agg('sum').reset_index()
Even, it Simply can be ad follows, as we are performing the sum based on the Group1 & Group2 only.
df_sum = df.groupby(['Group1', 'Group2']).sum().reset_index()
Another way:
df_sum = df.groupby(['Group1', 'Group2']).agg({'Amount': 'sum'}).reset_index()

Try to reset_index
df_sum = df.groupby(['Group1', 'Group2']).sum(['Amount']).reset_index()

Related

Identify change in status due to change in categorical variable in panel data

I have unbalanced panel data (repeated observations per ID at different points in time). I need to identify for a change in variable per person over time.
Here is the code to generate the data frame:
df = pd.DataFrame(
{
"region": ["C1", "C1", "C2", "C2", "C2"],
"id": [1, 1, 2, 2, 2],
"date": ["01/01/2021", "01/02/2021", "01/01/2021", "01/02/2021", "01/03/2021"],
"job": ["A", "A", "A", "B", "B"],
}
)
df
I am trying to create a column ("change") that indicates when individual 2 changes job status from A to B on that date (01/02/2021).
I have tried the following, but it is giving me an error:
df['change']=df.groupby(['id'])['job'].diff().fillna(0)
In your code error happens because you use 'diff' on 'job' column, but 'job' type is 'object' and 'diff' works only with numeric types.
current answer:
df["change"] = df.groupby(
["id"])["job"].transform(lambda x: x.ne(x.shift().bfill())).astype(int)
Here is the (longer) solution that I worked out:
df = pd.DataFrame(
{
"region": ["C1", "C1", "C2", "C2", "C2"],
"id": [1, 1, 2, 2, 2],
"date": [0, 1, 0, 1, 2],
"job": ["A", "A", "A", "B", "B"],
}
)
df1 = df.set_index(['id', 'date']).sort_index()
df1['job_lag'] = df1.groupby(level='id')['job'].shift()
df1.job_lag.fillna(df1.job, inplace=True)
def change(x):
if x['job'] != x['job_lag'] :
return 1
else:
return 0
df1['dummy'] = df1.apply(change, axis=1)
df1

Highlight distinct cells based on a different cell in the same row in a multiindex pivot table

I have created a pivot table where the column headers have several levels. This is a simplified version:
index = ['Person 1', 'Person 2', 'Person 3']
columns = [
["condition 1", "condition 1", "condition 1", "condition 2", "condition 2", "condition 2"],
["Mean", "SD", "n", "Mean", "SD", "n"],
]
data = [
[100, 10, 3, 200, 12, 5],
[500, 20, 4, 750, 6, 6],
[1000, 30, 5, None, None, None],
]
df = pd.DataFrame(data, columns=columns)
df
Now I would like to highlight the adjacent cells next to SD if SD > 10. This is how it should look like:
I found this answer but couldn't make it work for multiindices.
Thanks for any help.
Use Styler.apply with custom function - for select column use DataFrame.xs and for repeat boolean use DataFrame.reindex:
def hightlight(x):
c1 = 'background-color: red'
mask = x.xs('SD', axis=1, level=1).gt(10)
#DataFrame with same index and columns names as original filled empty strings
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
#modify values of df1 column by boolean mask
return df1.mask(mask.reindex(x.columns, level=0, axis=1), c1)
df.style.apply(hightlight, axis=None)

How to do a Pandas comparison with keep shape=False, but maintain relationship with the username column

I'm trying to run a Pandas dataframe comparison df.compare(df2) that returns only differences between two dataframes, but keep the relationship between the first column (with user's names) and the output when using the argument keep_shape=False which will only display rows with differences and the indexes, but the relationship with the username column is not displayed.
How do I keep the name column (which is the first column) and use the argument keep_shape=False at the same time so I can identify the username and the changes at the same time.
Example:
import pandas as pd
df_1 = pd.read_excel('../output/spreadsheet_Jan_1.xlsx')
df_2 = pd.read_excel('../output/spreadsheet_Feb_1.xlsx')
df_compare = df_1.compare(df_2, keep_equal=True, keep_shape=False)
I guess the image isn't showing...it's a spreadsheet with the df.compare() result showing the averages columns and the 'self' and 'other' columns split below the averages. The index is on the left hand side in the order of the 'keep_shape-False' format (e.g. 1, 6, 7, 8, 9 11, etc).
How do I match the usernames which is the first column along the left side with the associated indexes?
Thanks in advance.
Here is an example of one simple way to it:
import pandas as pd
df_1 = pd.DataFrame(
{
"fruit": {0: "banana", 1: "orange", 2: "apple", 3: "celery"},
"quantity": {0: 22, 1: 8, 2: 7, 3: 10},
}
)
df_2 = pd.DataFrame(
{
"fruit": {0: "banana", 1: "orange", 2: "apple", 3: "celery"},
"quantity": {0: 27, 1: 8, 2: 8, 3: 10},
}
)
In df_compare, we want to show the fruit names for which values are different in df_1 and df_2 (that is to say 'banana' and 'apple'):
df_compare = (
df_1
.compare(df_2, keep_equal=True, keep_shape=False)
.pipe(lambda df_: df_.set_index(df_1.loc[df_.index, "fruit"]))
.reset_index()
)
print(df_compare)
# Output
fruit quantity
self other
0 banana 22 27
1 apple 7 8
thanks Laurent for the given dataset example:
df_1 = pd.DataFrame({"fruit": {0: "banana", 1: "orange", 2: "apple", 3: "celery"},
"quantity": {0: 22, 1: 8, 2: 7, 3: 10}})
df_2 = pd.DataFrame({"fruit": {0: "banana", 1: "orange", 2: "apple", 3: "celery"},
"quantity": {0: 27, 1: 8, 2: 8, 3: 10}})
df_compare = pd.concat([df_1['fruit'],
df_1.compare(df_2, keep_equal=True, keep_shape=False)],1).dropna()
print(df_compare)
fruit (quantity, self) (quantity, other)
0 banana 22.0 27.0
2 apple 7.0 8.0

Efficient column MultiIndex ordering

I have this dataframe :
df = pandas.DataFrame({'A' : [2000, 2000, 2000, 2000, 2000, 2000],
'B' : ["A+", 'B+', "A+", "B+", "A+", "B+"],
'C' : ["M", "M", "M", "F", "F", "F"],
'D' : [1, 5, 3, 4, 2, 6],
'Value' : [11, 12, 13, 14, 15, 16] }).set_index((['A', 'B', 'C', 'D']))
df = df.unstack(['C', 'D']).fillna(0)
And I'm wondering is there is a more elegant way to order the columns MultiIndex that the following code :
# rows ordering
df = df.sort_values(by = ['A', "B"], ascending = [True, True])
# col ordering
df = df.transpose().sort_values(by = ["C", "D"], ascending = [False, False]).transpose()
Especially I feel like the last line with the two transpose si far more complex than it should be. I tried using sort_index but wasn't able to use it in a MultiIndex context (for both lines and columns).
You can use sort index on both levels:
out = df.sort_index(level=[0,1],axis=1,ascending=[True, False])
I can use
axis=1
And therefore the last line become
df = df.sort_values(axis = 1, by = ["C", "D"], ascending = [True, False])

How can i convert my dataset into json format like my required format

i want to convert my this dataset
enter image description here
into this json format using pandas
y = {'name':['a','b','c'],"rollno":[1,2,3],"teacher":'xyz',"year":1998}
First create dictionary by DataFrame.to_dict and filter out duplicated lists for scalars in dictionary comprehension with if-else by check length of sets:
d = {k:v if len(set(v)) > 1 else v[0] for k, v in df.to_dict('l').items()}
print (d)
{'name': ['a', 'b', 'c'], 'rollno': [1, 2, 3], 'teacher': 'xyz', 'year': 1998}
And then convert to json:
import json
j = json.dumps(d)
print (j)
{"name": ["a", "b", "c"], "rollno": [1, 2, 3], "teacher": "xyz", "year": 1998}
If values should be duplicated:
import json
j = json.dumps(df.to_dict(orient='l'))
print (j)
{"name": ["a", "b", "c"], "rollno": [1, 2, 3],
"teacher": ["xyz", "xyz", "xyz"], "year": [1998, 1998, 1998]}