Efficient column MultiIndex ordering - pandas

I have this dataframe :
df = pandas.DataFrame({'A' : [2000, 2000, 2000, 2000, 2000, 2000],
'B' : ["A+", 'B+', "A+", "B+", "A+", "B+"],
'C' : ["M", "M", "M", "F", "F", "F"],
'D' : [1, 5, 3, 4, 2, 6],
'Value' : [11, 12, 13, 14, 15, 16] }).set_index((['A', 'B', 'C', 'D']))
df = df.unstack(['C', 'D']).fillna(0)
And I'm wondering is there is a more elegant way to order the columns MultiIndex that the following code :
# rows ordering
df = df.sort_values(by = ['A', "B"], ascending = [True, True])
# col ordering
df = df.transpose().sort_values(by = ["C", "D"], ascending = [False, False]).transpose()
Especially I feel like the last line with the two transpose si far more complex than it should be. I tried using sort_index but wasn't able to use it in a MultiIndex context (for both lines and columns).

You can use sort index on both levels:
out = df.sort_index(level=[0,1],axis=1,ascending=[True, False])

I can use
axis=1
And therefore the last line become
df = df.sort_values(axis = 1, by = ["C", "D"], ascending = [True, False])

Related

Identify change in status due to change in categorical variable in panel data

I have unbalanced panel data (repeated observations per ID at different points in time). I need to identify for a change in variable per person over time.
Here is the code to generate the data frame:
df = pd.DataFrame(
{
"region": ["C1", "C1", "C2", "C2", "C2"],
"id": [1, 1, 2, 2, 2],
"date": ["01/01/2021", "01/02/2021", "01/01/2021", "01/02/2021", "01/03/2021"],
"job": ["A", "A", "A", "B", "B"],
}
)
df
I am trying to create a column ("change") that indicates when individual 2 changes job status from A to B on that date (01/02/2021).
I have tried the following, but it is giving me an error:
df['change']=df.groupby(['id'])['job'].diff().fillna(0)
In your code error happens because you use 'diff' on 'job' column, but 'job' type is 'object' and 'diff' works only with numeric types.
current answer:
df["change"] = df.groupby(
["id"])["job"].transform(lambda x: x.ne(x.shift().bfill())).astype(int)
Here is the (longer) solution that I worked out:
df = pd.DataFrame(
{
"region": ["C1", "C1", "C2", "C2", "C2"],
"id": [1, 1, 2, 2, 2],
"date": [0, 1, 0, 1, 2],
"job": ["A", "A", "A", "B", "B"],
}
)
df1 = df.set_index(['id', 'date']).sort_index()
df1['job_lag'] = df1.groupby(level='id')['job'].shift()
df1.job_lag.fillna(df1.job, inplace=True)
def change(x):
if x['job'] != x['job_lag'] :
return 1
else:
return 0
df1['dummy'] = df1.apply(change, axis=1)
df1

Highlight distinct cells based on a different cell in the same row in a multiindex pivot table

I have created a pivot table where the column headers have several levels. This is a simplified version:
index = ['Person 1', 'Person 2', 'Person 3']
columns = [
["condition 1", "condition 1", "condition 1", "condition 2", "condition 2", "condition 2"],
["Mean", "SD", "n", "Mean", "SD", "n"],
]
data = [
[100, 10, 3, 200, 12, 5],
[500, 20, 4, 750, 6, 6],
[1000, 30, 5, None, None, None],
]
df = pd.DataFrame(data, columns=columns)
df
Now I would like to highlight the adjacent cells next to SD if SD > 10. This is how it should look like:
I found this answer but couldn't make it work for multiindices.
Thanks for any help.
Use Styler.apply with custom function - for select column use DataFrame.xs and for repeat boolean use DataFrame.reindex:
def hightlight(x):
c1 = 'background-color: red'
mask = x.xs('SD', axis=1, level=1).gt(10)
#DataFrame with same index and columns names as original filled empty strings
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
#modify values of df1 column by boolean mask
return df1.mask(mask.reindex(x.columns, level=0, axis=1), c1)
df.style.apply(hightlight, axis=None)

Merging many multiple dataframes within a list into one dataframe

i have several dataframes, with all the same columns, within one list that i would like to have within one dataframe.
For instance, i have these three dataframes here:
df1 = pd.DataFrame(np.array([[10, 20, 30], [40, 50, 60], [70, 80, 90]]),
columns=['a', 'b', 'c'])
df2 = pd.DataFrame(np.array([[11, 22, 33], [44, 55, 66], [77, 88, 99]]),
columns=['a', 'b', 'c'])
df3 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['a', 'b', 'c'])
within one list:
dfList = [df1,df2,df3]
I know i can use the following which provides me with exactly what I'm looking for:
df_merge = pd.concat([dfList[0],dfList[1],dfList[2]])
However, my in my actual data i have 100s of dataframes within a list, so I'm trying to find a way to loop through and concat:
dfList_all = pd.DataFrame()
for i in range(len(dfList)):
dfList_all = pd.concat(dfList[i])
I tried the following above, but it provides me with the following error:
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"
Any ideas would be wonderful. Thanks

Calculate the difference between all rows and a specific row in the dataframe

This is a similar question to this thread.
Lets consider df as:
df = pd.DataFrame([["a", 2, 3], ["b", 5, 6], ["c", 8, 9],["a", 0, 0], ["a", 8, 7], ["c", 2, 1]], columns = ["A", "B", "C"])
How can you calculate the difference between all rows and the row at Nth index in a group (lowest index for EACH group) for column "B", and put it in column "D"? I want to calculate mean square displacement for my data and I want to calculate the difference of values in a column in each group with the first appeared row in that group.
I tried:
df['D'] = df.groupby(["A"])['B'].sub(df.groupby(['A'])["B"].iloc[0])
Group = df.groupby(["A"])
However using .sub and groupby raise the following error:
AttributeError: 'SeriesGroupBy' object has no attribute 'sub'
the desired result would be like this:
A B C D
0 a 2 3 0 *lowest index in group "a"
1 b 5 6 0 *lowest index in group "b"
2 c 8 9 0 *lowest index in group "c"
3 a 0 0 -2
4 a 8 7 6
5 c 2 1 -6
I guess this answer could be enough of a hint for you:
import pandas as pd
df = pd.DataFrame([["a", 2, 3], ["b", 5, 6], ["c", 8, 9],["a", 0, 0], ["a", 8, 7], ["c", 2, 1]], columns = ["A", "B", "C"])
print("df:")
print(df)
print()
groupA = df.groupby(['A'])
print("groupA:")
print(groupA.groups)
print()
print("lowest indices for each group from columnA:")
lowest_indices = dict()
for k, v in groupA.groups.items():
lowest_indices[k] = v[0]
print(lowest_indices)
print()
columnB = df['B']
print("columnB:")
print(columnB)
print()
df['D'] = df['B']
for i in range(len(df)):
group_at_i = df['A'].iloc[i]
lowest_index_of_that = lowest_indices[group_at_i]
b_element_at_that_index = df['B'].iloc[lowest_index_of_that]
the_difference = df['B'].iloc[i] - b_element_at_that_index
df.loc[i, 'D'] = the_difference
print("df:")
print(df)

How can i convert my dataset into json format like my required format

i want to convert my this dataset
enter image description here
into this json format using pandas
y = {'name':['a','b','c'],"rollno":[1,2,3],"teacher":'xyz',"year":1998}
First create dictionary by DataFrame.to_dict and filter out duplicated lists for scalars in dictionary comprehension with if-else by check length of sets:
d = {k:v if len(set(v)) > 1 else v[0] for k, v in df.to_dict('l').items()}
print (d)
{'name': ['a', 'b', 'c'], 'rollno': [1, 2, 3], 'teacher': 'xyz', 'year': 1998}
And then convert to json:
import json
j = json.dumps(d)
print (j)
{"name": ["a", "b", "c"], "rollno": [1, 2, 3], "teacher": "xyz", "year": 1998}
If values should be duplicated:
import json
j = json.dumps(df.to_dict(orient='l'))
print (j)
{"name": ["a", "b", "c"], "rollno": [1, 2, 3],
"teacher": ["xyz", "xyz", "xyz"], "year": [1998, 1998, 1998]}