Identify change in status due to change in categorical variable in panel data

Identify change in status due to change in categorical variable in panel data - pandas

I have unbalanced panel data (repeated observations per ID at different points in time). I need to identify for a change in variable per person over time.
Here is the code to generate the data frame:
df = pd.DataFrame(
{
"region": ["C1", "C1", "C2", "C2", "C2"],
"id": [1, 1, 2, 2, 2],
"date": ["01/01/2021", "01/02/2021", "01/01/2021", "01/02/2021", "01/03/2021"],
"job": ["A", "A", "A", "B", "B"],
}
)
df
I am trying to create a column ("change") that indicates when individual 2 changes job status from A to B on that date (01/02/2021).
I have tried the following, but it is giving me an error:
df['change']=df.groupby(['id'])['job'].diff().fillna(0)

In your code error happens because you use 'diff' on 'job' column, but 'job' type is 'object' and 'diff' works only with numeric types.
current answer:
df["change"] = df.groupby(
["id"])["job"].transform(lambda x: x.ne(x.shift().bfill())).astype(int)

Here is the (longer) solution that I worked out:
df = pd.DataFrame(
{
"region": ["C1", "C1", "C2", "C2", "C2"],
"id": [1, 1, 2, 2, 2],
"date": [0, 1, 0, 1, 2],
"job": ["A", "A", "A", "B", "B"],
}
)
df1 = df.set_index(['id', 'date']).sort_index()
df1['job_lag'] = df1.groupby(level='id')['job'].shift()
df1.job_lag.fillna(df1.job, inplace=True)
def change(x):
if x['job'] != x['job_lag'] :
return 1
else:
return 0
df1['dummy'] = df1.apply(change, axis=1)
df1

Related

Highlight distinct cells based on a different cell in the same row in a multiindex pivot table

I have created a pivot table where the column headers have several levels. This is a simplified version:
index = ['Person 1', 'Person 2', 'Person 3']
columns = [
["condition 1", "condition 1", "condition 1", "condition 2", "condition 2", "condition 2"],
["Mean", "SD", "n", "Mean", "SD", "n"],
]
data = [
[100, 10, 3, 200, 12, 5],
[500, 20, 4, 750, 6, 6],
[1000, 30, 5, None, None, None],
]
df = pd.DataFrame(data, columns=columns)
df
Now I would like to highlight the adjacent cells next to SD if SD > 10. This is how it should look like:
I found this answer but couldn't make it work for multiindices.
Thanks for any help.

Use Styler.apply with custom function - for select column use DataFrame.xs and for repeat boolean use DataFrame.reindex:
def hightlight(x):
c1 = 'background-color: red'
mask = x.xs('SD', axis=1, level=1).gt(10)
#DataFrame with same index and columns names as original filled empty strings
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
#modify values of df1 column by boolean mask
return df1.mask(mask.reindex(x.columns, level=0, axis=1), c1)
df.style.apply(hightlight, axis=None)

How do columns work in a Pandas Dataframe after using GroupBy

Basically, I want to use iterrows method to loop through my group-by dataframe, but I can't figure out how the columns work. In the example below, it does not create a column Called "Group1" and "Group2" like one might expect. One of the columns is a dtype itself?
import pandas as pd
df = pd.DataFrame(columns=["Group1", "Group2", "Amount"])
df = df.append({"Group1": "Apple", "Group2": "Red Delicious", "Amount": 15}, ignore_index=True)
df = df.append({"Group1": "Apple", "Group2": "McIntosh", "Amount": 20}, ignore_index=True)
df = df.append({"Group1": "Apple", "Group2": "McIntosh", "Amount": 30}, ignore_index=True)
df = df.append({"Group1": "Apple", "Group2": "Fuju", "Amount": 7}, ignore_index=True)
df = df.append({"Group1": "Orange", "Group2": "Navel", "Amount": 9}, ignore_index=True)
df = df.append({"Group1": "Orange", "Group2": "Navel", "Amount": 5}, ignore_index=True)
df = df.append({"Group1": "Orange", "Group2": "Mandarin", "Amount": 12}, ignore_index=True)
print(df.dtypes)
print(df.to_string())
df_sum = df.groupby(['Group1', 'Group2']).sum(['Amount'])
print("---- Sum Results----")
print(df_sum.dtypes)
print(df_sum.to_string())
for index, row in df_sum.iterrows():
# The line below is what I want to do conceptually.
# print(row.Group1, row.Group2. row.Amount) # 'Series' object has no attribute 'Group1'
print(row.Amount) # 'Series' object has no attribute 'Group1'
The part of the output we are interested is here. I noticed that "Group1 and Group2" are on a lin below the Amount.
---- Sum Results----
Amount int64
dtype: object
Amount
Group1 Group2
Apple Fuju 7
McIntosh 50
Red Delicious 15
Orange Mandarin 12
Navel 14

Simply try:
df_sum = df.groupby(['Group1', 'Group2'])['Amount'].sum().reset_index()
OR
df_sum = df.groupby(['Group1', 'Group2'])['Amount'].agg('sum').reset_index()
Even, it Simply can be ad follows, as we are performing the sum based on the Group1 & Group2 only.
df_sum = df.groupby(['Group1', 'Group2']).sum().reset_index()
Another way:
df_sum = df.groupby(['Group1', 'Group2']).agg({'Amount': 'sum'}).reset_index()

Try to reset_index
df_sum = df.groupby(['Group1', 'Group2']).sum(['Amount']).reset_index()

Efficient column MultiIndex ordering

I have this dataframe :
df = pandas.DataFrame({'A' : [2000, 2000, 2000, 2000, 2000, 2000],
'B' : ["A+", 'B+', "A+", "B+", "A+", "B+"],
'C' : ["M", "M", "M", "F", "F", "F"],
'D' : [1, 5, 3, 4, 2, 6],
'Value' : [11, 12, 13, 14, 15, 16] }).set_index((['A', 'B', 'C', 'D']))
df = df.unstack(['C', 'D']).fillna(0)
And I'm wondering is there is a more elegant way to order the columns MultiIndex that the following code :
# rows ordering
df = df.sort_values(by = ['A', "B"], ascending = [True, True])
# col ordering
df = df.transpose().sort_values(by = ["C", "D"], ascending = [False, False]).transpose()
Especially I feel like the last line with the two transpose si far more complex than it should be. I tried using sort_index but wasn't able to use it in a MultiIndex context (for both lines and columns).

You can use sort index on both levels:
out = df.sort_index(level=[0,1],axis=1,ascending=[True, False])

I can use
axis=1
And therefore the last line become
df = df.sort_values(axis = 1, by = ["C", "D"], ascending = [True, False])

How can i convert my dataset into json format like my required format

i want to convert my this dataset
enter image description here
into this json format using pandas
y = {'name':['a','b','c'],"rollno":[1,2,3],"teacher":'xyz',"year":1998}

First create dictionary by DataFrame.to_dict and filter out duplicated lists for scalars in dictionary comprehension with if-else by check length of sets:
d = {k:v if len(set(v)) > 1 else v[0] for k, v in df.to_dict('l').items()}
print (d)
{'name': ['a', 'b', 'c'], 'rollno': [1, 2, 3], 'teacher': 'xyz', 'year': 1998}
And then convert to json:
import json
j = json.dumps(d)
print (j)
{"name": ["a", "b", "c"], "rollno": [1, 2, 3], "teacher": "xyz", "year": 1998}
If values should be duplicated:
import json
j = json.dumps(df.to_dict(orient='l'))
print (j)
{"name": ["a", "b", "c"], "rollno": [1, 2, 3],
"teacher": ["xyz", "xyz", "xyz"], "year": [1998, 1998, 1998]}

Lookup smallest value greater than current

I have an objects table and a lookup table. In the objects table, I'm looking to add the smallest value from the lookup table that is greater than the object's number.
I found this similar question but it's about finding a value greater than a constant, rather than changing for each row.
In code:
import pandas as pd
objects = pd.DataFrame([{"id": 1, "number": 10}, {"id": 2, "number": 30}])
lookup = pd.DataFrame([{"number": 3}, {"number": 12}, {"number": 40}])
expected = pd.DataFrame(
[
{"id": 1, "number": 10, "smallest_greater": 12},
{"id": 2, "number": 30, "smallest_greater": 40},
]
)

First compare each value lookup['number'] by objects['number'] to 2d boolean mask, then add cumsum and compare first value by 1 and get position by numpy.argmax for set value by lookup['number'].
Output is generated with numpy.where for overwrite all not matched values to NaN.
objects = pd.DataFrame([{"id": 1, "number": 10}, {"id": 2, "number": 30},
{"id": 3, "number": 100},{"id": 4, "number": 1}])
print (objects)
id number
0 1 10
1 2 30
2 3 100
3 4 1
m1 = lookup['number'].values >= objects['number'].values[:, None]
m2 = np.cumsum(m1, axis=1) == 1
m3 = np.any(m1, axis=1)
out = lookup['number'].values[m2.argmax(axis=1)]
objects['smallest_greater'] = np.where(m3, out, np.nan)
print (objects)
id number smallest_greater
0 1 10 12.0
1 2 30 40.0
2 3 100 NaN
3 4 1 3.0

smallest_greater = []
for i in objects['number']: smallest_greater.append(lookup['number'[lookup[lookup['number']>i].sort_values(by='number').index[0]])
objects['smallest_greater'] = smallest_greater

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Identify change in status due to change in categorical variable in panel data - pandas

In your code error happens because you use 'diff' on 'job' column, but 'job' type is 'object' and 'diff' works only with numeric types. current answer: df["change"] = df.groupby( ["id"])["job"].transform(lambda x: x.ne(x.shift().bfill())).astype(int)

Related

Highlight distinct cells based on a different cell in the same row in a multiindex pivot table

How do columns work in a Pandas Dataframe after using GroupBy

Efficient column MultiIndex ordering

How can i convert my dataset into json format like my required format

Lookup smallest value greater than current

Categories

Resources