How can i convert my dataset into json format like my required format - pandas

i want to convert my this dataset
enter image description here
into this json format using pandas
y = {'name':['a','b','c'],"rollno":[1,2,3],"teacher":'xyz',"year":1998}

First create dictionary by DataFrame.to_dict and filter out duplicated lists for scalars in dictionary comprehension with if-else by check length of sets:
d = {k:v if len(set(v)) > 1 else v[0] for k, v in df.to_dict('l').items()}
print (d)
{'name': ['a', 'b', 'c'], 'rollno': [1, 2, 3], 'teacher': 'xyz', 'year': 1998}
And then convert to json:
import json
j = json.dumps(d)
print (j)
{"name": ["a", "b", "c"], "rollno": [1, 2, 3], "teacher": "xyz", "year": 1998}
If values should be duplicated:
import json
j = json.dumps(df.to_dict(orient='l'))
print (j)
{"name": ["a", "b", "c"], "rollno": [1, 2, 3],
"teacher": ["xyz", "xyz", "xyz"], "year": [1998, 1998, 1998]}

Related

Identify change in status due to change in categorical variable in panel data

I have unbalanced panel data (repeated observations per ID at different points in time). I need to identify for a change in variable per person over time.
Here is the code to generate the data frame:
df = pd.DataFrame(
{
"region": ["C1", "C1", "C2", "C2", "C2"],
"id": [1, 1, 2, 2, 2],
"date": ["01/01/2021", "01/02/2021", "01/01/2021", "01/02/2021", "01/03/2021"],
"job": ["A", "A", "A", "B", "B"],
}
)
df
I am trying to create a column ("change") that indicates when individual 2 changes job status from A to B on that date (01/02/2021).
I have tried the following, but it is giving me an error:
df['change']=df.groupby(['id'])['job'].diff().fillna(0)
In your code error happens because you use 'diff' on 'job' column, but 'job' type is 'object' and 'diff' works only with numeric types.
current answer:
df["change"] = df.groupby(
["id"])["job"].transform(lambda x: x.ne(x.shift().bfill())).astype(int)
Here is the (longer) solution that I worked out:
df = pd.DataFrame(
{
"region": ["C1", "C1", "C2", "C2", "C2"],
"id": [1, 1, 2, 2, 2],
"date": [0, 1, 0, 1, 2],
"job": ["A", "A", "A", "B", "B"],
}
)
df1 = df.set_index(['id', 'date']).sort_index()
df1['job_lag'] = df1.groupby(level='id')['job'].shift()
df1.job_lag.fillna(df1.job, inplace=True)
def change(x):
if x['job'] != x['job_lag'] :
return 1
else:
return 0
df1['dummy'] = df1.apply(change, axis=1)
df1

Highlight distinct cells based on a different cell in the same row in a multiindex pivot table

I have created a pivot table where the column headers have several levels. This is a simplified version:
index = ['Person 1', 'Person 2', 'Person 3']
columns = [
["condition 1", "condition 1", "condition 1", "condition 2", "condition 2", "condition 2"],
["Mean", "SD", "n", "Mean", "SD", "n"],
]
data = [
[100, 10, 3, 200, 12, 5],
[500, 20, 4, 750, 6, 6],
[1000, 30, 5, None, None, None],
]
df = pd.DataFrame(data, columns=columns)
df
Now I would like to highlight the adjacent cells next to SD if SD > 10. This is how it should look like:
I found this answer but couldn't make it work for multiindices.
Thanks for any help.
Use Styler.apply with custom function - for select column use DataFrame.xs and for repeat boolean use DataFrame.reindex:
def hightlight(x):
c1 = 'background-color: red'
mask = x.xs('SD', axis=1, level=1).gt(10)
#DataFrame with same index and columns names as original filled empty strings
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
#modify values of df1 column by boolean mask
return df1.mask(mask.reindex(x.columns, level=0, axis=1), c1)
df.style.apply(hightlight, axis=None)

numpy unique over multiple arrays

Numpy.unique expects a 1-D array. If the input is not a 1-D array, it flattens it by default.
Is there a way for it to accept multiple arrays? To keep it simple, let's just say a pair of arrays, and we are unique-ing the pair of elements across the 2 arrays.
For example, say I have 2 numpy array as inputs
a = [1, 2, 3, 3]
b = [10, 20, 30, 31]
I'm unique-ing against both of these arrays, so against these 4 pairs (1,10), (2,20) (3, 30), and (3,31). These 4 are all unique, so I want my result to say
[True, True, True, True]
If instead the inputs are as follows
a = [1, 2, 3, 3]
b = [10, 20, 30, 30]
Then the last 2 elements are not unique. So the output should be
[True, True, True, False]
You could use the unique_indices value returned by numpy.unique():
In [243]: def is_unique(*lsts):
...: arr = np.vstack(lsts)
...: _, ind = np.unique(arr, axis=1, return_index=True)
...: out = np.zeros(shape=arr.shape[1], dtype=bool)
...: out[ind] = True
...: return out
In [244]: a = [1, 2, 2, 3, 3]
In [245]: b = [1, 2, 2, 3, 3]
In [246]: c = [1, 2, 0, 3, 3]
In [247]: is_unique(a, b)
Out[247]: array([ True, True, False, True, False])
In [248]: is_unique(a, b, c)
Out[248]: array([ True, True, True, True, False])
You may also find this thread helpful.

Efficient column MultiIndex ordering

I have this dataframe :
df = pandas.DataFrame({'A' : [2000, 2000, 2000, 2000, 2000, 2000],
'B' : ["A+", 'B+', "A+", "B+", "A+", "B+"],
'C' : ["M", "M", "M", "F", "F", "F"],
'D' : [1, 5, 3, 4, 2, 6],
'Value' : [11, 12, 13, 14, 15, 16] }).set_index((['A', 'B', 'C', 'D']))
df = df.unstack(['C', 'D']).fillna(0)
And I'm wondering is there is a more elegant way to order the columns MultiIndex that the following code :
# rows ordering
df = df.sort_values(by = ['A', "B"], ascending = [True, True])
# col ordering
df = df.transpose().sort_values(by = ["C", "D"], ascending = [False, False]).transpose()
Especially I feel like the last line with the two transpose si far more complex than it should be. I tried using sort_index but wasn't able to use it in a MultiIndex context (for both lines and columns).
You can use sort index on both levels:
out = df.sort_index(level=[0,1],axis=1,ascending=[True, False])
I can use
axis=1
And therefore the last line become
df = df.sort_values(axis = 1, by = ["C", "D"], ascending = [True, False])

Pandas dataframe to json with key

I have a dataframe with columns ['a', 'b', 'c' ]
and would like to export in dictionnary as follow :
{ 'value of a' : { 'b': 3, 'c': 7},
'value2 of a' : { 'b': 7, 'c': 9}
}
I believe you need set_index with DataFrame.to_dict:
df = pd.DataFrame({'a':list('ABC'),
'b':[4,5,4],
'c':[7,8,9]})
print (df)
a b c
0 A 4 7
1 B 5 8
2 C 4 9
d = df.set_index('a').to_dict('index')
print (d)
{'A': {'b': 4, 'c': 7}, 'B': {'b': 5, 'c': 8}, 'C': {'b': 4, 'c': 9}}
And for json use DataFrame.to_json:
j = df.set_index('a').to_json(orient='index')
print (j)
{"A":{"b":4,"c":7},"B":{"b":5,"c":8},"C":{"b":4,"c":9}}