Convert list of dictionary in a dataframe to seperate dataframe - pandas

To convert list of dictionary already present in the dataset to a dataframe.
The dataset looks something like this.
[{'id': 35, 'name': 'Comedy'}]
How do I convert this list of dictionary to dataframe?
Thank you for your time!
I want to retrieve:
Comedy
from the list of dictionary.

Use:
df = pd.DataFrame({'col':[[{'id': 35, 'name': 'Comedy'}],[{'id': 35, 'name': 'Western'}]]})
print (df)
col
0 [{'id': 35, 'name': 'Comedy'}]
1 [{'id': 35, 'name': 'Western'}]
df['new'] = df['col'].apply(lambda x: x[0].get('name'))
print (df)
col new
0 [{'id': 35, 'name': 'Comedy'}] Comedy
1 [{'id': 35, 'name': 'Western'}] Western
If possible multiple dicts in list:
df = pd.DataFrame({'col':[[{'id': 35, 'name': 'Comedy'}, {'id':4, 'name':'Horror'}],
[{'id': 35, 'name': 'Western'}]]})
print (df)
col
0 [{'id': 35, 'name': 'Comedy'}, {'id': 4, 'name...
1 [{'id': 35, 'name': 'Western'}]
df['new'] = df['col'].apply(lambda x: [y.get('name') for y in x])
print (df)
col new
0 [{'id': 35, 'name': 'Comedy'}, {'id': 4, 'name... [Comedy, Horror]
1 [{'id': 35, 'name': 'Western'}] [Western]
And if want extract all values:
df1 = pd.concat([pd.DataFrame(x) for x in df['col']], ignore_index=True)
print (df1)
id name
0 35 Comedy
1 4 Horror
2 35 Western

Related

Why doesn't pandas dataframe need full row values?

fields = ['name', 'type', 'age']
df = pd.DataFrame(columns=fields)
item1 = {'name': 'john', type:'student', 'age': 21}
item2 = {'name': 'john', 'age': 21}
for item in items:
df = df.append(item, ignore_index=True)
I had thought only 'item1' would be able to be appended, not 'item2' since it has only 2 required fields. Is this normal?

Pandas - Extract value from Dataframe based on certain key value not in a sequence

I have a Dataframe in the below format:
id, ref
101, [{'id': '74947', 'type': {'id': '104', 'name': 'Sales', 'inward': 'Sales', 'outward': 'PO'}, 'inwardIssue': {'id': '76560', 'key': 'Prod-A'}}]
102, [{'id': '74948', 'type': {'id': '105', 'name': 'Return', 'inward': 'Return Order', 'outward': 'PO'}, 'inwardIssue': {'id': '76560', 'key': 'Prod-C'}},
{'id': '750001', 'type': {'id': '342', 'name': 'Sales', 'inward': 'Sales', 'outward': 'PO'}, 'inwardIssue': {'id': '76560', 'key': 'Prod-X'}}]
103, [{'id': '74949', 'type': {'id': '106', 'name': 'Sales', 'inward': 'Return Order', 'outward': 'PO'}, 'inwardIssue': {'id': '76560', 'key': 'Prod-B'}},
104, [{'id': '67543', 'type': {'id': '106', 'name': 'Other', 'inward': 'Return Order', 'outward': 'PO'}, 'inwardIssue': {'id': '76560', 'key': 'Prod-BA'}}]
I am trying to extract rows that have name = Sales and return back the below output:
101, Prod-A
102, Prod-X
103, Prod-B
I am able to extract the required data if the key value pair appears at the first instance but I am not able to do so if it is not the first instance like in the case of id = 102
df['names'] = df['ref'].str[0].str.get('type').str.get('name')
df['value'] = df['ref'].str[0].str.get('inwardIssue').str.get('key')
df['output'] = np.where(df['names'] == 'Sales', df['value'], 0)
Currently I am able to only get values for id = 101, 103
Let us do explode
s=pd.DataFrame(df.ref.explode().tolist())
s=s.loc[s.type.str.get('name').eq('Sales'),'inwardIssue'].str.get('key')
dfs=df.join(s,how='right')
id ref inwardIssue
0 101 [{'id': '74947', 'type': {'id': '104', 'name':... Prod-A
2 103 [{'id': '74949', 'type': {'id': '106', 'name':... Prod-X
3 104 [{'id': '67543', 'type': {'id': '106', 'name':... Prod-B
If you already have a dataframe in that format, you may convert it to json format and use pd.json_normalize to turn original df to a flat dataframe and slicing/filering on this flat dataframe.
df1 = pd.json_normalize(df.to_dict(orient='records'), 'ref')
The output of this flat dataframe df1
Out[83]:
id type.id type.name type.inward type.outward inwardIssue.id \
0 74947 104 Sales Sales PO 76560
1 74948 105 Return Return Order PO 76560
2 750001 342 Sales Sales PO 76560
3 74949 106 Sales Return Order PO 76560
4 67543 106 Other Return Order PO 76560
inwardIssue.key
0 Prod-A
1 Prod-C
2 Prod-X
3 Prod-B
4 Prod-BA
Finally, slicing on df1
df_final = df1.loc[df1['type.name'].eq('Sales'), ['type.id', 'inwardIssue.key']]
Out[88]:
type.id inwardIssue.key
0 104 Prod-A
2 342 Prod-X
3 106 Prod-B

Pandas - Extracting elements within a dictionary

I have a Python dictionary in the below structure. I am trying to extract certain elements from the Dictionary and convert them to a Dataframe.
When I try to perform pd.Dataframe(df) I get summary of the 2 groups data and PageCount whereas I only want the elements within Output in the Dataframe
{'code': 200,
'data': {'Output': [
{'id': 58,
'title': 'title1'},
{'id': 59,
'title': 'title2'}],
'PageCount': {'count': 196,
'page': 1,
'perPage': 10,
'totalPages': 20}},
'request_id': 'fggfgggdgd'}
Expected output:
id, title
58, title1
59, title2
You can do,
df = pd.io.json.json_normalize(dct["data"]["Output"])
You can also use;
l=[v['Output'] for k,v in d.items() if isinstance(v,dict) & ('Output' in str(v))]
pd.DataFrame(l[0])
id title
0 58 title1
1 59 title2

pandas: aggregate array during groupby, equivalent of SQL's array_agg?

I've got this dataframe:
df1 = pd.DataFrame([
{ 'id': 1, 'spend': 60, 'store': 'Stockport' },
{ 'id': 2, 'spend': 68, 'store': 'Didsbury' },
{ 'id': 3, 'spend': 70, 'store': 'Stockport' },
{ 'id': 4, 'spend': 35, 'store': 'Didsbury' },
{ 'id': 5, 'spend': 16, 'store': 'Didsbury' },
{ 'id': 6, 'spend': 12, 'store': 'Didsbury' },
])
I've grouped it by store and got the total spend by store:
df.groupby("store").agg({'spend': 'sum'})\
.reset_index().sort_values("spend", ascending=False)
store spend
Didsbury 131
Stockport 130
Is there a way I can get the IDs for each store as a column in the grouped object? Like the equivalent of ARRAY_AGG in Postgres? So the desired output would be:
store spend ids
Didsbury 131 [2,4,5,6]
Stockport 130 [1,3]
We can use named_aggregations, which is an aggregation method available since pandas >= 0.25.0.
Notice how we can instantely rename our column to "ids":
df1.groupby('store').agg(
spend=('spend', 'sum'),
ids=('id', list)
).reset_index()
store spend ids
0 Didsbury 131 [2, 4, 5, 6]
1 Stockport 130 [1, 3]
You can pass list like aggregation function for id column:
df = (df1.groupby("store").agg({'spend': 'sum', 'id':list})
.reset_index()
.sort_values("spend", ascending=False))
print (df)
store spend id
0 Didsbury 131 [2, 4, 5, 6]
1 Stockport 130 [1, 3]

How to subtract one row to other rows in a grouped by dataframe?

I've got this data frame with some 'init' values ('value', 'value2') that I want to subtract to the mid term value 'mid' and final value 'final' once I've grouped by ID.
import pandas as pd
df = pd.DataFrame({
'value': [100, 120, 130, 200, 190,210],
'value2': [2100, 2120, 2130, 2200, 2190,2210],
'ID': [1, 1, 1, 2, 2, 2],
'state': ['init','mid', 'final', 'init', 'mid', 'final'],
})
My attempt was tho extract the index where I found 'init', 'mid' and 'final' and subtract from 'mid' and 'final' the value of 'init' once I've grouped the value by 'ID':
group = df.groupby('ID')
group['diff_1_f'] = group['value'].iloc[group.index[group['state'] == 'final'] - group['value'].iloc[group.index[dfs['state'] == 'init']]]]
group['diff_2_f'] = group['value2'].iloc[group.index[group['state'] == 'final'] - group['value'].iloc[group.index[dfs['state'] == 'init']]]
group['diff_1_m'] = group['value'].iloc[group.index[group['state'] == 'mid'] - group['value'].iloc[group.index[dfs['state'] == 'init']]]
group['diff_2_m'] = group['value2'].iloc[group.index[group['state'] == 'mid'] - group['value'].iloc[group.index[dfs['state'] == 'init']]]
But of course it doesn't work. How can I obtain the following result:
df = pd.DataFrame({
'diff_value': [20, 30, -10,10],
'diff_value2': [20, 30, -10,10],
'ID': [ 1, 1, 2, 2],
'state': ['mid', 'final', 'mid', 'final'],
})
Also in it's grouped form.
Use:
#columns names in list for subtract
cols = ['value', 'value2']
#new columns names created by join
new = [c + '_diff' for c in cols]
#filter rows with init
m = df['state'].ne('init')
#add init rows to new columns by join and filter no init rows
df1 = df.join(df[~m].set_index('ID')[cols], lsuffix='_diff', on='ID')[m]
#subtract with numpy array by .values for prevent index alignment
df1[new] = df1[new].sub(df1[cols].values)
#remove helper columns
df1 = df1.drop(cols, axis=1)
print (df1)
value_diff value2_diff ID state
1 20 20 1 mid
2 30 30 1 final
4 -10 -10 2 mid
5 10 10 2 final