How to expand a nested dictionary in pandas column? - pandas

I have a dataframe with the following nested dictionary in one of the columns:
ID dict
1 {Comp A: {Street: 123 Street}, Comp B: {Street: 456 Street}}
2 {Comp C: {Street: 749 Street}}
3 {Comp D: {Street: }}
I want to expand out the dictionary with the resulting data frame
ID company_name street
1 Comp A 123 Street
1 Comp B 456 Street
2 Comp C 749 Street
3 Comp D
I have tried the following
dft['dict'] = df.dict.apply(eval)
dft = dft.explode('dict')
Which gives me the ID and company_name column correctly, though I haven't been able to figure out how to expand out the street column as well.
This is the data in dictionary form, for reproducibility:
data = [{'ID': 1, 'entity_details': "{'comp a': {'street_address': '123 street'}}"},
{'ID': 2, 'entity_details': "{'comp b': {'street_address': '456 street'}}"},{'ID': 3, 'entity_details': "{'comp c': {'street_address': '555 street'},'comp d': {'street_address': '585 street'}, 'comp e': {'street_address': '873 street'}}"},
{'ID': 4, 'entity_details': "{'comp f': {'street_address': '898 street'}}"}]

A for loop should suffice and be efficient for your use case; the key is to export it into a dictionary - you've done that already with the df.to_dict() code - and then iterate based on the logic - if you are on python 3.10 you could have more simplicity with the pattern matching syntax.
out = []
for entry in data:
for key, value in entry.items():
if key == "entity_details":
val = eval(value)
for k, v in val.items():
result = (entry['ID'], k, v['street_address'])
out.append(result)
pd.DataFrame(out, columns = ['ID', 'company_name', 'street_address'])
ID company_name street_address
0 1 comp a 123 street
1 2 comp b 456 street
2 3 comp c 555 street
3 3 comp d 585 street
4 3 comp e 873 street
5 4 comp f 898 street
Below is a possible structural pattern matching option; however, I feel the for loop above is more explicit :
out = []
for entry in data:
match entry:
case {"entity_details": other}:
output = eval(other)
output = [(entry['ID'], key, value['street_address'])
for key, value in output.items()]
out.extend(output)
pd.DataFrame(out, columns = ['ID', 'company_name', 'street_address'])
ID company_name street_address
0 1 comp a 123 street
1 2 comp b 456 street
2 3 comp c 555 street
3 3 comp d 585 street
4 3 comp e 873 street
5 4 comp f 898 street
Of course, for more complex destructuring, the pattern matching can come in handy

Initial data
As far as the original data wasn't provided, I'll supose that we have this one:
data = {
1: "{'Comp A': {'Street': '123 Street'}, 'Comp B': {'Street': '456 Street'}}",
2: "{'Comp C': {'Street': '749 Street'}}",
3: "{'Comp D': {'Street': ''}}",
}
df = pd.DataFrame.from_dict(data, orient='index', columns=['dict'])
At least, the use of the eval function is justified with these data.
The main idea
To transform them in the format Company_name, Street, we can use DataFrame.from_dict and concat in addition to apply(eval) like this:
f = partial(pd.DataFrame.from_dict, orient='index')
df_transformed = pd.concat(map(f, df['dict'].map(literal_eval)))
Here
f converts a dictionary into DataFrame as if its keys were indexes;
.map(literal_eval) is converting json-strings into dictionaries;
map(f, ...) is supplying data frames into pd.concat
The final touch could be setting the index and renaming the columns, which we can do inside pd.concat like this:
pd.concat(..., keys=df.index, names=['id', 'company']).reset_index('company')
The code
import pandas as pd
from functools import partial
from ast import literal_eval
data = {
1: "{'Comp A': {'Street': '123 Street'}, 'Comp B': {'Street': '456 Street'}}",
2: "{'Comp C': {'Street': '749 Street'}}",
3: "{'Comp D': {'Street': ''}}",
}
df = pd.DataFrame.from_dict(data, orient='index', columns=['dict'])
f = partial(pd.DataFrame.from_dict, orient='index')
dft = pd.concat(
map(f, df['dict'].map(literal_eval)),
keys=df.index, # use the original index to identify where each record comes from
names=['id', 'Company']
).reset_index('Company')
print(dft)
The output:
Company Street
id
1 Comp A 123 Street
1 Comp B 456 Street
2 Comp C 749 Street
3 Comp D
P.S.
Let's say, that:
data = \
[{'ID': 1, 'entity_details': "{'comp a': {'street_address': '123 street'}}"},
{'ID': 2, 'entity_details': "{'comp b': {'street_address': '456 street'}}"},{'ID': 3, 'entity_details': "{'comp c': {'street_address': '555 street'},'comp d': {'street_address': '585 street'}, 'comp e': {'street_address': '873 street'}}"},
{'ID': 4, 'entity_details': "{'comp f': {'street_address': '898 street'}}"}]
df = pd.DataFrame(data).set_index('ID')
In this case the only thing we should change in the code is the initial column name. It was dict, and now it's entity_details:
pd.concat(
map(f, df['entity_details'].map(literal_eval)),
keys=df.index,
names=['id', 'Company']
).reset_index('Company')

Related

Assign Random Number between two value conditionally

I have a dataframe:
df = pd.DataFrame({
'Prod': ['abc', 'qrt', 'xyz', 'xam', 'asc', 'yat'],
'Line': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'AGE': ['35-44', '20-34', '35-44', '35-44', '45-70', '35-44']})
I wish to replace the values in the Age column by integers between two values. So, for example, I wish to replace each value with age range '35-44' by a random integer between 35-44.
I tried:
df.loc[df["AGE"]== '35-44', 'AGE'] = random.randint(35, 44)
But it picks the same value for each row. I would like it to randomly pick a different value for each row.
I get:
df = pd.DataFrame({
'Prod': ['abc', 'qrt', 'xyz', 'xam', 'asc', 'yat'],
'Line': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'AGE': ['38', '20-34', '38', '38', '45-70', '38']})
But I would like to get something like the following. I don't much care about how the values are distributed as long as they are in the range that I assign
df = pd.DataFrame({
'Prod': ['abc', 'qrt', 'xyz', 'xam', 'asc', 'yat'],
'Line': ['Revenues','EBT', 'Expenses', 'Revenues', 'EBT', 'Expenses'],
'AGE': ['36', '20-34', '39', '38', '45-70', '45']})
The code
random.randint(35, 44)
Produces a single random value making the statement analogous to:
df.loc[df["AGE"]== '35-44', 'AGE'] = 38 # some constant
We need a collection of values that is the same length as the values to fill. We can use np.random.randint instead:
import numpy as np
m = df["AGE"] == '35-44'
df.loc[m, 'AGE'] = np.random.randint(35, 44, m.sum())
(Series.sum is used to "count" the number of True values in the Series since True is 1 and False is 0)
df:
Prod Line AGE
0 abc Revenues 40
1 qrt EBT 20-34
2 xyz Expenses 41
3 xam Revenues 35
4 asc EBT 45-70
5 yat Expenses 36
*Reproducible with np.random.seed(26)
Naturally, using the filter on both sides of the expression with apply would also work:
import random
m = df["AGE"] == '35-44'
df.loc[m, 'AGE'] = df.loc[m, 'AGE'].apply(lambda _: random.randint(35, 44))
df:
Prod Line AGE
0 abc Revenues 36
1 qrt EBT 20-34
2 xyz Expenses 37
3 xam Revenues 43
4 asc EBT 45-70
5 yat Expenses 44
*Reproducible with random.seed(28)

Numbering rows in pandas dataframe

i have a dataframe looks like:
import pandas as pd
df = pd.DataFrame({
'first_name': ['John', 'Jane', 'Marry', 'Victoria', 'Gabriel', 'Layla'],
'last_name': ['Smith', 'Doe', 'Jackson', 'Smith', 'Brown', 'Martinez'],
'number': [0, 29, 0, 52, 0, 0]})
And i am working on solution to numbering 0 values in Number column.
My code, with isnot working right
for i, row in df.iterrows():
df.loc[df['number'] == 0, 'number'] = i+1
df
This code replaces 0 with 1, but must replace first 0 with 1..second 0 with 2 etc...
i would like to have solution based on iteration method(.
Note: numbers "29", "52" etc, must not be changed
Try np.where on a Boolean index based on 0 values in df then replace with cumsum of the index to enumerate:
m = df['number'].eq(0)
df['number'] = np.where(m, m.cumsum(), df['number'])
Or use Series.mask
m = df['number'].eq(0)
df['number'] = df['number'].mask(m, m.cumsum())
df:
first_name last_name number
0 John Smith 1
1 Jane Doe 29
2 Marry Jackson 2
3 Victoria Smith 52
4 Gabriel Brown 3
5 Layla Martinez 4
m.cumsum():
0 1
1 1
2 2
3 2
4 3
5 4
Name: number, dtype: int32
Complete Working Example:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'first_name': ['John', 'Jane', 'Marry', 'Victoria', 'Gabriel', 'Layla'],
'last_name': ['Smith', 'Doe', 'Jackson', 'Smith', 'Brown', 'Martinez'],
'number': [0, 29, 0, 52, 0, 0]
})
m = df['number'].eq(0)
df['number'] = np.where(m, m.cumsum(), df['number'])
print(df)
Try via boolean masking and loc accessor:
mask=df['number']==0 #created boolean mask
df.loc[mask,'number']=mask.cumsum()
OR
via where() method:
df['number']=df.where(~mask,mask.cumsum(),axis=0)['number']
OR
via boolean masking and assign() method
df[mask]=df[mask].assign(number=mask.cumsum())
Output of df:
first_name last_name number
0 John Smith 1
1 Jane Doe 29
2 Marry Jackson 2
3 Victoria Smith 52
4 Gabriel Brown 3
5 Layla Martinez 4
Alternative via replace and fillna:
df.number = df.number.replace(0,np.NAN).fillna(df.number.eq(0).cumsum())

Compare the two column in different data frame in pandas

I have two table as shown below
user table:
user_id courses attended_modules
1 [A] {A:[1,2,3,4,5,6]}
2 [A,B,C] {A:[8], B:[5], C:[6]}
3 [A,B] {A:[2,3,9], B:[10]}
4 [A] {A:[3]}
5 [B] {B:[5]}
6 [A] {A:[3]}
7 [B] {B:[5]}
8 [A] {A:[4]}
Course table:
course_id modules
A [1,2,3,4,5,6,8,9]
B [5,8]
C [6,10]
From the above compare the attended_module in user table with modules in course table. Create a new column in user table Remaining_module as explained below.
Example: user_id = 1, attended the course A, and attended 6 modules, there are 8 modules in course so Remaining_module = {A:2}
Similarly for user_id = 2, Remaining_module = {A:7, B:1, C:1}
And So on...
Expected Output:
user_id attended_modules #Remaining_modules
1 {A:[1,2,3,4,5,6]} {A:2}
2 {A:[8], B:[5], C:[6]} {A:7, B:1, C:1}
3 {A:[2,3,9], B:[8]} {A:5, B:1}
4 {A:[3]} {A:7}
5 {B:[5]} {B:1}
6 {A:[3]} {A:7}
7 {B:[5]} {B:1}
8 {A:[4]} {A:7}
Idea is compare matched values of generator and sum True values:
df2 = df2.set_index('course_id')
mo = df2['modules'].to_dict()
#print (mo)
def f(x):
return {k: sum(i not in v for i in mo[k]) for k, v in x.items()}
df1['Remaining_modules'] = df1['attended_modules'].apply(f)
print (df1)
user_id courses attended_modules Remaining_modules
0 1 [A] {'A': [1, 2, 3, 4, 5, 6]} {'A': 2}
1 2 [A,B,C] {'A': [8], 'B': [5], 'C': [6]} {'A': 7, 'B': 1, 'C': 1}
2 3 [A,B] {'A': [2, 3, 9], 'B': [10]} {'A': 5, 'B': 2}
3 4 [A] {'A': [3]} {'A': 7}
4 5 [B] {'B': [5]} {'B': 1}
5 6 [A] {'A': [3]} {'A': 7}
6 7 [B] {'B': [5]} {'B': 1}
7 8 [A] {'A': [4]} {'A': 7}

How to convert dictionary with list to dataframe with default index and column names

How to convert dictionary to dataframe with default index and column names
dictionary d = {0: [1, 'Sports', 222], 1: [2, 'Tools', 11], 2: [3, 'Clothing', 23]}
df
id type value
0 1 Sports 222
1 2 Tools 11
2 3 Clothing 23
Use DataFrame.from_dict with orient='index' parameter:
d = {0: [1, 'Sports', 222], 1: [2, 'Tools', 11], 2: [3, 'Clothing', 23]}
df = pd.DataFrame.from_dict(d, orient='index', columns=['id','type','value'])
print (df)
id type value
0 1 Sports 222
1 2 Tools 11
2 3 Clothing 23

How to split a column into three columns in pandas

I have a data frame as shown below
ID Name Address
1 Kohli Country: India; State: Delhi; Sector: SE25
2 Sachin Country: India; State: Mumbai; Sector: SE39
3 Ponting Country: Australia; State: Tasmania
4 Ponting State: Tasmania; Sector: SE27
From the above I would like to prepare below data frame
ID Name Country State Sector
1 Kohli India Delhi SE25
2 Sachin India Mumbai SE39
3 Ponting Australia Tasmania None
4 Ponting None Tasmania SE27
I tried below code
df[['Country', 'State', 'Sector']] = pd.DataFrame(df['ADDRESS'].str.split(';',2).tolist(),
columns = ['Country', 'State', 'Sector'])
But from the above again I have to clean the data by slicing the column. I would like to know is there any easy method than this.
Use list comprehension with dict comprehension for list of dictionaries and pass to DataFrame constructor:
L = [{k:v for y in x.split('; ') for k, v in dict([y.split(': ')]).items()}
for x in df.pop('Address')]
df = df.join(pd.DataFrame(L, index=df.index))
print (df)
ID Name Country State Sector
0 1 Kohli India Delhi SE25
1 2 Sachin India Mumbai SE39
2 3 Ponting Australia Tasmania NaN
Or use split with reshape stack:
df1 = (df.pop('Address')
.str.split('; ', expand=True)
.stack()
.reset_index(level=1, drop=True)
.str.split(': ', expand=True)
.set_index(0, append=True)[1]
.unstack()
)
print (df1)
0 Country Sector State
0 India SE25 Delhi
1 India SE39 Mumbai
2 Australia NaN Tasmania
df = df.join(df1)
print (df)
ID Name Country Sector State
0 1 Kohli India SE25 Delhi
1 2 Sachin India SE39 Mumbai
2 3 Ponting Australia NaN Tasmania
You are almost there
cols = ['ZONE', 'State', 'Sector']
df[cols] = pd.DataFrame(df['ADDRESS'].str.split('; ',2).tolist(),
columns = cols)
for col in cols:
df[col] = df[col].str.split(': ').apply(lambda x:x[1])
Original answer
This can also do the job:
import pandas as pd
df = pd.DataFrame(
[
{'ID': 1, 'Name': 'Kohli', 'Address': 'Country: India; State: Delhi; Sector: SE25'},
{'ID': 2, 'Name': 'Sachin','Address': 'Country: India; State: Mumbai; Sector: SE39'},
{'ID': 3,'Name': 'Ponting','Address': 'Country: Australia; State: Tasmania'}
]
)
cols_to_extract = ['ZONE', 'State', 'Sector']
list_of_rows = df['Address'].str.split(';', 2).tolist()
df[cols_to_extract] = pd.DataFrame(
[[item.split(': ')[1] for item in row] for row in list_of_rows],
columns=cols_to_extract)
Output would be the following:
>> df[['ID', 'Name', 'ZONE', 'State', 'Sector']]
ID Name ZONE State Sector
1 Kohli India Delhi SE25
2 Sachin India Mumbai SE39
3 Ponting Australia Tasmania None
Edited answer
As #jezrael pointed out very well in question comment, my original answer was wrong, because it aligned values by position and could tend to wrong key - value pairs, when some of the values were NaNs. The following code should work on edited data set.
import pandas as pd
df = pd.DataFrame(
[
{'ID': 1, 'Name': 'Kohli', 'Address': 'Country: India; State: Delhi; Sector: SE25'},
{'ID': 2, 'Name': 'Sachin','Address': 'Country: India; State: Mumbai; Sector: SE39'},
{'ID': 3,'Name': 'Ponting','Address': 'Country: Australia; State: Tasmania'},
{'ID': 4, 'Name': 'Ponting','Address': 'State: Tasmania; Sector: SE27'}
]
)
cols_to_extract = ['Country', 'State', 'Sector']
list_of_rows = df['Address'].str.split(';', 2).tolist()
df[cols_to_extract] = pd.DataFrame(
[{item.split(': ')[0].strip(): item.split(': ')[1] for item in row} for row in list_of_rows],
columns=cols_to_extract)
df = df.rename(columns={'Country': 'ZONE'})
Output would be:
>> df[['ID', 'Name', 'ZONE', 'State', 'Sector']]
ID Name ZONE State Sector
1 Kohli India Delhi SE25
2 Sachin India Mumbai SE39
3 Ponting Australia Tasmania NaN
3 Ponting NaN Tasmania SE27