how to extract data from column which looks like a dictionary in Pandas? - pandas

Hi I am new to pandas/python and trying to read a txt file in pandas
I want to extract key, value pairs for each row.
Make the key as new column name and its respective value as values.
Input
data
{'Name': 'Tim', 'Class': 'Ninth', 'Hobbies' : 'Football'}
{'Name': 'Tom', 'Class': 'Ninth', 'Hobbies' : 'Football'}
{'Name': 'Jim', 'Class': 'Ninth', 'Hobbies' : 'Football'}
{'Name': 'John', 'Class': 'Ninth'}
Expected Output:
Name Class Hobbies
Tim Ninth Football
Tom Ninth Football
Jim Ninth Football
John Ninth NA
import pandas as pd
df1 = pd.read_csv('9data.txt',sep = '\t')
df1['Name'] = df1['data'].apply(lambda x : x.values()[1])
print(df1)
Error: AttributeError: 'str' object has no attribute 'values'
Is there any way in which i can do this in pandas ?

The way the data was being read, I could get it a new dataframe using eval(). This will iterate over each cell creating a new dataframe then concatenating them.
data='''data
{'Name': 'Tim', 'Class': 'Ninth', 'Hobbies' : 'Football'}
{'Name': 'Tom', 'Class': 'Ninth', 'Hobbies' : 'Football'}
{'Name': 'Jim', 'Class': 'Ninth', 'Hobbies' : 'Football'}
{'Name': 'John', 'Class': 'Ninth'}'''
df = pd.read_csv(io.StringIO(data), sep='\t', engine='python')
df1 = pd.concat([pd.json_normalize(eval(x)) for x in df['data']])
Output
Name Class Hobbies
0 Tim Ninth Football
0 Tom Ninth Football
0 Jim Ninth Football
0 John Ninth NaN
If you can get your data look like this, this is simpler method that Anurag Dabas alludes to. You might consider reading the file into a list first, then creating the dataframe, rather creating a dataframe from a dataframe.
datal = [{'Name': 'Tim', 'Class': 'Ninth', 'Hobbies' : 'Football'},
{'Name': 'Tom', 'Class': 'Ninth', 'Hobbies' : 'Football'},
{'Name': 'Jim', 'Class': 'Ninth', 'Hobbies' : 'Football'},
{'Name': 'John', 'Class': 'Ninth'}]
df = pd.DataFrame(datal)
df

Related

Why doesn't pandas dataframe need full row values?

fields = ['name', 'type', 'age']
df = pd.DataFrame(columns=fields)
item1 = {'name': 'john', type:'student', 'age': 21}
item2 = {'name': 'john', 'age': 21}
for item in items:
df = df.append(item, ignore_index=True)
I had thought only 'item1' would be able to be appended, not 'item2' since it has only 2 required fields. Is this normal?

How to parse a nested column in a df column?

Is there a smart pythonic way to parse a nested column in a pandas dataframe like this one to 3 different columns? So for example the column could look like this:
col1
[{'name': 'amount', 'value': 1}, {'name': 'frequency', 'value': 2}, {'name': 'freq_unit', 'value': 'month'}]
[{'name': 'amount', 'value': 3}, {'name': 'frequency', 'value': 1}, {'name': 'freq_unit', 'value': 'month'}]
And the expected result should be these 3 columns:
amount frequency freq_unit
1 2 month
3 1 month
That's just level 1. I have the level 2: What if the elements in the list still have the same names (amount, frequency and freq_unit) but the order could change? Could the code in the answer deal with this?
col1
[{'name': 'amount', 'value': 1}, {'name': 'frequency', 'value': 2}, {'name': 'freq_unit', 'value': 'month'}]
[{'name': 'amount', 'value': 3}, {'name': 'freq_unit', 'value': 'month'}, {'name': 'frequency', 'value': 1}]
Code for reproduce the data. Really look forward to see how the community would solve this. Thank you
data = {'col1':[[{'name': 'amount', 'value': 1}, {'name': 'frequency', 'value': 2}, {'name': 'freq_unit', 'value': 'month'}],
[{'name': 'amount', 'value': 3}, {'name': 'frequency', 'value': 1}, {'name': 'freq_unit', 'value': 'month'}]]}
df = pd.DataFrame(data)
A combination of list comprehension, itertools.chain, and collections.defaultdict could help out here:
from itertools import chain
from collections import defaultdict
data = defaultdict(list)
phase1 = [[(data["name"], data["value"])
for data in entry]
for entry in df.col1
]
phase1 = chain.from_iterable(phase1)
for key, value in phase1:
data[key].append(value)
pd.DataFrame(data)
amount frequency freq_unit
0 1 2 month
1 3 1 month
The above is verbose: #piRSquared's comment is much simpler, with a list comprehension:
pd.DataFrame([{x["name"]: x["value"] for x in lst} for lst in df.col1])
Another idea, but very unnecessary, is to use a list comprehension, combined with Pandas' string methods:
outcome = [(df.col1.str[num].str["value"]
.rename(df.col1.str[num].str["name"][0])
)
for num in range(df.col1.str.len()[0])
]
pd.concat(outcome, axis = 'columns')
#piRsquared's solution is the simplest, in my opinion.
You can write a function that will parse each cell in your Series and return a properly formatted Series and use apply to tuck the iteration away:
>>> def custom_parser(record):
... clean_record = {rec["name"]: rec["value"] for rec in record}
... return pd.Series(clean_record)
>>> df["col1"].apply(custom_parser)
amount frequency freq_unit
0 1 2 month
1 3 1 month

populating nested dictionaries with rows from Pandas data frame

I'm trying to populate a dictionary of dictionaries with entries from a Pandas data frame in Python by iterating through the nested dictionary and populating the values of each sub-dictionary with entries from a row of a Pandas data frame.
Although there are as many sub-dictionaries as there are rows in the data frame, all dictionaries get populated with the data from the last row of the data frame, instead of using every row for every dictionary.
Here is a toy reproducible example.
import pandas as pd
# initialize an empty df
data = pd.DataFrame()
# populate data frame with entries
data['name'] = ['Joe Smith', 'Mary James', 'Charles Williams']
data['school'] = ["Jollywood Secondary", "Northgate Sixth From", "Brompton High"]
data['subjects'] = [['Maths', 'Art', 'Biology'], ['English', 'French', 'History'], ['Chemistry', 'Biology', 'English']]
# use dictionary comprehensions to set up main dictionary and sub-dictionary templates
# sub-dictionary
keys = ['name', 'school', 'subjects']
record = {key: None for key in keys}
# main dictionary
keys2 = ['cand1', 'cand2', 'cand3']
candidates = {key: record for key in keys2}
# as a result i get something like this
# {'cand1': {'name': None, 'school': None, 'subjects': None},
# 'cand2': {'name': None, 'school': None, 'subjects': None},
# 'cand3': {'name': None, 'school': None, 'subjects': None}}
# iterate through main dictionary and populate each sub-dict with row of df
for i, d in enumerate(candidates.items()):
d[1]['name'] = data['name'].iloc[i]
d[1]['school'] = data['school'].iloc[i]
d[1]['subjcts'] = data['subjects'].iloc[i]
# what i end up with is the last row entry in each sub-dictionary
#{'cand1': {'name': 'Charles Williams',
# 'school': 'Brompton High',
# 'subjects': None,
# 'subjcts': ['Chemistry', 'Biology', 'English']},
# 'cand2': {'name': 'Charles Williams',
# 'school': 'Brompton High',
# 'subjects': None,
# 'subjcts': ['Chemistry', 'Biology', 'English']},
# 'cand3': {'name': 'Charles Williams',
# 'school': 'Brompton High',
# 'subjects': None,
# 'subjcts': ['Chemistry', 'Biology', 'English']}}
How do I need to modify my code to get each dictionary populated with a different row from my data frame?
I did not work through your code to look for the bug, because the solution is a one-liner with the method to_dict.
Here is a minimal working example with your sample data.
import pandas as pd
# initialize an empty df
data = pd.DataFrame()
# populate data frame with entries
data['name'] = ['Joe Smith', 'Mary James', 'Charles Williams']
data['school'] = ["Jollywood Secondary", "Northgate Sixth From", "Brompton High"]
data['subjects'] = [['Maths', 'Art', 'Biology'], ['English', 'French', 'History'], ['Chemistry', 'Biology', 'English']]
# redefine index to match your keys
data.index = ['cand{}'.format(i) for i in range(1,len(data)+1)]
# convert to dict
data_dict = data.to_dict(orient='index')
print(data_dict)
This will look something like this
{'cand1': {
'name': 'Joe Smith',
'school': 'Jollywood Secondary',
'subjects': ['Maths', 'Art', 'Biology']},
'cand2': {
'name': 'Mary James',
'school': 'Northgate Sixth From',
'subjects': ['English', 'French', 'History']},
'cand3': {
'name': 'Charles Williams',
'school': 'Brompton High',
'subjects': ['Chemistry', 'Biology', 'English']}}
Consider avoiding the roundabout away of building dictionary as Pandas maintains various methods to render nested structures such as to_dict and to_json. Specifically, consider adding a new column, cand and set it as index for to_dict output:
data['cand'] = 'cand' + pd.Series((data.index.astype('int') + 1).astype('str'))
mydict = data.set_index('cand').to_dict(orient='index')
print(mydict)
{'cand1': {'name': 'Joe Smith', 'school': 'Jollywood Secondary',
'subjects': ['Maths', 'Art', 'Biology']},
'cand2': {'name': 'Mary James', 'school': 'Northgate Sixth From',
'subjects': ['English', 'French', 'History']},
'cand3': {'name': 'Charles Williams', 'school': 'Brompton High',
'subjects': ['Chemistry', 'Biology', 'English']}}

Pandas Df.head() does not display when called inside the method()?

Cannot access the pandas dataframe.head() or dataframe.describe() when the call is made inside a method.
def develop_df():
studentData = {
0 : {
'name' : 'Aadi',
'age' : 16,
'city' : 'New york'
},
1 : {
'name' : 'Jack',
'age' : 34,
'city' : 'Sydney'
},
}
print("Now lets print student data")
print(studentData)
print("%" * 80)
print("Create a df and then print head")
st_df = pd.DataFrame(studentData)
st_df.head()
print("%" * 80)
develop_df()
Output:
Now lets print student data
{0: {'name': 'Aadi', 'age': 16, 'city': 'New york'}, 1: {'name': 'Jack', 'age': 34, 'city': 'Sydney'}}
Create a df and then print head
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
But, as seen when called outside the method, it works.
studentData = {
0 : {
'name' : 'Aadi',
'age' : 16,
'city' : 'New york'
},
1 : {
'name' : 'Jack',
'age' : 34,
'city' : 'Sydney'
},
}
print("Now lets print student data")
print(studentData)
print("%" * 80)
print("Create a df and then print head")
st_df = pd.DataFrame(studentData)
st_df.head()
Output:
Now lets print student data
{0: {'name': 'Aadi', 'age': 16, 'city': 'New york'}, 1: {'name': 'Jack', 'age': 34, 'city': 'Sydney'}}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Create a df and then print head
0 1
age 16 34
city New york Sydney
name Aadi Jack
Any suggestion on resolving it?
To pretty-print within a loop, first import the display_html function:
from IPython.display import display_html
Then wrap display_html around any calls to df.head() within a function definition, for example:
display_html(st_df.head())

Convert pandas to dictionary defining the columns used fo the key values

There's the pandas dataframe 'test_df'. My aim is to convert it to a dictionary. Therefore I run this:
id Name Gender Age
0 1 'Peter' 'M' 32
1 2 'Lara' 'F' 45
Therefore I run this:
test_dict = test_df.set_index('id').T.to_dict()
The output is this:
{1: {'Name': 'Peter', 'Gender': 'M', 'Age': 32}, 2: {'Name': 'Lara', 'Gender': 'F', 'Age': 45}}
Now, I want to choose only the 'Name' and 'Gender' columns as the values of dictionary's keys. I'm trying to modify the above script into sth like this:
test_dict = test_df.set_index('id')['Name']['Gender'].T.to_dict()
with no success!
Any suggestion please?!
You was very close, use subset of columns [['Name','Gender']]:
test_dict = test_df.set_index('id')[['Name','Gender']].T.to_dict()
print (test_dict)
{1: {'Name': 'Peter', 'Gender': 'M'}, 2: {'Name': 'Lara', 'Gender': 'F'}}
Also T is not necessary, use parameter orient='index':
test_dict = test_df.set_index('id')[['Name','Gender']].to_dict(orient='index')
print (test_dict)
{1: {'Name': 'Peter', 'Gender': 'M'}, 2: {'Name': 'Lara', 'Gender': 'F'}}