Pandas extracting value from column in Dataframe - pandas

I have a column in my Dataframe that has data in the below format
id,value
101,[{'self': 'https://www.web.com/rest/api/101', 'value': 'Yes', 'id': '546'}]
The type of the column (value) is of type pandas.core.series.Series.
I am trying to extract text corresponding to value in the above dataframe.
Expected output:
id, output
101,Yes

See if his works for you
a=df['value'].str[0].apply(pd.Series)
df['value']=a['value']
print(df)
Output
id value
0 101 Yes

import pandas as pd
import numpy as np
cols = ['id', 'value']
data = [
[101, [{'self': 'https://www.web.com/rest/api/101', 'value': 'Yes', 'id': '546'}]]
]
df = pd.DataFrame(data=data, columns=cols)
df.value = df.apply(lambda x: x['value'][0]['value'], axis=1)
print(df)
Result
id value
0 101 Yes

Related

Transform dictionary value in a single pandas dataframe column into multiple columns

import numpy as np
import pandas as pd
data = { 'ID': [112,113],'empDetails':[[{'key': 'score', 'value': 2},{'key': 'Name', 'value': 'Ajay'}, {'key': 'Department', 'value': 'HR'}],[ {'key': 'salary', 'value': 7.5},{'key': 'Name', 'value': 'Balu'}]]}
dataDF = pd.DataFrame(data)
#trails
# dataDF['newColumns'] = dataDF['empDetails'].apply(lambda x: x[0].get('key'))
# dataDF = dataDF['empDetails'].apply(pd.Series)
# create dataframe
# dataDF = pd.DataFrame(dataDF['empDetails'], columns=dataDF['empDetails'].keys())
# create the dataframe
# df = pd.concat([pd.DataFrame(v, columns=[k]) for k, v in dataDF['empDetails'].items()], axis=1)
# print(dataDF['empDetails'].items())
display(dataDF)
I am trying to iterate through empDetails column and fetch the value of Name,salary and Department into 3 different column
Using pd.series I am able to split the dictionary into different columns, but not able to rename the columns as the column order may change.
What will be the effective way to do this.
Expected output
Use lambda function for extract keys and values to new dictionaries and pass to DataFrame constructor:
f = lambda x: {y['key']:y['value'] for y in x}
df = dataDF.join(pd.DataFrame(dataDF['empDetails'].apply(f).tolist(), index=dataDF.index))
print (df)
ID empDetails score Name \
0 112 [{'key': 'score', 'value': 2}, {'key': 'Name',... 2.0 Ajay
1 113 [{'key': 'salary', 'value': 7.5}, {'key': 'Nam... NaN Balu
Department salary
0 HR NaN
1 NaN 7.5
Alternative solution:
f = lambda x: pd.Series({y['key']:y['value'] for y in x})
df = dataDF.join(dataDF['empDetails'].apply(f))
print (df)
ID empDetails score Name \
0 112 [{'key': 'score', 'value': 2}, {'key': 'Name',... 2.0 Ajay
1 113 [{'key': 'salary', 'value': 7.5}, {'key': 'Nam... NaN Balu
Department salary
0 HR NaN
1 NaN 7.5
Or use list comprehension (only pandas solution):
df1 = pd.DataFrame([{y['key']:y['value'] for y in x} for x in dataDF['empDetails']],
index=dataDF.index)
df = dataDF.join(df1)
If you are using python 3.5+, then you can unroll dict elements and append "ID" column in one line:
df.apply(lambda row: pd.Series({**{"ID":row["ID"]}, **{ed["key"]:ed["value"] for ed in row["empDetails"]}}), axis=1)
Update: If you want all columns from original df, then use dict comprehension:
df.apply(lambda row: pd.Series({**{col:row[col] for col in df.columns}, **{ed["key"]:ed["value"] for ed in row["empDetails"]}}), axis=1)
Full example:
data = { 'ID': [112,113],'empDetails':[[{'key': 'score', 'value': 2},{'key': 'Name', 'value': 'Ajay'}, {'key': 'Department', 'value': 'HR'}],[ {'key': 'salary', 'value': 7.5},{'key': 'Name', 'value': 'Balu'}]]}
df = pd.DataFrame(data)
df = df.apply(lambda row: pd.Series({**{col:row[col] for col in df.columns}, **{ed["key"]:ed["value"] for ed in row["empDetails"]}}), axis=1)
[Out]:
Department ID Name salary score
0 HR 112 Ajay NaN 2.0
1 NaN 113 Balu 7.5 NaN

how to consolidate series data and make new dataframe in pandas?

I've got dataframe like this
original data
and I hope to have new dataframe like below
new data
How can I create code for this modification?
It need to consolidate first series data and create new dataframe.
Some imports:
import pandas as pd
import numpy as np
Here we create dataframe from data you provided:
df = pd.DataFrame({
"a" : [
'A2C02158300', 'D REC/BAS16-03W 100V 250mA SOD323 0s SMD', 'D201,D206,D218,D219,D222,D302,D308,D408', 'D409,D501,D502,D505,D506,D507,D508',
'A2C02250500', 'T BIP/PUMD3,SOT363,SMD SOLDERING', 'T209,T501,T502'
]
})
df.head(10)
Output:
Then we prepare dataframe with first 2 columns:
s1 = df.iloc[::4, :]
s1.reset_index(drop=True, inplace=True)
s2 = df.iloc[1::4, :]
s2.reset_index(drop=True, inplace=True)
df = pd.DataFrame({
'a': s1['a'],
'b': s2['a']
})
After that prepare and add third column:
s3 = df.iloc[2::4, :]
s3.reset_index(drop=True, inplace=True)
s3 = s3['a'].str.split(',').apply(pd.Series, 1).stack()
s3.index = s3.index.droplevel(-1)
s3.name = 'c'
df = df.join(s3)
df.reset_index(drop=True, inplace=True)
df
Output:

Get max value of 3 columns from pandas DataFrame?

I've a Pandas DataFrame with 3 columns:
c={'a': [['US']],'b': [['US']], 'c': [['US','BE']]}
df = pd.DataFrame(c, columns = ['a','b','c'])
Now I need the max value of these 3 columns.
I've tried:
df['max_val'] = df[['a','b','c']].max(axis=1)
The result is Nan instead of the expected output: US.
How can I get the max value for these 3 columns? (and what if one of them contains Nan)
Use:
c={'a': [['US', 'BE'],['US']],'b': [['US'],['US']], 'c': [['US','BE'],['US','BE']]}
df = pd.DataFrame(c, columns = ['a','b','c'])
from collections import Counter
df = df[['a','b','c']].apply(lambda x: list(Counter(map(tuple, x)).most_common()[0][0]), 1)
print (df)
0 [US, BE]
1 [US]
dtype: object
if it as # Erfan stated, most common value in a row then .agg(), mode
df.agg('mode', axis=1)
0
0 [US, BE]
1 [US]
while your data are lists, you can't use pandas.mode(). because lists objects are unhashable and mode() function won't work.
a solution is converting the elements of your dataframe's row to strings and then use pandas.mode().
check this:
>>> import pandas as pd
>>> c = {'a': [['US','BE']],'b': [['US']], 'c': [['US','BE']]}
>>> df = pd.DataFrame(c, columns = ['a','b','c'])
>>> x = df.iloc[0].apply(lambda x: str(x))
>>> x.mode()
# Answer:
0 ['US', 'BE']
dtype: object
>>> d = {'a': [['US']],'b': [['US']], 'c': [['US','BE']]}
>>> df2 = pd.DataFrame(d, columns = ['a','b','c'])
>>> z = df.iloc[0].apply(lambda z: str(z))
>>> z.mode()
# Answer:
0 ['US']
dtype: object
As I can see you have some elements as a list type, So I think the below-mentioned code will work fine.
First, append all value into an array
Then, find the most occurring element from that array.
from scipy.stats import mode
arr = []
for i in df:
for j in range(len(df[i])):
for k in range(len(df[i][j])):
arr.append(df[i][j][k])
from collections import Counter
b = Counter(arr)
print(b.most_common())
this will give you an answer as you want.

How to find pearson correlation between rows in two dataframes

I have a dataframe that I split into two dataframes of the same amount of columns and rows (df1 and df2). I want to write a function that will go through each row and feed their values into the scipy.stats.pearsonr() function. How would I do this?
Something like:
for index, row in d1.iterrows():
print(scipy.stats.pearsonr(df1.loc[index], df2.loc[index]))
If you just want the function, try this:
import pandas as pd
from scipy.stats import pearsonr
df1 = pd.DataFrame(
{
'A': [0,2,3,4,5],
'B': [2,3,4,5,6],
'C': [5,6,7,8,9],
}
)
df2 = pd.DataFrame(
{
'A': [2,1,3,4,5],
'B': [3,2,4,5,6],
'C': [7,7,7,3,3],
}
)
def pandas_pearsonr(df1, df2):
assert len(df1)==len(df2)
coefs = []
for i in range(0, len(df1)):
coefs.append(pearsonr(df1.iloc[i].values, df2.iloc[i].values))
print(coefs)
return pd.DataFrame(index=df1.index, data=coefs, columns=['coef', 'p-value'])
pandas_pearsonr(df1, df2)
Output looks like this:
coef p-value
0 0.976221 0.139109
1 0.996271 0.054996
2 1.000000 0.000000
3 -0.720577 0.487754
4 -0.838628 0.366717
But I think, it can be more pythonic. And maybe you can use pandas.DataFrame.corr

How do I block a Keyerror in Python from reoccurring or create exception to handle it?

Im new to Python and working with API,
My code is below:
import pandas as pd
import json
from pandas.io.json import json_normalize
import datetime
threedaysago = datetime.date.fromordinal(datetime.date.today().toordinal()-3).strftime("%F")
import http.client
conn = http.client.HTTPSConnection("api.sendgrid.com")
payload = "{}"
keys = {
# "CF" : "SG.UdhzjmjYR**.-",
}
df = [] # Create new Dataframe
for name, value in keys.items():
headers = { 'authorization': "Bearer " + value }
conn.request("GET", "/v3/categories/stats/sums?aggregated_by=&start_date={d}&end_date={d}".format(d=threedaysago), payload, headers)
res = conn.getresponse()
data = res.read()
print(data.decode("utf-8"))
d = json.loads(data.decode("utf-8"))
c=d['stats']
# row = d['stats'][0]['name']
# Add Brand to data row here with 'name'
df.append(c) # Load data row into df
#1
df = pd.DataFrame(df[0])
df_new = df[['name']]
df_new.rename(columns={'name':'Category'}, inplace=True)
df_metric =pd.DataFrame(list(df['metrics'].values))
sendgrid = pd.concat([df_new, df_metric], axis=1, sort=False)
sendgrid.set_index('Category', inplace = True)
sendgrid.insert(0, 'Date', threedaysago)
sendgrid.insert(1,'BrandId',99)
sendgrid.rename(columns={
'blocks':'Blocks',
'bounce_drops' : 'BounceDrops',
'bounces': 'Bounces',
'clicks':'Clicks',
'deferred':'Deferred',
'delivered':'Delivered',
'invalid_emails': 'InvalidEmails',
'opens':'Opens',
'processed':'Processed',
'requests':'Requests',
'spam_report_drops' : 'SpamReportDrops',
'spam_reports' : 'SpamReports',
'unique_clicks' : 'UniqueClicks',
'unique_opens' : 'UniqueOpens',
'unsubscribe_drops' : 'UnsubscribeDrops',
'unsubscribes': 'Unsubscribes'
},
inplace=True)
When I run this however, I receive an error:
KeyError: "None of [Index(['name'], dtype='object')] are in the [columns]"
I know the reason this happens is because there are no stats available for three days ago:
{"date":"2020-02-16","stats":[]}
But how do I handle these exceptions in my code because this is going to run as a daily report and it will break if this error is not handled.
Sorry for the late answer.
KeyError: "None of [Index(['name'], dtype='object')] are in the [columns]" means there is no column called name in your dataframe.
But, you believe that error occurred because of "stats" : []. It is also not true. If any of the indexes is empty the error should occur as ValueError: arrays must all be same length
I have recreated this problem and I will show you to get an idea to overcome this problem.
Recreating KeyError: "None of [Index(['name'], dtype='object')] are in the [columns]"
import pandas as pd
df = [{'A': [1,4,5], 'B': [4,5,6], 'C':['a','b','c']}]
df = pd.DataFrame(df[0])
df = df[['D']]
print(df)
Output -:
KeyError: "None of [Index(['D'], dtype='object')] are in the [columns]"
Solution -: You can see there is no column called 'D' in the data frame. Therefore, recheck your columns
Adding 'D' and see what happens
import pandas as pd
df = [{'A': [1,4,5], 'B': [4,5,6], 'C':['a','b','c'], 'D': []}]
df = pd.DataFrame(df[0])
df = df[['D']]
print(df)
Output -:
ValueError: arrays must all be same length
Solution -: column 'D' need to fill same data count as 'A', 'B', and 'C'
Overcome from those two problems
import pandas as pd
df = [{'A': [1,4,5], 'B': [4,5,6], 'C':['a','b','c'], 'D':[]}]
df = pd.DataFrame.from_dict(df[0], orient='index')
df.transpose()
print(df)
Output -:
0 1 2
A 1 4 5
B 4 5 6
C a b c
D None None None
You can see columns are now represented as rows. You can use loc to select each row column.
import pandas as pd
df = [{'A': [1,4,5], 'B': [4,5,6], 'C':['a','b','c'], 'D':[]}]
df = pd.DataFrame.from_dict(df[0], orient='index')
df.transpose()
df = df.loc[['A']] # uses loc
print(df)
Output -:
0 1 2
A 1 4 5